Getting the pointer to CUDAMemory of a buffer

konstanty · November 28, 2023, 5:11am

Hi!

I’m using GStreamer 1.22.6 on NixOS with “bad plugins” and the auto-generated Python bindings. I can boot a more traditional distro if needed, but things works well so far and integrates extremely well with everything else I use.

I have a following example pipeline for decoding H.264 on my Nvidia GeForce RTX 3060. The bunny.mp4 can be replaced with any other MP4 file (wasn’t sure how to make it work with HTTP).

filesrc location=bunny.mp4
! qtdemux name=d d.video_0
! h264parse
! nvh264dec
! cudaconvertscale add-borders=false
! video/x-raw(memory:CUDAMemory), height=100, width=200, format=RGBA
! fakesink name=x

Running above uses the GPU and finishes faster than avdec, so I’m assuming things are working as intended so far.

I then placed a buffer probe on the fakesink with the Python bindings. Ultimately, I’d like to turn buffers to PyTorch tensors, without moving data to host memory. More specifically, I’m after implementing ... in the following function.

def buf_to_tensor(buf: Gst.Buffer, caps: Gst.Caps) -> torch.Tensor:
    """Converts GStreamer buffer/caps to a PyTorch CUDA tensor."""

    height = caps.get_structure(0).get_value("height")
    width = caps.get_structure(0).get_value("width")

    is_mapped, map_info = buf.map(Gst.MapFlags.READ)

    # Allocate an empty tensor of the right dimensions
    tensor = torch.empty(
        (height, width, 4),
        dtype=torch.uint8,
        device="cuda",
    )

    # Each entry is 1 byte
    n_bytes = height * width * 4

    # Getting CUDA memory pointers
    dest_ptr = tensor.data_ptr()
    source_ptr = ...

    # Copy memory device-to-device
    cuda.memcpy_dtod(dest_ptr, source_ptr, n_bytes)

    return tensor

A few comments on this attempt

GstCuda.is_cuda_memory(map_info.memory) is True for all frames, as expected
map_info.data is a Python memoryview object on the host
ctypes.cast(map_info.data, ctypes.c_void_p) doesn’t work, it’s raw data in there, not a pointer
id(map_info.data) is also not a CUDA memory address
Destination pointer is a valid CUDA memory address and I can easily copy these memory buffers around with PyCuda

I’m aware of (and inspired by) the blog posts by Paul Bridger. They use DeepStream, converting NVMM memory buffers using the NvBufSurface API, which stores the pointer in .surfaceList[0].dataPtr and wraps cudaMemcpy, all in a shared object nvbufsurface.so that ships with DeepStream.

These posts are from 2020, and my understanding is that GStreamer has since got some native support for CUDA memory management. DeepStream is a pretty big dependency with other issues, so I’d like to try doing it in a more modern way!

I can’t find any examples of how would I go about this, neither in C nor in Python. I understand the GstCuda API is unstable; it’s already immensely useful though! I’d love to understand more and try to contribute. I would be very grateful for any suggestions on where to look next. I saw many commits related to these APIs are due to @seungha. Thank you so much for this amazing work!

Thanks for reading. Have a great day!

Best wishes,

Konstanty

PS (one more possibly related experiment)

I’ve been trying to use GstCuda.CudaContext.new to get a new CUDA context.

import gi

gi.require_version("Gst", "1.0")
gi.require_version("GstCuda", "1.0")

from gi.repository import (
    Gst,
    GstCuda,
)

if not Gst.init_check(None):
    raise Exception("GStreamer failed to init")

GstCuda.CudaContext.new(0)

I get a segmentation fault in the above. Couldn’t find any relevant DEBUG logs. I see that my GPU is detected and a new CUDA context is created on launch. These are also present in my original pipeline.

0:00:00.452209979 85569       0x40c250 INFO                   nvenc gstnvenc.c:999:gst_nvenc_load_library: API version 11.1 load done
0:00:00.452214727 85569       0x40c250 INFO                   nvenc gstnvenc.c:1008:gst_nvenc_load_library: nvEncSetIOCudaStreams is supported
0:00:00.452296412 85569       0x40c250 INFO             cudacontext gstcudacontext.c:245:gst_create_cucontext: GPU #0 supports NVENC: yes (NVIDIA GeForce RTX 3060) (Compute SM 8.6)
0:00:00.528084197 85569       0x40c250 INFO             cudacontext gstcudacontext.c:269:gst_create_cucontext: Created CUDA context 0x16c4730 with device-id 0

I don’t necessarily need this I believe, just the pointer, as PyCuda can find the default context for the main thread. Interesting nonetheless!

seungha · November 28, 2023, 10:56am

To access CUDA memory in GstCudaMemory, you need to pass GST_MAP_CUDA flag too.

In C, you need to do

GstVideoFrame cuda_frame;
GstVideoInfo info;
GstCudaMemory *mem;

gst_video_info_from_caps (&info, caps);
gst_video_frame_map (cuda_buf, &info, buf, GST_MAP_READ | GST_MAP_CUDA);

mem = (GstCudaMemory *) gst_buffer_peek_memory (buf, 0);

// cuCtxPushCurrent ()
gst_cuda_context_push (mem->context):

// GstCuda uses alloc2d that would have padding bits (i.e., width != stride)
for (uint i = 0; i < GST_VIDEO_FRAME_N_PLANES (&frame); i++) {
  CUDA_MEMCPY2D param = { 0, };
  param.srcMemoryType = CU_MEMORYTYPE_DEVICE;
  param.srcDevice = (CUdeviceptr) GST_VIDEO_FRAME_PLANE_DATA (&frame, i);
  param.srcPitch = GST_VIDEO_FRAME_PLANE_STRIDE (&frame, i);
  param.WidthInBytes = GST_VIDEO_FRAME_COMP_WIDTH (&frame, i) *
    GST_VIDEO_FRAME_COMP_PSTRIDE (&frame, i);
  param.Height = GST_VIDEO_FRAME_COMP_HEIGHT (&frame, i);
  // Fill dst param accordingly
  
  cuMemcpy2D (&param);
}

// cuCtxPopCurrent (nullptr);
gst_cuda_context_pop (nullptr):

Not sure all the above things are available in python

konstanty · November 29, 2023, 7:10pm

Thank you, that was very helpful.

GST_MAP_CUDA exists in Python bindings, but adding it to the flags doesn’t change the value of mem_info.data, it’s still a CPU memoryview. It seems what I’m missing are the macro casts. I think maybe some GLib/GObject functionality is needed here. I’ll leave it for another thread when I learn more.

I’ve been trying to use your suggestion in C, but I’m very unfamiliar with CUDA_MEMCPY2D and how to fill the destination fields. I also believe Torch tensors, which would be the destitation here, are allocated in 1D only.

Is there a format I can convert to with cudaconvertscale such that the memory layout will be 1D? Or perhaps I can copy the 2D memory with 1D memcpy and resolve any stride issues in the tensor?

I also found NNStreamer which adds a other/tensors type to GStreamer and supports Torch C++ API. Unfortunately it seems they assume data is decoded on a CPU, and gets uploaded to a GPU with Torch. I’ll ask there for more details.

seungha · December 1, 2023, 1:32pm

CUDA memory frame copy is not that different from system memory (cuMemcpy or memcpy depending memory type)

You can do line by line copy for RGBA like

guint8 *src;
guint8 *dst;

src = GST_VIDEO_FRAME_PLANE_DATA (&frame, 0);
dst = /* your destination CUDA device memory pointer */

guint width_in_bytes = width * 4;
guint src_stride = GST_VIDEO_FRAME_PLANE_STRIDE (&frame, 0);
guint dst_stride = width_in_bytes; // assume dst 1D memory has no padding

for (guint i = 0; i < GST_VIDEO_FRAME_HEIGHT (&frame); i++ {
   cuMemcpy ((CUdeviceptr) dst, (CUdeviceptr) src, width_in_bytes);
   dst += dst_stride;
   src += src_stride;
}

konstanty · January 1, 2024, 6:49pm

Thank you, did exactly that, worked perfectly. For future readers, I defined the PyTorch tensor and its pointer like this

auto tensor = torch::empty(
    { height, width, 4 },
    torch::TensorOptions()
    .dtype(torch::kByte)
    .device(torch::kCUDA)
);

auto dst_ptr = torch.contiguous().data_ptr();

Sayyam-Jain · June 5, 2024, 1:56pm

Hi, @konstanty I’m also trying to achieve the same, converting gst buffers (on GPU) to torch tensor directly and facing the same issue. Using Paul Bridger’s code block of memmove is throwing error as NvBUfsurface address is integer and gst_buffer pointer is of type memoryview. How did you solve this? Can you please share python code for same (I’m not much familiar with C/C++)? Thanks