Hi!
I’m using GStreamer 1.22.6 on NixOS with “bad plugins” and the auto-generated Python bindings. I can boot a more traditional distro if needed, but things works well so far and integrates extremely well with everything else I use.
I have a following example pipeline for decoding H.264 on my Nvidia GeForce RTX 3060. The bunny.mp4
can be replaced with any other MP4 file (wasn’t sure how to make it work with HTTP).
filesrc location=bunny.mp4
! qtdemux name=d d.video_0
! h264parse
! nvh264dec
! cudaconvertscale add-borders=false
! video/x-raw(memory:CUDAMemory), height=100, width=200, format=RGBA
! fakesink name=x
Running above uses the GPU and finishes faster than avdec
, so I’m assuming things are working as intended so far.
I then placed a buffer probe on the fakesink
with the Python bindings. Ultimately, I’d like to turn buffers to PyTorch tensors, without moving data to host memory. More specifically, I’m after implementing ...
in the following function.
def buf_to_tensor(buf: Gst.Buffer, caps: Gst.Caps) -> torch.Tensor:
"""Converts GStreamer buffer/caps to a PyTorch CUDA tensor."""
height = caps.get_structure(0).get_value("height")
width = caps.get_structure(0).get_value("width")
is_mapped, map_info = buf.map(Gst.MapFlags.READ)
# Allocate an empty tensor of the right dimensions
tensor = torch.empty(
(height, width, 4),
dtype=torch.uint8,
device="cuda",
)
# Each entry is 1 byte
n_bytes = height * width * 4
# Getting CUDA memory pointers
dest_ptr = tensor.data_ptr()
source_ptr = ...
# Copy memory device-to-device
cuda.memcpy_dtod(dest_ptr, source_ptr, n_bytes)
return tensor
A few comments on this attempt
GstCuda.is_cuda_memory(map_info.memory)
is True for all frames, as expectedmap_info.data
is a Pythonmemoryview
object on the hostctypes.cast(map_info.data, ctypes.c_void_p)
doesn’t work, it’s raw data in there, not a pointerid(map_info.data)
is also not a CUDA memory address- Destination pointer is a valid CUDA memory address and I can easily copy these memory buffers around with PyCuda
I’m aware of (and inspired by) the blog posts by Paul Bridger. They use DeepStream, converting NVMM memory buffers using the NvBufSurface API, which stores the pointer in .surfaceList[0].dataPtr
and wraps cudaMemcpy
, all in a shared object nvbufsurface.so
that ships with DeepStream.
These posts are from 2020, and my understanding is that GStreamer has since got some native support for CUDA memory management. DeepStream is a pretty big dependency with other issues, so I’d like to try doing it in a more modern way!
I can’t find any examples of how would I go about this, neither in C nor in Python. I understand the GstCuda API is unstable; it’s already immensely useful though! I’d love to understand more and try to contribute. I would be very grateful for any suggestions on where to look next. I saw many commits related to these APIs are due to @seungha. Thank you so much for this amazing work!
Thanks for reading. Have a great day!
Best wishes,
Konstanty
PS (one more possibly related experiment)
I’ve been trying to use GstCuda.CudaContext.new to get a new CUDA context.
import gi
gi.require_version("Gst", "1.0")
gi.require_version("GstCuda", "1.0")
from gi.repository import (
Gst,
GstCuda,
)
if not Gst.init_check(None):
raise Exception("GStreamer failed to init")
GstCuda.CudaContext.new(0)
I get a segmentation fault in the above. Couldn’t find any relevant DEBUG logs. I see that my GPU is detected and a new CUDA context is created on launch. These are also present in my original pipeline.
0:00:00.452209979 85569 0x40c250 INFO nvenc gstnvenc.c:999:gst_nvenc_load_library: API version 11.1 load done
0:00:00.452214727 85569 0x40c250 INFO nvenc gstnvenc.c:1008:gst_nvenc_load_library: nvEncSetIOCudaStreams is supported
0:00:00.452296412 85569 0x40c250 INFO cudacontext gstcudacontext.c:245:gst_create_cucontext: GPU #0 supports NVENC: yes (NVIDIA GeForce RTX 3060) (Compute SM 8.6)
0:00:00.528084197 85569 0x40c250 INFO cudacontext gstcudacontext.c:269:gst_create_cucontext: Created CUDA context 0x16c4730 with device-id 0
I don’t necessarily need this I believe, just the pointer, as PyCuda can find the default context for the main thread. Interesting nonetheless!