Nvh265enc directly from gpu memory

nitola · April 22, 2024, 7:50am

Hello everybody,
I am trying to build a GStreamer Pipeline like the following (GStreamer 1.24)

appsrc->nvh265enc->h265parse->qtmux->filesink

I create the gst_cuda_context object/stream like so:

  CUcontext cuda_context{};
  auto result = cuDevicePrimaryCtxRetain(&cuda_context, 0);
  gst_cuda_context = gst_cuda_context_new_wrapped(cuda_context, 0);
  gst_cuda_stream = gst_cuda_stream_new(gst_cuda_context);

I push frames to the appsrc like so:

auto cuda_memory = gst_cuda_allocator_alloc_wrapped(nullptr, gst_cuda_context, gst_cuda_stream, gst_video_info, CUdeviceptr(image_in_device_memory), nullptr, &FreeFunction);
GstBuffer* push_buffer = gst_buffer_new();
gst_buffer_insert_memory(push_buffer, -1, GST_MEMORY_CAST(cuda_memory));

The input to the Appsrc is a CUDA processed image that is saved in device memory with the pointer “image_in_device_memory” (allocated via cudaMalloc). The format is RGBA8888.

In principle this seems to work, I get a playable video on my hard disk.

Now to the problem:
When viewing the process in the NVIDIA Profiler I see the following behaviour

Somehow Gstreamer issues a device-to-host copy followed by a host-to-device copy of the image (even performed by another Cuda Context) and only then the RGB2YUV kernel is running and after that the NVENC/HEVC hardware is activated.

This additional transfer to the host and subsequent reupload confuses me immensely as the data is already on the device to start with.

Can anybody point me in the right direction why this uncecessary process happens? How could I avoid this?

Best regards and thank you

seungha · April 22, 2024, 4:19pm

encoder will copy input buffer if input cuda memory belongs to different GstCudaContext. You should make your GstCudaContext shared with pipeline.

To assign your cuda context to pipeline,

creates GstContext → gst_context_new_cuda_context().
listen need-context message using SYNC bus handler
parse context type → gst_message_parse_context_type()
if context type == GST_CUDA_CONTEXT_TYPE, sets your context to the message source (the source is encoder in your case) → gst_element_set_context()

Note 1) nvh265enc will do copy input buffer to encoder’s internal device memory always. zero-copy supported nvcudah265enc is recommended instead.

Note2) nvcudah265enc does not support RGBA format in 1.24 (support was added in current main branch recently)

nitola · April 23, 2024, 6:58am

Dear Seungha,
thank you very much for your help. I feel honored that the developer of the plugin themself took the time to answer my question.

You solution works, the essential step was indeed gst_element_set_context() to avoid the second context. The extra copy is now gone and my program runs much smoother.

Thank you for your work on nvh265enc and ncudah265enc, it helps others a lot.