Spawning independent gst pipelines with dynamically assigned cuda context

On ubuntu 22.04, we use gst 1.22 to create several gst hardware accelerated encoding pipelines. Our software is written in GoLang, so we wrote custom cgo bindings to gst - it all works great.

I tried declaring which cuda device the nvh264enc element should use, but gst would throw an error claiming something along the lines of not being able to write/set a different cuda property.

Before we create an encoding pipeline, the main thread kicks off the main gst loop. Clearly, the main thread is using the default cuda context (device 0) and running with that. When I use docker and set an environment variable to only show cuda device 1, then the whole pipeline utilizes device 1.

We would like to be able to declare N cuda device at runtime when constructing the pipeline. Separately, when we start the pipeline, we do so wrapping in a Go {} to spawn a new thread.

After poking around, I have a hunch that I should be able to use yet another cgo wrapper around cuda so that within the thread I could set any cuda device, but my concern is if that would cause any issues consider that the construction of the pipeline was done in a parent thread w/ a different cuda context.

I don’t know if I have to rewrite my application in such a way where the construction of the pipeline AND kicking it off have to occur at the same time when I declare a particular cuda context.

I’ve already spent several hours trying to find an example where several running gst pipelines are running with unique cuda contexts - any guidance or pointers would be greatly appreciated!

I am not sure why you think cuda context and thread are associated. cuda device selection and context construction is unrelated to calling thread, and it would not inherit something from parent’s thread.

Regarding GPU selection, cuda-device-id property of encoder/decoder is read-only property meant to help GPU selection. To select GPU, you need to construct matching elements via factory name.

Assuming the all your NVIDIA GPUs support encoding and decoding. Then
nvh264enc and nvh264dec always requires a cuda context corresponding to cuda-device-id=0
The other encoder/decoder element corresponding to cuda-device-id != 0 will have factory name nvh264device{index}enc / nvh264device{index}enc. For example, encoder with a cuda-device-id = 1 would have a factory name nvh264device1enc.

You will be able to find a list of available encoder/decoder elements using gst-launch-1.0 nvcodec command.

Seungha, first of all thank you very much for your quick response.

I was able to dynamically use different device ids by using the device{index} format! This was the only thing related to the scheme you’ve described that I could find in the docs. I’d be more than happy to contribute to the documentation for others who might have trouble comprehending the capabilities.

I also noticed that I was able to use the nvcudah264enc element with the same scheme and target a cuda device through nvcudah264device{index}enc, is that normal behavior around gst elements or unique to this scenario?

We’re interested in exploring the differences between the cuda based encoder versus pure nvenc, and I found that using cuda still utlizes the encoder but substantially less. Do you happen to know what’s fundamentally different from using cuda vs not? I was under the impression that no nvenc would be used, but that’s not the case. I tried finding resources that would explain but search results did not yield anything helpful.

I also noticed that I was able to use the nvcudah264enc element with the same scheme and target a cuda device through nvcudah264device{index}enc, is that normal behavior around gst elements or unique to this scenario?

GStreamer has no strict naming rule for the case as far as I can tell, but all hardware plugins I wrote (nvcodec, d3d11/12, qsv, and amfcodec) use the naming rule for multi-GPU scenario.

We’re interested in exploring the differences between the cuda based encoder versus pure nvenc

Both nvh264enc and nvcudah264enc use the same CUDA + NVENC API. nvcudah264enc is a new NVENC implementation written in 1.22 dev cycle in order to support zero copy encoding if upstream element supports CUDA memory (e.g., nvh264dec, cudaconvert).

Because of the zero-copy feature, new nvcudah264enc would consume less GPU memory than old one, and can avoid unnecessary CUDA memcpy. Note that there are additional optimizations added to nvcudah264enc in main branch.

Another note is that I have a plan to replace nvh264enc with nvcudah264enc after 1.24 release (remove old nvh264enc implementation, and rename nvcudah264enc to nvh264enc).

1 Like

Thank you for sharing this insight and the heads up for future versions.

One last question to settle the multi-gpu scenario: if my previous elements linked to the encoder is cudaupload, cudaconvert, filter and caps with memory:CUDAMemory, do all of those elements also need to have a cuda device explicitly assigned? Or does gstreamer automatically assign the appropriate cuda device based on the device declared by the encoder element?

device context searching order is downstream → upstream → global (in pipeline). So I expect encoder’s cuda context can be automatically assigned to linked upstream elements.

But I recommend set cuda-device-id property to linked upstream cudaupload / cudaconvert if you know which device needs to be assigned to corresponding encoding branch. The cuda context propagation may not behave for some reason.

Or nvautogpuh264enc is an alternative way of GPU selection.
The nvautogpuh264enc will initialize encoding session using the first received GstMemory’s cuda context. And other cuda elements (such as cudaconvert) can/will update underlying cuda resource if the cuda context of received GstCudaMemory is different from the element’s own context.

Thus if you configure a branch like this, cudaupload cuda-device-id=1 ! cudaconvert ! capsfilter ! nvautogpuh264enc, the branch will use the same GPU with cuda context corresponding to cuda-device-id=1 automatically.

Woah, what an awesome feature! Is this documented somewhere and I missed it?

I assume that the nvautogpuh264enc is using the same logic as the nvcudah264enc and not the nvh264enc?

Again, much appreciated guidance :slight_smile:

its purpose and differences are not well documented, it’s mentioned in 1.22 release note though.

I assume that the nvautogpuh264enc is using the same logic as the nvcudah264enc and not the nvh264enc?

nvautogpuh264enc and nvcudah264enc use the same implementation and code except for device selection part.

@seungha Hi how can I decide the device in decodebin? I provide the cuda context using NEED_CONTEXT message. But it now work. After debug, I found the decoder refuse the context sliently because it’s cuda-device-id is 0, and I want to use device 1. Why default value of cuda-device-id be 0 not -1? I found if it is -1, it will accept my cuda context and use the devie of that context.
And I also found cuda-device-id is readonly property, Is there a good way to specify decodebin to use the device I specify?

HI again @seungha

Earlier you mentioned:

 Thus if you configure a branch like this, `cudaupload cuda-device-id=1 ! cudaconvert ! capsfilter ! nvautogpuh264enc` , the branch will use the same GPU with cuda context corresponding to `cuda-device-id=1` automatically.

Today i’m working on strategically selecting cuda devices for our encoding pipeline.

Fri Nov 22 21:19:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070 Ti     Off | 00000000:01:00.0 Off |                  N/A |
|  0%   33C    P8              20W / 290W |   3647MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:41:00.0 Off |                  N/A |
|  0%   32C    P8              31W / 350W |   1018MiB / 24576MiB |     24%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:42:00.0 Off |                  N/A |
| 57%   31C    P8              44W / 420W |   1021MiB / 24576MiB |     39%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN V                 Off | 00000000:61:00.0 Off |                  N/A |
| 29%   43C    P2              29W / 250W |   1226MiB / 12288MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:62:00.0 Off |                  Off |
|  0%   36C    P5              26W / 450W |   1884MiB / 24564MiB |     34%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         

I’ve noticed that the cuda device id assigned in the pipeline does not match the device id from nvidia-smi.

gst used device - program’s target device
4 - 0
1 - 1
2 - 2
0 - 3
3 - 4

i.e. when I created a cudaupload element and i set the device id to be 0 I would expect it to use the 3070 Ti, but instead it uses the 4090 (device 4)

Is this a bug or expected behavior? Is there a way to ask gst what cuda devices it can see and what type they are?