Using nvcodec nvh264dec on multi-gpu machines

Hello! This topic is a follow-up of my question in gitlab issues here cudadownload init failure on multi-gpu setup if first device is out of memory (#3173) · Issues · GStreamer / gstreamer · GitLab

Question briefly: how to distribute nvh264dec among several GPUs on multi-gpu machine using decodebin?

Question in more detail:
We run our project on machine equipped with 4 GPUs NVIDIA A5000. We use GStreamer to receive video and decode frames to RGB. Separate pipelines are created for each camera with gst_parse_launch, but all of them are in same process.
There are ~85 FullHD cameras running in our system, 15-25 fps on input stream, 5 fps limit on decoding.

To speed up processing we use nvcodec plugin, so pipeline looks like this: rtspsrc ! decodebin ! videorate ! cudaconvert ! cudascale ! cudadownload ! appsink. GPU memory usage is high: 76 cameras in our case require ~30Gb of GPU memory, and hence streams are to be distributed among GPUs. We would like to distribute streams among GPUs as uniform as possible. In the issue I mentioned above Seungha Yang gave advice to use decodebin’s autoplug-* signals. I made an attempt to use autoplug-sort signal to rearrange decoder factories and to make decodebin use GPU I want to. Here I faced several problems:

  • I do not understand how to get feature name from factory provided by autoplug-sort signal. I had to use long name of a factory, but it feels pretty ugly solution. Is there any other way to distinguish factories?
  • after filtering NVDEC decoders I have to get corresponding gpu id to match it against memory usage reported by nvml library. The only way I found here is to extract device id from factory longname using regex. It feels ugly too.
  • stateless decoders do not have ‘with device N’ suffix in their longname in GStreamer 1.22 (Ubuntu 23.04) while they do so in 1.20 (Ubuntu 22.04). This makes impossible to pick proper one.
  • and the most significant issue here is that it seems decodebin ignores sorted factories and as a consequence first (gpu id = 0) GPU is used more than others.

My questions:

  • is approach described above correct at all?
  • what is the best way to distribute nvh264decs uniformly among several GPUs?
  • does decodebin strictly follows order of sorted factories returned by autoplug-sort?

Here is my code:

	g_signal_connect(decodebin_, "autoplug-sort", G_CALLBACK(::autoplug_sort_callback), nullptr);

static gint compare_factories(gconstpointer a, gconstpointer b)
    const GValue *val_a = (const GValue *)a;
    const GValue *val_b = (const GValue *)b;
    GstElementFactory *factory_a = (GstElementFactory *)g_value_get_object(val_a);
    GstElementFactory *factory_b = (GstElementFactory *)g_value_get_object(val_b);
    const char *class_a = gst_element_factory_get_klass(factory_a);
    const char *class_b = gst_element_factory_get_klass(factory_b);
    const char *name_a = gst_element_factory_get_longname(factory_a);
    const char *name_b = gst_element_factory_get_longname(factory_b);

    bool a_is_hw = (strcasestr(class_a, "/Hardware"));
    bool b_is_hw = (strcasestr(class_b, "/Hardware"));
    bool a_is_suitable = a_is_hw;
    bool b_is_suitable = b_is_hw;
    if(a_is_hw and b_is_hw)
        bool a_is_sl = (strcasestr(name_a, "stateless"));
        bool b_is_sl = (strcasestr(name_b, "stateless"));

        if(a_is_sl == b_is_sl)
            auto a_gpu_id = get_gpu_id(name_a);
            auto b_gpu_id = get_gpu_id(name_b);
                a_is_suitable = false;
                b_is_suitable = false;

            if(a_is_suitable and b_is_suitable)
                auto optimal = get_best_of(a_gpu_id.value(), b_gpu_id.value());
                a_is_suitable = (optimal == a_gpu_id);
                b_is_suitable = !a_is_suitable;
            a_is_suitable = !a_is_sl;
            b_is_suitable = !b_is_sl;
        return -1;
        return 1;

    return 0;

static GValueArray *autoplug_sort_callback (GstElement */*bin*/, GstPad */*pad*/, GstCaps *caps, GValueArray *factories, gpointer /*udata*/)
    GValueArray *ret = g_value_array_copy(factories);
    g_value_array_sort(ret, compare_factories);
    return ret;