Qsvh264enc: performance issue

Hello! My system is running on the 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz CPU (equipped with Intel® UHD Graphics 750), screen resolution is FullHD (1920x1080).

I’m using a following pipeline:

ximagesrc display-name=mydisplay show-pointer=true use-damage=false
remote=true blocksize=16384 enable-navigation-events=true \
            ! video/x-raw,framerate=60/1 \
            ! timeoverlay \
            ! videoconvert \
            ! qsvh264enc bitrate=10000 low-latency=true \
            ! video/x-h264,profile=baseline \
            ! queue
...
intel-gpu-top -  833/ 874 MHz;    0% RC6; ----- (null);      619 irqs/s

      IMC reads:   ------ (null)/s
     IMC writes:   ------ (null)/s

          ENGINE      BUSY                                      MI_SEMA MI_WAIT
     Render/3D/0   16.01% |█████▌                             |      0%      0%
       Blitter/0    0.00% |                                   |      0%      0%
         Video/0   19.76% |██████▉                            |     15%      0%
  VideoEnhance/0    0.00% |                                   |      0%      0%

So one encoding stream consumes ~20% of GPU. When I run 6+ streams, FPS on the receiver side starts to drop from 60FPS to 55FPS and so on (approx 5FPS per stream drop). So my GPU is able to encode only 6 streams in real-time mode at 60FPS.

Now when I read posts like Wow! QuickSync on newer gen Intels are
transcode beasts!
where people are getting “24 simultaneous 1080P to 720P transcodes” on UHD630 I’m curious how is that possible.

  1. UHD630 is worse than UHD750 by as much as 80% in some benchmarks (I realize it might be unrelated to QuickSync but still…);
  2. Transcoding is a more resource intensive operation than encoding (as transcoding usually requires both encoding and decoding).

So what do I miss when interpreting these results? Thanks.

The performance is very dependent on framerate as well.

note that there are a few performance related properties in qsv encoder elements.

  • low-latency : there would be performance vs. low-latency trade-off
  • target-usage: you might be able to see better performance with 7, default is 4

Thanks Seungha! Using target-usage=7 reduces GPU resources usage by ~25%, now I can get 7 x 60FPS streams without FPS loss on the receiver side. But it’s still far from 24 streams people are getting. Even if 24 streams are for 30 FPS, I’m still not getting close to even 12 streams. low-latency=false doesn’t make any difference btw.
Could you tell me from your experience, is QuickSync hardware same for different intel products (CPUs and dGPUs) or they may be different? For NVENC a single GPU may have from 1 to 4 NVENC chips (https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new), is the same true for Intel products? And in case there are products with multiple physical encoders, would the driver distribute load equally?

I’m not an Intel employee, so hard to answer :slight_smile:

Anyway, 1920x1080 is about 2x larger than 1280x720 so you need to take that into account

I thought you’ve been testing on a different Intel hardware while developing qvs plugin.
“1080P to 720P transcodes” means decoding from 1080P and encoding to 720P if I’m not mistaken so should be same or more resource intensive comparable to the pure 1080P encoding (that’s what I’m trying to do) I believe.

FYI, we had similar issues decoding on Windows with Intel+GStreamer.

The hardware decoder used has d3d11h264dec, but it uses Intel hardware internally. We observed that the limit for an i7 8gen or upper with a iGPU is much smaller than for an i7 7gen or lower. It was very strange.

More info about the issue can be found at: GitHub - rgonzalezfluendo/intel_d3d11_perf

Tested on a 12th Generation Intel(R) Core™ i9-12900H CPU featuring the UHD770 iGPU with x2 QuickSync chips, yielding the following outcome for 7 encoding threads:

intel-gpu-top: Intel Alderlake_p (Gen12) @ /dev/dri/card0 - 1001/1224 MHz;  10% RC6;  3.92/44.18 W;     2635 irqs/s

         ENGINES     BUSY                                                                                                             MI_SEMA MI_WAIT
       Render/3D   56.12% |███████████████████████████████████████████████████████████                                              |      0%      0%
         Blitter    0.00% |                                                                                                         |      0%      0%
           Video   28.19% |█████████████████████████████▋                                                                           |      0%      0%
    VideoEnhance    0.00% |                                                                                                         |      0%      0%

Thus, it constitutes half of the load we observed on the UHD 750. Then I encountered CPU limitations, but I believe that by employing additional CPU cores, the UHD770 has the potential to handle 12 concurrent encoding tasks.