Optimizing Video Frame Processing with GStreamer: GPU Acceleration and Parallel Processing

Hello! I’ve developed an open-source application that performs face detection and applies scramble effects to facial areas in videos. The app works well, thanks to the gstreamer, but I’m looking to optimize its performance.

My pipeline currently:

  1. Reads video files using filesrc and decodebin
  2. Processes frames one-by-one using appsink/appsrc for custom frame manipulation
  3. Performs face detection with an ONNX model
  4. Applies scramble effects to the detected facial regions
  5. re-encode…

The full implementation is available on GitHub: scramblery/video-processor/src/lib.rs at main · altunenes/scramblery · GitHub

My question is there a “general” way to modify the pipeline to process multiple frames in parallel rather than one-by-one? What’s the recommended approach for parallelizing custom frame processing in GStreamer while maintaining synchronization? of course I am not expecting a “code”, I am just looking for insight or an example on this topic so that I can study it and experiment with it. :slight_smile:

saw some comments replacing elements like x264enc with GPU-accelerated encoders (like nvenc or vaapih264enc) but I think they are more meaningful after I make my pipeline parallel (?)… :thinking:

I got an answer from here:
https://www.reddit.com/r/gstreamer/comments/1ixzrdk/optimizing_video_frame_processing_with_gstreamer/

But I’d be grateful for additional comments always :slight_smile:

I could not view in a browser the suggestion on Reddit (it strictly forces using the app for this post). To make your use case entirely in GStreamer, you’d need to develop an variant of funnel that actually place back the frames in the right order. Then you’d simply use a combination of tee, queue and syncfunnel. I’ll keep that in mind, perhaps a project for later. Fits well with our goal to try and keep in the graph the AI processing (we have onnx plugins and AI metadata).

Now, since you already use an appsrc, and your explanation suggest the scrambling is done outside of the pipeline, my suggestion would be to create number of threads. Each thread when activated would process one frame, and signal it.

A very simple scheduling that will provide parallelism and ordering is to feed incoming frames into each thread and in a second thread wait for frames to be ready. You’d wait on one thread at a time in the same ordering. This should work for a handful of threads, but if you aim for bigger number, be aware that the ping-pong quickly becomes expensive and inefficient (e.g. won’t scale for 100cores).

The encoding must stay in order, so using a faster encoder such as VA encoders is the best optimization.

1 Like

thank you so much! I’ll experiment this idea :slight_smile:

For your information, we open-sourced our solution to maximize encoding performance using all the cores. This solution is based on time-slicing parallelization (fixed duration in this case).

The concept can be used not only for encoding but also for any type of custom frame processing.

You can read our blog post [1] and the implementation done in Rust is in GitHub [2].

[1] HYPE: HYbrid Parallel Encoder - Fluendo
[2] GitHub - fluendo/flu-plugins-oss at hype

1 Like

great resource thank you!