Introduction and context
Hi everyone, I will try to explain my problem as clearly as possible, thank you for everyone who will find time to help me.
Context
For a project at work I have to develop a program which receives multiple live SRT streams of the same scene in input, processes them in real time with a custom in-house built element, and then outputs multiple NDI streams.
The cameras are placed in the same physical location and all send the SRT stream through the Internet at another location, where our machine resides. Our machine is a server with enough processing power.
Primary requirement for our implementation
The must-have requirement is that our processing pipeline must preserve the synchronization between frames of the input SRT streams, in particular we must keep the frame-by-frame synchronization information.
Stream Specification
The SRT streams are composed of a H.264 video track and an audio track.
Our pipeline
So our task is to keep the streams synchronized after our processing. Our complete pipeline for each of the SRT strem is the following:
- receive the SRT stream
- decode the video stream
- process the video with our plugin (already implemented)
- passthrough the audio
- reassemble the audio and the processed video stream
- output an NDI stream
This project faces multiple challenges, the most relevant being:
- How do we even insert synchronization information in the source SRT streams?
- Supposing we can embed synchronization timestamps in the sources, how can we maintain the synchronization while processing the decoded frames in the GPU?
I will try to outline what we devised for now, and what we are still missing, walking through the decisions that brought us where we are now.
Challange 1: Video synchronization strategies
While this is not strictly a problem for us but for who sends us the video streams, it becomes so when considering that the strategy of synchronization will determine what we can do to maintain it.
After many days of research on the web, it appears that the common way of synchronizing videos (or audio/subtitles tracks) is by using the timecode. For the most part we find articles describing what is timecode in video production, that suggest using timecode generators to sync multiple cameras setup to ease post production processing. The only references to remote production and synchronization of SRT/RTMP/RTP streams are in commercial software descriptions, for example see the links in the following section:
Commercial software mentioning timecode
- Synchronization of multiple cameras: problem and solution
- Synchronize media workflows with live stream content using timecode | Live Stream API | Google Cloud
- BytePlus | Business growth through superior technology
- Softvelum news: Nimble Streamer, Larix Broadcaster and more: SEI metadata NTP time sync support in Nimble Streamer
- Synchronizing streams by NTP-based timecodes
- Time synchronization in Larix Broadcaster
- Putting timecodes in your outputs - MediaConvert
- https://www.reddit.com/r/VIDEOENGINEERING/comments/nzqr48/comment/h1rsu59/, which in turns recommends professional hardware (Makito X4 Video Encoder | Haivision)
Judging by the resources in these links, it seems that for synchronization of several cameras when transmitting the streams over a network two steps are required:
- the clocks of the cameras must be synchronized, ideally with specific devices as timecode generators, connected to each of the cameras, or with NTP if hardware generators are not available.
- the timecodes of each frame must be transmitted with the video stream
So, assuming the clocks are synchronized, we need timecodes for every frame.
As we are dealing with H.264 streams, I found that timecode can be embedded in SEI messages or we can use the more general VITC timecode. Let us consider only timecode in SEI messages for the sake of this conversation, even if VITC might be still a viable option if we fail to implement the primary option.
Suppose we decide to use Unregistered User Data SEI messages, where we will embed timecodes as 4 byte strings (one byte for each of hours, minutes, seconds and frame number).
Question 1
Is this approach for synchronizing videos the correct one? Is there something else used commonly which I did not mention?
Challenge 2: Keep timecode synchronized between input and output
Now, supposing that we use SEI messages to store timecode information, our pipeline should read the SRT stream, decode the frames while annotating their timecode, process the decoded frames with our algorithm in gpu, then reapply the original timecode in each frame in the NDI output.
I found some references to SEI and timecode in Gstreamer docs, but I have not been able to grasp how this process of extracting timecode from a frame, decoding the frame, processing it with our algorithm in gpu, reencoding the frame and finally putting back the corresponding timecode extracted at the start of the pipeline works.
Some of the resources I consulted are:
Also, I searched here in the Discourse, and the relevant topics are:
Relevant discourse topics
- https://discourse.gstreamer.org/t/send-frame-number-over-udp/1035/2: here @slomo suggests the idea of using SEI timecodes
- https://discourse.gstreamer.org/t/associating-additional-data-with-a-frame-webrtc/198: here there are some other suggestion about SEI timecodes and how to handle them. As far as I can tell I should implement a custom element to process the timecodes
- https://discourse.gstreamer.org/t/inserting-sei-metadata-in-h264/840: here the asker is using the function to insert metadata and gave me the idea of using some video filter to accomplish my task (I should research better here)
- https://discourse.gstreamer.org/t/plugin-to-modify-sei-data/1118: this user is trying to implement a similar feature, but it is just to modify SEI data in place
- https://discourse.gstreamer.org/t/no-python-gstreamer-api-extract-h264-sei/1175/2: here the asker is trying to extract SEI with python and @ndufresne answers to “create a pipeline with a parser and an appsink”, which I do not fully understand. Actually I have been able to extract some SEI metadata with pyav, which is the second suggestion.
Question 2
How can I implement this logic of reading the timecode for a frame in the H.264 input stream, keeping it on hold somewhere for after, decode the frame and process it in gpu, then reattach the timecode to the frame?
What I obtained so far
I was able to insert timecodes in SEI messages with AWS Mediaconvert, following this guide Putting timecodes in your outputs - MediaConvert, and I was able to insert timecodes both in timecode SEI messages and as User Data Unregistered SEI Messages.
Also with FFmpeg I could insert metadata, but only on the first frame, with the h264_metadata bitstream filter for h264) and read them (actually, printing them to screen, not reading programmatically) with FFmpeg and the showinfo bitstream filter or with ffprobe -show_frames.
With Gstreamer I tried to generate the timecode in SEI messages with the following pipeline, but I received many errors:
gst-launch-1.0 -e videotestsrc ! timecodestamper ! x264enc ! h264parse update-timecode=1 ! matroskamux ! filesink location=timecodestamper_out.mkv
The errors are like this:
WARNING: from element /GstPipeline:pipeline0/GstH264Parse:h264parse0: Element doesn't implement handling of this stream. Please file a bug.
Additional debug info:
../gst/videoparsers/gsth264parse.c(2938): gst_h264_parse_create_pic_timing_sei (): /GstPipeline:pipeline0/GstH264Parse:h264parse0:
timecode update was requested but VUI doesn't support timecode
and the produced file does not seem to contain any timecode information.