Multiple SRT stream synchronization with timecode

Introduction and context

Hi everyone, I will try to explain my problem as clearly as possible, thank you for everyone who will find time to help me.

Context

For a project at work I have to develop a program which receives multiple live SRT streams of the same scene in input, processes them in real time with a custom in-house built element, and then outputs multiple NDI streams.

The cameras are placed in the same physical location and all send the SRT stream through the Internet at another location, where our machine resides. Our machine is a server with enough processing power.

Primary requirement for our implementation

The must-have requirement is that our processing pipeline must preserve the synchronization between frames of the input SRT streams, in particular we must keep the frame-by-frame synchronization information.

Stream Specification

The SRT streams are composed of a H.264 video track and an audio track.

Our pipeline

So our task is to keep the streams synchronized after our processing. Our complete pipeline for each of the SRT strem is the following:

  • receive the SRT stream
  • decode the video stream
  • process the video with our plugin (already implemented)
  • passthrough the audio
  • reassemble the audio and the processed video stream
  • output an NDI stream

This project faces multiple challenges, the most relevant being:

  1. How do we even insert synchronization information in the source SRT streams?
  2. Supposing we can embed synchronization timestamps in the sources, how can we maintain the synchronization while processing the decoded frames in the GPU?

I will try to outline what we devised for now, and what we are still missing, walking through the decisions that brought us where we are now.

Challange 1: Video synchronization strategies

While this is not strictly a problem for us but for who sends us the video streams, it becomes so when considering that the strategy of synchronization will determine what we can do to maintain it.

After many days of research on the web, it appears that the common way of synchronizing videos (or audio/subtitles tracks) is by using the timecode. For the most part we find articles describing what is timecode in video production, that suggest using timecode generators to sync multiple cameras setup to ease post production processing. The only references to remote production and synchronization of SRT/RTMP/RTP streams are in commercial software descriptions, for example see the links in the following section:

Commercial software mentioning timecode

Judging by the resources in these links, it seems that for synchronization of several cameras when transmitting the streams over a network two steps are required:

  1. the clocks of the cameras must be synchronized, ideally with specific devices as timecode generators, connected to each of the cameras, or with NTP if hardware generators are not available.
  2. the timecodes of each frame must be transmitted with the video stream

So, assuming the clocks are synchronized, we need timecodes for every frame.

As we are dealing with H.264 streams, I found that timecode can be embedded in SEI messages or we can use the more general VITC timecode. Let us consider only timecode in SEI messages for the sake of this conversation, even if VITC might be still a viable option if we fail to implement the primary option.

Suppose we decide to use Unregistered User Data SEI messages, where we will embed timecodes as 4 byte strings (one byte for each of hours, minutes, seconds and frame number).

Question 1

Is this approach for synchronizing videos the correct one? Is there something else used commonly which I did not mention?

Challenge 2: Keep timecode synchronized between input and output

Now, supposing that we use SEI messages to store timecode information, our pipeline should read the SRT stream, decode the frames while annotating their timecode, process the decoded frames with our algorithm in gpu, then reapply the original timecode in each frame in the NDI output.

I found some references to SEI and timecode in Gstreamer docs, but I have not been able to grasp how this process of extracting timecode from a frame, decoding the frame, processing it with our algorithm in gpu, reencoding the frame and finally putting back the corresponding timecode extracted at the start of the pipeline works.

Some of the resources I consulted are:

Also, I searched here in the Discourse, and the relevant topics are:

Relevant discourse topics

Question 2

How can I implement this logic of reading the timecode for a frame in the H.264 input stream, keeping it on hold somewhere for after, decode the frame and process it in gpu, then reattach the timecode to the frame?

What I obtained so far

I was able to insert timecodes in SEI messages with AWS Mediaconvert, following this guide Putting timecodes in your outputs - MediaConvert, and I was able to insert timecodes both in timecode SEI messages and as User Data Unregistered SEI Messages.

Also with FFmpeg I could insert metadata, but only on the first frame, with the h264_metadata bitstream filter for h264) and read them (actually, printing them to screen, not reading programmatically) with FFmpeg and the showinfo bitstream filter or with ffprobe -show_frames.

With Gstreamer I tried to generate the timecode in SEI messages with the following pipeline, but I received many errors:

gst-launch-1.0 -e videotestsrc ! timecodestamper ! x264enc ! h264parse update-timecode=1 ! matroskamux ! filesink location=timecodestamper_out.mkv

The errors are like this:

WARNING: from element /GstPipeline:pipeline0/GstH264Parse:h264parse0: Element doesn't implement handling of this stream. Please file a bug.
Additional debug info:
../gst/videoparsers/gsth264parse.c(2938): gst_h264_parse_create_pic_timing_sei (): /GstPipeline:pipeline0/GstH264Parse:h264parse0:
timecode update was requested but VUI doesn't support timecode

and the produced file does not seem to contain any timecode information.