Sync woes with mpeg-ts/srt and packet/frame drops

Hello hackers!

We ingest streams from multiple sources, they all need to be in sync, preferably on the frame level.
To achieve this we make use of the frame capture time, embedded in the SEI header of h.264/h.265.

Our pipeline is sort of like this, and is uses a NTP clock:
srtsrc → tsdemux →rtpbin → a journey in k8s for post processing.

In order to get the sync, again down on frame level, we have a step in this pipeline that re-timestamps the buffers before reaching the rtpbin, so that the rtp ntp64 header / rtcp carries pts with an absolute capture time of the frame.

Sort of using these two methods:

pub fn calculate_spiideo_sei_diff(
    pad: &gst::Pad,
    buffer: &gst::Buffer,
) -> Option<gst::ClockTimeDiff> {
    let sei_time = get_spiideo_sei_ntp_time(buffer)?;
    let buffer_timestamp = get_ntp_time_from_pad(pad, buffer).ok()?;
    let buffer_ntp_ns = buffer_timestamp.nseconds() as i64;
    let sei_ntp_ns = sei_time.nseconds() as i64;
    buffer_ntp_ns.checked_add(-sei_ntp_ns)
}

pub fn get_ntp_time_from_pad(
    pad: &gst::Pad,
    buffer: &gst::BufferRef,
) → Result<gst::ClockTime, Error> {
let pts = buffer.pts()
    .ok_or_else(|| anyhow!(“failed to get pts”))?;
let event = pad
    .sticky_event::gst::event::Segment(0)
    .ok_or_else(|| anyhow!(“Failed to get segment”))?;
let segment = event.segment();

let base_time = pad
    .parent_element()
    .unwrap()
    .base_time()
    .ok_or_else(|| anyhow!("failed to get base time"))?;

let running_time = segment
    .downcast_ref::<gst::ClockTime>()
    .ok_or_else(|| anyhow!("failed to downcast segment"))?
    .to_running_time(pts)
    .ok_or_else(|| anyhow!("failed to get running time"))?;

let ntp_time = running_time.checked_add(base_time).ok_or_else(|| {
    let msg = format!("ntp time overflow: running: {running_time} base: {base_time}");
    tracing::warn!(msg);
    anyhow!(msg)
})?;

Ok(ntp_time)

Then we use this to alter the PTS before sending out the RTP.

This works. Most of the time. We get a good inter-stream sync. But!

When there is packet loss, we get woes! The SEI diff starts going wild and increases with seconds! I suspect this is somehow connected to mpeg-ts skew(?) but I am not sure. I feel a bit lost.

I tried to use the mpegtslivesrc hoping it would help, but then I am forced to a monotonic clock, right? And I am not sure how I would go about re-timestamping using SEI then …

I also tried to set skew-correction=false and that made the diff stable … but it seemed I got issues downstream when trying to create fmp4 items of the stream… have not dug deep.

Does anyone have any insight? Is our setup sound? Or are there inherent issues with the way we re-timestamp? Any tips on how to improve?

Thanks
Jonas

I dunno, though, this is getting me confused, I realize that I only see this “drift” in the diff on H.265 … does that make sense?

Discussion on Matrix about this for posterity:

slomo: so you have absolute NTP times in SEI for the H264/5 frames, no other streams (e.g. audio), and you want to send out things according to those NTP times via RTP? for RTP, do you need RTCP too, do you make use of any time-related header extensions? is your pipeline also using that same NTP clock (or could it)?

jonasdn: There is audio as well, the approach we have taken is to get the diff using the sei of video and applying it to audio as well, we use the ntp-64 header for instant sync, like in your blog.

We use NTP clock for all pipeline involved … in our distributed network of pipelines

and on the camera sending the SRT

slomo: what i’d do: no mpegtslivesrc (no need to estimate the PCR, you know it’s equal to the NTP clock), tsdemux with skew-corrections=false (as MPEG-TS is timestamped according to NTP == pipeline clock), after h264parse set the exact NTP time (- pipeline base time) as timestamps for the video (remember the diff, that diff should stay mostly constant), apply the same diff to the audio timestamps (need to make sure to block audio until you get the diff), then send things out via RTP (if you need RTCP make sure to configure it so that the RTCP NTP times are based on the buffer clock times)

that will work fine as long as the sender is timestamp data correctly. and unless you run into one of the tsdemux bugs (see @bilboed’s talk at the conference, also for the solution to that) that would make the tsdemux timestamps become completely wrong after packet loss or other kinds of resyncs sometimes

jonasdn: thx slomo !

|16pxx16px@bilboed: so what patches / gst version do I need to be safe(+)?

(+) as far as we know and pray

slomo: you say skew-corrections=false broke things for you with fmp4. how does fmp4 come into play here?

also one thing you might have to check / adjust is the pipeline latency. as you’re setting just the NTP timestamps you’re going to have all running times late when ignoring latency. tsdemux adds 700ms latency so that might be enough, but depending on your setup you might have to explicitly configure the latency on the pipeline (you can disable automatic latency configuration and just set a value). gstreamer can’t calculate the correct latency as that depends on how long it takes from timestamping on the sender side to the sinks after rtpbin

slomo:

so what patches / gst version do I need to be safe(+)?

git main is the safest that currently exists but the underlying problem is not solved (and only worked around to some degree) and requires a complete rearchitecturing of the demuxer (see last slide)

jonasdn: yeah, sorry, might be confusion / frustration with sitting with this issue for a while … I am not sure it did, I just stressed the poor Axis camera with a lot of bw, do not know the root or the cause anymore, using sane bw it seems to work fine with skew-corrections=false … fmp4 comes into play in our mad k8s circus of pods and nodes, where some persist the RTP stream to fmp4 to be able to do HLS things later

slomo: ok so something to look into separately and maybe not even a problem anymore?

jonasdn: that is true, when all this blow up in our face later, we might come calling to one of your companies to save us, but for now we are hanging in

bilboed: What sebastian said. I’m currently working on a reimplementation of the demuxer, but won’t be ready before another few months