How to Achieve Immediate A/V Interleaving When Muxing in GStreamer


I am working on an application that involves muxing audio and video streams for subsequent RTMP or SRT streaming. Both input sources are live and timestamped in NTP time, both sources are received via the same rtpbin. However, I am facing a specific issue that I’d like to resolve.

In my GStreamer pipeline, I have the following scenario:

(1) The video source is received arbitrary seconds after the audio source
(2) The video source has latency approximately 2s higher than the audio.

If the audio then has been received for 10s before the video, then the output also consist of 10s of audio, before video and audio are interleaved. This I understand might be intended design to not lose any data - however I would like the application to discard frames until I have synced A/V that can be muxed and then streamed - so that we send interleaved A/V frames from the start.

Why do I want this? I’ve observed playback and transcoding issues downstream that I suspect are related to this initial buffer accumulation. For instance, some decoders require A/V packets to be received interleaved immediately or else they might skip decoding the audio track. Another thing I suspect is related is during transcoding (on some other receiver platforms) the A/V offset of the first received frames can cause a sync offset between A/V in the transcoded file.

How would one go about solving this problem? I.e. delaying muxing process until there is input on both pads, preferably with synchronised buffers right from the first packet.

I have not yet found any properties on the muxers (flvmux/mpegtsmux in this case) that seem to do the trick or any other elements that can help. Currently, I have a somewhat hacky solution that throws away buffers around the muxer until data is being received on both pads but that does not feel like a smart way to go.

Any help and ideas on how to achieve this is greatly appreciated.

To reproduce with my pipeline…

  1. Start the restreamer
  2. Start the receiver
  3. Start sending audio
  4. Wait some seconds
  5. start sending video
  6. Observe the difference in PTS for video and audio frames.


gst-launch-1.0 srtsink name=sink wait-for-connection=false uri="srt://:7001" sync=true \
    mpegtsmux name=muxer alignment=7 \
    rtpbin name=rtpbin \
    ntp-sync=true buffer-mode=synced \
    ntp-time-source=ntp \
    max-rtcp-rtp-time-diff=-1 \
    latency=3700 \
    udpsrc caps=application/x-rtp,media=video,payload=96,encoding-name=H264,clock-rate=90000 port=5000 buffer-size=16777216 name=udpsrc_0 ! rtpbin.recv_rtp_sink_0 \
    udpsrc caps=application/x-rtcp port=5001 name=udpsrc_1 ! rtpbin.recv_rtcp_sink_0 \
    udpsrc caps=application/x-rtp,media=audio,payload=96,encoding-name=MP4A-LATM,clock-rate=48000 port=5004 buffer-size=16777216 name=udpsrc_2 ! rtpbin.recv_rtp_sink_1 \
    udpsrc caps=application/x-rtcp port=5005 name=udpsrc_3 ! rtpbin.recv_rtcp_sink_1 \
    rtpbin. ! queue ! rtph264depay name=video_depay ! queue ! capsfilter caps="video/x-h264" ! h264parse config-interval=1 ! queue ! muxer. \
    rtpbin. ! queue ! rtpmp4adepay name=audio_depay ! queue ! capsfilter caps="audio/mpeg,codec_data=(buffer)1188" ! aacparse ! queue ! muxer. \
    muxer. ! queue ! sink.

Audio sender

gst-launch-1.0 \
	rtpbin name=rtpbin "sdes=application/x-rtp-source-sdes,cname=(string)\"sender\",tool=(string)GStreamer" ntp-time-source=ntp rtcp-sync-send-time=false ntp-sync=true \
	audiotestsrc is-live=true ! audio/x-raw,rate=48000 \
	! faac \
	! rtpmp4apay pt=96 \
	! rtpbin.send_rtp_sink_0 \
	rtpbin.send_rtp_src_0 ! udpsink host= port=5004 \
	rtpbin.send_rtcp_src_0 ! udpsink host= port=5005 sync=false async=false

Video sender

gst-launch-1.0 \
	rtpbin name=rtpbin "sdes=application/x-rtp-source-sdes,cname=(string)\"sender\",tool=(string)GStreamer" ntp-time-source=ntp rtcp-sync-send-time=false ntp-sync=true \
    ! videotestsrc is-live=true \
    ! video/x-raw,width=1920,height=1080,framerate=25/1 \
    ! timeoverlay halignment=right valignment=bottom text="Stream time:" shaded-background=true font-desc="Sans, 24" \
	  ! x264enc \
    ! rtph264pay \
	  ! rtpbin.send_rtp_sink_0 \
	rtpbin.send_rtp_src_0 ! udpsink host= port=5000 \
	rtpbin.send_rtcp_src_0 ! udpsink host= port=5001 sync=false async=false

SRT receiver to file

gst-launch-1.0 -e srtsrc "uri=srt://" ! queue ! filesink location=test.ts

FFmpeg to probe frames

ffprobe -of json -show_frames test.ts > test.json

I haven’t tried it myself, but if you know that the video is lagging by a constant/max amount, then perhaps the avwait element could help?

Thanks @tpm for your suggestion.
I see that the element only accepts raw format on the caps, which would not be the best fit to avoid a layer of coding. But sounds like it could the trick. If you have other ideas please let know. I also did try with the streamsynchronizer element in the pipeline before muxing, but got issues with not getting any data out from that element, however not sure if that element will help for this problem.

If you have an application you could simply use a pad probe for buffers that drops all buffers until all streams have seen data (optionally with min timestamp coordination).

That is what I am doing atm. Using pad probes and dropping buffers until we detect flow on both sink pads. It does feel a bit wrong though to do this manually…? After doing so, I also ended up with throwing away buffers after the muxer to ensure that A/V are “synced” from resp. first frame. Seemed like some buffer-accumulation was still there in the pipeline as the gates were opened (canal lock analogy). Also leading to another side-effect of not knowing that first outputted frame will be a keyframe…

I was hoping that there would be either some other elements that could be used or strategies to avoid my issue. But if writing a own element or manually dropping buffers is the way forward here then that will bring peace to mind. Maybe some timestamp coordination can help with the side-effects I guess, as you mentioned.