I have a pipeline that connects to rtsp source and depending on when the user requests a different resolution from the stream, I create a dynamic tee pad from the decoded frames, resize and create encoders for the requested resolution.
The issue I am having is when my pipeline runs for about 4+ hours and I create a new tee pad, the pipeline seems to freeze/hang. The pipeline is big and has multiple branches.
I added a whole bunch of queue probes to check for “current-level-buffers”. Most of the time they are 0. But when the pipeline hangs, I see that the queue before the decoder “current-level-buffers” goes up to 15. I also see that a pad probe I added on the demuxer which is right after the rtspsrc and before the queue stops reporting that it received buffers.
Here is the static pipeline simplified
hlssink2 name=ingest1 playlist-location=/manifest.m3u8 location=/video/%t_hevc_orig.ts
hlssink2 name=ingest2 playlist-location=/stream.m3u8 location=%t.ts rtspsrc latency=100 location=rtsp://<rtspurl> protocols=0x00000004 ntp-sync=false ntp-time-source=3 buffer-mode=0 do-rtsp-keep-alive=true name=basesrc
basesrc. ! rtph265depay name=depay ! tee name=t
t.! queue name=q_ingest1 ! h265parse config-interval=10 ! ingest1.video
t.! queue ! h265parse config-interval=10 ! ingest2.video
t.! queue name=q_decoder ! h265parse config-interval=1 ! vah265dec name=video_decoder ! tee name=one_decode
one_decode. ! queue name=q_720p ! videorate ! video/x-raw,framerate=15/1 ! vapostproc ! video/x-raw,width=1280,height=720 ! tee name=resized_720p_t
resized_720p_t. ! valve name=aiValve drop=true ! queue leaky=upstream ! videorate ! video/x-raw,framerate=3/1 ! vapostproc ! video/x-raw,width=640,height=384 ! videoconvert ! motionplugin motion-still-path-format=%Y-%m-%dT%H-%M-%SZ.jpg name=motion ! fakesink async=false
Here the q_decoder
shows that it has 15 buffers right when the pipeline hangs. I am creating a pad from the one_decode
tee.
This takes about 4+ hours to reproduce so enabling all GST_DEBUG=5 fills up my disk. What specific GST_DEBUGS can enable I look for to figure out and narrow down the problem?
Thanks a lot for your help!!
With large pipeline, its nice to add a way in your software to dump the visual pipeline state. For non-interactive tools (since I see you are on linux) the simplest would be to add a unix signal handler, and call GST_DEBUG_BIN_TO_DOT_FILE_WITH_TS()
. That will let you check that your pipeline topology is the one you intended and that no linking accident occurred. It will also help you share visually in here what is going on, as I’m not too sure myself from the reading.
For the rest, you seem to be going toward the right direction, which is to first locate the point of stall. A queue filling up is a sign that something is stall downstream of it. In complex dynamic pipelines, a leaked blocking pad probe is quite likely. There is also cases where you may have lost a segment. That could cause sinks to wait forever due to a time skew.
Hi Nicolas,
Thanks a lot for your response!
I added a signal as you suggested and generated a dot file when the pipeline pauses. I could not see much from the diagram as to what could be wrong. The pipeline is too big to be uploaded in this forum. I just cut the portion where the queue builds up and stalls.
I pull a tee pad from the one_decode tee and logged all the pad probes of all the src elements downstream to that enc_1620_q → enc_1620_videorate ->…enc_h264timestamper keep receiving buffers and then suddenly it just stops.
As you mentioned I tried to check for leaked pad probe. The only pad where I have added a blocked probe is when tee pad is created. I added logs around the place where remove pad is added to see if it is added and I see all of them are removed.
I have built it my pipeline in rust and here is how I am adding and remove probes.
// request a new pad from video
let tee_video_pad = one_decode.request_pad_simple("src_%u").unwrap();
log::debug!("got a new video pad from one_decode for res={} padname={}", height, tee_video_pad.name());
let video_block = tee_video_pad.add_probe(PadProbeType::BLOCK_DOWNSTREAM, |_pad, _info| {
PadProbeReturn::Ok
}).unwrap();
// get the video ghost pad and link with enc bin
let video_sink_pad = enc_bin.static_pad("video").unwrap();
// use link_full to link the pads, needed for sauron as the tee src pad returns empty caps
tee_video_pad.link_full(&video_sink_pad, gst::PadLinkCheck::TEMPLATE_CAPS).unwrap();
// get encoded audio
let audio_t = pipe_bin.by_name("audio_t").unwrap();
let tee_audio_pad = audio_t.request_pad_simple("src_%u").unwrap();
let _tee_audio_vod_pad = audio_t.request_pad_simple("src_%u").unwrap();
log::debug!("got a new audio pad from one_decode for res={} padname={}", height, tee_audio_pad.name());
let audio_block: PadProbeId = tee_audio_pad.add_probe(PadProbeType::BLOCK_DOWNSTREAM, |_pad, _info| {
PadProbeReturn::Ok
}).unwrap();
let audio_sink_pad = enc_bin.static_pad("audio").unwrap();
tee_audio_pad.link(&audio_sink_pad).unwrap();
enc_bin.call_async(move |bin| {
if bin.sync_state_with_parent().is_err() {
log::error!("could not set the enc_bin_{} to playing", height);
}
tee_video_pad.remove_probe(video_block);
tee_audio_pad.remove_probe(audio_block);
log::info!("add_encoding_ladder enc_bin_{} is to playing with audio. REMOVED BLOCKING PROBE", height);
});
Can you also help me understand by what you mean by lost a segment? How do I detect it?
I appreciate your help!
The pipeline you shared is unreadable, best to share in SVG form, so we can zoom.
I think before digging into the hypothesis of timing, you should catch the hang in a debugger, and dump all threads backtrace. From there you can find what the pipeline is waiting on, if there is anything suspicious, like waiting on a clock entry, then investigate further.
Hi Nicolas,
I will try to add it in a debugger. My pipeline runs in a docker container in a linux environment. I will update you on my findings. I have not been able to upload SVG. Here is a link to my gist. pipeline_stalled.svg
Thanks a lot for your help
In that branch, you have two sinks only that are async=TRUE (The default), a udpsink and a fakesink. With this amount of tee, pre-rolling can be tedious, and perhaps does not make sense too much for a re-streamer. Its cheap to test at least, disable pre-roll completely by setting async=true to the remaining sinks.
Hi Nicolas,
Looks like the hlssink2 only has a flag async-handling=true. When I set that the pipeline hangs everytime I create a tee. I tried setting this on all the hlssink2 that I create but got the same result everytime.
I then removed that flag and set the async=true on the fakesink and the udpsink. But what you said might be true. I added probes on all the elements downstream of the tee (which are still in PAUSED state). They all seem to get one buffer/Frame and then it all freezes. And the state of the elements are still in PAUSED state.
Also I was under the impression that setting async=FALSE avoid the preroll to be dependent on state.
So it seems like it might be the prerolling that is failing? But I am not able to determine which of the sink elements is causing this. Is async-handling same as async flag?
Can you please let me know if I have to check for prerolling issues what debug logs should I look for that could show me which one is causing it?
I truly appreciate your guidance—thank you!