H264 via RTP through isofmp4mux DTS/PTS issues

btgoodwin · May 3, 2024, 12:58pm

I have a production application running on GStreamer 1.22.12 that we are switching from the older mp4mux to the isofmp4mux. Our pipeline is like: rtpbin -> rtph264depay -> h264parse -> isofmp4mux -> filesink (there’s an audio branch, but it doesn’t matter in this case). The isofmp4mux immediately emits an error message that DTS is missing, so per another thread here in discourse and GStreamer patch notes, I inserted a h264timestamper after the parser to see if that would resolve the time stamping issues. Instead, my log floods with this:

2024-05-02T20:06:37.423648Z 22705 0x7f1fc0001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:2240:gstfmp4::fmp4mux::imp:b82752c0/9de2eeef/a3653636 Draining at timeout
2024-05-02T20:06:37.423653Z 22705 0x7f1fc0001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1875:gstfmp4::fmp4mux::imp:b82752c0/9de2eeef/a3653636 Starting to drain at 0:00:14.009291666 (fragment start 0:00:14.009291666, fragment end 0:00:24.009291666, chunk start 0:00:14.009291666, chunk end --:--:--.---------)
2024-05-02T20:06:37.423659Z 22705 0x7f1fc0001aa0 WARN                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1936:gstfmp4::fmp4mux::imp:b82752c0/9de2eeef/a3653636 Don't have a complete GOP for the first stream on timeout in a live pipeline

The resulting file plays back fine, so I could simply mute the logs for fmp4mux but I don’t want to accidentally paper over an issue if there is one.

I have also tried moving the timestamper to ahead of the parser, but that made no difference in the behavior – seemingly every buffer encountered was triggering this trio of messages.

Does anyone have experience dealing with this new element in a pipeline like this (H264)?

slomo · May 4, 2024, 9:03am

This is a problem with your incoming stream / the configuration on fmp4mux.

Every fMP4 fragment should ideally start with a keyframe, but your incoming stream does not have a keyframe after fragment-duration. So at this point (because it’s a live stream), the muxer times out and tries to drain and notices that there’s not even a single complete GOP (group of pictures) at this point and complains. It should still produce valid files but it warns you that because of this the latency configuration of the pipeline is not followed.

Old mp4mux did not care about such details and didn’t handle latency in live pipelines at all.

btgoodwin · May 6, 2024, 1:16pm

We have a blocking probe that drops everything until a buffer is present that is a keyframe:

  if ((info->type & GST_PAD_PROBE_TYPE_BUFFER) && (buffer = gst_pad_probe_info_get_buffer (info))) {
    // if the delta unit flag is set, then it's NOT a keyframe.
    is_keyframe = !GST_BUFFER_FLAG_IS_SET (buffer, GST_BUFFER_FLAG_DELTA_UNIT);
  }
  // other logic confirming we want to start with this keyframe, then returns
  // GST_PAD_PROBE_REMOVE rather than _DROP

Is there an example somewhere showing this expectation? This probe logic has never caused us an issue on the old muxer.

Thank you for your insight.

btgoodwin · May 6, 2024, 3:19pm

I’ve been scraping the mediasoup docs for a while now and have not run across how I can tune the periodicity of keyframes in that library, but one of my colleagues found a ticket against our application from another team that essentially states our generated files have abnormally long keyframe intervals (the ticket is from a year ago ).

Do you by chance have experience with mediasoup’s SFU?

ystreet00 · May 7, 2024, 6:02am

mediasoup (very) likely does not change anything about the codec parameters at all. The distance between keyframes is decided entirely by the sender and would be before mediasoup is involved.

slomo · May 8, 2024, 9:42am

This only makes sure that the very first frame is a keyframe. What is required here is that every fragment of the muxer starts with a keyframe, which requires you to ensure that the encoder on the other side regularly inserts a keyframe at the right positions so that the fragment-duration property on the muxer is covered.

btgoodwin · May 8, 2024, 12:22pm

I see.

I was working towards requesting keyframes using fragment-duration as a sort-of timeout (make the request every fragment-duration - some-delta, where some-delta compensates for the round trip latency). It sounds like you’re saying that won’t work because the keyframe may or may not land exactly at the start of the fragment duration. Am I understanding you correctly?

Thank you again for all your help!

slomo · May 9, 2024, 7:29am

As long as the keyframes land approximately at those times it will be fine. You should be able to configure some slack on the muxer via the latency property. If the keyframe arrives later than that, you would get this warning though.

btgoodwin · May 10, 2024, 3:45pm

I added a non-blocking probe to verify how often my keyframes are coming through; they’re 1 second apart. I then configured the fragment-duration to 1e9, and as soon as the stream begins, I get a bus error pushed out from the muxer:

Panicked: assertion failed: timeout || all_eos || stream.sinkpad.is_eos() || stream.queued_gops.get(1).map(|gop| gop.final_earliest_pts) == Some(true) || settings.chunk_duration.is_some();

(I edited it to remove all escapes \ so it’s a bit more readable.)

I have tried setting latency to half a second as well, but it still instantly quits with this error.

On the other hand, if I leave fragment-duration at 10 seconds, latency at 2 seconds, and permit the producer to continue pushing keyframes every second, I get the same flood of log messages as before despite keyframes absolutely landing within the 10+/-2 second interval (since they’re steadily arriving every second according to my log).

I’m trying to review the in-tree examples to see if there’s something I’m missing here, but I’m new to RUST so I am probably missing something.

Any ideas what might be wrong?

slomo · May 10, 2024, 4:07pm

That’s a bug, and was reported here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs/-/issues/521

Do you maybe have a way to reproduce this? That would make it much easier to fix.

btgoodwin · May 10, 2024, 5:49pm

That’s too bad, and unfortunately no, I’m only getting this when running the application in question live against a browser-based producer. When I run my unit tests, which are quite a lot like that person’s pipeline, I get none of the problems I’ve been having with the live application.

For example, I have an issue where if I have audio and video running into the older qtmux doing non-fragmented files, I will often see in the live application that sending an EOS down the front of the pipeline segment that leads to each of the pads on the muxers will not result in the EOS passing through the muxer and off to my file sink. Instead, it’s deadlocked down in the GstAggregator class SRC_LOCK (if I recall correctly) and all of the intermediate pads are stuck on their chain functions. I have unit tests replicating that exact scenario involving video and audio test source elements followed by RTP payloaders, capsfilters, etc. to make sure that everything is as close to the deployed application as I can. Those tests pass and never deadlock on EOS. The runtime application however locks the vast majority of the time. It used to work fine and nothing in the gstreamer application changed; it’s like a client browser update or some intermediate library in a service update broke the whole functionality of the muxer for us.

This current ticket trying to switch over to the isofmp4mux was next deck for me, so here I am. I can get good recordings, but I basically have to mute the log for the element because it’s unclear how to replicate whatever the correct environmental setup might be for this element. From the conference last year, it sounded like this would be a drop-in replacement for the qtmux.

From the tests, I set chunk-duration to fragment-duration and latency to half (so 1, 1, and 0.5 seconds). Now I get no warnings, but it’s a constant log spam at INFO Draining at timeout and Starting to drain. Should those logs be at a lower level? The base class has the related timeout log at DEBUG. I guess more to the point – should these INFO messages concern me other than their making my logs completely unusable unless I mute them via GST_DEBUG=fmp4mux:WARNING,...?

btgoodwin · May 10, 2024, 6:07pm

BTW, the pair of “starting to drain” and “draining” are coming at 100 microsecond intervals with the above settings. It’s like it goes into that loop in drain() somehow with timeout locked to true (despite being set to false after that comment, // Only the first iteration is considered a timeout).

2024-05-10T18:03:31.870814Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1875:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Starting to drain at 0:00:13.158503856 (fragment start 0:00:12.791510240, fragment end 0:00:12.791510241, chunk start 0:00:13.158503856, chunk end 0:00:13.158503857)
2024-05-10T18:03:31.870908Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:2240:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Draining at timeout
2024-05-10T18:03:31.870952Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1875:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Starting to drain at 0:00:13.158503856 (fragment start 0:00:12.791510240, fragment end 0:00:12.791510241, chunk start 0:00:13.158503856, chunk end 0:00:13.158503857)
2024-05-10T18:03:31.871016Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:2240:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Draining at timeout
2024-05-10T18:03:31.871058Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1875:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Starting to drain at 0:00:13.158503856 (fragment start 0:00:12.791510240, fragment end 0:00:12.791510241, chunk start 0:00:13.158503856, chunk end 0:00:13.158503857)
2024-05-10T18:03:31.871094Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:2240:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Draining at timeout
2024-05-10T18:03:31.871106Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:1875:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Starting to drain at 0:00:13.158503856 (fragment start 0:00:12.791510240, fragment end 0:00:12.791510241, chunk start 0:00:13.158503856, chunk end 0:00:13.158503857)
2024-05-10T18:03:31.871123Z 47610 0x7f212c001aa0 INFO                 fmp4mux mux/fmp4/src/fmp4mux/imp.rs:2240:gstfmp4::fmp4mux::imp:b615dab7/76291291/c58a49b5 Draining at timeout

Maybe this will make more sense next week, but right now it seems like there’s something wrong here with how the element or the aggregator base class is behaving.

slomo · May 11, 2024, 8:03am

I think this discussion would be better continued in gitlab The wrong “draining on timeout” is a separate problem from the assertion and should get its own ticket. If there’s any additional information you can put into the ticket about the assertion that would also be very helpful and maybe I can make sense of it.

In both cases a debug log with GST_DEBUG=fmp4*:9 would be useful.

btgoodwin · May 13, 2024, 2:47pm

Alright, done: RTP H264 through isofmp4mux results in INFO log flood (#3539) · Issues · GStreamer / gstreamer · GitLab

I’m going to spend some time today trying to reproduce these messages in a test environment.