Guidance on adding support for a new audio codec

inkychris · February 22, 2024, 9:51am

I’m trying to add support for a custom audio codec with the aim of using gstreamer to use this codec with MPEG-DASH. I have a functioning parser and decoder element with typefind support, but the realtime performance is poor when used with uridecodebin pointing to a dash manifest. It is possible that the dash manifest is also to blame since I’ve modified shaka-packager in order to generate it, but everything works fine when streaming to a file which leads me to believe it’s some kind of buffering/latency/timestamping issue somewhere. I could do with some feedback on the overall approach I’ve taken since I’ve mostly been reverse engineering the wav, wavpack, and flac parser/decoders!

For context, the codec uses RIFF for the bitstream, with a config chunk to initilalise the decoder, an optional seek table chunk, and then subsequent packet chunks to be passed to the decoder which typically contain in the order of 10ms of audio and optionally a hash of the packet data.

I currently have a parser element which pulls the config data and sets it on the cap, and then passes the pure codec packet data to the decoder element (also removing the packet hash field if it was present). It does this by constructing a new buffer and setting the frame’s out_buffer with timestamps determined by the packet duration and a global packet counter. I couldn’t find any useful examples of parser elemnts using the out_buffer but it does “appear to work”.

I don’t make any use of the skipsize value which seems to allow you to align the parser’s incoming frame. I tried an alternative implementation using this to pass the packet data without using out_buffer, but it both performed worse (longer to decode whole stream to file), and appeared to introduce a large number of discont events where timestamps encountered were intespersed with invalid ones (max time value) so clearly didn’t configure something correctly.

Is this a reasonable approach to take? I was wondering if perhaps the parser should simply chop up the RIFF stream into the discrete RIFF chunks, and then have the decoder parse the RIFF aspect of the stream too?

Am I seeing realtime performance issues when streaming dash (not from file) simply because my codec packet duration is very short and so it’s exposing a lack of buffering out of the decoder that other codecs perhaps mask?

pipeline reading from file (performs fine):

gst-launch-1.0.exe filesrc location=.build/example.riff ! decodebin ! audioconvert ! audioresample ! autoaudiosink

pipeline reading from dash server (hicups as new segments are pulled):

gst-launch-1.0.exe uridecodebin uri=http://localhost/manifest.mpd ! audioconvert ! audioresample ! autoaudiosink

tpm · February 22, 2024, 10:21am

Have you tried uridecodebin3 (so it uses the new adaptivedemux2 implementation instead)?

Might also be worth using a specific audiosink and perhaps checking what kind of buffer sizes it configures itself to by default.

inkychris · February 22, 2024, 12:00pm

Just gave that a go. It seems to be spawning more than one parser element which in its current form is then failing since it’s not seeing the start of the stream. Is there some way to stop it from doing that? It’s possible I’ve cargo culted some baseparse option over that just needs setting to false.

Using uridecodebin3 does appear to have the first parser instance submit multiple packets before the pipeline starts rolling, although hard to tell with the error currently being thrown.

inkychris · February 23, 2024, 1:03pm

There were a few issues with the parser that were causing problems.

uridecodebin3 is passing my raw codec packet data to a second parser instance. I hadn’t accounted for passthrough mode, so in handle_frame I needed to check if the data was a raw codec packet or the RIFF bitsream. In my case I check the sink pad for framed == true which is set on the src caps from the parser that processes the RIFF bitstream.
During adaptive mode, for whatever reason uridecodebin3 is first choosing the lowest bitrate DASH representation before switching to the highest. It doesn’t do this for an AAC test stream but that might just be because the datarates are much lower. My parser wasn’t correctly identifying the new stream due to the way I was handling the incoming frames. I switched to using skipsize to align the incoming frames with RIFF sync-points (i.e. start of stream, and then each subchunk). The parser can then still use the frame->out_buffer method to chop off the RIFF chunk header and any other extra fields (in my case a packet hash) and return the frame as it was before.

I’m now facing an issue with complete stream drop out after a few seconds when streaming the highest datarate but lower datarates appear to play back without issue using playbin3.

It would also be nice if I could avoid the copy to chop the front of the frame buffer, but the copy at the moment seems cleaner than making another skipsize roundtrip just to drop ~8 bytes.