Pipeline design advice

YvesAtrovyte · February 17, 2025, 5:55pm

Hi yall

I’ve already built a speech processing pipeline like so in c++ without gstreamer:
Mic input → SSE (Speech Signal Enhancement) → VAD (Voice Activity Detection) → ASR (Automatic speech recognition)

Note: The VAD detects starts and ends of sentences, not single words

I’m considering switching to gstreamer as its paradigm is quite compelling for my use case, but I’m concerned some parts of my pipeline might bend it a bit too far.
Here are some of my current questions:

How should my ASR element be implemented?
- As a sink, pushing the transcribed text to the gstreamer message bus?
- As a filter element with a text source pad (which the application would then need to retrieve—though I’m unsure how)
In my current implementation w/o gstreamer, the VAD outputs variable length audio buffers (depending on the sentence length) to a queue, from which the ASR pops the next audio buffer (sentence) to transcribe. Note that the ASR must to process the whole sentence at once, not in a streaming way.
Here what would be the best approach?
- Using markers: I’ve seen the GST_BUFFER_FLAG_MARKER
  - The doc says: “for audio this is the start of a talkspurt” which seems perfect, but I cannot find a marker for the end of the speech segment?
  - In this case the ASR element would only copy the incoming buffer to its internal one between those markers.
- Using signals and the same mechanism as markers in the ASR element
- Using a queue?
  - I’ve seen mentions of queue elements in gstreamer, but I don’t quite understand how they work, as a queue implies a start and an end to buffer, which would not work in a streaming paradigm?
  - Does this concept of “indivisible” buffers exist in gstreamer?

Thank you for reading, have a nice day !
Yves