Best GStreamer audio preprocessing pipeline for speaker diarization?

altunenes · December 10, 2024, 11:30pm

I’m working on a speaker diarization system using GStreamer for audio preprocessing, followed by PyAnnote 3.0 for segmentation (it can’t handle parallel speech), WeSpeaker (wespeaker_en_voxceleb_CAM) for speaker identification (Speaker comparison is done using cosine similarity), and Whisper small model for transcription (in Rust, I use gstreamer-rs).

My current approach actually works like 80+% ACC for speaker identification. And I m looking for ways how to improve the results.

Current Pipeline: - Using audioqueue → audioamplify → audioconvert → audioresample → capsfilter (16kHz, mono, F32LE) -

Tried improving with high-quality resampling (kaiser method, full sinc table, cubic interpolation) - Experimented with webrtcdsp for noise suppression and echo cancellation Current challenges:

Results vary between different video sources. etc: Sometimes kaiser gives better results but sometimes not.
Some videos produce great diarization results while others perform poorly.

what I mean is that over-normalizing the sound can increase the quality of transcription and decrease the quality of speaker identification. I have no problem with the quality of transcription. My goal is to try to improve the quality of speaker identification.

I know the limitations of the models, so what I am looking for is more of a “general” paradigm so that I can use these models in the most efficient way

What’s the recommended GStreamer preprocessing pipeline for speaker diarization?
Are there specific elements or properties I should add/modify?
Any experience with optimal audio preprocessing for speaker Identification?

Thank you…