I’m working on a speaker diarization system using GStreamer for audio preprocessing, followed by PyAnnote 3.0 for segmentation (it can’t handle parallel speech), WeSpeaker (wespeaker_en_voxceleb_CAM) for speaker identification (Speaker comparison is done using cosine similarity), and Whisper small model for transcription (in Rust, I use gstreamer-rs).
My current approach actually works like 80+% ACC for speaker identification. And I m looking for ways how to improve the results.
Current Pipeline: - Using audioqueue → audioamplify → audioconvert → audioresample → capsfilter (16kHz, mono, F32LE) -
Tried improving with high-quality resampling (kaiser method, full sinc table, cubic interpolation) - Experimented with webrtcdsp for noise suppression and echo cancellation Current challenges:
- Results vary between different video sources. etc: Sometimes kaiser gives better results but sometimes not.
- Some videos produce great diarization results while others perform poorly.
what I mean is that over-normalizing the sound can increase the quality of transcription and decrease the quality of speaker identification. I have no problem with the quality of transcription. My goal is to try to improve the quality of speaker identification.
I know the limitations of the models, so what I am looking for is more of a “general” paradigm so that I can use these models in the most efficient way
- What’s the recommended GStreamer preprocessing pipeline for speaker diarization?
- Are there specific elements or properties I should add/modify?
- Any experience with optimal audio preprocessing for speaker Identification?
Thank you…