RTP server for live transcription and streaming responses


I’d like to create an RTP server in Rust that would accept connection from various clients (say telephone exchange, computer audio etc.). Its task would be to stream the incoming media to Azure Speech-to-Text and stream back responses from Azure Text-to-Speech. The Azure SDK has functions to stream bytes to speech to text and get a stream of bytes back from text to speech. So now I need to write the glue that connects RTP to the Azure services.

After my initial research, I suppose GStreamer RTP and https://webrtc.rs/ could be the libraries I need.

Can anyone point me in the right direction, is GStreamer the right tool for this use case?

Thanks for any advice,

You could also use GStreamer’s WebRTC support. That avoids the complications you’ll have with properly integrating with webrtc.rs. You can find various examples using GStreamer’s webrtcbin from Rust in gst-examples, for example the sendrecv example.

Apart from that, GStreamer seems like a good choice for what you’re trying to do. There’s also an existing transcriber element making use of the Amazon AWS transcribe API.

1 Like

I haven’t yet understood all the concepts in audio streaming world, but my current understanding is that I need a plain RTP server.

Something like this:

The telephone PBX is Asterisk and it has a capability to stream the phone call over RTP. The most straight forward way to connect my service to the PBX would be to use SIP protocol and act like a endpoint device (VoIP telephone). But this comes with the disadvantage that my service is a SIP client and I need to connect from the service to the PBX, not the other way around. So I cannot really create a server application to which multiple Asterisks can connect.

If you have to do SIP then that’s limiting your options, yes. Also webrtc.rs / WebRTC in general is irrelevant then.

It is not necessary to use SIP. I already have a solution using it and I want to get rid of it and replace it with custom WS communication and RTP forwarding.

At the moment, the PBX uses strictly G.711 A-law, so my goal is to forward the audio stream into Azure STT and respond with a stream from Azure TTS. And I’d like to make it as simple as possible just to prove the idea. (meaning I don’t want the service to be a SIP client, but make it RTP server instead)

Would it be enough to just create a UDP socket for now, parse the RTP packets and not use any framework (gstreamer or webrtc.rs)?

You could do that if it’s really just a matter of UDP/RTP/a-law. If you also need to handle RTCP then using GStreamer will be a lot easier, or also if you need to handle many different codecs.

webrtc.rs is not helping you might with either of this in any case, it’s for usage with WebRTC. You’ll be able to re-use parts of it but I would expect that you’d run into some mismatches / missing features in a few places.

ok, I’ll try to start with plain UDP/RTP and continue with GStreamer if necessary. Thanks a lot for your input :slight_smile: