Enhancing Voice Activity Detection for Real-World Conversational AI

Machine Learning @ Verloop

Jun 09, 2025

Our Use Case: Building a Responsive Conversational AI System for Voice

At Verloop, we deploy conversational Voice AI Agents for our clients across a variety of use cases. These agents are built using our Recipes (agent workflows), which integrate LLMs at different stages to deliver intelligent and dynamic conversations.

In most voice based conversational systems today, the typical processing pipeline looks something like this:

Customer Utterance → Speech-to-Text (STT) → LLM → Text-to-Speech (TTS) → Response

While direct audio-to-audio processing is becoming increasingly feasible, its steerability for real-world use cases still presents significant challenges.

In voice-based conversational systems, accurately detecting when a customer starts and stops speaking is critical. This helps minimize latency and ensures the system generates appropriate and timely responses.

Detecting when a customer has stopped speaking is particularly important as we don't want to send incomplete transcriptions from STT to the LLM mid-sentence. Instead, we aim to forward the text only after the customer has finished speaking, ensuring that the LLM generates a coherent and contextually accurate response.

Consider the following utterance:

"Hi" <brief pause> "Can I get more information about this promotion" <brief pause> "and can you also please send me the details by email" <long pause>

With a streaming STT, the system continuously outputs words and phrases as the customer speaks. However, we don't want to trigger response generation mid-sentence after every short pause. Instead, we monitor for longer pauses that indicate the customer has finished their thought, making it the right time to process and respond.

Similarly, consider another scenario where the agent is speaking, and the customer interrupts. In such cases, it's crucial for the agent to pause—just as a human would in a natural conversation—listen to the customer, and then generate an appropriate response.

This is precisely where Voice Activity Detection (VAD) plays a crucial role.

Voice Activity Detection in Conversational AI

In conversational AI, especially in voice-based systems, understanding when a user starts and stops speaking is as important as understanding what they are saying. Voice Activity Detection (VAD) is the technology that enables this — it analyzes incoming audio to determine whether speech is present or not.

A reliable VAD system forms the backbone of a smooth, human-like conversation between a customer and a voice agent. It ensures that the agent listens actively, responds at the right moments, and doesn’t interrupt or lag unnecessarily.

In traditional applications like telephony or speech recognition, VAD is often used to save bandwidth or optimize transcription accuracy. However, in conversational AI, the requirements are much stricter:

Timely Response Generation: The system must quickly recognize when a user has finished speaking to minimize response latency.
Natural Turn-Taking: Just like in human conversations, the agent should know when to pause if a user interrupts or interjects.
Efficient Use of Compute Resources: By accurately detecting the end of a speech segment, we avoid sending partial or incomplete utterances to downstream components like LLMs, improving efficiency and coherence.
Noise Robustness: In real-world environments, VAD needs to distinguish between actual speech and background noise, music, or other non-speech events.

This led us to experiment with various VAD options available and ultimately build a customized strategy on top of them, specifically focused on reliably detecting speaker state changes.

Experiments and Findings

Back in 2022, we began experimenting with various Voice Activity Detection (VAD) approaches. Many of these methods remain effective today. Our primary evaluation criteria focused on achieving predictions with minimal latency—ideally within a few milliseconds—suitable for real-time streaming inference on CPU-based systems. Additionally, we prioritized models that offered high recall and precision. Below are some of the approaches we evaluated.

Energy based VAD

This is a simple, threshold-based filtering approach which works by assessing the short-term energy levels in a signal. If the energy of a frame exceeds a predefined threshold, it is classified as voice; otherwise, it is classified as non-voice.

There are also more sophisticated variations of this method, which rely on analyzing the signal's spectral properties using a Fast Fourier Transform (FFT) or by applying dynamic, adaptive thresholds based on background noise estimation.

In our experiments, we started with a simple power computation over small chunks of the signal and plotted the energy levels across different audio samples to study voice and non-voice patterns.

Silero VAD

Silero VAD is an open-source, lightweight voice activity detection model accessible via PyTorch Hub and pip. It employs a multi-head attention neural network architecture with Short-Time Fourier Transform (STFT) features, enabling efficient speech detection. Optimized for low-latency applications, it processes audio chunks as small as 30 milliseconds in under 1 millisecond on a single CPU thread. Trained on a diverse dataset encompassing over 100 languages, Silero VAD generalizes well across various audio conditions. Its compact size (approximately 1–2 MB) and support for 8 kHz and 16 kHz sampling rates make it suitable for real-time streaming scenarios.

Webrtc VAD

Web Real-Time Communication (Webrtc) VAD was released by Google. WebRTC is actively maintained by Google WebRTC team. We tested with a Python interface to the WebRTC Voice Activity Detector.

Speechbrain VAD

The SpeechBrain VAD model, hosted on Hugging Face as speechbrain/vad-crdnn-libriparty, is a voice activity detection system implemented using the SpeechBrain toolkit. It employs a Convolutional Recurrent Deep Neural Network (CRDNN) architecture, consisting of convolutional layers for feature extraction, followed by bidirectional LSTM layers, and fully connected layers. The model processes audio segments and outputs frame-level posterior probabilities, which are thresholded to classify segments as speech or non-speech. Trained on the LibriParty dataset, the model is designed to handle both short and long speech recordings. While the exact frame length is not specified, the model's architecture suggests it may not be optimized for real-time streaming scenarios.

Nemo VAD

The Nemo (vad_marblenet) model was released by Nvidia. The model is based on MarbleNet architecture and the input feature of this model is MFCC. MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers.

Pyannote VAD

The model is an open-source voice activity detection system hosted on Hugging Face. While primarily designed for voice activity detection, it can also perform speaker segmentation tasks. The model processes audio in short chunks, typically around 5 seconds, making it suitable for analyzing pre-recorded audio files rather than real-time streaming scenarios

Findings

Input data for testing was via microphone and audios saved as Wav files. The samples used were curated with variety of background noises and languages to test effectiveness of these models.

The audio data was curated such that we had a dataset with background noises - clicking sounds, people speaking, background music etc.

✅ Clear Speech

All models performed well with clear, audible speech in English, Hindi, and other Indic languages.

🔊 Background Static

Most of the models get confounded by the clicking sound. WebRTC offered binary outputs, while Silero provided probabilistic scores that can be thresholded (e.g., 0.8–0.9) to mitigate false positives. However, Silero occasionally assigned high probabilities to static noise.

🗣️ Background Voice

Background speech was the most challenging scenario. Threshold-based models performed better when the primary speaker is significantly louder than background voices. However, increasing the energy threshold may cause the model to miss quieter voices.

🎶 Background Music

Models like Silero and WebRTC handled plain background music effectively. Silero maintains low probability scores (below 0.8) when only music is present.

☕ Café Ambience

In noisy environments like crowded cafés, Silero performed well if the speaker's voice is much louder than the background noise. WebRTC tended to produce more false positives in such settings.

🔇 Low-Volume Speech

When the speaker's voice is soft, models may fail to detect speech. Applying a higher energy threshold to filter background noise can further reduce the likelihood of detecting low-volume speech.

⚠️ Streaming Considerations

Models requiring larger audio chunks, such as NeMo, SpeechBrain, and pyannote, was not suitable for us owing to our real-time streaming requirements.

We then tried an ensemble approach where Silero is combined with simple energy threshold based VAD .

Sample

Detecting Speaker State Changes: Start and Stop Detection

While experimenting with various models, we found that Silero VAD offers high precision with an impressively low latency—around 1ms per 32ms audio chunk (ie. 32ms for 1 second of audio). This makes it highly suitable for production-grade applications.

However, despite its accuracy, Silero VAD occasionally produced false positives, detecting speech even when no speaker was active. To mitigate this, we coupled it with an Energy-based VAD, which led to noticeable improvements in both precision and recall.

While VAD detects whenever a user speaks or does not speak in a speech segment, our use case required us to track state changes - when a user starts speaking and stops speaking. We also wanted to smoothen out the short pauses that are common during human speech.

With accuracy more or less sorted, we turned our focus to controlling the sensitivity of the VAD module. Sensitivity, in this context, refers to how aggressively the VAD detects speech, especially based on the duration of speech segments. Here's how we approached it:

🎯 High sensitivity: Helps detect short utterances like “hi”
🛑 Low sensitivity: Better suited for capturing longer sentences, filtering out brief noise or quick words.

To implement this control, we introduced a state buffer —a mechanism that stores the output from the ensemble VAD module over the last x seconds. From this:

We calculate a state probability, which represents the average likelihood of speech across the buffer.
This is then compared against a state probability threshold.

A higher threshold and longer buffer window makes the VAD less sensitive to short utterances, reducing false positives.

The system tracks transitions between three possible speaker states:

🗣️ Speaker was speaking → Speaker stopped speaking
🔇 Speaker was not speaking → Speaker started speaking
🔁 No change: The speaker remains in the same state (either continuously speaking or not speaking)

Deployment of VAD systems for low latency

We needed to use VAD in a streaming fashion, where the VAD model receives one chunk (512 / 256 samples depending on sample rate) at a time and runs inference on it in real time. To facilitate this, we used Redis Streams as the messaging backbone for our VAD service.

However deploying Silero VAD can be tricky due to some subtle challenges that aren't immediately obvious during initial research. In our first experiment we hosted multiple models with several parallel workers handling audio streams. In this setup, chunks from any stream could be processed by any of the workers. While this configuration performed well in experiments, it proved suboptimal on further analysis.

🧠 Statefulness of Silero VAD

Since Silero VAD is a stateful model , it relies on the contexts of previous audio chunks to accurately detect speech. For example, if one chunk with background noise (no speaker) is followed by a chunk of clear speech (no noise), the VAD might still fail to detect the speaker — because it's missing continuity. With chunks distributed randomly across threads, this statefulness was lost, leading to poor detection accuracy.

In the following experiments, we implemented a mechanism that binds each stream to a specific VAD model. This preserved state across chunks and improved accuracy. However, this introduced a new bottleneck — only X number of streams could be processed concurrently, and others had to wait. Of course, these could be alleviated by scaling the service horizontally and increasing the number of workers.

⚠️ Memory Management Challenges

While this solved the latency and throughput issues to some extent, we ran into a critical memory issue. Silero VAD wasn’t releasing memory properly — and under load, memory usage ballooned to over 30 GB across pods. Each VAD model was only about 1 MB, yet individual instances were consuming 800MB+.

Even attempts to reset the model state didn’t free up memory. Upon investigation, we discovered that Silero VAD’s memory usage grows with the duration of the longest active utterance — not the number of streams. So, if a single 30-minute call was handled by one VAD, that instance would retain memory for the entire duration. Shorter calls processed afterward wouldn’t reduce this memory footprint.

✅ Switching to ONNX Silero VAD

After testing multiple strategies which included a rolling window buffer, the most effective solution was a simple one: switching to the ONNX version of Silero VAD . This resulted in lower memory footprint by 200-300% and memory was being released properly. We found that the detection accuracy didn't change and it also led to a further 50% decrease in latency.

This switch eliminated the memory bloat issue without sacrificing performance — and gave us a much more stable, production-ready deployment.

Metrics and Evaluation: How We Measured VAD Success

Since there’s no open-source dataset available for VAD evaluation in real-world noisy conditions, we collected our own data. A test sample of call recordings with different background noises and speakers to benchmark the VAD system.

To assess the model’s performance, we used precision, recall and F1 Score as our core evaluation metrics. These helped us measure how accurately the VAD was able to detect state changes—i.e., transitions between speaking and not speaking.

However, annotating audio data manually is notoriously difficult and time-consuming. To get an intuition of how VAD was doing, we generated animations that visually represented the VAD's state changes over time. These animations clearly marked when the system detected the speaker as active or silent. By visually comparing these to the actual audio content, we could calculate precision and recall more intuitively and efficiently.

Here's what one of those animations looked like 👇

This gave us intuitive understanding of how good the model and the hyper parameters were doing. We then used the VAD predictions along with human corrections to create a better ground truth test set.

We found accuracy numbers varying depending on lot of parameters. On noisy audios, we got F1 scores of ~ 82% , but on clear audios, the F1 scores are much higher.

Future roadmap

We’ve been actively exploring turn-taking detection systems to enhance the voice interaction experience. While VAD-based solutions perform well for detecting speech activity, they often lack context awareness—particularly in managing conversational flow. That’s where turn detection comes in. Our ongoing work focuses on combining turn detection mechanisms with VAD to create a more natural and responsive conversational experience.

A guest post by

Shivam Raj

None

Verloop Tech blog

Discussion about this post

Ready for more?