Quick take
Voice AI works when you treat it like plumbing, not magic. Keep perceived latency under 500ms, treat interruptions as a first-class concern, and keep the task scope narrow. The architecture choice between a modular pipeline and an end-to-end model matters less than your streaming strategy.
The gap between a voice AI demo and a voice AI product is about six months of work on things nobody finds exciting: latency tuning, interruption handling, and figuring out what happens when the user mumbles, changes their mind, or goes silent for eight seconds.
I’ve been involved in voice interface projects going back to a travel startup I built, and more recently in voice-first support tools. The models have gotten dramatically better. The engineering around them hasn’t kept pace.
Two architectures, one tradeoff
You have two practical options for a voice AI system:
Modular pipeline: Separate services for transcription, reasoning, and synthesis. You can swap components, instrument each stage, and debug failures in isolation. The cost is latency at every boundary.
mic -> STT service -> LLM -> TTS service -> speaker
~200ms ~800ms ~300ms
End-to-end model: A single model like GPT-4o that handles audio natively. Lower latency and a more natural feel, but harder to debug, and you’re locked to one provider’s capabilities.
I lean modular for anything going to production. Here’s why: when a user reports “the bot said something weird,” I need to know whether it was a transcription error, a reasoning failure, or a synthesis artifact. With an end-to-end model, that’s a black box.
The streaming architecture that matters
The biggest latency win isn’t model speed. It’s streaming. Start synthesizing audio before the full response is generated. In Go, it looks something like:
type VoiceSession struct {
sttClient STTClient
llm LLMClient
ttsClient TTSClient
audioOut chan []byte
interrupted atomic.Bool
}
func (s *VoiceSession) HandleUtterance(ctx context.Context, audio []byte) error {
// Transcribe
transcript, err := s.sttClient.Transcribe(ctx, audio)
if err != nil {
return fmt.Errorf("transcription failed: %w", err)
}
// Stream LLM response, pipe chunks directly to TTS
stream, err := s.llm.StreamChat(ctx, transcript)
if err != nil {
return fmt.Errorf("llm stream failed: %w", err)
}
var buf strings.Builder
for chunk := range stream {
if s.interrupted.Load() {
return nil // User interrupted, stop generating
}
buf.WriteString(chunk.Text)
// Flush to TTS at sentence boundaries
if isSentenceEnd(buf.String()) {
audioChunk, err := s.ttsClient.Synthesize(ctx, buf.String())
if err != nil {
continue // Degrade gracefully, skip this chunk
}
s.audioOut <- audioChunk
buf.Reset()
}
}
// Flush remaining text
if buf.Len() > 0 {
audioChunk, _ := s.ttsClient.Synthesize(ctx, buf.String())
s.audioOut <- audioChunk
}
return nil
}
The key insight: flush to TTS at sentence boundaries, not at the end of the full response. The user hears the first sentence while the model is still generating the third. Perceived latency drops from 1300ms to under 500ms.
Interruptions aren’t edge cases
People interrupt. They talk over the bot. They say “wait, no, actually…” halfway through a sentence. If your system can’t handle this, users will hate it within 30 seconds.
The interrupt handler needs to do three things fast:
- Stop audio output immediately. Not after the current sentence. Now.
- Cancel pending TTS and LLM generation. Don’t waste compute on a response nobody will hear.
- Accept the new input without resetting the conversation. Context should carry over.
func (s *VoiceSession) HandleInterrupt(ctx context.Context, newAudio []byte) error {
s.interrupted.Store(true)
// Drain the audio output channel
for len(s.audioOut) > 0 {
<-s.audioOut
}
s.interrupted.Store(false)
return s.HandleUtterance(ctx, newAudio)
}
This is simplified, but the pattern holds. The atomic.Bool flag propagates interrupts to the streaming loop without complex synchronization.
When voice is the wrong interface
Voice is great when:
- The user’s hands are busy (driving, cooking, field work)
- The task has a narrow, predictable vocabulary
- The expected output is short – a confirmation, a lookup, a simple action
Voice is terrible when:
- The user needs to compare options visually
- The output is complex or structured (tables, code, lists)
- Precision matters more than speed (medical, legal, financial details)
I keep seeing teams try to build “voice-first everything” products. Don’t do this. Voice should be one input mode in a system that gracefully falls back to text or visual UI when the task demands it.
Operational concerns that will bite you
Transcription accuracy varies wildly by accent, background noise, and microphone quality. Test with real users in real environments, not in a quiet office with a studio mic. I learned this the hard way: a prototype that worked perfectly in our office fell apart in a warehouse with forklift noise.
Track these metrics from day one:
- Transcription word error rate by user segment
- Time to first audio byte (perceived latency)
- Interruption rate and recovery success
- Conversation completion rate vs. abandonment
- Fallback-to-text rate
Cost adds up fast. A 30-second voice interaction can involve a STT call, an LLM call with conversation history, and a TTS call. Multiply by thousands of daily users and you need a cost model before you launch, not after.
Keep it boring
The best voice AI products I’ve seen are boring. They do one thing, they do it fast, and they handle failure gracefully. A voice ordering system that works for 50 menu items. A voice-controlled inventory check. A hands-free incident report dictation tool.
Nobody is going to have a deep philosophical conversation with your voice bot. They want to get something done and move on. Design for that.
The tech is ready. The hard part is the discipline to ship something narrow and reliable instead of something ambitious and fragile.