Building Voice AI That People Actually Use

Quick take

Voice AI works when you treat it like plumbing, not magic. Keep perceived latency under 500ms, treat interruptions as a first-class concern, and keep the task scope narrow. The architecture choice between a modular pipeline and an end-to-end model matters less than your streaming strategy.

The gap between a voice AI demo and a voice AI product is about six months of work on things nobody finds exciting: latency tuning, interruption handling, and figuring out what happens when the user mumbles, changes their mind, or goes silent for eight seconds.

I’ve been involved in voice interface projects going back to a travel startup I built, and more recently in voice-first support tools. The models have gotten dramatically better. The engineering around them hasn’t kept pace.

Two architectures, one tradeoff

You have two practical options for a voice AI system:

Modular pipeline: Separate services for transcription, reasoning, and synthesis. You can swap components, instrument each stage, and debug failures in isolation. The cost is latency at every boundary.

mic -> STT service -> LLM -> TTS service -> speaker
         ~200ms       ~800ms     ~300ms

End-to-end model: A single model like GPT-4o that handles audio natively. Lower latency and a more natural feel, but harder to debug, and you’re locked to one provider’s capabilities.

I lean modular for anything going to production. Here’s why: when a user reports “the bot said something weird,” I need to know whether it was a transcription error, a reasoning failure, or a synthesis artifact. With an end-to-end model, that’s a black box.

The streaming architecture that matters

The biggest latency win isn’t model speed. It’s streaming. Start synthesizing audio before the full response is generated. In Go, it looks something like:

type VoiceSession struct {
    sttClient    STTClient
    llm          LLMClient
    ttsClient    TTSClient
    audioOut     chan []byte
    interrupted  atomic.Bool
}

func (s *VoiceSession) HandleUtterance(ctx context.Context, audio []byte) error {
    // Transcribe
    transcript, err := s.sttClient.Transcribe(ctx, audio)
    if err != nil {
        return fmt.Errorf("transcription failed: %w", err)
    }

    // Stream LLM response, pipe chunks directly to TTS
    stream, err := s.llm.StreamChat(ctx, transcript)
    if err != nil {
        return fmt.Errorf("llm stream failed: %w", err)
    }

    var buf strings.Builder
    for chunk := range stream {
        if s.interrupted.Load() {
            return nil // User interrupted, stop generating
        }

        buf.WriteString(chunk.Text)

        // Flush to TTS at sentence boundaries
        if isSentenceEnd(buf.String()) {
            audioChunk, err := s.ttsClient.Synthesize(ctx, buf.String())
            if err != nil {
                continue // Degrade gracefully, skip this chunk
            }
            s.audioOut <- audioChunk
            buf.Reset()
        }
    }

    // Flush remaining text
    if buf.Len() > 0 {
        audioChunk, _ := s.ttsClient.Synthesize(ctx, buf.String())
        s.audioOut <- audioChunk
    }

    return nil
}

The key insight: flush to TTS at sentence boundaries, not at the end of the full response. The user hears the first sentence while the model is still generating the third. Perceived latency drops from 1300ms to under 500ms.

Interruptions aren’t edge cases

People interrupt. They talk over the bot. They say “wait, no, actually…” halfway through a sentence. If your system can’t handle this, users will hate it within 30 seconds.

The interrupt handler needs to do three things fast:

Stop audio output immediately. Not after the current sentence. Now.
Cancel pending TTS and LLM generation. Don’t waste compute on a response nobody will hear.
Accept the new input without resetting the conversation. Context should carry over.

func (s *VoiceSession) HandleInterrupt(ctx context.Context, newAudio []byte) error {
    s.interrupted.Store(true)

    // Drain the audio output channel
    for len(s.audioOut) > 0 {
        <-s.audioOut
    }

    s.interrupted.Store(false)
    return s.HandleUtterance(ctx, newAudio)
}

This is simplified, but the pattern holds. The atomic.Bool flag propagates interrupts to the streaming loop without complex synchronization.

When voice is the wrong interface

Voice is great when:

The user’s hands are busy (driving, cooking, field work)
The task has a narrow, predictable vocabulary
The expected output is short – a confirmation, a lookup, a simple action

Voice is terrible when:

The user needs to compare options visually
The output is complex or structured (tables, code, lists)
Precision matters more than speed (medical, legal, financial details)

I keep seeing teams try to build “voice-first everything” products. Don’t do this. Voice should be one input mode in a system that gracefully falls back to text or visual UI when the task demands it.

Operational concerns that will bite you

Transcription accuracy varies wildly by accent, background noise, and microphone quality. Test with real users in real environments, not in a quiet office with a studio mic. I learned this the hard way: a prototype that worked perfectly in our office fell apart in a warehouse with forklift noise.

Track these metrics from day one:

Transcription word error rate by user segment
Time to first audio byte (perceived latency)
Interruption rate and recovery success
Conversation completion rate vs. abandonment
Fallback-to-text rate

Cost adds up fast. A 30-second voice interaction can involve a STT call, an LLM call with conversation history, and a TTS call. Multiply by thousands of daily users and you need a cost model before you launch, not after.

Keep it boring

The best voice AI products I’ve seen are boring. They do one thing, they do it fast, and they handle failure gracefully. A voice ordering system that works for 50 menu items. A voice-controlled inventory check. A hands-free incident report dictation tool.

Nobody is going to have a deep philosophical conversation with your voice bot. They want to get something done and move on. Design for that.

The tech is ready. The hard part is the discipline to ship something narrow and reliable instead of something ambitious and fragile.