Reasoning Models in Production: A Practical Guide

Quick take

Don’t make reasoning models your default path. Route by complexity, run expensive calls async, set per-request budgets, and cache aggressively. The model is the easy part. The routing and cost control are where you earn your keep.

I spent the last month integrating reasoning models into a production service. The short version: they’re genuinely better at complex analysis tasks. The long version: they’ll wreck your UX and budget if you treat them like a drop-in replacement for fast models.

This post covers the architecture I landed on, with real Go code. When I started this work, most posts I found were hand-wavy “use async patterns” advice with zero implementation detail.

The problem, concretely

Standard LLM calls in our pipeline take 1-3 seconds. Reasoning model calls take 8-45 seconds. That’s not a rounding error. It’s a completely different product experience.

Cost scales the same way. A reasoning call can burn 10-50x the tokens of a standard call for the same input because the model does internal chain-of-thought before producing output. On a high-traffic endpoint, that adds up fast.

At one company, someone enabled a reasoning model as the default for their support chatbot. The monthly API bill went from $2,000 to $34,000 in three weeks. Most of those calls were “what are your business hours?” Not exactly a problem that requires deep reasoning.

When reasoning models actually help

I’ve found three categories where the latency and cost trade-off is worth it:

Multi-step analysis. Reviewing a contract clause, debugging a complex data pipeline, synthesizing information from multiple sources. Tasks where a wrong answer costs more than a slow answer.

Code review and debugging. Reasoning models catch logic errors and subtle bugs that fast models miss entirely. I use them in our CI pipeline for reviewing diffs on critical paths. Nobody cares if that takes 30 seconds.

Planning and decomposition. Breaking a complex task into subtasks, reasoning about dependencies, identifying risks. The model needs to hold a lot of context and think through implications.

Where they’re a waste: simple Q&A, classification, extraction, and anything high-volume or latency-sensitive. Route those to fast models and save money.

The routing layer

The core insight is simple: not every request deserves the same model. Here’s the router I use in Go:

type ComplexityLevel int

const (
	ComplexityLow ComplexityLevel = iota
	ComplexityMedium
	ComplexityHigh
)

type Router struct {
	fastModel      string
	reasoningModel string
	classifier     *ComplexityClassifier
}

func (r *Router) Route(ctx context.Context, req Request) (Response, error) {
	level := r.classifier.Assess(req)

	switch level {
	case ComplexityLow:
		return r.callModel(ctx, r.fastModel, req, defaultBudget)
	case ComplexityMedium:
		resp, err := r.callModel(ctx, r.fastModel, req, defaultBudget)
		if err != nil || resp.Confidence < 0.7 {
			return r.callModel(ctx, r.reasoningModel, req, premiumBudget)
		}
		return resp, nil
	case ComplexityHigh:
		return r.callModel(ctx, r.reasoningModel, req, premiumBudget)
	default:
		return r.callModel(ctx, r.fastModel, req, defaultBudget)
	}
}

The complexity classifier doesn’t need to be fancy. Ours uses a combination of input length, certain keywords (like “analyze”, “compare”, “debug”), and whether the request references multiple documents. A simple heuristic gets you 80% of the way there.

The medium-complexity path is where this gets interesting. Try the fast model first. If confidence is low, escalate to reasoning. This keeps costs down for tasks that turn out to be simpler than they look.

Async execution for expensive calls

Any reasoning model call that might take more than a few seconds shouldn’t block your HTTP handler. Here’s the pattern I use:

type Job struct {
	ID        string
	Status    string
	Request   Request
	Response  *Response
	CreatedAt time.Time
}

type AsyncExecutor struct {
	jobs   sync.Map
	router *Router
	notify func(jobID string, resp Response)
}

func (e *AsyncExecutor) Submit(ctx context.Context, req Request) (string, error) {
	job := &Job{
		ID:        generateID(),
		Status:    "pending",
		Request:   req,
		CreatedAt: time.Now(),
	}
	e.jobs.Store(job.ID, job)

	go func() {
		resp, err := e.router.Route(context.Background(), req)
		if err != nil {
			job.Status = "failed"
			return
		}
		job.Response = &resp
		job.Status = "completed"
		e.notify(job.ID, resp)
	}()

	return job.ID, nil
}

func (e *AsyncExecutor) Poll(jobID string) (*Job, bool) {
	val, ok := e.jobs.Load(jobID)
	if !ok {
		return nil, false
	}
	return val.(*Job), true
}

The caller gets a job ID back immediately. They can poll for status, or we can push a notification when it’s done. The UX team shows a “thinking deeply about this…” indicator. Users are surprisingly tolerant of waiting when you tell them why.

In production, you want a proper job queue (we use Redis) and persistence. But the pattern is the same.

Per-request cost budgets

This is the piece most teams skip, and it’s what prevents surprise bills. Every model call gets a token budget:

type Budget struct {
	MaxInputTokens  int
	MaxOutputTokens int
	MaxCostCents    int
	TimeoutSeconds  int
}

var (
	defaultBudget = Budget{
		MaxInputTokens:  4000,
		MaxOutputTokens: 1000,
		MaxCostCents:    5,
		TimeoutSeconds:  10,
	}
	premiumBudget = Budget{
		MaxInputTokens:  16000,
		MaxOutputTokens: 4000,
		MaxCostCents:    50,
		TimeoutSeconds:  60,
	}
)

func (r *Router) callModel(ctx context.Context, model string, req Request, budget Budget) (Response, error) {
	ctx, cancel := context.WithTimeout(ctx, time.Duration(budget.TimeoutSeconds)*time.Second)
	defer cancel()

	if req.EstimatedInputTokens() > budget.MaxInputTokens {
		return Response{}, fmt.Errorf("input exceeds budget: %d > %d tokens",
			req.EstimatedInputTokens(), budget.MaxInputTokens)
	}

	resp, err := r.client.Complete(ctx, model, req.ToPrompt(),
		WithMaxTokens(budget.MaxOutputTokens),
	)
	if err != nil {
		return Response{}, fmt.Errorf("model call failed: %w", err)
	}

	costCents := estimateCost(model, resp.Usage)
	if costCents > budget.MaxCostCents {
		log.Printf("WARN: call exceeded cost budget: %d > %d cents", costCents, budget.MaxCostCents)
	}

	return parseResponse(resp), nil
}

The budget is enforced before and during the call. Context timeouts prevent runaway reasoning. Token limits prevent ballooning inputs. Cost estimation after the call feeds monitoring and alerting.

At one company, we added a daily cost ceiling per endpoint. If the endpoint hits 80% of its daily budget by noon, it automatically downgrades all calls to the fast model for the rest of the day. Crude but effective.

Caching reasoning results

Reasoning model outputs are expensive to produce but often reusable. Same contract clause reviewed twice? Same code pattern analyzed in different PRs? Cache it.

type ResultCache struct {
	store *redis.Client
	ttl   time.Duration
}

func (c *ResultCache) GetOrCompute(ctx context.Context, key string, compute func() (Response, error)) (Response, error) {
	cached, err := c.store.Get(ctx, key).Result()
	if err == nil {
		var resp Response
		if json.Unmarshal([]byte(cached), &resp) == nil {
			resp.FromCache = true
			return resp, nil
		}
	}

	resp, err := compute()
	if err != nil {
		return resp, err
	}

	data, _ := json.Marshal(resp)
	c.store.Set(ctx, key, data, c.ttl)
	return resp, nil
}

The cache key is a hash of the input and model version. When the model changes, the cache invalidates naturally. We use a 24-hour TTL for most analysis tasks and a 1-hour TTL for anything time-sensitive.

This alone cut our reasoning model costs by about 40% on the code review pipeline, because many PRs touch similar patterns.

What I got wrong the first time

I initially tried to hide latency entirely. Bad idea. Users thought the system was broken. The moment we switched to explicit “this needs deeper analysis, checking now…” messaging, complaints dropped to zero. People understand that some questions take longer to answer well. Respect that.

I also over-routed to reasoning models early on. The classifier was too generous with “high complexity” ratings. We added a feedback loop: if a reasoning model call produces essentially the same output as a fast model would have (measured by comparing on a sample), downgrade the classification for that pattern. Within two weeks, our routing accuracy improved significantly.

The architecture, summarized

Request → Complexity Classifier → Router
                                    ├── Low → Fast Model (sync)
                                    ├── Medium → Fast Model → check confidence → maybe Reasoning Model
                                    └── High → Async Executor → Reasoning Model → Notify

All paths → Budget Enforcement → Cache Check → Model Call → Response

Treat reasoning models as a premium tier. Route intelligently. Execute async when latency matters. Budget every call. Cache reusable results. The model does the thinking. Your job is to make sure it only thinks when it needs to.