Running AI Locally: A Practical Guide for Teams Who Care About …

Quick take

Local AI development is a legitimate option for teams that need data control, predictable costs, or offline capability. The tradeoff is operational work. Keep the stack small, abstract the provider behind an interface, version your models like you version your code, maintain an eval set, and always keep a cloud fallback for quality-critical paths.

I run local models daily: in production projects, for prototypes, and for anything involving sensitive data that shouldn’t leave my machine. The tooling has matured enough that this is no longer a novelty; it’s a practical engineering choice with clear tradeoffs.

I’ve also seen teams go all-in on local AI without understanding what they’re signing up for. Running your own models means owning the full lifecycle: model selection, quantization, runtime management, version pinning, quality monitoring, and fallback strategies. If you aren’t prepared for that operational load, use a managed API.

This post is for teams who have decided local makes sense and want to do it properly.

When Local Is the Right Call

Local AI makes sense in specific scenarios:

Sensitive data. Proprietary code, financial records – anything you don’t want leaving your network. I frequently work with data under NDA, and local inference means the data never touches a third-party API.
Predictable costs. API costs scale with usage; local costs scale with hardware. For high-volume routine tasks – classification, extraction, summarization – local can be dramatically cheaper once you amortize the hardware.
Offline or air-gapped environments. Some deployments don’t have reliable internet. Some shouldn’t have it. My NATO background drilled this in – there are environments where external API calls aren’t just inconvenient; they aren’t allowed.
Deterministic CI testing. When your tests depend on model output, you need a pinned model version that doesn’t change between runs. Local gives you that control.

Local is the wrong call when you need frontier-level quality on every request or your team can’t absorb the operational overhead.

The Provider Abstraction

First rule: never hard-code your provider. Whether you’re using Ollama, llama.cpp, vLLM, or a cloud API, the rest of your code shouldn’t care. Hide it behind an interface.

In Go, this is clean:

// Provider defines the contract for any AI backend.
type Provider interface {
    Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error)
    Embed(ctx context.Context, input string) ([]float64, error)
    Health(ctx context.Context) error
}

type CompletionRequest struct {
    Model       string
    Messages    []Message
    MaxTokens   int
    Temperature float64
}

type CompletionResponse struct {
    Content    string
    TokensUsed int
    Model      string
    FinishReason string
}

Now your local and cloud providers implement the same interface. Switching between them is a config change, not a code rewrite. Testing is trivial: mock the interface and move on.

type OllamaProvider struct {
    endpoint string
    client   *http.Client
}

func NewOllamaProvider(endpoint string) *OllamaProvider {
    return &OllamaProvider{
        endpoint: endpoint,
        client: &http.Client{
            Timeout: 120 * time.Second,
        },
    }
}

func (o *OllamaProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    body := ollamaRequest{
        Model:    req.Model,
        Messages: toOllamaMessages(req.Messages),
        Stream:   false,
        Options: ollamaOptions{
            Temperature: req.Temperature,
            NumPredict:  req.MaxTokens,
        },
    }

    resp, err := o.post(ctx, "/api/chat", body)
    if err != nil {
        return CompletionResponse{}, fmt.Errorf("ollama completion: %w", err)
    }

    return CompletionResponse{
        Content:      resp.Message.Content,
        TokensUsed:   resp.EvalCount,
        Model:        resp.Model,
        FinishReason: resp.DoneReason,
    }, nil
}

The Fallback Chain

Local models are good. They aren’t always good enough. For quality-critical paths – user-facing content generation, complex reasoning tasks, anything where a wrong answer costs real money – you need a fallback to a stronger model.

type FallbackProvider struct {
    primary   Provider
    fallback  Provider
    threshold float64 // confidence threshold for fallback
}

func (f *FallbackProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    resp, err := f.primary.Complete(ctx, req)
    if err != nil {
        // Primary failed, try fallback
        slog.Warn("primary provider failed, using fallback", "error", err)
        return f.fallback.Complete(ctx, req)
    }
    return resp, nil
}

In practice, I extend this with confidence scoring – if the local model returns a low-confidence response, automatically retry with the cloud provider. The core pattern is simple: try local first, fall back to cloud when needed, and log fallbacks so you know how often they happen.

Configuration That Travels

Keep your AI configuration in a structured file in source control. Everything – model names, endpoints, fallback rules, temperature settings – should be declarative and version-controlled.

ai:
  default_provider: local

  providers:
    local:
      type: ollama
      endpoint: http://127.0.0.1:11434
      models:
        completion: "mistral:7b-instruct-v0.3-q5_K_M"
        embedding: "nomic-embed-text:latest"
      timeout: 120s

    cloud:
      type: openai
      # API key from environment: AI_CLOUD_API_KEY
      models:
        completion: "gpt-4o"
        embedding: "text-embedding-3-small"
      timeout: 30s

  fallback:
    enabled: true
    primary: local
    secondary: cloud
    on_error: true
    on_low_confidence: true
    confidence_threshold: 0.7

  evaluation:
    eval_set_path: "./eval/fixtures"
    run_on_model_change: true

The model name includes the quantization level. This is deliberate. mistral:7b-instruct-v0.3-q5_K_M is not the same as mistral:7b-instruct-v0.3-q4_0. Different quantization levels produce different outputs. Pin it.

Versioning and Reproducibility

This is where most local setups fall apart. Someone updates the model, doesn’t tell the team, and suddenly outputs are different. Tests still pass because nobody wrote quality assertions – they just check that the model returned something.

Version these things:

Model file hash. SHA256 the model binary. Store the hash in your lockfile or config. If the hash changes, the model changed.
Runtime version. Pin your Ollama or llama.cpp version in your Dockerfile or setup script.
Prompt templates. Keep them in source control alongside the code that uses them. Prompt drift is real and insidious.

FROM ollama/ollama:0.3.12

# Pull and pin specific model versions
RUN ollama pull mistral:7b-instruct-v0.3-q5_K_M

# Copy eval fixtures for smoke test
COPY eval/fixtures /eval/fixtures

The Evaluation Harness

You need an eval set. Not optional. It should be a small collection of representative inputs with expected outputs that you run every time you change a model, update a prompt, or modify provider configuration.

func TestModelQuality(t *testing.T) {
    provider := setupLocalProvider(t)

    fixtures := loadEvalFixtures(t, "./eval/fixtures")
    var passed, failed int

    for _, fix := range fixtures {
        resp, err := provider.Complete(context.Background(), fix.Request)
        if err != nil {
            t.Errorf("fixture %s: %v", fix.Name, err)
            failed++
            continue
        }

        if !fix.Validate(resp.Content) {
            t.Errorf("fixture %s: expected pattern %q, got %q",
                fix.Name, fix.ExpectedPattern, resp.Content)
            failed++
            continue
        }
        passed++
    }

    passRate := float64(passed) / float64(passed+failed)
    if passRate < 0.85 {
        t.Fatalf("pass rate %.1f%% below threshold 85%%", passRate*100)
    }
}

Run this in CI. Run it before every model swap. Run it when you change prompts. The eval harness is what keeps you from shipping a regression you don’t notice for two weeks.

Performance Tuning Order

If local inference is too slow, fix it in this order:

Smaller model. For routine tasks – classification, extraction, simple summarization – a 7B parameter model is often sufficient. Don’t run a 70B model for ticket triage.
Quantization. Q5_K_M is usually the sweet spot between quality and speed. Q4_0 is faster but you’ll notice quality degradation on complex tasks. Measure with your eval set before committing.
Batching. If you have throughput-heavy workloads, batch requests. Most local runtimes support this. The latency per request goes up slightly but throughput goes up dramatically.
Hardware. GPU inference is 10-50x faster than CPU for most model sizes. If you’re serious about local AI, budget for a decent GPU. An RTX 4090 handles a 7B model comfortably.

The Honest Tradeoff

Local AI gives you control, privacy, and predictable costs. In exchange, you take on operational responsibility for model management, quality monitoring, and infrastructure maintenance. That’s a fair trade for the right workloads.

Keep the stack small. Abstract the provider. Version everything. Measure quality continuously. Keep a cloud fallback for the moments when local isn’t enough.

The teams that do this well treat local AI like any other infrastructure dependency – with discipline, not enthusiasm.

Running AI Locally: A Practical Guide for Teams Who Care About Control