My First Week Building with GPT-4

I got GPT-4 access on launch day. Dropped what I was doing at a financial infrastructure company for the afternoon and fed it one of our hairier code review problems – a ledger reconciliation edge case that had been living in our backlog because nobody wanted to untangle it.

GPT-3.5 would have given me a plausible-looking answer full of subtle errors. GPT-4 gave me something I could actually use. Not perfect. But the reasoning was there. It followed multi-step logic without losing the thread. It handled a long context window without the usual degradation.

I sat there for a minute thinking, “okay, this is different.”

Then I looked at the API pricing and thought, “okay, this is also expensive.”

The Capability Jump Is Real

Let me be specific about what changed. GPT-4 isn’t just GPT-3.5 but better. It’s qualitatively different in three ways:

Multi-step reasoning actually works. You can give it a complex problem with dependencies between steps, and it holds the chain together. With 3.5, long reasoning chains would quietly go off the rails by step three. GPT-4 stays coherent.

Instruction following is reliable. When I say “return JSON with these exact fields,” it returns JSON with those exact fields. Not most of the time – basically every time. This sounds minor but it’s a massive improvement for programmatic use.

Long prompts don’t break it. You can stuff a lot of context in there and the model still performs. This opens up use cases that weren’t practical before – real document analysis, code review over large files, complex multi-document synthesis.

The catch is that all of these improvements come with tradeoffs: higher cost per request, slower response times, and rate limits that make you think carefully about when to use it.

The Architecture Decision: Route, Don’t Default

My first instinct was to swap GPT-3.5 for GPT-4 everywhere. That instinct lasted about an hour, until I estimated the monthly cost.

The right approach is a routing layer. Most requests don’t need GPT-4. Classification, simple summarization, format conversion – 3.5 handles those fine and costs a fraction. Reserve GPT-4 for the tasks that actually benefit: complex analysis, multi-step reasoning, anything where quality matters more than speed.

A simple routing strategy:

Classify the request complexity (this can be rule-based, it doesn’t need to be clever)
Route simple tasks to GPT-3.5-turbo
Route complex or high-value tasks to GPT-4
Cache aggressively at both levels
Fall back to 3.5 if GPT-4 is slow or unavailable

This isn’t premature optimization. At GPT-4 pricing, running everything through it is a budget decision disguised as an engineering decision. Make it explicit.

Context Windows: More Isn’t Better

The larger context window is genuinely useful, but “I can fit more in” isn’t the same as “I should put more in.” Larger prompts are slower and more expensive. The model having the capacity to handle 8k or 32k tokens doesn’t mean every request should use that much.

The principle stays the same: use retrieval and summarization so the model sees the right context, not all context. If anything, the temptation to dump everything into the prompt is stronger with GPT-4 because it handles it gracefully. Resist that temptation. Your bill will thank you.

Multimodal Is Coming, But Not Yet

GPT-4’s vision capabilities are impressive in demos. For production use, they need the same treatment as text: normalization, size limits, content policy checks, and explicit handling of failure cases. I’m not rushing to ship image features. The text capabilities alone are enough to keep me busy for months.

How I’m Rolling This Out

I’m not flipping a switch. Staged rollout:

Internal eval first. Run the existing test suite against GPT-4 to establish a quality baseline. This took me a day and was worth every minute.
Shadow traffic. Send a percentage of production requests to GPT-4 in parallel, compare outputs, don’t serve them. This surfaces quality differences without risk.
Selective routing. Move high-value tasks to GPT-4 first – the ones where quality improvements directly affect the user experience.

The eval step is non-negotiable. If you can’t measure whether GPT-4 is actually better for your use case, you’re just spending more money on vibes.

What I’m Actually Excited About

Code understanding is the immediate win for me. I’ve been using GPT-4 to review Go code, explain complex financial logic, and draft test cases. The quality is high enough that it saves real time. Not a replacement for human review, but a strong first pass that catches things I might skim over at 6pm on a Friday.

The deeper opportunity is in workflows where errors are expensive. Financial reconciliation, compliance checks, anything where a confident wrong answer causes real damage. GPT-4’s improved reliability makes these use cases worth exploring in a way they weren’t before.

What matters

GPT-4 is a meaningful upgrade. Not incremental. The reasoning quality, instruction following, and context handling are genuinely better. But it’s also more expensive and slower, which means the architecture around it matters more, not less.

Route by complexity. Cache aggressively. Stage your rollout. Measure everything. The model got better, but the engineering discipline stays the same.