Quick take
Everyone is building agents. Almost nobody is building agents that work reliably. The gap isn’t prompting – it’s systems design. Loops, memory, tool boundaries, and evaluation are the boring parts that decide whether your agent ships or stays a demo.
I’ve been building automation systems since before “AI agent” became a LinkedIn headline. Control loops, state machines, orchestration – this is distributed systems work wearing a new hat. The concepts aren’t new. The failure modes, however, are spectacular in ways I didn’t anticipate.
The gap between a demo agent and a production agent is enormous. I’ve seen this firsthand at a fintech company, where we started exploring how agents could assist with ledger operations. Impressive in a notebook. Terrifying when you imagine it touching real money.
What Actually Makes Something an “Agent”
Strip away the hype and an agent is just a loop with tools and state. A single model call is a function: input in, output out. An agent wraps that in a cycle – decide, act, observe, update – and gives the model access to external tools between iterations.
That’s the entire magic. And that loop is where everything goes wrong.
task -> decide next action -> call tool -> observe result -> update state -> done?
|
no -> loop back
The discipline isn’t in making the loop clever. It’s in making each step explicit, logged, and bounded. Every decision should be traceable. Every tool call should be recorded. And there must be a clear termination condition, because an agent without a stop condition is just a very expensive infinite loop.
Planning: Two Patterns Worth Knowing
Plan-Then-Execute
The agent drafts a plan before touching any tools. Then an executor walks the plan step by step, revising as new information arrives.
I like this for bounded tasks. “Gather these three pieces of data and produce a summary.” It falls apart when the first step invalidates the plan and the agent can’t recover gracefully. In Go terms, think of it like writing your entire execution path before handling any errors – you would never do that in real code, so be cautious doing it with agents.
type Plan struct {
Steps []Step
Current int
Context map[string]any
}
type Step struct {
Action string
Tool string
Input any
Completed bool
Result any
}
func (p *Plan) Execute(ctx context.Context, tools ToolRegistry) error {
for p.Current < len(p.Steps) {
step := &p.Steps[p.Current]
result, err := tools.Call(ctx, step.Tool, step.Input)
if err != nil {
// This is where most agent frameworks fall over.
// They retry blindly instead of re-planning.
return fmt.Errorf("step %d failed: %w", p.Current, err)
}
step.Result = result
step.Completed = true
p.Context[step.Action] = result
p.Current++
}
return nil
}
The real question is what happens when Execute returns an error. Most frameworks retry. Better systems re-plan. The best systems ask a human.
Hierarchical Agents
A manager decomposes the task and delegates to specialists. This sounds elegant. In practice, it’s coordination overhead that would make any microservices architect wince.
Without tight interfaces between the manager and specialists, you get inconsistent outputs, duplicated work, and blame loops where the manager blames the specialist and the specialist blames the prompt. I’ve seen this pattern work exactly once – when the specialist agents had extremely narrow, well-tested tool access and the manager was basically a router.
Memory: The Part Everyone Gets Wrong
Short-Term Memory
This is your rolling context. Recent actions, tool results, intermediate conclusions. It keeps the agent grounded and prevents it from repeating itself.
The trap is letting this grow unbounded. Every token in context costs money and attention. I’ve seen agents degrade because their context window filled with irrelevant observations from twenty steps ago. Summarize aggressively. Prune ruthlessly.
Long-Term Memory
This is retrieval. Vector stores, knowledge bases, past conversation history. Everyone wants it. Few implement it well.
The critical mistake is treating retrieved content as ground truth. It isn’t. It’s a noisy signal. Summaries drift. Embeddings match on vibes, not facts. Old data can be actively wrong in a new context.
type MemoryStore interface {
// Store should include metadata about when and why
Store(ctx context.Context, key string, value any, meta Metadata) error
// Retrieve should return confidence, not just results
Retrieve(ctx context.Context, query string) ([]MemoryResult, error)
}
type MemoryResult struct {
Value any
Confidence float64 // Be honest about this number
Age time.Duration
Source string
}
If your Confidence field is always 1.0, you’re lying to yourself.
Tool Access: Least Privilege Isn’t Optional
Tool access is the most dangerous part of any agent system. An agent with unrestricted tool access is a remote code execution vulnerability with a friendly chat interface.
At a fintech company, the idea of an agent with write access to a financial ledger kept me up at night. The answer was layered controls:
- Allowlists over denylists. The agent can only call tools you explicitly permit. Everything else is denied.
- Schema validation on every call. The tool input must match a strict schema. No free-form execution.
- Human approval for destructive actions. Anything that mutates state gets a confirmation step. Non-negotiable.
- Comprehensive logging. Every tool call, every result, every decision. If the agent does something wrong, the logs are the only way to understand why.
My NATO background makes me paranoid about this stuff, and I think that paranoia is appropriate. Defense in depth isn’t overkill when you’re giving an LLM the ability to take actions in the real world.
Evaluation: Build It Before You Build the Agent
This is my strongest opinion on agents: if you don’t have an evaluation harness before you write your first prompt, you aren’t engineering. You’re hoping.
Evaluation for agents is harder than for single model calls because the output space is wider and the failure modes are sequential. An agent can make the right decision seven times and then do something insane on step eight.
What to measure:
- Task completion: Did it actually solve the problem? Not “did it produce output” – did the output solve the real problem?
- Tool correctness: Were tools called with valid inputs? Were results interpreted correctly?
- Efficiency: How many steps did it take? How many tokens did it burn? An agent that solves the problem in 47 steps when 5 would suffice is a cost center, not a feature.
- Safety: Did guardrails fire appropriately? Were there near-misses?
Build a small eval set of 20-30 representative tasks. Run it on every change. This is your regression suite. It doesn’t need to be fancy. It needs to exist.
The Simplicity Principle
Every agent system I’ve seen fail had one thing in common: it was more complex than it needed to be. The ones that worked started simple and earned their complexity.
- Start with a single loop. No hierarchy, no multi-agent orchestration.
- Add one tool at a time. Test it thoroughly before adding the next.
- Keep the context window lean. Summarize old observations.
- Make every action traceable in logs.
- Set hard limits on iterations, token spend, and wall-clock time.
- Design your fallback path (usually “ask a human”) before your happy path.
Agent architecture is systems design. The best patterns make behavior predictable, failures obvious, and improvements measurable. The rest is hype.