Quick take
An agent that can read data and change state isn’t a chatbot with extra steps. It’s a system with real blast radius. Constrain it with explicit policies, prefer structured workflows over free-form loops, and invest in observability before you invest in capabilities. The boring stuff is what makes agents safe to ship.
There’s a moment in every agentic AI demo that makes the audience gasp. The agent reads a database, reasons about the results, drafts an email, and sends it. Autonomously. It feels like magic.
Then someone asks: “What happens if it sends the wrong email?” And the room gets quiet.
I’ve been building agentic systems for several months now. The demo-to-production gap here is wider than almost anywhere else in AI engineering. A chatbot that hallucinates is annoying. An agent that hallucinates and then acts on the hallucination is a liability.
The difference between teams that ship agents successfully and teams that revert after a week comes down to three things: boundaries, structure, and boring reliability work.
Boundaries first, capabilities second
Almost every team starts with capabilities. “What tools should the agent have? What actions can it take?” Wrong starting point.
Start with constraints. What is the agent not allowed to do? What’s the maximum blast radius of a single run? What happens when it goes wrong?
A policy config is the simplest way to make these constraints explicit and auditable:
agent_policy:
allowed_tools: [read_db, write_ticket, send_email_draft]
max_steps: 8
max_runtime_seconds: 120
max_cost_usd: 0.50
approval_required: [send_email, issue_refund, modify_production]
This isn’t a suggestion. It’s the foundation. The allowed tools list is an allowlist, not a blocklist – the agent can only use what’s explicitly permitted. Step and time limits prevent runaway loops. Cost caps prevent a single request from draining your budget. The approval list separates actions that are safe to automate from actions that need a human in the loop.
At one delivery company I worked with, a team skipped the approval step for “low-risk” actions. One of those low-risk actions turned out to be updating customer records. An agent misinterpreted a support request and bulk-updated addresses for a batch of orders. The fix took two days. The approval gate would have taken two seconds.
If the policy feels too restrictive, relax it intentionally and document why. If you can’t explain why a tool is on the allowed list, it shouldn’t be there.
Structured workflows beat free-form loops
The temptation with agents is to give them a goal and let them figure out the steps. This works beautifully in demos. In production, it creates systems that are impossible to debug, test, or audit.
I prefer structured workflows with a small number of decision points. The model chooses among defined paths. Deterministic logic handles state transitions. The result is a system you can trace, test, and explain.
Think of it as a state machine where the model influences transitions but doesn’t control them entirely. The model might decide whether a customer inquiry needs escalation or can be handled automatically. But the escalation path itself – what happens, in what order, and with what approvals – is defined in code, not improvised by the model.
When a task genuinely doesn’t fit a clean workflow, isolate it. Put the free-form reasoning in a narrow, heavily instrumented sandbox with tight constraints. Don’t make it the default path for everything.
The boring reliability checklist
I know this section won’t go viral. That’s fine. It’s the section that keeps your agent from becoming an incident.
Idempotent steps. If a step fails and retries, it shouldn’t duplicate work. The agent shouldn’t send two emails because the first one timed out after actually sending. Design every action to be safe to retry.
Checkpointing. Long-running workflows should save their state at each step. If the process crashes or the model call times out, the workflow should resume from the last checkpoint, not start over.
Time and step caps. Hard limits. Non-negotiable. An agent stuck in a reasoning loop should hit a wall after N steps or M seconds, return whatever partial results it has, and report the failure. I set these conservatively and loosen them only after seeing production data.
Retry discipline. Retry on clearly transient failures – rate limits, network timeouts. Don’t retry on semantic failures – the model misunderstood the task, or the tool returned an error because the input was wrong. Retrying bad logic just wastes money and time.
Observability isn’t optional
If you can’t trace what an agent did – every tool call, every model response, every decision point – you can’t debug it. And you will need to debug it.
Structured logging for every step:
- What tool was called and with what inputs
- What the model returned and what confidence signal it provided
- Whether an approval was required and who approved it
- How long each step took and how many tokens it consumed
- The final outcome and whether it matched the intent
This log isn’t just for debugging. It’s your feedback loop. It tells you which prompts need refinement, which tools are unreliable, which workflows cost too much, and where the model consistently makes bad decisions.
One caution: be disciplined about what you log. Inputs and outputs may contain sensitive data. Define retention policies and access controls before you ship, not after an auditor asks.
Rolling out without regret
The teams that succeed with agentic workflows share a rollout pattern:
- Shadow mode first. The agent runs alongside the existing process but doesn’t take any actions. Log what it would have done. Compare to what the human actually did. This gives you real quality data without any risk.
- Low-risk tasks with clear success criteria. Start with internal tasks where a mistake is inconvenient, not catastrophic. Ticket triage. Data enrichment. Report drafting.
- Expand only after stability. Once reliability, cost, and quality are stable for the initial scope, add more tools or more complex workflows. One step at a time.
This pacing is unglamorous. It’s also the only approach I’ve seen work consistently.
The uncomfortable truth
Agents are powerful. They’re also the highest-risk AI feature you can ship. Every other AI feature is advisory – the model suggests, the user decides. An agent acts. That means every bug, every hallucination, every misunderstanding has real consequences.
Treat agents as systems engineering, not prompt engineering. Define the blast radius. Build the constraints. Invest in the observability. Ship slow.
The teams that move carefully are the ones still running agents in production six months later. The teams that rush are the ones writing postmortems.