Quick take
Most AI agent failures aren’t model failures. They’re infrastructure failures wearing a model mask. Legacy networking assumptions, flat trust boundaries, and missing circuit breakers create brittle agent behavior that looks like “the AI is unreliable” but is actually “the network can’t support autonomous execution patterns.” Fix the infrastructure and the agents get dramatically more reliable overnight.
The Execution Path Nobody Drew on a Whiteboard
Agent tasks fan out across DNS resolution, TLS handshakes, token exchanges, service mesh routing, and backend queries. The multi-hop latency problem is well-understood (I covered the general case in the cloud-heavy architecture post), but the networking-specific failure modes deserve their own treatment: stale DNS caches that route agents to decommissioned endpoints, TLS renegotiation overhead that compounds across 40 tool calls, service mesh sidecars that add 5-15ms per hop invisibly, and queue depth limits that silently drop requests during agent-scale bursts. These aren’t model problems. They’re networking problems that surface as agent unreliability.
The Hidden Cost of 20th-Century Network Assumptions
Most enterprise networks were designed around two assumptions: traffic flows north-south through a perimeter, and anything inside the perimeter is trusted. AI agents violate both assumptions simultaneously.
Agent traffic is east-west by default. A single task might call an internal knowledge base, a code execution sandbox, an external search API, and a database, all in a single reasoning loop. The traffic pattern looks like a mesh, not a pipeline. Networks designed for request-response patterns between a frontend and a backend choke on this.
The trusted-network assumption is worse. When an agent has a service account with broad permissions, every tool call inherits those permissions. If the agent can read from a document store, it can read from all of it. If it can write to a database, the blast radius of a prompt injection extends to every table the service account can touch. This isn’t a theoretical risk. It’s the default configuration in most deployments I’ve seen.
Latency compounds differently for agents than for traditional services. A human user tolerates 200ms of added latency on a page load. An agent making 40 tool calls in a single task turns 200ms of unnecessary overhead per call into 8 seconds of total delay. At scale, this means the difference between an agent that completes tasks in seconds and one that takes minutes. Users notice. They lose trust. They stop using the feature.
Zero-Trust Identity for Autonomous Systems
The fix isn’t a network redesign. It’s an identity redesign at the network layer.
Every agent tool call should carry a scoped identity that specifies what the agent can reach, for how long, and on behalf of which user or task. This is standard zero-trust thinking applied to agent traffic patterns. (For the broader tool permission and output validation side of this, see my earlier post on AI security.)
In practice, the networking-specific concerns are:
Per-task credentials with network scope. Instead of a long-lived service account, mint a short-lived token for each agent task. The token carries the minimum permissions needed for that specific workflow, and critically, it limits which network endpoints the agent can reach. When the task ends, the token expires. If the agent is compromised mid-task, the blast radius is one task’s worth of permissions and one task’s set of reachable services.
Per-call authentication overhead. Every tool call crossing a network boundary needs auth, and that auth has a cost. TLS mutual authentication, token validation, and policy lookup all add latency. The design tradeoff is between granular identity (every call authenticated independently) and performance (connection pooling, session tokens, cached auth decisions). Get this wrong and your zero-trust layer becomes the latency bottleneck it was meant to protect against.
Network segmentation per agent class. Not all agents need the same network access. An agent that summarizes documents has no business reaching your billing API. Segment your network so each agent class can only route to the services it needs. This is basic network segmentation, but most teams skip it because their agents all share one service account with broad network access.
Reliability Engineering for Agent Workflows
Traditional reliability patterns need adjustment for agentic workloads. The standard toolkit, retries, timeouts, circuit breakers, still applies, but the parameters and placement change.
Timeouts need to be per-step, not per-request. An agent task might legitimately run for 30 seconds across 20 tool calls. A global timeout of 30 seconds will kill valid workflows. A per-step timeout of 3 seconds will catch hung dependencies without killing the task.
Retry logic needs backpressure awareness. An agent that retries a failed tool call immediately, while 50 other agent instances are doing the same thing, creates a retry storm that takes down the dependency. Exponential backoff with jitter is the minimum. Better: a circuit breaker that trips after a threshold and fails fast for all agent instances, with a clear error message the model can reason about.
Queue depth matters more than you think. Agent workloads are bursty. A user action that triggers 10 agent tasks, each making 15 tool calls, puts 150 requests into your service mesh in seconds. If the target service has a queue depth of 50, you’re dropping requests before the agent even knows there’s a problem. Size your queues for agent-scale fan-out, not human-scale request rates.
Graceful degradation over hard failure. When a tool call fails, the agent should get a structured error it can reason about, not a 500 or a timeout. “Knowledge base unavailable, try alternative approach” is actionable. A raw HTTP error is not. Design your tool contracts to return machine-readable failure modes.
Observability for Agent Decision Traces
Standard APM tools show you request latency and error rates. For agent workflows, you need something more: a trace that follows the agent’s reasoning across tool calls, captures the decision points, and shows why the agent chose one path over another.
This means correlating model inputs, outputs, and tool calls into a single trace. Each agent task gets a trace ID. Each tool call within that task gets a span. The spans include the tool arguments, the response, the latency, and the policy decision. When you look at a slow or failed agent task, you can see exactly which step took too long, which dependency failed, and whether the agent’s retry behavior made things better or worse.
The teams doing this well treat agent traces like they treat database query plans. They review them regularly, look for patterns, and optimize the hot paths. A tool call that takes 500ms and gets called 20 times per task is a bigger problem than a tool call that takes 2 seconds but only gets called once.
Migration Path
You don’t need to rebuild your infrastructure to start.
- Instrument first. Add trace IDs to agent tool calls. Log latency, errors, and retry counts per step. You can’t fix what you can’t see.
- Add identity boundaries. Replace long-lived service accounts with per-task tokens, starting with agents that have write access.
- Circuit-break external calls. Add circuit breakers and per-step timeouts for every external dependency. Size queues for agent-scale fan-out.
- Migrate to mesh. Deploy a service mesh or policy layer for tool call routing. Start in audit mode, then shift to enforcement.
Each step is small and reversible. Together they compound into a fundamentally more reliable agent platform.
Checklist: Risk Reduction in 90 Days
- Map every tool an agent can call, its permissions, and its failure modes
- Add per-task trace IDs to all agent tool calls
- Replace at least one long-lived service account with scoped, short-lived tokens
- Set per-step timeouts on all agent tool calls
- Add circuit breakers for external API dependencies
- Deploy a policy layer in audit mode for tool call authorization
- Review agent decision traces weekly for latency outliers and retry storms
- Load test agent workflows at 10x expected concurrency
- Document failure modes and give agents structured error responses
- Establish an error budget for agent reliability separate from service reliability
Key Takeaways
Agent reliability is infrastructure reliability. The model is usually fine. The network, the auth layer, the retry logic, and the observability stack are where agent workflows actually break.
Treat agent tool calls like an API surface that needs zero-trust security, per-step reliability engineering, and end-to-end tracing. The teams that figure this out early will ship reliable agent products. The teams that keep tuning prompts to work around infrastructure problems will keep wondering why their agents are “flaky.”
Network and identity design is core agent product work, not background platform plumbing. Budget for it accordingly.