Writing / 2026

The Anti-Fragile AI Organization

The best AI organizations do not merely survive model churn and vendor shocks; they convert each one into a capability they keep.

Almost nobody runs a working fallback, and the reason is never ignorance. Strip away Taleb’s vocabulary and the prescription underneath is thirty years old: invert your dependencies, hide the vendor behind an interface, keep a fallback ready. Hexagonal architecture had a name before anti-fragility did. The engineering is not the hard part, and pretending it is wastes your time. What actually defeats the fallback is that a warm second provider is a tax somebody has to keep paying, and taxes with no owner do not get paid.

Keeping a provider genuinely warm is a standing cost, not a build. Your eval suite has to pass green against both providers on every prompt change. You maintain two sets of golden outputs. When one provider alters its tool-call encoding or its structured-output mode, you eat the fix to keep the boundary honest. Budget it as a fixed slice of the boundary team’s capacity, indefinitely, plus the inference spend of the shadow runs. If that line item lives in no one’s plan, you do not have a fallback. You have an aspiration and a grep result.

The boundary also leaks exactly where the value is. Constrained JSON-schema output on one provider becomes “ask for JSON and parse defensively” on another. Parallel tool calls on one go sequential on the other. Context limits, safety-filter behavior, and per-token economics differ enough that the same prompt yields a different product. A boundary that hides all of this can only promise the intersection, which means the price of portability is the provider-specific capability you surrendered to get it. That is a real trade, and it has a rule rather than a vibe: the critical path stays portable and dumb and carries a green fallback; a differentiating feature is allowed to depend on one provider, and in place of a fallback it carries a written number for what re-homing it would cost. Decide it per feature, on purpose, not by default.

Then the shock arrives. A vendor doubles the price effective next cycle. Three weeks of runway. You will not build a clean adapter across forty call sites in three weeks with a price gun to your head, and any plan that assumes otherwise is the fragile response in disguise. What three weeks actually buys is narrower and worth more: wrap the one critical path, get its fallback eval green, flip a slice of traffic, and write down the leaks you punted and what the other thirty-nine sites would cost. You solved the price problem. You also keep a swap that is now an afternoon on that path and a costed map of the rest, and the next path is cheaper because the scaffolding exists. That is the difference between robust and anti-fragile: surviving versus walking away holding a capability the shock paid for.

That capability never accumulates if the work sits in one team, and the reason is structural, not a lack of will. One group owns the model integration, the eval harness, the prompt registry, and the pager . A shock arrives, the same five people absorb it, restore service, close the ticket. That is not an organization. That is a single point of failure the org chart happens to call a department. Restoring service and changing the architecture so the next shock is cheaper are two different jobs, and the second needs the slack the on-call team just spent surviving the first. The conversion never happens, because the people who could do it are holding the pager.

So split the jobs. On-call restores service. The conversion is a tracked deliverable owned by someone off the pager that week, with the fallback’s carrying cost funded in their plan, not scavenged from spare time. Then push the portability decision down: the boundary team owns the shared interface and harness, but each product team owns its path’s portability budget and pays to keep that path’s fallback green. The work spreads to the people who own the paths instead of bottlenecking on five, and the authority to flip the switch sits with the team that owns the consequence.

The morning after any shock, that machinery has either left something behind or it has not. You are holding a warm path you can split traffic onto, an eval that catches the failure class that just bit you, a renegotiated contract with an exit clause : concrete things that were not there the week before. Or you are holding none of them. “We got through it and went back to normal” is the second case: service restored, full freight paid, not one green eval or warm path added that lowers the price of the next hit.