AI Safety Is Just Production Engineering

Quick take

Treat AI safety like you treat security: assume breach, layer your defenses, and make every boundary observable. A single filter will fail. A layered system with clear escalation paths won’t.

My time working with NATO Cyber Defense taught me one lesson that transfers directly to AI safety: if your security model depends on a single control working perfectly, you don’t have a security model. You have hope.

Most AI safety implementations I review look like this: one content filter, one system prompt instruction, maybe a regex check on output. Then comes surprise when someone finds a bypass in production.

AI safety isn’t a research frontier. It’s production engineering. The same defense-in-depth thinking that protects networks also protects AI systems. The mental model is the same.

Assume Your Controls Will Be Tested

The moment you deploy an AI system to users, it becomes a target. Not always from malicious actors – though those exist – but from curious users, edge cases you never imagined, and the simple reality that models do unexpected things with novel inputs.

In cyber defense, you plan for this. You assume the perimeter will be breached and design the interior to limit damage. AI safety is the same. Assume:

Someone will try prompt injection. They’ll try hard.
The model will occasionally produce harmful or inappropriate output. No filter catches everything.
Data will leak through outputs or logs if you don’t explicitly prevent it.
Users will find ways to use capabilities you didn’t intend to expose.

This isn’t pessimism. It’s operational realism. Plan for it.

Input: Treat It as Untrusted

Every input to your AI system is untrusted. Full stop. This isn’t different from web security – you wouldn’t pass raw user input to a SQL query. Don’t pass raw user input to a model without validation.

Practical input controls:

Separate user content from system instructions at the architecture level, not just the prompt level
Length and format limits for every input field
Explicit allowlists for supported content types and languages
PII detection with consent-aware handling
Pattern checks for known injection techniques

Keep these simple. Complex input policies are hard to test, hard to maintain, and easy to bypass. A few robust checks beat a hundred brittle ones.

Output: The Last Boundary

Output is the final safety layer before the user sees a response. In my NATO work, we called this the “last line of defense” principle: design it assuming everything upstream has already failed.

Output controls:

Content filtering to block or redact unsafe responses
Leakage checks for system prompts, internal data, or PII
Schema validation when the response must follow a defined format
Safe fallback behavior when a response fails any check

Fallback behavior matters more than people think. A system that returns “I can’t help with that” when unsure is vastly safer than one that guesses and serves a plausible-looking wrong answer. Refusal is a feature.

System-Level Controls

Safety doesn’t live in the model layer alone. It belongs in the surrounding system. This is where the cyber defense analogy is strongest: you don’t just firewall the endpoint, you design the entire network for containment.

Rate limits and quotas reduce abuse surface and cost spikes. If someone is hammering your system with injection attempts, rate limiting slows them down before any content filter needs to fire.

Scoped tool access with clear permissions limits blast radius. If your agent can call APIs, those APIs should have the minimum permissions required. Not admin. Not read-write when read-only suffices.

Sandboxed execution for anything that touches external systems. If your agent generates code or makes API calls, run those in a sandbox. No exceptions.

Configurable policy modes so you can tighten safety quickly during an incident. A kill switch isn’t elegant but it’s necessary.

Monitoring: Safety Is Operational

In cyber defense, detection matters as much as prevention. You need to know when your controls are failing. The same applies to AI safety.

Treat safety incidents like reliability incidents:

Define thresholds for unsafe output rates, injection attempt rates, and escalation volumes
Set up clear escalation paths – who gets paged, what gets rolled back, what needs a review
Feed production signals back into model prompts, filters, and product design
Run regular reviews. Not quarterly. Weekly at minimum during early deployment.

The teams that catch problems early treat safety as an operational concern. The teams that catch problems late treat it as a PR crisis.

Defense in Depth

A single safeguard will fail. I can’t say this enough. Every content filter has bypasses. Every system prompt can be manipulated under the right conditions. Every validation check has edge cases.

The defense-in-depth approach layers controls so that any single failure doesn’t become an incident:

Input validation catches obvious abuse
System prompt discipline limits the model’s scope
Output filtering catches problematic responses
System controls (rate limits, permissions, sandboxing) limit blast radius
Monitoring detects when any layer is failing

Each layer is simple. The combination is robust. This isn’t a new idea – it’s how every mature security program works. AI safety should be no different.

Where to Start

If you’re deploying AI to production and haven’t built safety controls yet, start small:

Define the allowed inputs and outputs for your first use case. Write them down.
Implement input validation and output filtering with clear failure behavior
Add rate limiting and logging
Set up a simple review queue for flagged interactions
Iterate based on what you see in production

Don’t try to build a perfect safety system before shipping. Build a functional one, instrument it, and improve it continuously. Teams that wait for perfection ship nothing. Teams that ship with layered, observable safety controls learn fast and get better.

Safe systems and reliable systems are built the same way. Clear boundaries, observable behavior, steady iteration. The discipline transfers.