Quick take
Treat AI safety like you treat security: define what must never happen, layer your defenses, assume the system will be attacked, and keep an audit trail. The threat model is different but the discipline is the same.
I spent time working in NATO cyber defense before moving into startups. That background shapes how I think about AI safety, which is probably why I find most of the current discourse frustrating. The AI safety conversation is dominated by philosophers debating existential risk and executives writing policy memos nobody reads. Meanwhile, engineers are shipping AI features into production with no input validation, no output filtering, and no fallback plan.
AI safety isn’t an ethics seminar. It’s a security engineering problem. And we already know how to do security engineering. We just need to apply it.
Safety Is an Engineering Property
In security, we don’t hope the system behaves correctly. We define what must never happen, build controls to prevent it, and monitor for violations. AI safety should work the same way.
Three properties matter:
Reliability. The system should avoid confident errors and make uncertainty visible. A model that says “I don’t know” is safer than one that confidently fabricates an answer. Design for that.
Security. Prompts, tools, and outputs are untrusted surfaces. Full stop. Every principle you apply to user input in a web application applies here. Input validation, output sanitization, least-privilege access.
Accountability. Keep decisions traceable. Log what went in, what came out, and why the system chose what it chose. When something goes wrong – and it will – you need to reconstruct the chain.
The Threat Model
From my security background, the threats break down into four categories. None of them are hypothetical.
Confidently Wrong Answers
Language models produce fluent, authoritative-sounding text even when they’re completely wrong. This is the AI equivalent of a phishing email that looks legitimate. The fluency is the attack vector. Users trust confident text, and the model is always confident.
Defense: ground outputs in retrieved sources, prefer narrow tasks, and add fallback paths when the system can’t verify its own answer.
Prompt Injection
This is the big one, and I’m surprised how few teams take it seriously. User inputs can contain hostile instructions that override your system prompt, extract sensitive data, or manipulate the model into doing things you didn’t intend.
This is SQL injection for the AI era. We spent twenty years learning to never trust user input in database queries. Now we’re concatenating user text directly into model prompts and hoping for the best.
Defense: separate user content from system instructions. Validate outputs before they reach users. Limit tool access. Assume every user input is adversarial, because eventually one will be.
Data Leakage
AI features touch everything: logs, documents, user text, internal wikis. Every piece of data you send to a model is data that could leak – through the model’s responses, through provider logs, through training data in the next version.
Defense: minimize what you send. Redact sensitive fields. Never feed private content back into prompts where other users might see the output. Apply the same data classification you use for any other third-party service.
Bias and Uneven Performance
Models perform differently across languages, demographics, and domains. This isn’t theoretical – I’ve seen it in production. A summarization feature that works well in English and falls apart in other languages. A classification model that performs inconsistently across user groups.
Defense: test with diverse inputs. Don’t use AI for high-stakes decisions without human review. Monitor performance across segments, not just in aggregate.
Defense in Depth
No single control solves this. Layer your defenses:
- Ground generation in trusted, retrieved data
- Filter and validate both inputs and outputs
- Apply least-privilege to tool and data access
- Build refusal paths for requests that exceed the system’s safe operating range
- Add human review for high-stakes decisions
This is exactly how we design secure systems. The specifics are different but the pattern is identical.
Testing Like You Mean It
Safety is a lifecycle, not a checklist you run once before launch. Build evaluation sets that include adversarial prompts, confusing edge cases, and realistic user flows. Run them continuously. Monitor production for error spikes, unusual output patterns, and refusal rate changes.
At a financial infrastructure company, where we deal with financial data, I apply the same principle to every system: if you can’t detect when it fails, you can’t call it safe. AI features are no different.
Start Here
Four steps that every team should take before shipping an AI feature:
- Define what the system must never do. Write it down. Make it specific. “Must not leak PII” is better than “must be safe.”
- Limit scope to a narrow task. The broader the capability, the larger the attack surface.
- Add guardrails before shipping, not after. Input validation, output filtering, fallback paths. All of it. Before the first user touches it.
- Measure failures and iterate. Track what went wrong, fix the defense, repeat.
This isn’t glamorous work. It’s the same kind of boring, essential engineering that keeps every other system from falling apart. The model is new. The discipline isn’t.