AI Security: Same Principles, New Attack Surface

Quick take

Treat every AI endpoint like an exposed API that can be tricked into doing things you didn’t intend. Separate trusted instructions from untrusted content. Constrain tool access. Filter outputs for leakage. Monitor like the system is adversarial, because someone will make it so. Security, stability, performance – in that order.

During a NATO cyber defense exercise a few years back, we ran a scenario where the opposing team compromised an automated decision support system. They didn’t hack the system in the traditional sense. They fed it manipulated data that changed its recommendations. The system worked exactly as designed. It just made the wrong decisions because its inputs were poisoned.

That scenario has stayed in my head this year because it’s exactly what prompt injection does to AI systems. The model works as designed. The inputs are manipulated. The outputs are wrong. And the system has no idea.

The threat model isn’t theoretical

Every AI system I see in production combines three things that should make security engineers nervous:

Untrusted user input goes directly into the model context.
Retrieved content from external sources is treated as context, not as untrusted data.
Tool access allows the model to take actions with real consequences.

Mix those three together and you get a system where a malicious string in a support ticket can, in the worst case, cause the model to call an internal API, exfiltrate data, or take an action that nobody authorized.

This isn’t hypothetical. I’ve seen prompt injection succeed against production systems. In one case, a user embedded instructions in a document that was retrieved during RAG. The model followed those instructions and included internal system prompt details in its response. The user got a screenshot and posted it on social media. Not a great day for that team.

Where the attacks land

Prompt injection is the big one. Direct injection, where the user types instructions that override the system prompt, is the obvious case. Indirect injection is scarier: malicious instructions embedded in retrieved documents, emails, or web pages that the model processes. The model can’t reliably distinguish “instructions from the developer” from “instructions from an attacker hiding in the data.”

Data leakage is the second big one. Models will echo back their system prompts, retrieved context, or other users’ data if you ask the right way. Output filtering catches some of this. But the model is creative, and attackers are more creative. Assume that anything in the context window can potentially appear in the output.

Tool misuse is the emerging threat. As AI systems gain access to tools – databases, APIs, file systems, deployment pipelines – the blast radius of a successful injection grows dramatically. A chatbot that can only generate text is annoying when compromised. A chatbot that can query your database and call your APIs is dangerous.

Defenses that actually work

I apply the same layered defense approach I learned in the NATO context, adapted for AI systems.

Separate trusted from untrusted

The most important architectural decision is maintaining a clear hierarchy of instructions. System prompts are trusted. User input is untrusted. Retrieved content is untrusted. Tool outputs are semi-trusted. The model should have explicit markers for these boundaries, and the system should be designed so that untrusted content can’t override trusted instructions.

This doesn’t fully prevent injection, but it raises the bar. Label everything. Normalize inputs. Strip or escape known injection patterns before they enter the context.

Constrain tool access

Every tool an AI system can access should follow least privilege. Read-only by default. Write operations require explicit confirmation. Destructive operations require human approval. Scope queries to the current user’s data. Rate limit everything.

Our MCP tool servers enforce permission checks at the tool level, not just at the connection level. A user might be allowed to query their own deployment status but not trigger a rollback. The model never gets to make that decision – the permission boundary does.

Filter outputs aggressively

Output filtering is your last line of defense. Check every response for:

System prompt fragments or internal instructions
Personally identifiable information that shouldn’t appear
Known attack patterns (encoded instructions, suspicious URLs)
Content that violates your safety policies

This isn’t foolproof. Models are remarkably good at paraphrasing things they shouldn’t say. But filtering catches the low-hanging fruit and raises the cost of attack.

Monitor for the weird

Traditional security monitoring looks for known attack patterns. AI security monitoring also needs to detect behavioral anomalies:

Sudden changes in tool call patterns
Requests that are unusually long or contain encoded content
Responses that include fragments of system prompts
Spikes in refusal rates or cost
Users who systematically probe the model’s boundaries

On one project, we caught an attacker by noticing a user who submitted 200 requests in an hour, each slightly different, all testing variations of the same injection technique. Traditional rate limiting didn’t flag it because the request volume was below the threshold. Behavioral analysis did.

The architecture matters more than the detection

Here’s the uncomfortable truth: you can’t fully prevent prompt injection with current techniques. The model is a general-purpose text processor that follows instructions, and there’s no reliable way to make it distinguish between legitimate instructions and injected ones.

What you can do is limit the blast radius. Isolate AI services from core systems. Scope permissions narrowly. Put human approval gates on sensitive actions. Log everything. Make the system auditable.

This is the same defense-in-depth approach we apply to every exposed system. The fact that the attack vector is natural language instead of SQL or shellcode doesn’t change the principles. It changes the surface.

What I tell every team

Security, stability, performance – in that order. That’s my priority stack for AI systems, same as any other system I build.

Start by assuming the model will be tricked. Design your system so that a successful trick does as little damage as possible. Then add detection. Then add response playbooks. Then drill them.

The teams that treat their AI systems like exposed APIs with real blast radius will be fine. The teams that treat them like internal tools with trusted inputs will learn an expensive lesson. I’d rather they learned from this post than from their first incident.