2026-07-04Security

What Actually Contains a Rogue AI Agent (Come Try to Break It)

Most agent security is a claim on a slide. So we built Arena — a place to attack a real AI agent in our runtime and see, hands-on, what actually holds: the microVM it's isolated in, the guardrails that inspect its traffic, and the deterministic walls it can't route around.

Every company selling AI-agent security will tell you their guardrails work. Almost none will let you try to break them.

We wanted the opposite. So we built Arena: a place where you play the attacker against a real AI agent running in the real Declaw runtime, and watch — hands-on, no signup, nothing taken on faith — what holds and what doesn't. The agent is guarding a secret. Your job is to get it out: talk the model into leaking it, or drop into a root shell and steal it yourself.

This post is a tour of what you're actually attacking, and why some of it falls in ten seconds while some of it is a wall by design.

One honest note up front: Arena is a thin harness. The web app just sets a policy and calls the sandbox API — the enforcement is the runtime itself, the same one you get from the SDKs. So when a defense holds in Arena, it's the product holding, not game logic.

The one idea: enforcement has to live below the agent

Most "agent security" is a gateway or guardrail the agent is configured to route through — an SDK wrapper, a proxy, an API sitting in front of the model. That works right up until the agent does something you didn't script: opens a raw socket, swaps its DNS resolver, hits an endpoint the gateway doesn't cover, or just doesn't call your SDK. Agentic code is untrusted code, and anything it can route around, it eventually will.

Our bet is simple, and it's why Declaw is a runtime and not a bolt-on: enforcement is only non-bypassable if you own the compute the agent runs on. A gateway sits in a network path in front of your agent — it doesn't own the box, so the box can route around it. Declaw owns the box. Every agent runs in its own hardware-isolated microVM: its own kernel, its own network namespace, a clean filesystem each run. That isolated compute is the foundation — and because we own it, we put the enforcement on the outside of the boundary, where the agent has root on the inside and still can't reach the thing enforcing on it. Isolation here isn't a backdrop; it's what makes everything else impossible to bypass.

We argued the limits of isolation-alone in Why Sandboxes Alone Won't Secure Your AI Agents — you need the boundary controls too. Arena is the "show, don't tell."

Two kinds of defense — and why you need both

That isolation boundary is the ground everything else stands on — a contained agent is the precondition for every other control. On top of it, the most useful thing Arena teaches is that defenses come in two kinds, and a serious runtime invests in both.

The first is guardrails: content inspection on everything crossing the boundary. PII redaction, and a layered prompt-injection defense — deterministic signatures, an ML classifier, and a session-aware judge — that catches single-turn, multi-turn, and indirect attacks. This is a core, benchmarked part of the product, not a checkbox: it scores near the top of public injection sets like Gandalf and InjecAgent, with low over-refusal on benign prompts (the numbers are here). It stops the overwhelming majority of attacks before they ever land.

The second is deterministic walls: kernel- and network-level drops that don't care how clever the prompt was. These exist because content inspection, however good, is still inspection — and some attacks (data exfil that looks exactly like a legitimate API call) can't be caught by reading content at all. That's not a knock on guardrails; it's why you back strong guardrails with enforcement the agent physically can't route around.

You need both. Guardrails catch what's recognizable; the walls catch what's unroutable. Here's how that plays out across the scenarios.

Guardrails: stopping the attack in the content

PII redaction is the first thing you hit in the Data Analyst scenario: an agent guarding a customer table full of SSNs, credit cards, and emails. You can absolutely talk the model into "showing you row 12" — and you'll get a redacted row, because the raw PII was scrubbed before it ever reached the model. That's not the model being disciplined; it's that the data never arrived. (We broke down how these leaks happen in Anatomy of an Agent Data Leak.)

Turn the difficulty up and you meet the injection defense in force. It weighs each request against the agent's actual task and blocks manipulation, and it's what turns back the large majority of what people throw at the chat scenarios. It's also, honestly, inspection — a strong probabilistic call, not a mathematical guarantee. Arena is built to show you that seam: the Research Bot's middle level deliberately leaves the network open behind that inspection, so a determined social-engineer can talk an off-allowlist fetch through. That level is winnable on purpose — not because the defense is weak, but so you can see exactly where content inspection ends and a deterministic layer has to take over. Prompt injection is not solved for anyone; the honest move is to make guardrails as strong as they can be and back them with walls.

The Inbox Assistant is those same guardrails against a nastier attack: indirect injection. The malicious instruction isn't in your chat — it's hidden in an email the agent reads. It rides in on trusted-looking data, which is exactly why it's hard, and why the injection defense scans ingested content, not just your prompts.

The deterministic floor: isolation plus out-of-guest enforcement

Here's the part that only works because Declaw owns the runtime. Notice the shell scenarios hand you an actual root shell — and we're relaxed about it. That's the isolation doing its job: root inside the microVM is still inside the microVM. Even a kernel-level compromise is contained by the VM boundary — escaping means breaking the hypervisor, a far smaller and harder surface than a shared container kernel (we walked through exactly this during a real kernel zero-day in Dirty Frag). And the controls that actually stop you from doing damage live on the host side of that boundary, where your root doesn't reach. That's what "outside the guest" buys you.

Egress control is opt-in — by default an agent can reach anything — but when you turn it on, it's enforced in the sandbox's network namespace and on the host, not inside the VM. A per-sandbox DNS resolver only answers allowlisted names, their IPs get pushed into a kernel ipset, and iptables only forwards packets whose destination is in that set. Try to skip DNS by hardcoding a bare IP? The kernel drops the connection before it reaches anything. (The practical how-to is How to Lock Down an AI Agent's Network Egress.)

Cut the Wire makes this visceral. We hand you a root shell and dare you to turn the egress policy off. iptables -F, kill processes, open raw sockets, try odd ports — none of it works, because the thing enforcing the policy isn't in your VM. You can't disable an enforcer you can't reach. That level is unwinnable by design — not "mathematically impossible," just: there's no path from inside.

The Sync Bot shows why you need this floor at all. Exfiltrating data by POSTing it somewhere looks exactly like a legitimate sync — no classifier can reliably tell them apart. The answer isn't a smarter judge; it's a network allowlist the agent can't route around.

Secrets that were never in the box

In The Breakout, you get a root shell and go hunting for the app's API key. At the easy level it's sitting in a config file and you win in seconds. At the hard level, the environment variable holds a decoy — the literal string declaw:vault-managed — and the real key is injected by the egress proxy on the outbound request, so it never enters the VM at all. You can dump every file and every env var with full root and still walk away with a fake. (That's the credential vault; the how-to is How to Give an AI Agent Code Execution Without Handing Over Your Credentials.)

The famous attack that's dead on arrival

The Capital One scenario recreates the 2019 breach: reach the cloud metadata endpoint (169.254.169.254) via SSRF and pull IAM credentials. You get root and try the same move. It's blocked by default, at multiple layers, with zero configuration. The most famous cloud attack of the decade — dead on arrival.

A map of what guarantees what

Layer	Kind	What it stops	Can a determined attacker beat it?
microVM isolation	foundation	a compromised agent reaching the host or another tenant	No — the VM boundary contains it; even a kernel-level compromise stays in the VM (escape means breaking the hypervisor, a far smaller surface)
PII redaction	guardrails	the model ever seeing raw PII	Rarely — configured PII is masked before the model sees it; pattern-based, like any redactor
Prompt-injection defense	guardrails	single-, multi-turn & indirect injection	Rarely — layered, top-tier on public benchmarks; still probabilistic, not a guarantee
Egress allowlist	deterministic wall	connections to non-allowlisted hosts	No — the kernel drops it, outside the guest
Credential vault	deterministic wall	key theft from inside the VM	No — the key is never in the VM (caveat: cert-pinned upstreams can't be brokered)
Metadata block	deterministic wall	SSRF to cloud metadata	No — dropped by default, no config

Read that last column top to bottom. The guardrail layers are strong but probabilistic — rarely beaten, not never. The deterministic walls are absolute: they stop what guardrails fundamentally can't. Neither is the "real" layer. The security is the combination.

The honest fine print

The easy levels have no defenses on purpose. They're there so you can feel the difference the moment a layer turns on.
Guardrails are strong, and still probabilistic. Our injection defense benchmarks at the top of public attack sets and turns back the large majority of what people try — but content inspection is inspection, so we back it with deterministic walls instead of betting everything on it. The Research Bot's middle level is left winnable on purpose to show exactly that seam.
Nothing real is at risk. Every secret is a fake canary. And win-detection is deliberately generous — it decodes base64, hex, rot13, and reversed strings, so "I technically smuggled it out" counts. We're not grading in our own favor.
It's the real runtime. Arena is a thin game around the same isolation, egress enforcement, and vault you get through the SDKs. We built a microVM story on Firecracker for a reason — here's that reasoning.

Come break it

The point of Arena isn't to claim Declaw is unbreakable. It's to make the difference between content inspection and a deterministic wall something you can feel with your own hands — and to show what it looks like when a runtime does both. Most agent security stops at one layer: guardrails alone, or raw isolation alone. We think you need strong guardrails and enforcement the agent can't route around — and Arena is the place to check whether we mean it.

Go try to break it: declaw.ai/arena. If you get past a hardened level, we genuinely want to know how.