Why Sandboxes Alone Won't Secure Your AI Agents
Sandboxes isolate execution. But isolation doesn't stop your agent from leaking a customer's SSN to a third-party API. Here's the gap nobody talks about.
The sandbox illusion
If you're running AI agents in production, you probably have some form of sandboxing. Maybe you're spinning up containers, maybe microVMs. The agent runs, does its thing, gets torn down. Isolated. Safe. Right?
Not really.
Sandboxes solve one problem well: they stop a rogue agent from touching your host system. But here's what they don't do — they don't look at what's leaving the sandbox.
What actually goes wrong
We've been running agents in sandboxed environments for a while now, and the failure modes are almost never about the sandbox itself. They're about what the agent sends out of it.
A few things we've seen firsthand:
- An agent scraping user data passes a full credit card number in a POST body to an analytics endpoint
- A coding assistant sends an API key found in a config file to a completions endpoint as part of its context
- A research agent follows a prompt injection embedded in a webpage and exfiltrates conversation history to an attacker-controlled domain
None of these are sandbox escapes. The sandbox did its job. The agent stayed inside its little VM. It just happened to send sensitive data straight through the front door.
The missing layer
What's missing is inspection at the network boundary. Not just "can this agent reach the internet" but "what is this agent saying to the internet."
This is why we built Declaw's security pipeline directly into the sandbox runtime. Every outbound request from an agent passes through a proxy that runs inside the VM. It's not a separate service you bolt on after the fact — it shares the execution context.
That proxy does a few things:
- PII redaction: Detects and strips sensitive data (SSNs, credit cards, emails, API keys) before it leaves the sandbox. Original values get restored in responses so the agent's workflow isn't broken.
- Prompt injection detection: Catches both direct injections and indirect ones embedded in content the agent fetches.
- Code security scanning: Blocks dangerous system calls and code execution patterns.
- Toxicity filtering: Prevents agents from generating or forwarding harmful content.
- Invisible text detection: Strips hidden unicode characters that can carry covert instructions.
Why "bolt-on" doesn't work
You could technically run a guardrails service externally and route your sandbox traffic through it. People do this. Here's why it breaks down:
Context loss. An external guardrails service sees HTTP requests. It doesn't know which sandbox sent them, what the agent was doing, or what files it had access to. Declaw's proxy sees everything because it runs in the same VM.
Latency. Every request takes a detour to another service and back. For agents making dozens of API calls per task, this adds up. Our pipeline adds single-digit milliseconds because it's local to the sandbox.
Gaps. Two vendors means two configurations, two sets of policies, two places where something can silently fail. We've seen setups where the sandbox vendor's networking config quietly bypassed the guardrails service entirely.
What we'd actually recommend
If you're building with AI agents today:
- Don't trust your sandbox to be your entire security story
- Inspect outbound traffic, not just inbound
- Run your security layer as close to the agent as possible — ideally in the same execution environment
- Log everything. You'll need the audit trail when something inevitably gets through.
We built Declaw because we hit these problems ourselves and got tired of duct-taping three different tools together. If you're in the same spot, join the waitlist or come talk to us on Discord.