Why We Chose Firecracker Over Docker for Agent Sandboxing
Containers are fast and convenient. They're also a terrible isolation boundary for untrusted AI agents. Here's the tradeoff we made and why.
The obvious choice
When we started building Declaw, the first prototype used Docker containers. It's what everyone reaches for. Fast to spin up, great tooling, every CI system on earth knows how to run them. For most workloads, containers are the right call.
AI agent sandboxing is not most workloads.
The container problem
Containers share a kernel with the host. That's the fundamental issue. When you run an untrusted container, you're betting that the Linux kernel's namespace and cgroup isolation is airtight. For your own application code, that bet is reasonable. For an AI agent that can write and execute arbitrary code? Less so.
Here's what we were worried about:
Kernel exploits. Container escapes via kernel vulnerabilities are not theoretical — they happen. CVE-2024-21626, CVE-2022-0185, CVE-2020-15257. An AI agent that can write code can potentially write exploit code. It doesn't need to be sophisticated — it just needs to find the right PoC on the internet and adapt it.
Resource isolation gaps. Cgroups handle CPU and memory, but there are shared kernel resources that containers can't fully isolate. Filesystem cache pressure, network stack state, certain /proc entries. An agent doing something weird in one container can affect others on the same host in subtle ways.
The /proc and /sys surface area. Even with seccomp and AppArmor, the attack surface of a shared kernel is large. Every new kernel feature is a potential escape vector. Keeping up with that treadmill for security-critical workloads isn't sustainable.
We spent a few weeks hardening our Docker setup — custom seccomp profiles, read-only root filesystems, dropped capabilities, gVisor as a runtime. It worked, sort of. But every new edge case felt like we were patching holes in a boat rather than building a better boat.
The Firecracker bet
Firecracker gives you a real VM — separate kernel, separate userspace — with near-container startup times. AWS built it for Lambda and Fargate, which have roughly the same trust model we do: run untrusted code safely at scale.
The tradeoffs:
What we gained:
- Hard isolation boundary. The agent runs in its own kernel. A kernel exploit inside the VM doesn't compromise the host because the host is running a different kernel behind the Firecracker VMM.
- Clean network namespace. Each VM gets its own network stack. Our security proxy runs inside this stack and sees all traffic. No iptables gymnastics, no sidecar container networking hacks.
- Predictable resource limits. CPU, memory, and disk are hard-partitioned. One agent can't starve another.
- Simpler security model. Instead of layering seccomp + AppArmor + user namespaces + gVisor and hoping we didn't miss a gap, we have a single isolation primitive that's well-understood.
What it costs:
- ~50ms cold start. Slower than a Docker container start (usually 50-100ms), but fast enough for agent workloads where the agent itself takes seconds to minutes to complete a task.
- Higher memory baseline. Each VM has its own kernel in memory. We minimize this with a minimal kernel config and shared memory where possible, but it's still more than a container.
- Less tooling. No
docker execto drop into. No docker-compose for local dev (we built our own orchestration). Debugging is harder — but that's partly the point. If it's hard for us to get in, it's hard for an attacker too.
How the security proxy fits in
The real win isn't just isolation — it's what the VM boundary lets us do with network traffic.
Because each sandbox is its own VM with its own network stack, we can run a transparent proxy inside the VM that intercepts all outbound traffic. The agent doesn't know it's there. No proxy configuration, no environment variables, no SDK changes.
Inside the VM, the proxy runs the full security pipeline — PII redaction, prompt injection scanning, code security checks, toxicity filtering, invisible text detection. All before traffic hits the VM's virtual network interface.
With containers, achieving this required either a sidecar proxy with iptables REDIRECT rules (fragile, easy to misconfigure) or modifying the agent's HTTP client (invasive, breaks when agents use subprocesses). With Firecracker, it's just how the network works inside the VM.
The numbers
Some benchmarks from our test suite, for context:
- Cold start (no snapshot): ~50ms from API call to agent-ready
- Security pipeline overhead: 2-7ms per request depending on payload size and which scanners are active
- Memory per sandbox: ~40MB baseline (minimal kernel + init + security proxy)
- Concurrent sandboxes per host: Depends on host size, but we've run 200+ on a single 64GB machine
For comparison, the agent tasks we benchmark against take 5-60 seconds. The sandbox overhead is noise.
When you should still use containers
To be clear — Firecracker isn't the right choice for everything. If you're running your own trusted application code and just want process isolation, containers are fine. If you need sub-10ms startup, containers win. If your team's operational expertise is in Docker/Kubernetes, the learning curve of Firecracker has a real cost.
We chose Firecracker specifically because we're running untrusted, arbitrary code generated by AI. That threat model demands a harder isolation boundary than containers provide.
If you're curious about the implementation details, the packages/orchestrator directory has all the Firecracker lifecycle management. Reach out on Discord if you want to learn more.