2026-05-17Security

Dirty Frag: Why Hardware Isolation Matters

A new kernel zero-day gives unprivileged users deterministic root on most Linux distributions since 2017. We tested it against our microVM isolation boundary. Here's what happened.

On May 7, Hyunwoo Kim (V4bel) disclosed Dirty Frag — two Linux kernel vulnerabilities (CVE-2026-43284 and CVE-2026-43500) that give unprivileged users deterministic root on most Linux distributions shipped since 2017. Microsoft confirmed active exploitation the next day.

We build declaw.ai — sandboxing infrastructure for AI agents, built on Firecracker microVMs. Our workloads execute untrusted code that we don't write and can't predict. When Dirty Frag dropped, our first question was: does our isolation boundary actually hold?

We tested it. It does. This post explains what we did, what happened, and why the answer comes down to architecture.

What Dirty Frag does

The short version: the exploit tricks the kernel into performing in-place cryptographic decryption on pages that belong to the page cache — the kernel's shared read cache for files. This lets an attacker overwrite the in-memory contents of any file (e.g., /usr/bin/su or /etc/passwd) and gain root. No race condition, fully deterministic.

There are two independent paths:

CVE-2026-43284 (ESP/IPsec subsystem, in the kernel since 2017) — requires user namespace creation for CAP_NET_ADMIN
CVE-2026-43500 (RxRPC subsystem, since 2023) — requires only socket(AF_RXRPC) and add_key()

They cover each other's blind spots across distributions. Qualys has an excellent deep-dive on the mechanics. Sysdig covers detection and the exploitation chain in detail.

Why this matters for multi-tenant platforms

Dirty Frag is a page-cache write primitive. The page cache is shared across the entire machine — every container, every process, the host itself. When two containers read the same file, they're reading the same physical memory pages.

For container-based isolation, this is the problem: containers share the host kernel. Namespace isolation, seccomp, dropped capabilities — all of these are enforced by the kernel. A kernel exploit doesn't need to "escape" the container. It operates below the layer where container isolation exists.

This isn't a new observation. It's the same structural issue behind Dirty COW (2016), Dirty Pipe (2022), and now Dirty Frag (2026). Edera documented the pattern: 40+ kernel CVEs from 2020-2025 that could reach through namespace boundaries, roughly 2-4 escape-grade vulnerabilities per year.

The problem isn't that any specific platform failed to patch. The problem is structural: on the day a kernel zero-day drops — before any patch exists — every container-based sandbox sharing that kernel is exposed. Patching closes the window after the fact. It cannot close it in advance. This class of vulnerability will keep appearing (Dirty COW → Dirty Pipe → Dirty Frag → the next one), and container isolation will keep being insufficient on day zero, every time.

What we tested

We ran the public Dirty Frag PoC (ESP path, CVE-2026-43284) in two environments: a container-based sandbox and our Firecracker microVM infrastructure.

Test 1: Container-based sandbox (shared kernel)

Setup: we replicated a typical container sandbox configuration — Docker with seccomp enabled, unprivileged user (uid=1001), host kernel 6.8.0 (in the vulnerable range).

We first verified the exploit prerequisites:

$ id
uid=1001(user) gid=1001(user) groups=1001(user)
$ unshare --user --net echo "namespaces work"
namespaces work

User namespace creation was allowed by the seccomp profile. XFRM netlink sockets inside the new namespace were also permitted. We compiled and ran the PoC:

$ ./dirtyfrag
root ➜ ~ $ id
uid=0(root) gid=0(root) groups=0(root)
root ➜ ~ $ cat /etc/shadow | head -2
root:*:20549:0:99999:7:::
daemon:*:20549:0:99999:7:::
root ➜ ~ $ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-[redacted] root=UUID=[redacted]

Unprivileged user to root in under 2 seconds. Seccomp was active — it didn't prevent the attack because the required syscalls (user namespace creation, XFRM netlink) were permitted by the profile. With root, we could read /etc/shadow, see host kernel boot parameters via /proc/cmdline, and see Docker overlay2 layer paths in /proc/1/mountinfo.

Test 2: Firecracker microVM (hardware boundary)

Setup: Firecracker microVM with a deliberately unpatched guest kernel (in the vulnerable range). No seccomp restrictions. The user starts as root with full capabilities — intentionally more permissive than the container test, since the VM boundary is the isolation layer.

$ id
uid=0(root) gid=0(root) groups=0(root)
$ cat /proc/self/status | grep Seccomp
Seccomp:  0

We compiled and ran the same PoC. The exploit worked — page-cache corruption succeeded inside the guest. Then we attempted to reach the host:

$ cat /proc/cmdline
console=ttyS0 reboot=k panic=1 pci=off [...] virtio_mmio.device=4K@0xd0000000:5
$ ps aux | head -5
USER  PID %CPU %MEM  COMMAND
root    1  0.0  2.6  /sbin/init
root    2  0.0  0.0  [kthreadd]
root    3  0.0  0.0  [rcu_gp]
root    4  0.0  0.0  [rcu_par_gp]
$ for port in 22 80 443 8080 9090; do
>   echo "Port $port: $(timeout 1 bash -c "echo >/dev/tcp/10.0.0.1/$port" 2>&1 && echo OPEN || echo CLOSED)"
> done
Port 22: CLOSED
Port 80: CLOSED
Port 443: CLOSED
Port 8080: CLOSED
Port 9090: CLOSED

Every attempt to reach the host failed:

Host kernel: not visible. /proc/cmdline shows guest boot params (virtio devices), not the host's real kernel or disk UUIDs.
Host processes: invisible. ps shows only guest kernel threads and the in-VM daemon. The guest has its own kthreadd, kswapd, ksoftirqd — a completely separate kernel.
Host network: all ports on the gateway (10.0.0.1) are closed. No host services are reachable.
Host filesystem: only virtual block devices (/dev/vda, /dev/vdb). No host disks, no overlay paths.
Host hardware info: no DMI/SMBIOS data. Firecracker's minimal device model doesn't expose host hardware identity.

The exploit achieved page-cache corruption inside the guest — but the page cache it corrupted belongs to the guest's own kernel. The host's page cache is in a different kernel, in memory the guest cannot address.

Why the microVM boundary holds

The difference is architectural. A Firecracker guest runs its own kernel with its own page cache, backed by a virtual block device. When the exploit corrupts page-cache pages, it corrupts the guest's pages — which are mapped to a bounded region of host memory via EPT (Extended Page Tables). The host's page cache lives in a different kernel, in memory the guest cannot address.

Note the asymmetry: the microVM test started with more privilege (root, full capabilities, no seccomp) than the container test (unprivileged user, seccomp active). Yet the container test exposed host information, while the microVM test could not. The boundary that matters is not what permissions the software grants — it's whether the kernel is shared.

To escape a Firecracker microVM, an attacker would need to find a vulnerability in the VMM itself (five emulated virtio devices, ~50K lines of Rust) or in KVM (~25K lines of kernel code). Google's kvmCTF offers $250,000 for a full KVM guest-to-host escape. Only one has ever been publicly demonstrated (CVE-2021-29657, AMD nested SVM — required nested virtualization and months of exploit development).

The takeaway

This isn't a "we're secure because we patched fast" story. We deliberately tested on an unpatched kernel. The boundary held because it's architectural — it doesn't depend on knowing about the vulnerability in advance.

If you run untrusted code in a multi-tenant environment, the question worth asking any isolation provider: if code inside the sandbox becomes root, can it reach the host or other tenants? If the answer involves "as long as we're patched" — that's the gap.

The Dirty Frag PoC is public. The test is reproducible. You don't need to trust our claims.

The Dirty Frag PoC is at github.com/V4bel/dirtyfrag. Questions welcome — shivam@declaw.ai.