Benchmarks
How Declaw measures up
Sandbox TTI benchmarked using the computesdk/benchmarks methodology. Guardrails benchmarked against public datasets (Deepset, Lakera Gandalf, jailbreak-classification, AgentDojo, PII-Masking-300k) and documented attack classes. All results compared directly to leading industry solutions.
Sandbox Performance
TTI (time to interactive) measures wall-clock from the sandbox.create() API call to the first command executing inside the sandbox. All results are from the ComputeSDK public leaderboard — 100 iterations per provider, scored as 60% median / 25% p95 / 15% p99, multiplied by success rate. Benchmarks run daily via automated GitHub Actions.
Declaw ranks #1 across all three modes — sequential (99.6), staggered (99.6), and burst (97.4). 40ms median TTI in sequential/staggered, 240ms under 100-sandbox concurrent burst. 100% success rate across all modes, 14 providers benchmarked.
Full results
| Category | Benchmark | Declaw | Best public competitor | Delta |
|---|---|---|---|---|
| Sequential TTI | Composite score | 99.6 / 100 | Tensorlake: 98.0 | +1.6 |
| Sequential TTI | Median | 40 ms | Tensorlake: 190 ms | −150 ms |
| Sequential TTI | p95 | 40 ms | Tensorlake: 220 ms | −180 ms |
| Sequential TTI | p99 | 40 ms | Tensorlake: 220 ms | −180 ms |
| Staggered TTI (200ms apart) | Composite score | 99.6 / 100 | Tensorlake: 98.0 | +1.6 |
| Staggered TTI | Median | 40 ms | Tensorlake: 190 ms | −150 ms |
| Staggered TTI | p95 | 40 ms | Tensorlake: 210 ms | −170 ms |
| Staggered TTI | p99 | 40 ms | Tensorlake: 220 ms | −180 ms |
| Burst TTI (100 simultaneous) | Composite score | 97.4 / 100 | Tensorlake: 95.1 | +2.3 |
| Burst TTI | Median | 240 ms | Tensorlake: 430 ms | −190 ms |
| Burst TTI | p95 | 290 ms | Tensorlake: 560 ms | −270 ms |
| Burst TTI | p99 | 290 ms | Tensorlake: 580 ms | −290 ms |
| Burst TTI | Wall clock (100 sandboxes) | 0.31 s | Tensorlake: 0.60 s | −0.29 s |
Source: ComputeSDK Sandbox Leaderboard, May 9, 2026. 100 iterations per provider, automated daily runs via GitHub Actions. Methodology: github.com/computesdk/benchmarks.
Guardrails
1. Prompt Injection
Declaw's prompt-injection detection was evaluated on 2,965 prompts across three public datasets: the canonical Deepset dataset, Lakera's Gandalf user-attack corpus, and the balanced jailbreak-classification set. Lakera's PINT dataset is proprietary; these three constitute the equivalent public benchmark set.
Results at production settings
| Dataset | n | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| deepset/prompt-injections | 662 | 95.92% | 99.58% | 90.08% | 0.946 |
| Lakera gandalf_ignore_instructions | 1,000 | 99.90% | 100.00% | 99.90% | 0.9995 |
| jackhhao/jailbreak-classification | 1,306 | 98.70% | 99.09% | 98.35% | 0.987 |
| Combined | 2,965 | 98.48% | 99.63% | 98.03% | 0.988 |
Industry comparison — Deepset accuracy
| System | Accuracy | F1 |
|---|---|---|
| Declaw | 95.92% | 0.946 |
| Lakera Guard | 87.91% | 0.823 |
| Vigil | 77.49% | 0.615 |
| Azure Prompt Shield | 77.28% | 0.560 |
| ProtectAI LLM Guard | 76.00% | 0.579 |
| Calypso AI Moderator | 73.56% | 0.572 |
| Rebuff | 72.96% | 0.689 |
| LangKit Similarity | 70.20% | 0.588 |
| LangKit Canary | 61.02% | 0.044 |
Competitor numbers from Palit & Woods (2025), arXiv:2505.13028, Table 5. Declaw evaluated 2026-04-20 on the same Deepset dataset (662 prompts, train+test splits) at production settings.
2. Agentic Indirect Injection (AgentDojo)
The numbers above measure detection on static datasets. The harder question is whether a guardrail actually stops an attack inside a live agent loop, where injected instructions hide inside tool outputs (emails, Slack messages, documents) and the LLM must be prevented from acting on them. AgentDojo (ETH Zürich SPY Lab, NeurIPS 2024) is the canonical benchmark for this: it runs real tool-using agents through curated user tasks with adversarial instructions planted in returned tool data.
Declaw was integrated into AgentDojo's pipeline and evaluated on the full slack suite (105 trials, gpt-4o-mini) under the important_instructions attack — AgentDojo's strongest attack class.
| Variant | Attack success rate | Injections blocked | Task utility |
|---|---|---|---|
| Baseline (no defense) | 62.9% | 39 / 105 | 51.4% |
| Declaw | 0.0% | 105 / 105 | 17.1% |
Declaw blocks 100% of injections (0 of 105 succeed) versus a 62.9% success rate with no defense — a −62.9 pp reduction in attack success rate on a benchmark explicitly designed to probe live agentic pipelines.
On the utility tradeoff: AgentDojo's evaluator treats an entire tool-output message as tainted once Declaw flags an injection. Because the slack suite intentionally concentrates attacks inside the same messages that carry legitimate data (the attacker's goal is to ride real content), blocking the injection also redacts some useful context — which lowers measured utility. This is a property of the benchmark (adversarial-maximal placement + whole-message redaction), not an ordinary workload. We consider the ASR reduction the primary signal.
3. Invisible-Text Attacks
Invisible-text attacks use Unicode encoding tricks to hide adversarial instructions from reviewers and LLM interfaces. Recent disclosures include Sourcegraph Amp Code and Google Jules coding agent (both patched in 2025), plus documented research from AWS, Cisco Talos, Embrace the Red, and Keysight.
No public benchmark exists for this category. Declaw's coverage is evaluated against a curated corpus of 78 attack payloads drawn from the above disclosures plus 35 benign controls (CJK, Arabic, Hebrew, emoji, URLs, code). Overall F1 = 0.961 at production threshold.
Coverage by documented attack class
| Attack class | Coverage | Representative |
|---|---|---|
| Unicode Tags block (U+E0000–U+E007F) | 100% (20/20) | Sourcegraph Amp, Google Jules, AWS 2025 |
| Zero-width characters | 100% (20/20) | ZWSP, ZWNJ, ZWJ, BOM |
| Bidi control overrides | 100% (10/10) | Trojan-source-style RLO |
| Private Use Area | 100% (8/8) | Custom-glyph smuggling |
| Combined / in-the-wild payloads | 100% (10/10) | Embrace the Red "ASCII Smuggler" |
| Soft-hyphen / format controls | 75% (3/4) | Soft hyphen caught; U+FFFC missed |
| Unassigned codepoints | 67% (2/3) | — |
| Variation selectors | 0% (by design) | Legitimate emoji use |
Declaw covers 100% of the documented 2024–2026 in-the-wild invisible-text attack classes (Tags block, zero-width, bidi, PUA). Lakera Guard, ProtectAI LLM Guard, and AWS Bedrock Guardrails do not publish comparable coverage numbers for this category.
4. PII Detection
Evaluation uses the ai4privacy/pii-masking-300k validation set (1,000 samples), with entity-level overlap scoring on the aligned-type intersection (PERSON, EMAIL, PHONE, SSN, IP, CREDIT_CARD, LOCATION, DATE). Declaw targets US-format PII — US SSN (XXX-XX-XXXX), North American phone, plus format-deterministic global types (email, IP, credit card).
Strong categories (format-deterministic, Declaw's product scope)
| Type | Precision | Recall | F1 |
|---|---|---|---|
| 0.957 | 0.997 | 0.977 | |
| IP address | 0.972 | 1.000 | 0.986 |
| US SSN (US-format subset) | 1.000 | 1.000 | 1.000 |
For the types enterprise compliance teams most frequently audit — email, IP, and US SSN — Declaw's detection is on par with or exceeds leading PII-specific solutions. Declaw pairs detection with redaction and a full audit log for SOC2 / HIPAA workflows.