Benchmarks

How Declaw measures up

Sandbox TTI benchmarked using the computesdk/benchmarks methodology. Guardrails benchmarked against public datasets (Deepset, Lakera Gandalf, jailbreak-classification, AgentDojo, PII-Masking-300k) and documented attack classes. All results compared directly to leading industry solutions.

Sandbox Guardrails

Sandbox Performance

TTI (time to interactive) measures wall-clock from the sandbox.create() API call to the first command executing inside the sandbox. All results are from the ComputeSDK public leaderboard — 100 iterations per provider, scored as 60% median / 25% p95 / 15% p99, multiplied by success rate. Benchmarks run daily via automated GitHub Actions.

Declaw ranks #1 across all three modes — sequential (99.6), staggered (99.6), and burst (97.4). 40ms median TTI in sequential/staggered, 240ms under 100-sandbox concurrent burst. 100% success rate across all modes, 14 providers benchmarked.

Full results

Category	Benchmark	Declaw	Best public competitor	Delta
Sequential TTI	Composite score	99.6 / 100	Tensorlake: 98.0	+1.6
Sequential TTI	Median	40 ms	Tensorlake: 190 ms	−150 ms
Sequential TTI	p95	40 ms	Tensorlake: 220 ms	−180 ms
Sequential TTI	p99	40 ms	Tensorlake: 220 ms	−180 ms
Staggered TTI (200ms apart)	Composite score	99.6 / 100	Tensorlake: 98.0	+1.6
Staggered TTI	Median	40 ms	Tensorlake: 190 ms	−150 ms
Staggered TTI	p95	40 ms	Tensorlake: 210 ms	−170 ms
Staggered TTI	p99	40 ms	Tensorlake: 220 ms	−180 ms
Burst TTI (100 simultaneous)	Composite score	97.4 / 100	Tensorlake: 95.1	+2.3
Burst TTI	Median	240 ms	Tensorlake: 430 ms	−190 ms
Burst TTI	p95	290 ms	Tensorlake: 560 ms	−270 ms
Burst TTI	p99	290 ms	Tensorlake: 580 ms	−290 ms
Burst TTI	Wall clock (100 sandboxes)	0.31 s	Tensorlake: 0.60 s	−0.29 s

Source: ComputeSDK Sandbox Leaderboard, May 9, 2026. 100 iterations per provider, automated daily runs via GitHub Actions. Methodology: github.com/computesdk/benchmarks.

Guardrails

1. Prompt Injection

Declaw's prompt-injection detection was evaluated on 2,965 prompts across three public datasets: the canonical Deepset dataset, Lakera's Gandalf user-attack corpus, and the balanced jailbreak-classification set. Lakera's PINT dataset is proprietary; these three constitute the equivalent public benchmark set.

Results at production settings

Dataset	n	Accuracy	Precision	Recall	F1
deepset/prompt-injections	662	95.92%	99.58%	90.08%	0.946
Lakera gandalf_ignore_instructions	1,000	99.90%	100.00%	99.90%	0.9995
jackhhao/jailbreak-classification	1,306	98.70%	99.09%	98.35%	0.987
Combined	2,965	98.48%	99.63%	98.03%	0.988

Industry comparison — Deepset accuracy

System	Accuracy	F1
Declaw	95.92%	0.946
Lakera Guard	87.91%	0.823
Vigil	77.49%	0.615
Azure Prompt Shield	77.28%	0.560
ProtectAI LLM Guard	76.00%	0.579
Calypso AI Moderator	73.56%	0.572
Rebuff	72.96%	0.689
LangKit Similarity	70.20%	0.588
LangKit Canary	61.02%	0.044

Competitor numbers from Palit & Woods (2025), arXiv:2505.13028, Table 5. Declaw evaluated 2026-04-20 on the same Deepset dataset (662 prompts, train+test splits) at production settings.

2. Agentic Indirect Injection (AgentDojo)

The numbers above measure detection on static datasets. The harder question is whether a guardrail actually stops an attack inside a live agent loop, where injected instructions hide inside tool outputs (emails, Slack messages, documents) and the LLM must be prevented from acting on them. AgentDojo (ETH Zürich SPY Lab, NeurIPS 2024) is the canonical benchmark for this: it runs real tool-using agents through curated user tasks with adversarial instructions planted in returned tool data.

Declaw was integrated into AgentDojo's pipeline and evaluated on the full slack suite (105 trials, gpt-4o-mini) under the important_instructions attack — AgentDojo's strongest attack class.

Variant	Attack success rate	Injections blocked	Task utility
Baseline (no defense)	62.9%	39 / 105	51.4%
Declaw	0.0%	105 / 105	17.1%

Declaw blocks 100% of injections (0 of 105 succeed) versus a 62.9% success rate with no defense — a −62.9 pp reduction in attack success rate on a benchmark explicitly designed to probe live agentic pipelines.

On the utility tradeoff: AgentDojo's evaluator treats an entire tool-output message as tainted once Declaw flags an injection. Because the slack suite intentionally concentrates attacks inside the same messages that carry legitimate data (the attacker's goal is to ride real content), blocking the injection also redacts some useful context — which lowers measured utility. This is a property of the benchmark (adversarial-maximal placement + whole-message redaction), not an ordinary workload. We consider the ASR reduction the primary signal.

3. Invisible-Text Attacks

Invisible-text attacks use Unicode encoding tricks to hide adversarial instructions from reviewers and LLM interfaces. Recent disclosures include Sourcegraph Amp Code and Google Jules coding agent (both patched in 2025), plus documented research from AWS, Cisco Talos, Embrace the Red, and Keysight.

No public benchmark exists for this category. Declaw's coverage is evaluated against a curated corpus of 78 attack payloads drawn from the above disclosures plus 35 benign controls (CJK, Arabic, Hebrew, emoji, URLs, code). Overall F1 = 0.961 at production threshold.

Coverage by documented attack class

Attack class	Coverage	Representative
Unicode Tags block (U+E0000–U+E007F)	100% (20/20)	Sourcegraph Amp, Google Jules, AWS 2025
Zero-width characters	100% (20/20)	ZWSP, ZWNJ, ZWJ, BOM
Bidi control overrides	100% (10/10)	Trojan-source-style RLO
Private Use Area	100% (8/8)	Custom-glyph smuggling
Combined / in-the-wild payloads	100% (10/10)	Embrace the Red "ASCII Smuggler"
Soft-hyphen / format controls	75% (3/4)	Soft hyphen caught; U+FFFC missed
Unassigned codepoints	67% (2/3)	—
Variation selectors	0% (by design)	Legitimate emoji use

Declaw covers 100% of the documented 2024–2026 in-the-wild invisible-text attack classes (Tags block, zero-width, bidi, PUA). Lakera Guard, ProtectAI LLM Guard, and AWS Bedrock Guardrails do not publish comparable coverage numbers for this category.

4. PII Detection

Evaluation uses the ai4privacy/pii-masking-300k validation set (1,000 samples), with entity-level overlap scoring on the aligned-type intersection (PERSON, EMAIL, PHONE, SSN, IP, CREDIT_CARD, LOCATION, DATE). Declaw targets US-format PII — US SSN (XXX-XX-XXXX), North American phone, plus format-deterministic global types (email, IP, credit card).

Strong categories (format-deterministic, Declaw's product scope)

Type	Precision	Recall	F1
EMAIL	0.957	0.997	0.977
IP address	0.972	1.000	0.986
US SSN (US-format subset)	1.000	1.000	1.000

For the types enterprise compliance teams most frequently audit — email, IP, and US SSN — Declaw's detection is on par with or exceeds leading PII-specific solutions. Declaw pairs detection with redaction and a full audit log for SOC2 / HIPAA workflows.