Open Benchmarks

Real numbers.
No fabrication.

Two experiments: context quality across long agent sessions, and tool schema token savings. Both reproducible.

Experiment 1 · May 2026
Context quality across long agent sessions
How well does each system preserve facts across a 50-turn conversation? Probed with 5 factual questions per session after compression. Scored on Factory.ai's 6-dimension framework (0–5 scale).
System Quality score Compression Approach
Promptolian 4.26 / 5 21.8% Extractive · KV-sandwich
Anthropic built-in 3.44 / 5 98.7% LLM summarization
OpenAI built-in 3.35 / 5 99.3% LLM summarization
4.26
out of 5
Quality score
+0.82
vs Anthropic
Quality gap
31%
fewer failures
vs Anthropic built-in

Each session was probed with 5 factual questions after compression. A judge model scored every answer across six dimensions — each on a 0–5 scale. Here's what each one means:

DimensionWhat it measuresScore
Accuracy Did the agent recall exact facts — numbers, names, URLs — from earlier in the conversation? 4.30
Context Did the answer make sense given the conversation history, or did it feel disconnected? 4.26
Artifact Were code snippets, config values, and structured data preserved intact — not paraphrased or lost? 4.20
Completeness Did the answer cover everything asked, or were parts missing because the context was compressed away? 4.26
Continuity Did the agent remember decisions and facts from early turns, not just the most recent messages? 4.30
Instruction Did the agent still follow the original system prompt constraints after 50 turns of compression? 4.20

Artifact is the hardest dimension for LLM summarizers — they scored 2.19–2.45/5 on this. Promptolian's rule-based encoding preserves exact values verbatim.

Assumptions & methodology

Experiment 2 · May 2026
Tool schema token savings via prompt cache
Every API call re-sends the full tool schema. The proxy injects Anthropic cache_control blocks automatically — cached tokens are billed at 10%.
~90%
session avg savings
Tool schema tokens
$24
saved / month
at 500 calls/day · 5 tools
<10ms
proxy overhead
per request

Assumptions

Why high compression hurts overall: see the U-curve — total cost vs compression rate →