Benchmarks — Promptolian

Experiment 1 · May 2026

Context quality across long agent sessions

How well does each system preserve facts across a 50-turn conversation? Probed with 5 factual questions per session after compression. Scored on Factory.ai's 6-dimension framework (0–5 scale).

System	Quality score	Compression	Approach
Promptolian	4.26 / 5	21.8%	Extractive · KV-sandwich
Anthropic built-in	3.44 / 5	98.7%	LLM summarization
OpenAI built-in	3.35 / 5	99.3%	LLM summarization

4.26

out of 5

Quality score

+0.82

vs Anthropic

Quality gap

31%

fewer failures

vs Anthropic built-in

6-Dimension Breakdown (Promptolian)

Each session was probed with 5 factual questions after compression. A judge model scored every answer across six dimensions — each on a 0–5 scale. Here's what each one means:

Dimension	What it measures	Score
Accuracy	Did the agent recall exact facts — numbers, names, URLs — from earlier in the conversation?	4.30
Context	Did the answer make sense given the conversation history, or did it feel disconnected?	4.26
Artifact	Were code snippets, config values, and structured data preserved intact — not paraphrased or lost?	4.20
Completeness	Did the answer cover everything asked, or were parts missing because the context was compressed away?	4.26
Continuity	Did the agent remember decisions and facts from early turns, not just the most recent messages?	4.30
Instruction	Did the agent still follow the original system prompt constraints after 50 turns of compression?	4.20

Artifact is the hardest dimension for LLM summarizers — they scored 2.19–2.45/5 on this. Promptolian's rule-based encoding preserves exact values verbatim.

Assumptions & methodology

25 sessions × 5 task domains (coding, deployment, data, research, ops)
50 turns per session, then 5 factual probe questions per session
Answer model: llama-3.1-8b-instant · Judge model: gpt-oss-120b via OpenRouter
Scoring: Factory.ai 6-dimension framework (Accuracy, Context, Artifact, Completeness, Continuity, Instruction)
Anthropic / OpenAI baselines: Factory.ai May 2026 study — same scoring methodology, independent test sessions
Validation run (second 25 sessions after entity-encoding fix): 4.19 / 5 — range 4.19–4.26 across two runs
Fact-loss rate = 1 − quality / 5 → Promptolian 14.8% · Anthropic 31.2% · OpenAI 33.0%

Experiment 2 · May 2026

Tool schema token savings via prompt cache

Every API call re-sends the full tool schema. The proxy injects Anthropic cache_control blocks automatically — cached tokens are billed at 10%.

~90%

session avg savings

Tool schema tokens

$24

saved / month

at 500 calls/day · 5 tools

<10ms

proxy overhead

per request

Assumptions

5 tools · ~120 tokens each = 600 tool tokens per call
500 calls/day · 30 days = 9M tool tokens/month → $27.00 without caching
With Anthropic prompt cache (10% on hits, 5-min TTL) → $2.70
Saving: $24.30/month at Claude Sonnet 4 pricing ($3/MTok input)
Cache hit rate assumes tool schema unchanged across session turns — typical for agent workloads

Why high compression hurts overall: see the U-curve — total cost vs compression rate →

Real numbers.No fabrication.

Assumptions & methodology

Assumptions

Real numbers.
No fabrication.