Experiment 1 · May 2026
Context quality across long agent sessions
How well does each system preserve facts across a 50-turn conversation? Probed with 5 factual questions per session after compression. Scored on Factory.ai's 6-dimension framework (0–5 scale).
| System |
Quality score |
Compression |
Approach |
| Promptolian |
4.26 / 5 |
21.8% |
Extractive · KV-sandwich |
| Anthropic built-in |
3.44 / 5 |
98.7% |
LLM summarization |
| OpenAI built-in |
3.35 / 5 |
99.3% |
LLM summarization |
4.26
out of 5
Quality score
+0.82
vs Anthropic
Quality gap
31%
fewer failures
vs Anthropic built-in
6-Dimension Breakdown (Promptolian)
Each session was probed with 5 factual questions after compression. A judge model scored every answer across six dimensions — each on a 0–5 scale. Here's what each one means:
| Dimension | What it measures | Score |
| Accuracy |
Did the agent recall exact facts — numbers, names, URLs — from earlier in the conversation? |
4.30 |
| Context |
Did the answer make sense given the conversation history, or did it feel disconnected? |
4.26 |
| Artifact |
Were code snippets, config values, and structured data preserved intact — not paraphrased or lost? |
4.20 |
| Completeness |
Did the answer cover everything asked, or were parts missing because the context was compressed away? |
4.26 |
| Continuity |
Did the agent remember decisions and facts from early turns, not just the most recent messages? |
4.30 |
| Instruction |
Did the agent still follow the original system prompt constraints after 50 turns of compression? |
4.20 |
Artifact is the hardest dimension for LLM summarizers — they scored 2.19–2.45/5 on this. Promptolian's rule-based encoding preserves exact values verbatim.
Assumptions & methodology
- 25 sessions × 5 task domains (coding, deployment, data, research, ops)
- 50 turns per session, then 5 factual probe questions per session
- Answer model: llama-3.1-8b-instant · Judge model: gpt-oss-120b via OpenRouter
- Scoring: Factory.ai 6-dimension framework (Accuracy, Context, Artifact, Completeness, Continuity, Instruction)
- Anthropic / OpenAI baselines: Factory.ai May 2026 study — same scoring methodology, independent test sessions
- Validation run (second 25 sessions after entity-encoding fix): 4.19 / 5 — range 4.19–4.26 across two runs
- Fact-loss rate = 1 − quality / 5 → Promptolian 14.8% · Anthropic 31.2% · OpenAI 33.0%
Experiment 2 · May 2026
Tool schema token savings via prompt cache
Every API call re-sends the full tool schema. The proxy injects Anthropic cache_control blocks automatically — cached tokens are billed at 10%.
~90%
session avg savings
Tool schema tokens
$24
saved / month
at 500 calls/day · 5 tools
<10ms
proxy overhead
per request
Assumptions
- 5 tools · ~120 tokens each = 600 tool tokens per call
- 500 calls/day · 30 days = 9M tool tokens/month → $27.00 without caching
- With Anthropic prompt cache (10% on hits, 5-min TTL) → $2.70
- Saving: $24.30/month at Claude Sonnet 4 pricing ($3/MTok input)
- Cache hit rate assumes tool schema unchanged across session turns — typical for agent workloads
Why high compression hurts overall: see the U-curve — total cost vs compression rate →