When a local LLM (8B–14B parameters, 8K–16K context) acts as a tool-calling agent, it exhausts its context window in 3–5 rounds. Each round appends both a model response and a tool result. Quality degrades before context fills — the failure mode is confabulation, not a crash. The model starts generating plausible-sounding responses that ignore the tool results, because the relevant data has been pushed too far back in the context.
Measured forgetting treats context as a finite resource to be managed during inference, not before it. The method monitors context utilisation in real-time, partitions messages into three zones, and invokes the model itself to compress the compressible zone — all within the same generation loop.
Three-zone partitioning
Zone 1 (Pinned): System prompt + user question. Never compressed. Never discarded.
Zone 2 (Compressible): Intermediate tool exchanges. Compressed into a summary.
Zone 3 (Recent): Last 2 messages. Preserved verbatim — the model needs fresh context to continue.
Compaction fires when context utilisation crosses 75% of the usable window. The compressible zone is fed to the model in a dedicated summarisation pass (low temperature, 256-token cap). The result replaces the entire compressible zone, restoring the context budget.
Version 2: Dimension-aware compression
V1 applies uniform compression: all messages are treated equally by the summariser. This works but loses information non-uniformly — causal chains and high-value data points get compressed at the same rate as error messages.
V2 adds three instruments:
1. Influence Equation
Each message is scored across six dimensions: data, causal, task-reference, entity, temporal, structural. The key insight: a message that converges on 4 dimensions is not 4x more valuable than a single-dimension message — it is 8x more valuable, because multi-dimensional convergence is harder to reconstruct from a summary.
The resonance multiplier Φ(B) = B1.5 was selected empirically from three candidates (γ=1.0 failed to separate tiers, γ=2.0 over-promoted). TF-IDF scoring and fixed-tier allocation were also tried and abandoned. We report this honestly rather than presenting it as principled — see the paper for the full failure analysis.
2. Trace Topology
Messages form causal chains: tool call → tool result → interpretation. These are binding units — they get preserved or compressed together, never split.
3. Problem Taxonomy
Results
18 synthetic scenarios (6 problem classes × 3 complexity levels), 47 probes per model, 7 models from 5 independent families. Scoring: 0–3 via keyword matching (deterministic, reproducible, biased toward factual recall). The same scorer applies to all conditions, so any bias cancels in the V2−V1 comparison.
The largest gains appear on the weakest summarisers. Qwen3 8B (a thinking-mode model that spends tokens on <think> blocks) gains +0.46. Models already near ceiling gain +0.15. Directional evidence suggests an inverse capability relationship (r = −0.77, p = 0.043, n=7), though the sample is too small for robust correlation claims.
KV cache persistence
Measured forgetting reduces what needs to be in context. KV cache persistence reduces the cost of what remains. In a multi-round tool-calling loop, each round's prompt shares a long common prefix with the previous round. A session worker on a dedicated thread persists the llama.cpp KV cache across rounds, finding the longest common prefix and decoding only the new tokens.
Together, the two methods form a virtuous cycle: the model investigates, KV persistence avoids redundant decoding, when context fills measured forgetting compresses the middle zone, the KV cache partially invalidates but the pinned prefix stays cached, and the model continues with a fresh budget.
Production deployment
Both techniques run in production in B.app, a Tauri desktop application using Qwen3 8B (Q4_K_M, 4.9 GB) on Apple M4 with 16K context and Metal acceleration via llama.cpp. The application performs multi-round tool-calling investigations — querying APIs, reading files, executing code — with measured forgetting enabling 12+ tool rounds on a 16K window that would otherwise fill in 3–4.
Limitations
- Summarisation ceiling: An 8B model's summary is only as good as its summarisation ability. Information loss is inherent.
- Φ blind spot: The super-linear resonance multiplier disadvantages messages informative on a single dimension. Aggregation tasks suffer (V2: 2.84 vs V1: 2.90).
- Causal fidelity gap: Chain detection works for structural dependencies but misses semantic dependencies spanning non-adjacent messages.
- Synthetic benchmark: 18 scenarios with known ground-truth facts. Real conversations may differ. Naturalistic evaluation is pending.
- Keyword scoring: Rewards factual recall over nuanced understanding. Will undercount correct paraphrases using synonyms.
Reproduce it
The benchmark harness, all 7 model results (JSON), and the full measured forgetting algorithm are open-source.
# Requires Ollama with models pulled locally
cargo run --release --bin bench-forgetting -- --model mistral:7b
# JSON output
cargo run --release --bin bench-forgetting -- --model mistral:7b --json > results.json
# All default models
cargo run --release --bin bench-forgetting -- --all
The repository includes the paper source (LaTeX), benchmark results for all 7 models, and the complete Rust implementation of both V1 and V2 compression.
Help us get this on arXiv
We're first-time submitters and need an endorsement from someone who has published in cs.CL or cs.AI on arXiv. If you've read the paper and find it credible, endorsing takes about two minutes.
Endorse on arXiv →Citation
@article{asidi2026measured,
title={Measured Forgetting: Context Management for Agentic Local LLMs},
author={Asidi, Barak Achillah},
journal={arXiv preprint},
year={2026}
}