7 / 7
V2 outperforms V1 on every model tested — sign test p = 0.008

When a local LLM (8B–14B parameters, 8K–16K context) acts as a tool-calling agent, it exhausts its context window in 3–5 rounds. Each round appends both a model response and a tool result. Quality degrades before context fills — the failure mode is confabulation, not a crash. The model starts generating plausible-sounding responses that ignore the tool results, because the relevant data has been pushed too far back in the context.

Measured forgetting treats context as a finite resource to be managed during inference, not before it. The method monitors context utilisation in real-time, partitions messages into three zones, and invokes the model itself to compress the compressible zone — all within the same generation loop.

Three-zone partitioning

Before compaction ~12,000 tokens
Pinned
Compressible (tool rounds 1…n)
Recent
After compaction ~3,200 tokens
Pinned
Summary (≤256 tok)
Recent

Zone 1 (Pinned): System prompt + user question. Never compressed. Never discarded.
Zone 2 (Compressible): Intermediate tool exchanges. Compressed into a summary.
Zone 3 (Recent): Last 2 messages. Preserved verbatim — the model needs fresh context to continue.

Compaction fires when context utilisation crosses 75% of the usable window. The compressible zone is fed to the model in a dedicated summarisation pass (low temperature, 256-token cap). The result replaces the entire compressible zone, restoring the context budget.

Version 2: Dimension-aware compression

V1 applies uniform compression: all messages are treated equally by the summariser. This works but loses information non-uniformly — causal chains and high-value data points get compressed at the same rate as error messages.

V2 adds three instruments:

1. Influence Equation

I(m) = Φ(Bm) × ∑ wd(t) · Cd(m)

Each message is scored across six dimensions: data, causal, task-reference, entity, temporal, structural. The key insight: a message that converges on 4 dimensions is not 4x more valuable than a single-dimension message — it is 8x more valuable, because multi-dimensional convergence is harder to reconstruct from a summary.

The resonance multiplier Φ(B) = B1.5 was selected empirically from three candidates (γ=1.0 failed to separate tiers, γ=2.0 over-promoted). TF-IDF scoring and fixed-tier allocation were also tried and abandoned. We report this honestly rather than presenting it as principled — see the paper for the full failure analysis.

2. Trace Topology

Messages form causal chains: tool call → tool result → interpretation. These are binding units — they get preserved or compressed together, never split.

3. Problem Taxonomy

Lookup
Sufficient statistic: the answer
Preserve highest-influence result; compress everything else
Multi-hop
Sufficient statistic: the chain
Preserve causal chain topology (A → B → C)
Exploratory
Sufficient statistic: the success
Compress failures to one line; preserve successes in full
Aggregation
Sufficient statistic: the totals
Preserve counts and sums; compress individual data points
Contradiction
Sufficient statistic: both paths
Never heavily compress conflicting data — both sides are the kernel
Temporal
Sufficient statistic: the sequence
Preserve chronological order; include all dates/timestamps

Results

18 synthetic scenarios (6 problem classes × 3 complexity levels), 47 probes per model, 7 models from 5 independent families. Scoring: 0–3 via keyword matching (deterministic, reproducible, biased toward factual recall). The same scorer applies to all conditions, so any bias cancels in the V2−V1 comparison.

Models
7
5 independent families
Scenarios
18
6 classes × 3 levels
Probes
47
per model per condition
Sign test
p = .008
7/7 positive
Retention scores by model
Score out of 3.0 — higher is better. Hover for details.
Baseline (truncation)
V1 (uniform)
V2 (dimension-aware)

The largest gains appear on the weakest summarisers. Qwen3 8B (a thinking-mode model that spends tokens on <think> blocks) gains +0.46. Models already near ceiling gain +0.15. Directional evidence suggests an inverse capability relationship (r = −0.77, p = 0.043, n=7), though the sample is too small for robust correlation claims.

KV cache persistence

Measured forgetting reduces what needs to be in context. KV cache persistence reduces the cost of what remains. In a multi-round tool-calling loop, each round's prompt shares a long common prefix with the previous round. A session worker on a dedicated thread persists the llama.cpp KV cache across rounds, finding the longest common prefix and decoding only the new tokens.

Decode savings across 5 tool rounds
Prompt processing tokens — with vs without KV cache persistence

Together, the two methods form a virtuous cycle: the model investigates, KV persistence avoids redundant decoding, when context fills measured forgetting compresses the middle zone, the KV cache partially invalidates but the pinned prefix stays cached, and the model continues with a fresh budget.

Production deployment

Both techniques run in production in B.app, a Tauri desktop application using Qwen3 8B (Q4_K_M, 4.9 GB) on Apple M4 with 16K context and Metal acceleration via llama.cpp. The application performs multi-round tool-calling investigations — querying APIs, reading files, executing code — with measured forgetting enabling 12+ tool rounds on a 16K window that would otherwise fill in 3–4.

Limitations

Reproduce it

The benchmark harness, all 7 model results (JSON), and the full measured forgetting algorithm are open-source.

# Requires Ollama with models pulled locally
cargo run --release --bin bench-forgetting -- --model mistral:7b

# JSON output
cargo run --release --bin bench-forgetting -- --model mistral:7b --json > results.json

# All default models
cargo run --release --bin bench-forgetting -- --all

The repository includes the paper source (LaTeX), benchmark results for all 7 models, and the complete Rust implementation of both V1 and V2 compression.

Help us get this on arXiv

We're first-time submitters and need an endorsement from someone who has published in cs.CL or cs.AI on arXiv. If you've read the paper and find it credible, endorsing takes about two minutes.

Endorse on arXiv →
The link goes directly to arXiv's endorsement page. We welcome adversarial benchmarks, alternative Φ formulations, and replications on different hardware.

Citation

@article{asidi2026measured,
  title={Measured Forgetting: Context Management for Agentic Local LLMs},
  author={Asidi, Barak Achillah},
  journal={arXiv preprint},
  year={2026}
}