Measured Forgetting: Context Management for Agentic Local LLMs

7 / 7

V2 outperforms V1 on every model tested — sign test p = 0.008

When a local LLM (8B–14B parameters, 8K–16K context) acts as a tool-calling agent, it exhausts its context window in 3–5 rounds. Each round appends both a model response and a tool result. Quality degrades before context fills — the failure mode is confabulation, not a crash. The model starts generating plausible-sounding responses that ignore the tool results, because the relevant data has been pushed too far back in the context.

Measured forgetting treats context as a finite resource to be managed during inference, not before it. The method monitors context utilisation in real-time, partitions messages into three zones, and invokes the model itself to compress the compressible zone — all within the same generation loop.

Three-zone partitioning

Before compaction ~12,000 tokens

Pinned

Compressible (tool rounds 1…n)

Recent

↓

After compaction ~3,200 tokens

Pinned

Summary (≤256 tok)

Recent

Zone 1 (Pinned): System prompt + user question. Never compressed. Never discarded.
Zone 2 (Compressible): Intermediate tool exchanges. Compressed into a summary.
Zone 3 (Recent): Last 2 messages. Preserved verbatim — the model needs fresh context to continue.

Compaction fires when context utilisation crosses 75% of the usable window. The compressible zone is fed to the model in a dedicated summarisation pass (low temperature, 256-token cap). The result replaces the entire compressible zone, restoring the context budget.

Version 2: Dimension-aware compression

V1 applies uniform compression: all messages are treated equally by the summariser. This works but loses information non-uniformly — causal chains and high-value data points get compressed at the same rate as error messages.

V2 adds three instruments:

1. Influence Equation

I(m) = Φ(Bm) × ∑ wd(t) · Cd(m)

Each message is scored across six dimensions: data, causal, task-reference, entity, temporal, structural. The key insight: a message that converges on 4 dimensions is not 4x more valuable than a single-dimension message — it is 8x more valuable, because multi-dimensional convergence is harder to reconstruct from a summary.

The resonance multiplier Φ(B) = B^1.5 was selected empirically from three candidates (γ=1.0 failed to separate tiers, γ=2.0 over-promoted). TF-IDF scoring and fixed-tier allocation were also tried and abandoned. We report this honestly rather than presenting it as principled — see the paper for the full failure analysis.

2. Trace Topology

Messages form causal chains: tool call → tool result → interpretation. These are binding units — they get preserved or compressed together, never split.

3. Problem Taxonomy

Lookup

Sufficient statistic: the answer

Preserve highest-influence result; compress everything else

Multi-hop

Sufficient statistic: the chain

Preserve causal chain topology (A → B → C)

Exploratory

Sufficient statistic: the success

Compress failures to one line; preserve successes in full

Aggregation

Sufficient statistic: the totals

Preserve counts and sums; compress individual data points

Contradiction

Sufficient statistic: both paths

Never heavily compress conflicting data — both sides are the kernel

Temporal

Sufficient statistic: the sequence

Preserve chronological order; include all dates/timestamps

Results

18 synthetic scenarios (6 problem classes × 3 complexity levels), 47 probes per model, 7 models from 5 independent families. Scoring: 0–3 via keyword matching (deterministic, reproducible, biased toward factual recall). The same scorer applies to all conditions, so any bias cancels in the V2−V1 comparison.

Models

5 independent families

Scenarios

6 classes × 3 levels

Probes

per model per condition

Sign test

p = .008

7/7 positive

Retention scores by model

Score out of 3.0 — higher is better. Hover for details.

Baseline (truncation)

V1 (uniform)

V2 (dimension-aware)

The largest gains appear on the weakest summarisers. Qwen3 8B (a thinking-mode model that spends tokens on <think> blocks) gains +0.46. Models already near ceiling gain +0.15. Directional evidence suggests an inverse capability relationship (r = −0.77, p = 0.043, n=7), though the sample is too small for robust correlation claims.

KV cache persistence

Measured forgetting reduces what needs to be in context. KV cache persistence reduces the cost of what remains. In a multi-round tool-calling loop, each round's prompt shares a long common prefix with the previous round. A session worker on a dedicated thread persists the llama.cpp KV cache across rounds, finding the longest common prefix and decoding only the new tokens.

Decode savings across 5 tool rounds

Prompt processing tokens — with vs without KV cache persistence

Together, the two methods form a virtuous cycle: the model investigates, KV persistence avoids redundant decoding, when context fills measured forgetting compresses the middle zone, the KV cache partially invalidates but the pinned prefix stays cached, and the model continues with a fresh budget.

Production deployment

Both techniques run in production in B.app, a Tauri desktop application using Qwen3 8B (Q4_K_M, 4.9 GB) on Apple M4 with 16K context and Metal acceleration via llama.cpp. The application performs multi-round tool-calling investigations — querying APIs, reading files, executing code — with measured forgetting enabling 12+ tool rounds on a 16K window that would otherwise fill in 3–4.

Limitations

Summarisation ceiling: An 8B model's summary is only as good as its summarisation ability. Information loss is inherent.
Φ blind spot: The super-linear resonance multiplier disadvantages messages informative on a single dimension. Aggregation tasks suffer (V2: 2.84 vs V1: 2.90).
Causal fidelity gap: Chain detection works for structural dependencies but misses semantic dependencies spanning non-adjacent messages.
Synthetic benchmark: 18 scenarios with known ground-truth facts. Real conversations may differ. Naturalistic evaluation is pending.
Keyword scoring: Rewards factual recall over nuanced understanding. Will undercount correct paraphrases using synonyms.

Reproduce it

The benchmark harness, all 7 model results (JSON), and the full measured forgetting algorithm are open-source.

# Requires Ollama with models pulled locally
cargo run --release --bin bench-forgetting -- --model mistral:7b

# JSON output
cargo run --release --bin bench-forgetting -- --model mistral:7b --json > results.json

# All default models
cargo run --release --bin bench-forgetting -- --all

The repository includes the paper source (LaTeX), benchmark results for all 7 models, and the complete Rust implementation of both V1 and V2 compression.

View on GitHub Read the full paper (PDF)

Help us get this on arXiv

We're first-time submitters and need an endorsement from someone who has published in cs.CL or cs.AI on arXiv. If you've read the paper and find it credible, endorsing takes about two minutes.

Endorse on arXiv →

The link goes directly to arXiv's endorsement page. We welcome adversarial benchmarks, alternative Φ formulations, and replications on different hardware.

Citation

@article{asidi2026measured,
  title={Measured Forgetting: Context Management for Agentic Local LLMs},
  author={Asidi, Barak Achillah},
  journal={arXiv preprint},
  year={2026}
}