← Home

Three Ways Agent Systems Accidentally Kill Prompt Caching

Estimated reading time: 9-11 minutes | ~1,725 words

TL;DR

  • Prompt caching is an economics problem, not a binary on/off feature: the right model is exclusive read savings minus write and lifecycle overhead.
  • Cache hits fail at the provider’s prefix-equivalence boundary. If your cache-critical prefix changes in tiny ways (tool order, old-history rewrites, early image replacement), your hit rate drops.
  • The three most common cache killers I keep seeing are nondeterministic tool ordering, aggressive mutation of older conversation history, and pruning multimodal blocks too early.
  • The most practical architecture I’ve found is freeze prefix, mutate tail, with explicit quality guardrails so cache wins don’t degrade answers.
  • If you don’t track cache_attempt_coverage, eligibility_realization_rate, and token-attribution conservation, you can convince yourself caching is working when it isn’t.

Most prompt-caching failures I debug are self-inflicted. Agent systems mutate prompt content they think is harmless, then wonder why cache hit rate collapses.

I think this is easy to miss because “prompt caching” sounds like a provider feature you switch on. In practice, it behaves more like a strict systems contract between your request assembly logic and the provider’s cache semantics. If your side is unstable, provider-side caching can’t save you.

This post walks through why that happens, plus three concrete failure modes from OpenClaw fixes by Boris Cherny: #58036, #58037, and #58038.


Prompt caching fails at the provider matching boundary, not the intent boundary

The core thing to internalize is simple: for prefix-match caching, providers compare provider-defined prefix equivalence (usually token-prefix or content-block equivalence), not your intent.

For OpenAI prompt caching, docs currently describe automatic reuse once prompts are large enough (for example, 1,024+ input tokens), with provider-managed behavior beyond that. For Anthropic prompt caching, docs describe two paths: top-level cache_control (automatic breakpoint at the last cacheable block) and explicit block-level cache_control breakpoints you place manually. For Gemini/Vertex context caching and Vertex cached context, docs describe both implicit caching (enabled by default on supported models/projects) and explicit cache resources with TTL lifecycle controls.

Different knobs, same reality: if cache-critical content drifts, reuse drops.

The simplest way I model it is this equation:

E[net_savings/request]
≈ E[read_savings_exclusive/request]
 - E[cache_write_creation_cost/request]
 - E[cache_lifecycle_overhead/request]

And the read side has to avoid double counting:

E[read_savings_exclusive/request]
≈ E[prefix_tokens_excl × Δcost_prefix]
 + E[handle_tokens_excl × Δcost_handle]

attribution precedence (for accounting): handle > prefix > none

That precedence rule matters. If you count the same tokens as both “prefix saved” and “handle saved,” your dashboard lies.

I wrote about this style of “determinism first, optimization second” in a different context in Optimizing PyTorch Inference Without Sacrificing Determinism. Prompt caching has the same shape: small nondeterminism bugs dominate the outcome.


Cache killer #1: nondeterministic tool ordering

This is the easiest one to explain and the easiest one to miss.

Many agent runtimes discover tools from MCP servers, plugin registries, or dynamic service catalogs. Those collections are often unordered. If you materialize tools in returned order, the same logical toolset can produce different serialized prompts between turns.

Boris Cherny fixed exactly this in OpenClaw PR #58037: sort tools deterministically before request assembly.

Before:

// Nondeterministic: depends on upstream list order
const tools = await listTools();
request.tools = tools;

After:

// Deterministic: stable key ordering
const tools = await listTools();
tools.sort((a, b) => a.name.localeCompare(b.name));
request.tools = tools;

I think this is a pure win in most systems. It rarely has semantic downside, and it often gives immediate cache stability.

The subtle version of this bug is map/object serialization drift. If your serializer doesn’t guarantee key order, you can still churn cache-critical prompt content even after sorting arrays. Use canonical JSON serialization for cache-critical regions.


Cache killer #2: rewriting old history during compaction

The second failure mode is maintenance logic mutating the prefix.

Many agent systems compact tool results when context grows too large. That’s good in principle. The problem is where they compact. If you mutate older turns first, you rewrite exactly the part of history that should stay stable for cache reuse.

Boris Cherny fixed this in OpenClaw PR #58036: compact newest-first, not oldest-first.

Before:

// Oldest-first compaction mutates cache-critical prefix first
for (let i = 0; i < messages.length; i++) {
  maybeCompact(messages[i]);
}

After:

// Newest-first compaction absorbs pressure in mutable tail
for (let i = messages.length - 1; i >= 0; i--) {
  maybeCompact(messages[i]);
}

This is the core architecture pattern I keep coming back to:

[FROZEN PREFIX ZONE] [MUTABLE TAIL ZONE]

I use it because it gives me a concrete invariant: maintenance transforms are allowed in the tail, blocked in the prefix.

There is a tradeoff. Newest-first compaction can impact recency quality if you overdo it. That’s why I treat this as a policy with guardrails, not a doctrine.


Cache killer #3: pruning multimodal blocks too early

Multimodal history is expensive and fragile. If you replace image/audio blocks with placeholders too early, you may preserve “meaning” for humans while destroying token/block-level prefix continuity for the cache.

Boris Cherny handled this in OpenClaw PR #58038: delay pruning and keep a recent retention window.

The key change was basically this:

const PRESERVE_RECENT_ASSISTANT_TURNS = 3;

// only prune image blocks older than retention window

I like this pattern because it acknowledges two truths at once:

  1. You need eventual compaction to control context growth.
  2. You need short-term prefix stability for cache continuity.

A fixed window (3 assistant turns here) is a pragmatic starting point. Then tune by workload quality metrics.


The architecture I use now: freeze prefix, mutate tail

When people ask me for one operational rule, this is it.

I build requests as two explicit zones and enforce mutation boundaries in code review and tests.

logical state
  -> canonicalize tools + maps + JSON
  -> choose policy mode (cache_optimized | balanced | quality_optimized)
  -> partition history into [frozen prefix | mutable tail]
  -> run compaction/pruning only in mutable tail
  -> call provider cache path

Then I set quality guardrails up front. If guardrails regress, I automatically move to a less aggressive mode.

For example:

  • cache_optimized: maximize prefix stability, aggressive tail mutation.
  • balanced: moderate mutation.
  • quality_optimized: preserve recency/context even if hit rate drops.

I think teams get into trouble when they optimize cache metrics in isolation. A lower bill is not a win if tool success or answer correctness drops.


The measurement traps that waste the most time

I’ve seen teams burn weeks because their caching metrics were not decision-grade.

These are the minimum contracts I now require.

1) Coverage and realization are separate

You need both:

  • cache_attempt_coverage = cache_attempt_request_count / expected_eligible_request_count
  • eligibility_realization_rate = eligible_request_count / expected_eligible_request_count

This only works if expected_eligible_request_count is well-defined and measured consistently. If eligibility is estimated badly, both KPIs can mislead. I track eligibility-projection error by provider/model bucket and treat high error as a degraded-metrics condition.

If attempt coverage drops, you probably have a runtime regression (for example, expected handle not attached). If realization drops with stable attempts, you likely have TTL, compatibility, or threshold problems.

2) Denominator-zero must be NA, not 0

When a stratum has denominator 0, emit NA and a denominator_zero=true flag.

Coercing to 0 silently poisons rollups. I still see this constantly.

3) Token conservation must hold

If you use mixed attribution (prefix, handle, none), enforce conservation:

candidate_total_tokens
= prefix_tokens_excl + handle_tokens_excl + none_tokens

If this drifts, your savings math is untrustworthy.

4) Miss reasons must be explicit

For handle/resource flows, at minimum:

  • expected_handle_missing
  • handle_lookup_failed
  • handle_scope_mismatch
  • handle_compatibility_mismatch
  • handle_expired_ttl
  • cache_lifecycle_api_error

Without these, “miss rate increased” is just noise.


Provider differences change the operating loop

I think one of the biggest mistakes is shipping one generic caching strategy.

The control loops are different:

ProviderPrimary mechanismWhat I optimize firstMost common failure pattern
OpenAIautomatic prefix reuse on eligible promptsstable shared prefix and threshold coverageshort prompts below threshold, hidden prefix drift
Anthropictop-level cache_control (automatic last-cacheable-block breakpoint) + explicit block-level breakpointsstable top-level blocks plus breakpoint placement for hot spansbreakpoint drift, expiry misses, hidden prompt drift
Gemini / Verteximplicit caching (default on supported models/projects) + explicit cache resources (cachedContents)maximize implicit reuse first, then tighten cache-resource lifecycle in explicit modemissing/expired/invalid resource IDs, scope/TTL mismatches

This table is a point-in-time snapshot. Provider cache semantics change fast, so I re-check docs before major rollout decisions and after model/SDK upgrades.

The good news is the three killers in this post still matter across all three. Deterministic assembly and strict mutation boundaries are universal.


What I would implement first on Monday morning

If I inherit an agent system with poor cache performance, this is my first pass:

  1. Canonicalize request assembly

    • Sort tools and any unordered collections.
    • Use deterministic JSON serialization for cache-critical regions.
  2. Enforce boundary invariants

    • Introduce explicit frozen prefix and mutable tail zones.
    • Block compaction/pruning transforms in frozen prefix.
  3. Delay destructive multimodal transforms

    • Preserve recent image/audio turns with a bounded window.
    • Prune only beyond that window.
  4. Ship real miss telemetry

    • Add handle/resource miss codes.
    • Track coverage, realization, and denominator-zero strata.
  5. Validate net value, not gross hit rate

    • Include write and lifecycle overhead in economics.
    • Enforce token-attribution conservation before trusting savings.

This usually gets you from “mysterious misses” to a stable baseline quickly.


The uncomfortable truth

Prompt caching is less about clever algorithms and more about boring engineering discipline.

It rewards stable ordering, canonical serialization, explicit boundaries, and honest telemetry. It punishes every “harmless” mutation in the cache-critical prefix.

I think that’s why these bugs keep recurring. They look tiny in code review, but they compound across every turn in an agent loop.

If you fix only one thing, fix determinism in request assembly. In my experience, that single change often unlocks more than any provider-specific optimization trick.