spotinference Sign in with GitHub

The KV cache stores every prior token's attention tensors so decoding stays linear, and its memory footprint caps how many requests share a GPU. For API users that produces three rules: first-token latency grows with prompt length, decode speed is roughly constant per token, and long-lived contexts are the expensive resource.

How the KV cache sets LLM API latency and cost

Nearly every latency and pricing property an LLM API user observes traces back to one data structure. The KV cache makes decoding linear instead of quadratic, its memory footprint caps how many requests share a GPU, and that cap, more than arithmetic speed, sets the cost of a token. Three user-visible consequences follow: first-token latency grows with prompt length, decode speed is roughly constant per token, and long contexts are expensive to hold open.

One structure, two phases

The glossary entry defines the cache; the mechanics live at How engines work: the KV cache. The short version: attention needs the Key and Value tensors of every prior token, and recomputing them per step would make generation quadratic, so the engine stores them once and reads them every step. That split creates the two phases every request passes through. Prefill ingests the prompt and builds the cache in one compute-bound pass whose duration grows with prompt length: this is most of time to first token. Decode then produces one token per step in a memory-bound loop that reads the whole cache each iteration: per-token time stays roughly flat as the answer grows.

Why decode speed looks like memory bandwidth

The fleet's own published measurements illustrate the memory-bound claim. On the same short-context benchmark through the same gateway, a dual-H100 tier decoded at 118.9 tokens per second and a dual-A100 tier at 106.6, a gap of roughly 12 percent between hardware generations whose datasheet compute differs by integer multiples. Decode does not buy FLOPs; it buys bytes per second between high-bandwidth memory and the compute units, and the measured-throughput article walks the full comparison. The user-facing consequence: model choice and quantisation move decode speed more than GPU model does, and a provider quoting decode rates should say which hardware and context length produced them.

The arithmetic that caps concurrency

Cache size per token is fixed by model architecture: two tensors, times layer count, times KV-head count, times head dimension, times bytes per element. For a concrete public example, Llama 3 8B (32 layers, 8 KV heads under grouped-query attention, head dimension 128) at 16-bit precision costs 2 x 32 x 8 x 128 x 2 bytes, exactly 128 KiB of cache per token. An 8,192-token conversation therefore holds one gigabyte of GPU memory open, per sequence, before counting the model's own weights. Divide what remains of an 80 GB card after roughly 16 GB of weights by one gigabyte per long sequence and the concurrency ceiling appears two orders of magnitude before arithmetic throughput would have imposed one.

That ceiling is why PagedAttention mattered enough to name an engine after: paging the cache eliminated the fragmentation that wasted most of this budget, and continuous batching keeps the freed memory full of active work. Together they multiplied tokens per GPU-hour, which is the denominator of every per-token price.

Where it lands in the bill

A GPU bills by the hour, so cost per token is rent divided by tokens produced, and tokens produced is bounded by how many sequences the cache admits. Every long-context request occupies cache for its entire lifetime, crowding out neighbors; idle-but-open contexts are the expensive kind of idle. This is the mechanical reason long-context pricing tiers exist at hosted APIs, and the reason a fleet's realised cost band, like the $0.30 to $0.70 per million output tokens this fleet publishes in the invoice-true article, depends on utilisation: the band assumes the cache spends its hours full of paying work rather than holding open space.

What an API caller can do about it

  • Trim the standing prompt. Every system-prompt token is paid in prefill time on every request and in cache space for the request's lifetime. A 2,000-token preamble on a 50-token question is a 40x overhead the user experiences as TTFT.
  • Cap max_tokens realistically. The cap bounds how long a sequence can keep growing its cache allocation, and it is the input to honest read-timeout arithmetic, per the retries guide.
  • Stream. Streaming does not change cache behavior, but it hides prefill latency behind the first paint, per the streaming guide.
  • Keep conversations bounded. Summarise or truncate history instead of replaying entire transcripts; replayed history is re-prefilled every turn and cached for every turn's duration.
  • Prefer structured brevity. Schema-constrained answers, per the structured-output guide, cut output length and therefore decode time and cache growth at once.

Methodology

The decode rates cited (118.9 tokens per second dual-H100, 106.6 dual-A100) are the fleet's published 2026-04-19 short-context benchmark numbers. The per-token cache arithmetic uses public architecture parameters from the cited Llama 3 report and is plain multiplication, reproducible from the text. No concurrency figure for this fleet's tiers is claimed.

Part of Performance and latency on the learn hub.

See also
References

The techniques in these pages run in production behind spotinference's OpenAI-compatible endpoint. Get a key and try it: swap the base URL and the key in an existing SDK, and the first request streams back tokens.