Plain-English entries for the inference jargon on this site. Each term links to a deep page with definition, motivation, and references.
- Cold startTime from a fresh process initialising to its first served request.
- Continuous batchingIteration-level scheduling where requests join and leave every decode step.
- KV cachePer-sequence attention tensors retained between decode steps.
- PagedAttentionPage-based KV cache allocator modeled on OS virtual memory.
- Pipeline parallelism (PP)Layer-group stages assigned one per GPU, micro-batches streaming forward.
- QuantisationStoring weights at lower precision than the training format.
- Structured outputServer-enforced response shape matching a caller-supplied JSON Schema.
- Tensor parallelism (TP)Intra-layer sharding of weight matrices across N GPUs.
- Tool calls (function calling)Model returns structured function invocations instead of free text.
- TTFT, time to first tokenEnd-to-end latency from request submit to first streamed response byte.
- FP8 quantization8-bit floating-point weights executed natively by Hopper tensor cores.
- Speculative decodingDraft-model token proposals verified in one target-model pass.
- Mixture of experts (MoE)Sparse expert routing: large total capacity, small per-token compute.
- Spot vs on-demand GPUDiscounted reclaimable GPU capacity versus paid-for-certainty allocation.
- OpenAI-compatible APIThe drop-in /v1/chat/completions contract for swapping inference providers.
- Cold-start wake latencyThe budgeted SLO for waking a hibernated tier into service.