Question 1

What is TTFT in LLM inference?

Accepted Answer

Time to first token is the delay between sending a request and seeing the first byte back. In practice it's dominated by prompt tokenization, KV-cache prefill, and the engine's batching cycle. For interactive apps, this is the latency users actually feel.

Question 2

What is the KV cache?

Accepted Answer

The per-sequence Key/Value attention tensors that every decoding step produces and reuses. Without caching, every step would recompute the full attention over all prior tokens. With it, the per-step cost is bounded by the model size and the one new token.

Question 3

What is PagedAttention?

Accepted Answer

Kwon et al.'s allocation algorithm for the KV cache. Split the cache into fixed-size blocks (pages) and let multiple sequences share unused space. vLLM builds on it. Effective KV utilisation went from roughly 20-40% to nearly 100% in their paper.

Question 4

What is continuous batching in LLM serving?

Accepted Answer

Iteration-level scheduling where new requests can join an in-flight batch mid-decode, and finished requests free their slot for the next step. It's the opposite of static batching. Yu et al.'s Orca paper (OSDI 2022) introduced it; vLLM, TGI, and TensorRT-LLM all use it now.

Question 5

What is quantisation in LLM inference?

Accepted Answer

Storing model weights in a lower-precision number format (FP8, INT8, or INT4) to fit more activations in GPU memory and lean on tensor-core throughput. The two dominant post-training schemes are GPTQ (layer-wise) and AWQ (activation-aware).

Question 6

What is tensor parallelism?

Accepted Answer

A model-parallel strategy. Shard each layer's weights across N GPUs and synchronise activations after every layer via NVLink or PCIe. It works well within a single node over NVLink. Across nodes it bleeds bandwidth fast.

Question 7

What is pipeline parallelism?

Accepted Answer

The other model-parallel strategy. Split the model into sequential layer groups, one group per GPU, and stream activations forward. Less bandwidth-intensive than tensor parallelism, but it introduces pipeline-bubble inefficiency at small batch sizes.

Question 8

What is a cold start in LLM serving?

Accepted Answer

What happens between a fresh inference process starting up and serving its first request: model weight loading from disk, CUDA context initialisation, torch.compile kernel compilation, KV-cache allocation. On large models the cold start can run several minutes. It's the cost driver behind every scale-to-zero serving strategy.

Question 9

What are tool calls (function calling) in the OpenAI API?

Accepted Answer

A request shape where the model is told about callable functions (name plus JSON Schema parameters) and can respond by emitting a function call for the client to execute, instead of returning free text. The OpenAI API standardised this in 2024 and compatible servers picked it up quickly.

Question 10

What is structured output in LLM inference?

Accepted Answer

A request shape that constrains the model's response to a JSON Schema. The server enforces the constraint either at decode time (constrained decoding) or after the fact (retry on schema violation). It removes a class of 'parse the LLM's JSON' code from the client.