Why This Is Asked
Running LLMs at scale is expensive. At Google, Meta, and NVIDIA, inference optimization is a core engineering challenge — not just a research problem. This question separates engineers who understand what's happening inside the serving stack from those who just call API endpoints.
Key Concepts to Cover
- KV cache — what it is, how it works, how to size it
- Quantization — INT8/INT4 precision tradeoffs
- Continuous batching — the key technique for GPU utilization
- Speculative decoding — using a small model to draft tokens for a large model
- Throughput vs. latency — they conflict, and production systems must choose
How to Approach This
1. The Inference Bottleneck
LLM inference has two phases:
- Prefill: process all input tokens in parallel → compute-bound (GPU utilization is high)
- Decode: generate output tokens one at a time → memory-bandwidth-bound (GPU utilization is low per step)
The decode phase dominates total time for long outputs. Most optimization efforts target it.
2. KV Cache
During decode, the model recomputes attention over all previous tokens at every step — unless we cache the key and value matrices (KV cache). With caching, each new token only requires computing attention over its own K and V, using cached values for all prior tokens.
KV cache size formula (per token, per layer):
bytes = 2 × num_heads × head_dim × 2 (float16) × num_layers
For a 70B model (80 layers, 64 heads, 128 head_dim):
Per token: 2 × 64 × 128 × 2 bytes × 80 layers = 5.2 MB per token in cache
For a 2048-token sequence: ~10 GB just for the KV cache — before model weights.
Implications:
- Long context = large KV cache = less GPU memory for batching
- Context window size and batch size are in direct tension
- PagedAttention (used in vLLM) manages KV cache like virtual memory pages, dramatically improving GPU memory utilization
3. Quantization
Reduce model weight precision to use less memory and speed up matrix multiplications:
| Precision | Memory vs FP16 | Quality impact | |---|---|---| | FP16 | 1× (baseline) | None | | INT8 | 0.5× | Minimal (< 1% quality loss) | | INT4 | 0.25× | Small (1-3% quality loss) | | GPTQ/AWQ INT4 | 0.25× | Better than naive INT4 |
When to use:
- INT8: default for most production self-hosted deployments; virtually no quality loss
- INT4: when GPU memory is the bottleneck; use GPTQ or AWQ (weight-only quantization) for better quality than naive int4
- Avoid quantizing the KV cache unless necessary — it affects every request
4. Continuous Batching
Naive static batching: wait until a batch is full, process all, return all. Problem: one slow (long) request blocks all others in the batch from completing.
Continuous batching: process tokens from multiple requests in the same forward pass, and let requests join/leave the batch as they complete. Requests with different lengths run concurrently; finished requests free their slot for new ones.
This is the single biggest throughput improvement for production serving:
- vLLM, TGI (text-generation-inference), TensorRT-LLM all implement continuous batching
- Typical throughput improvement: 5-10× over static batching for mixed-length requests
5. Speculative Decoding
Use a small, fast "draft" model to speculatively generate multiple tokens, then verify them in parallel with the large model:
- Small model generates K tokens (e.g., K=4) very quickly
- Large model evaluates all K tokens in a single forward pass (parallel, fast)
- Accept tokens up to the first one the large model disagrees with
- For long matches, you get K tokens at the cost of ~1-1.5 forward passes
Result: 2-3× latency reduction when the draft model's guesses are often correct (typically on predictable/routine text).
Limitation: the small model must share the same tokenizer and vocabulary as the large model. Works best when output is somewhat predictable (code, structured text); less effective for highly creative generation.
6. Throughput vs. Latency Tradeoff
These are in direct conflict:
- Higher batch size → higher throughput (tokens/sec), higher latency per request
- Lower batch size → lower latency, lower throughput
- Longer sequences → better model outputs, more KV cache memory, fewer concurrent requests
Production serving strategy:
- Interactive chat: optimize for latency (small batches, streaming)
- Batch processing (document summarization, offline evals): optimize for throughput (large batches)
- Mixed workloads: separate queues with different SLA targets; route by request type
7. Additional Techniques
- Flash Attention: memory-efficient attention that avoids materializing the full attention matrix; reduces memory and speeds up prefill
- Prefix caching: if many requests share the same system prompt, cache the KV for that prefix and skip recomputing it (supported in OpenAI API, Anthropic, vLLM)
- Model parallelism: for very large models, split across GPUs (tensor parallelism for within-layer, pipeline parallelism for across-layer)
Common Follow-ups
-
"How do you calculate the maximum batch size for a given model and GPU?" GPU memory = model weights + KV cache per request × batch size + activations. With model weights fixed, remaining memory ÷ KV cache per sequence = max batch size. For a 7B model (14GB FP16) on an A100 80GB: ~66GB free → at 2GB KV per sequence → max batch ~33. In practice, lower due to fragmentation and activations.
-
"When does speculative decoding not help?" When output is highly unpredictable (creative writing, multilingual content, complex reasoning) and the draft model is frequently wrong. If acceptance rate drops below ~50%, speculative decoding may add overhead without latency benefit.
-
"How do you serve a model that's larger than a single GPU?" Tensor parallelism: split each weight matrix across N GPUs; each GPU computes a slice and all-reduces. Reduces memory per GPU but requires inter-GPU communication. Pipeline parallelism: assign different layers to different GPUs; a request flows through them sequentially. Less communication overhead but adds pipeline latency. Most production systems use both (tensor parallel within a node, pipeline parallel across nodes).