Walk through the key techniques for optimizing LLM inference performance in production — KV cache management, quantization, continuous batching, and speculative decoding.

Learn how to optimize LLM inference for production. Covers KV cache sizing, quantization, continuous batching, speculative decoding, and throughput vs latency tradeoffs.

LLM Inference Optimization: KV Cache, Quantization, Batching

Why This Is Asked

Running LLMs at scale is expensive. At Google, Meta, and NVIDIA, inference optimization is a core engineering challenge — not just a research problem. This question separates engineers who understand what's happening inside the serving stack from those who just call API endpoints.

Key Concepts to Cover

KV cache — what it is, how it works, how to size it
Quantization — INT8/INT4 precision tradeoffs
Continuous batching — the key technique for GPU utilization
Speculative decoding — using a small model to draft tokens for a large model
Throughput vs. latency — they conflict, and production systems must choose

How to Approach This

1. The Inference Bottleneck

LLM inference has two phases:

Prefill: process all input tokens in parallel → compute-bound (GPU utilization is high)
Decode: generate output tokens one at a time → memory-bandwidth-bound (GPU utilization is low per step)

The decode phase dominates total time for long outputs. Most optimization efforts target it.

2. KV Cache

During decode, the model recomputes attention over all previous tokens at every step — unless we cache the key and value matrices (KV cache). With caching, each new token only requires computing attention over its own K and V, using cached values for all prior tokens.

KV cache size formula (per token, per layer):

bytes = 2 × num_heads × head_dim × 2 (float16) × num_layers

For a 70B model (80 layers, 64 heads, 128 head_dim):

Per token: 2 × 64 × 128 × 2 bytes × 80 layers = 5.2 MB per token in cache

For a 2048-token sequence: ~10 GB just for the KV cache — before model weights.

Implications:

Long context = large KV cache = less GPU memory for batching
Context window size and batch size are in direct tension
PagedAttention (used in vLLM) manages KV cache like virtual memory pages, dramatically improving GPU memory utilization

3. Quantization

Reduce model weight precision to use less memory and speed up matrix multiplications:

| Precision | Memory vs FP16 | Quality impact | |---|---|---| | FP16 | 1× (baseline) | None | | INT8 | 0.5× | Minimal (< 1% quality loss) | | INT4 | 0.25× | Small (1-3% quality loss) | | GPTQ/AWQ INT4 | 0.25× | Better than naive INT4 |

When to use:

INT8: default for most production self-hosted deployments; virtually no quality loss
INT4: when GPU memory is the bottleneck; use GPTQ or AWQ (weight-only quantization) for better quality than naive int4
Avoid quantizing the KV cache unless necessary — it affects every request

4. Continuous Batching

Naive static batching: wait until a batch is full, process all, return all. Problem: one slow (long) request blocks all others in the batch from completing.

Continuous batching: process tokens from multiple requests in the same forward pass, and let requests join/leave the batch as they complete. Requests with different lengths run concurrently; finished requests free their slot for new ones.

This is the single biggest throughput improvement for production serving:

vLLM, TGI (text-generation-inference), TensorRT-LLM all implement continuous batching
Typical throughput improvement: 5-10× over static batching for mixed-length requests

5. Speculative Decoding

Use a small, fast "draft" model to speculatively generate multiple tokens, then verify them in parallel with the large model:

Small model generates K tokens (e.g., K=4) very quickly
Large model evaluates all K tokens in a single forward pass (parallel, fast)
Accept tokens up to the first one the large model disagrees with
For long matches, you get K tokens at the cost of ~1-1.5 forward passes

Result: 2-3× latency reduction when the draft model's guesses are often correct (typically on predictable/routine text).

Limitation: the small model must share the same tokenizer and vocabulary as the large model. Works best when output is somewhat predictable (code, structured text); less effective for highly creative generation.

6. Throughput vs. Latency Tradeoff

These are in direct conflict:

Higher batch size → higher throughput (tokens/sec), higher latency per request
Lower batch size → lower latency, lower throughput
Longer sequences → better model outputs, more KV cache memory, fewer concurrent requests

Production serving strategy:

Interactive chat: optimize for latency (small batches, streaming)
Batch processing (document summarization, offline evals): optimize for throughput (large batches)
Mixed workloads: separate queues with different SLA targets; route by request type

7. Additional Techniques

Flash Attention: memory-efficient attention that avoids materializing the full attention matrix; reduces memory and speeds up prefill
Prefix caching: if many requests share the same system prompt, cache the KV for that prefix and skip recomputing it (supported in OpenAI API, Anthropic, vLLM)
Model parallelism: for very large models, split across GPUs (tensor parallelism for within-layer, pipeline parallelism for across-layer)

Common Follow-ups

"How do you calculate the maximum batch size for a given model and GPU?" GPU memory = model weights + KV cache per request × batch size + activations. With model weights fixed, remaining memory ÷ KV cache per sequence = max batch size. For a 7B model (14GB FP16) on an A100 80GB: ~66GB free → at 2GB KV per sequence → max batch ~33. In practice, lower due to fragmentation and activations.
"When does speculative decoding not help?" When output is highly unpredictable (creative writing, multilingual content, complex reasoning) and the draft model is frequently wrong. If acceptance rate drops below ~50%, speculative decoding may add overhead without latency benefit.
"How do you serve a model that's larger than a single GPU?" Tensor parallelism: split each weight matrix across N GPUs; each GPU computes a slice and all-reduces. Reduces memory per GPU but requires inter-GPU communication. Pipeline parallelism: assign different layers to different GPUs; a request flows through them sequentially. Less communication overhead but adds pipeline latency. Most production systems use both (tensor parallel within a node, pipeline parallel across nodes).

How Do You Optimize LLM Inference for Higher Throughput and Lower Latency?