Explain how LLMs select output tokens — covering temperature, top-k, top-p nucleus sampling, greedy decoding, and stopping criteria — and when each strategy is appropriate.

Learn how LLM decoding strategies work — temperature, top-k, top-p (nucleus sampling), greedy decoding, beam search — and when to use each in production.

LLM Decoding Strategies: Temperature, Top-K, Top-P - AI Interview Question

Why This Is Asked

Decoding parameters are among the most commonly misunderstood LLM settings. Candidates who understand them can tune output quality for different tasks; those who don't set temperature to 0.7 everywhere and wonder why their structured outputs break.

Key Concepts to Cover

Greedy decoding — always pick the highest-probability token
Temperature — scaling the probability distribution
Top-k sampling — restricting the candidate pool to k tokens
Top-p (nucleus) sampling — dynamic probability-mass-based pool
Beam search — keeping multiple candidate sequences
Stopping criteria — how the model knows when to stop

How to Approach This

1. How the LLM Selects Tokens

At each step, the model outputs a probability distribution over its entire vocabulary (e.g., 100K tokens). The decoding strategy decides which token to pick from that distribution.

2. Greedy Decoding

Always pick the highest-probability token at each step.

Pros: fast, deterministic, easy to reason about
Cons: locally optimal choices can lead to globally poor outputs; tends to produce repetitive, flat text
Use when: structured output tasks where you want the model's "best guess" (classification, extraction, SQL generation) with temperature=0

3. Temperature

Temperature scales the logits before the softmax, making the distribution more or less peaked:

Temperature < 1 (e.g., 0.2): sharpens the distribution → model is more confident, less diverse, more repetitive
Temperature = 1: no change to the distribution
Temperature > 1 (e.g., 1.5): flattens the distribution → more random, more creative, higher risk of incoherence
Temperature = 0: equivalent to greedy decoding

Rules of thumb:

0.0 for deterministic structured outputs (JSON, SQL, classification)
0.2-0.4 for factual Q&A, summarization, code generation
0.7-1.0 for creative writing, brainstorming, diverse outputs
>1.0 rarely useful in production — mostly for experimentation

4. Top-K Sampling

Restrict the candidate pool to the top K highest-probability tokens, then sample from those.

Prevents the model from picking very unlikely (incoherent) tokens
Problem: K is a fixed number regardless of the shape of the distribution. Sometimes the top 50 tokens have very similar probabilities (diverse vocabulary appropriate); sometimes there's one clear winner and the next 49 are all bad.
Less commonly used as a standalone setting than top-p

5. Top-P (Nucleus) Sampling

Select the smallest set of tokens whose cumulative probability mass ≥ p, then sample from that set.

If p=0.9: include tokens until their cumulative probability reaches 90%
Dynamically adjusts the pool size to the shape of the distribution
When the model is confident (one token has 90% probability), the pool is tiny
When the model is uncertain, the pool is larger and the output is more diverse
Much better than top-k for most production use cases

Common defaults:

top_p=1.0 (no truncation) with temperature=0.7 for general tasks
top_p=0.9 with temperature=1.0 for creative tasks
For deterministic outputs, use temperature=0 and ignore top-p

Top-p and temperature interact: if you lower temperature, the distribution is already peaked, so top-p has less effect. Don't tune both aggressively at the same time.

6. Beam Search

Instead of picking one token at each step, keep the top B sequences (beams) in parallel and score each full sequence.

Produces more globally coherent sequences than greedy
Common in translation and summarization (older research usage)
Rarely used with modern instruction-tuned LLMs — sampling-based methods work better for open-ended generation
Computationally expensive (B× more memory and compute)
Some LLM APIs don't expose this parameter

7. Stopping Criteria

The model generates until a stop condition is met:

EOS token: the model generates its end-of-sequence token (natural completion)
Stop sequences: caller specifies strings that terminate generation (["\n", "###", "</response>"])
Max tokens: hard cutoff on output length
Custom logic: caller inspects the stream and stops based on application logic

Production tip: always set explicit stop sequences for structured outputs. If you're generating JSON, stop on } (or better, validate and stop when the JSON closes). If you're generating one answer per turn, stop on a newline. Don't rely solely on the EOS token.

Common Follow-ups

"Why does temperature=0 not guarantee identical outputs every time?" Temperature controls sampling, but at temperature=0 the model is greedy. Identical outputs are expected for identical inputs — but GPU floating-point non-determinism and changes in the KV cache state can cause minor variations. Most APIs guarantee determinism only with seed parameters.
"What's the difference between max_tokens and context window length?" max_tokens limits the output length. Context window length limits input + output combined. A 128K context window means you can have 120K input tokens and 8K output tokens — but setting max_tokens=4096 means the output stops at 4K regardless of the context window.
"How do you prevent the model from stopping mid-sentence because it hit max_tokens?" Set max_tokens generously for the task, or use a post-processing step that detects incomplete sentences and re-prompts. For structured outputs, use a JSON schema enforcement library (Outlines, Instructor) that constrains generation to valid syntax throughout.

What Are LLM Decoding Strategies, and When Do You Use Each?