Why This Is Asked
Decoding parameters are among the most commonly misunderstood LLM settings. Candidates who understand them can tune output quality for different tasks; those who don't set temperature to 0.7 everywhere and wonder why their structured outputs break.
Key Concepts to Cover
- Greedy decoding — always pick the highest-probability token
- Temperature — scaling the probability distribution
- Top-k sampling — restricting the candidate pool to k tokens
- Top-p (nucleus) sampling — dynamic probability-mass-based pool
- Beam search — keeping multiple candidate sequences
- Stopping criteria — how the model knows when to stop
How to Approach This
1. How the LLM Selects Tokens
At each step, the model outputs a probability distribution over its entire vocabulary (e.g., 100K tokens). The decoding strategy decides which token to pick from that distribution.
2. Greedy Decoding
Always pick the highest-probability token at each step.
- Pros: fast, deterministic, easy to reason about
- Cons: locally optimal choices can lead to globally poor outputs; tends to produce repetitive, flat text
- Use when: structured output tasks where you want the model's "best guess" (classification, extraction, SQL generation) with
temperature=0
3. Temperature
Temperature scales the logits before the softmax, making the distribution more or less peaked:
- Temperature < 1 (e.g., 0.2): sharpens the distribution → model is more confident, less diverse, more repetitive
- Temperature = 1: no change to the distribution
- Temperature > 1 (e.g., 1.5): flattens the distribution → more random, more creative, higher risk of incoherence
- Temperature = 0: equivalent to greedy decoding
Rules of thumb:
0.0for deterministic structured outputs (JSON, SQL, classification)0.2-0.4for factual Q&A, summarization, code generation0.7-1.0for creative writing, brainstorming, diverse outputs>1.0rarely useful in production — mostly for experimentation
4. Top-K Sampling
Restrict the candidate pool to the top K highest-probability tokens, then sample from those.
- Prevents the model from picking very unlikely (incoherent) tokens
- Problem: K is a fixed number regardless of the shape of the distribution. Sometimes the top 50 tokens have very similar probabilities (diverse vocabulary appropriate); sometimes there's one clear winner and the next 49 are all bad.
- Less commonly used as a standalone setting than top-p
5. Top-P (Nucleus) Sampling
Select the smallest set of tokens whose cumulative probability mass ≥ p, then sample from that set.
- If
p=0.9: include tokens until their cumulative probability reaches 90% - Dynamically adjusts the pool size to the shape of the distribution
- When the model is confident (one token has 90% probability), the pool is tiny
- When the model is uncertain, the pool is larger and the output is more diverse
- Much better than top-k for most production use cases
Common defaults:
top_p=1.0(no truncation) withtemperature=0.7for general taskstop_p=0.9withtemperature=1.0for creative tasks- For deterministic outputs, use
temperature=0and ignore top-p
Top-p and temperature interact: if you lower temperature, the distribution is already peaked, so top-p has less effect. Don't tune both aggressively at the same time.
6. Beam Search
Instead of picking one token at each step, keep the top B sequences (beams) in parallel and score each full sequence.
- Produces more globally coherent sequences than greedy
- Common in translation and summarization (older research usage)
- Rarely used with modern instruction-tuned LLMs — sampling-based methods work better for open-ended generation
- Computationally expensive (B× more memory and compute)
- Some LLM APIs don't expose this parameter
7. Stopping Criteria
The model generates until a stop condition is met:
- EOS token: the model generates its end-of-sequence token (natural completion)
- Stop sequences: caller specifies strings that terminate generation (
["\n", "###", "</response>"]) - Max tokens: hard cutoff on output length
- Custom logic: caller inspects the stream and stops based on application logic
Production tip: always set explicit stop sequences for structured outputs. If you're generating JSON, stop on } (or better, validate and stop when the JSON closes). If you're generating one answer per turn, stop on a newline. Don't rely solely on the EOS token.
Common Follow-ups
-
"Why does
temperature=0not guarantee identical outputs every time?" Temperature controls sampling, but attemperature=0the model is greedy. Identical outputs are expected for identical inputs — but GPU floating-point non-determinism and changes in the KV cache state can cause minor variations. Most APIs guarantee determinism only withseedparameters. -
"What's the difference between
max_tokensand context window length?"max_tokenslimits the output length. Context window length limits input + output combined. A 128K context window means you can have 120K input tokens and 8K output tokens — but settingmax_tokens=4096means the output stops at 4K regardless of the context window. -
"How do you prevent the model from stopping mid-sentence because it hit
max_tokens?" Setmax_tokensgenerously for the task, or use a post-processing step that detects incomplete sentences and re-prompts. For structured outputs, use a JSON schema enforcement library (Outlines, Instructor) that constrains generation to valid syntax throughout.