Beginner5 min read

What Are LLM Decoding Strategies, and When Do You Use Each?

Explain how LLMs select output tokens — covering temperature, top-k, top-p nucleus sampling, greedy decoding, and stopping criteria — and when each strategy is appropriate.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Decoding parameters are among the most commonly misunderstood LLM settings. Candidates who understand them can tune output quality for different tasks; those who don't set temperature to 0.7 everywhere and wonder why their structured outputs break.

Key Concepts to Cover

  • Greedy decoding — always pick the highest-probability token
  • Temperature — scaling the probability distribution
  • Top-k sampling — restricting the candidate pool to k tokens
  • Top-p (nucleus) sampling — dynamic probability-mass-based pool
  • Beam search — keeping multiple candidate sequences
  • Stopping criteria — how the model knows when to stop

How to Approach This

1. How the LLM Selects Tokens

At each step, the model outputs a probability distribution over its entire vocabulary (e.g., 100K tokens). The decoding strategy decides which token to pick from that distribution.

2. Greedy Decoding

Always pick the highest-probability token at each step.

  • Pros: fast, deterministic, easy to reason about
  • Cons: locally optimal choices can lead to globally poor outputs; tends to produce repetitive, flat text
  • Use when: structured output tasks where you want the model's "best guess" (classification, extraction, SQL generation) with temperature=0

3. Temperature

Temperature scales the logits before the softmax, making the distribution more or less peaked:

  • Temperature < 1 (e.g., 0.2): sharpens the distribution → model is more confident, less diverse, more repetitive
  • Temperature = 1: no change to the distribution
  • Temperature > 1 (e.g., 1.5): flattens the distribution → more random, more creative, higher risk of incoherence
  • Temperature = 0: equivalent to greedy decoding

Rules of thumb:

  • 0.0 for deterministic structured outputs (JSON, SQL, classification)
  • 0.2-0.4 for factual Q&A, summarization, code generation
  • 0.7-1.0 for creative writing, brainstorming, diverse outputs
  • >1.0 rarely useful in production — mostly for experimentation

4. Top-K Sampling

Restrict the candidate pool to the top K highest-probability tokens, then sample from those.

  • Prevents the model from picking very unlikely (incoherent) tokens
  • Problem: K is a fixed number regardless of the shape of the distribution. Sometimes the top 50 tokens have very similar probabilities (diverse vocabulary appropriate); sometimes there's one clear winner and the next 49 are all bad.
  • Less commonly used as a standalone setting than top-p

5. Top-P (Nucleus) Sampling

Select the smallest set of tokens whose cumulative probability mass ≥ p, then sample from that set.

  • If p=0.9: include tokens until their cumulative probability reaches 90%
  • Dynamically adjusts the pool size to the shape of the distribution
  • When the model is confident (one token has 90% probability), the pool is tiny
  • When the model is uncertain, the pool is larger and the output is more diverse
  • Much better than top-k for most production use cases

Common defaults:

  • top_p=1.0 (no truncation) with temperature=0.7 for general tasks
  • top_p=0.9 with temperature=1.0 for creative tasks
  • For deterministic outputs, use temperature=0 and ignore top-p

Top-p and temperature interact: if you lower temperature, the distribution is already peaked, so top-p has less effect. Don't tune both aggressively at the same time.

Instead of picking one token at each step, keep the top B sequences (beams) in parallel and score each full sequence.

  • Produces more globally coherent sequences than greedy
  • Common in translation and summarization (older research usage)
  • Rarely used with modern instruction-tuned LLMs — sampling-based methods work better for open-ended generation
  • Computationally expensive (B× more memory and compute)
  • Some LLM APIs don't expose this parameter

7. Stopping Criteria

The model generates until a stop condition is met:

  • EOS token: the model generates its end-of-sequence token (natural completion)
  • Stop sequences: caller specifies strings that terminate generation (["\n", "###", "</response>"])
  • Max tokens: hard cutoff on output length
  • Custom logic: caller inspects the stream and stops based on application logic

Production tip: always set explicit stop sequences for structured outputs. If you're generating JSON, stop on } (or better, validate and stop when the JSON closes). If you're generating one answer per turn, stop on a newline. Don't rely solely on the EOS token.

Common Follow-ups

  1. "Why does temperature=0 not guarantee identical outputs every time?" Temperature controls sampling, but at temperature=0 the model is greedy. Identical outputs are expected for identical inputs — but GPU floating-point non-determinism and changes in the KV cache state can cause minor variations. Most APIs guarantee determinism only with seed parameters.

  2. "What's the difference between max_tokens and context window length?" max_tokens limits the output length. Context window length limits input + output combined. A 128K context window means you can have 120K input tokens and 8K output tokens — but setting max_tokens=4096 means the output stops at 4K regardless of the context window.

  3. "How do you prevent the model from stopping mid-sentence because it hit max_tokens?" Set max_tokens generously for the task, or use a post-processing step that detects incomplete sentences and re-prompts. For structured outputs, use a JSON schema enforcement library (Outlines, Instructor) that constrains generation to valid syntax throughout.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview