Sampling Strategies: From Greedy to Stochastic

The simplest decoding method is Greedy Decoding: Choose the token with the highest probability. This is deterministic and leads to consistent but often boring outputs.

Stochastic Sampling allows variability: Sample tokens according to their probability distribution. This increases creativity but can lead to "illogical nonsense."

To find this balance, modern LLMs use filtering strategies that limit the number of candidate tokens:

Fig. 1 | Comparison of Top-k and Top-p filtering on the same probability distribution. Top-k selects the k most probable tokens, Top-p selects tokens until cumulative probability exceeds p.
k=5
p=0.90

Top-k Sampling

Top-k Sampling keeps the k tokens with the highest probabilities. All other tokens are set to probability 0, and the distribution is renormalized.

Filtered Tokens = {i : rank(P(token_i)) ≤ k}

P'(token_i) = P(token_i) / Σ(Filtered) if i ∈ Filtered, else 0

Advantage: Simple, predictable. With k=5, only 5 tokens are ever considered.

Disadvantage: Rigid. For very confident predictions (e.g., "The capital of France is ___"), k=5 unnecessarily limits candidates. For uncertain tasks, k=5 may be too restrictive.

Top-p (Nucleus) Sampling

Top-p Sampling (also called Nucleus Sampling) keeps the smallest set of tokens whose cumulative probability ≥ p. This is adaptive.

Sorted: P(token_1) ≥ P(token_2) ≥ ... ≥ P(token_n)

Filtered Tokens = {token_i : Σ(j≤i) P(token_j) ≥ p}

Advantage: Adaptive. For confident predictions (high probability for top token), few tokens are needed. For uncertain ones, p=0.9 might include 20+ tokens.

Disadvantage: More complex to understand and implement. A threshold p=0.9 has different meaning depending on context type.

Fig. 2 | Side-by-side comparison: Left side shows filtered tokens for Top-k, right side for Top-p. The cumulative probability is shown by the red line.

Practical Recommendations

Use Case Top-k Setting Top-p Setting Temperature Rationale
Factual (Search, QA) k=1 (Greedy) p=1.0 (No Filter) T=0.0-0.3 One correct answer preferred
General (Chat) k=40-50 p=0.9-0.95 T=0.7-0.9 Balances coherence and variability
Creative (Storytelling) k=100+ p=0.95+ T=1.0-1.5 More creativity, less coherence
Code Generation k=5-10 p=0.8-0.9 T=0.0-0.5 Syntactic correctness is important

In practice: Top-p is used more often than Top-k because it's more adaptive. OpenAI uses p=1.0 by default, but many users set p=0.9 for better quality. The combination with Temperature T is important for finer control.

Key Insights

⚖️ Quality vs Diversity

Top-k and Top-p balance between deterministic correct answers and creative, varied outputs.

📊 Adaptivity

Top-p adapts to uncertainty: Automatically more restrictive for confident predictions.

🔧 Combinations

Temperature + Top-k/p work together. Both need to be tuned for optimal results.

💡 Practical Defaults

Top-p=0.9 + Temperature=0.7 is a good default for most tasks.

🎯 Trade-offs

Too restrictive = boring. Too permissive = incoherent. The balance depends on the task.

🚀 Inference Quality

Good sampling is as important as model size for generation quality. Often underestimated.