Temperature, Top-P, and Top-K Explained: Controlling LLM Randomness
▶ Watch on YouTube & subscribe to The Stack Underflow
Every time you send a prompt to an LLM, the model does not pick one answer — it computes a full probability distribution over every token in its vocabulary and then draws from that distribution. The three parameters covered here — temperature, top-k, and top-p — are the knobs that reshape that distribution before the draw happens. Get them wrong and you either get a model that always says the same thing or one that hallucinates with flair.
This page is the written companion to the video above. It expands on the visual intuitions from the video into a reference you can come back to when you’re tuning inference parameters in production.
The one-sentence version: Temperature flattens or sharpens the token probability curve, top-k discards every token outside the top K ranks, and top-p discards every token below a cumulative probability threshold — and all three are applied in that order before the final token is sampled.
The probability distribution under every token
Prompt: “The cat sat on the ___”
Before any sampling parameter is applied, the model produces a logit for every token in its vocabulary — a raw score representing how well that token fits. Those logits are converted to probabilities via softmax, giving you something like:
| Token | Probability |
|---|---|
| mat | 0.42 |
| sofa | 0.08 |
| floor | 0.06 |
| roof | 0.06 |
| … | ~0.38 (long tail) |
This is the model’s belief about what comes next. The final token is chosen by a weighted random draw — like dropping a dart on a bar chart where each bar’s width is its probability. Usually you get “mat”. Sometimes you get “sofa”. Occasionally, “roof”. That weighted draw is why the same prompt produces different outputs on different runs.
Temperature: reshaping the curve
Temperature is a scalar divisor applied to the logits before softmax:
adjusted_logit[i] = raw_logit[i] / temperature
- High temperature (e.g. 1.5–2.0): Logits move closer together. After softmax the probabilities flatten out. The long tail rises. Unlikely tokens get a real shot. Output is more varied, more creative — and more chaotic.
- Low temperature (e.g. 0.1–0.3): Logits spread apart. After softmax the distribution sharpens into a spike on the top token. At temperature → 0 you always get the single most likely token. Deterministic, boring, correct.
- Temperature = 1.0: Logits are passed through unchanged. The raw model distribution, no reshaping.
temp=2.0 → flat curve → lots of variety
temp=1.0 → raw curve → model default
temp=0.1 → sharp spike → near-deterministic
Top-K: keeping only the K most likely tokens
After temperature reshapes the curve, top-k truncates it. If K = 3:
- Rank all tokens by their (already temperature-adjusted) probability.
- Keep only the top 3 tokens.
- Zero out the rest and renormalize the survivors so they sum to 1.
- Draw from the surviving 3.
Top-k is blunt: it always keeps exactly K tokens regardless of how the probabilities are actually distributed. If the top 3 tokens account for 99 % of the mass, you’re still keeping 3. If they account for 3 %, you’re still keeping only 3.
Top-P: adapting to the shape of the distribution
Top-p (also called nucleus sampling) is smarter. If P = 0.90:
- Sort tokens by probability, highest first.
- Walk down the sorted list, accumulating probability mass.
- Stop the moment the cumulative sum reaches P (0.90).
- Drop everything below that cutoff and renormalize.
- Draw from the survivors.
The key insight: top-p adapts to the curve’s shape.
Peaked distribution → only 2-3 tokens survive (it hits 0.90 fast)
Flat distribution → many tokens survive (it takes longer to reach 0.90)
This makes top-p more robust than top-k across varying prompt types. When the model is confident, the nucleus stays small. When the model is uncertain, it stays open.
How the three stack together
The order of operations is not arbitrary:
Raw logits
│
▼ ① Temperature scaling (reshape)
Adjusted logits → softmax → probabilities
│
▼ ② Top-K truncation (drop by rank)
or
▼ ② Top-P truncation (drop by cumulative mass)
│
▼ ③ Renormalize survivors
│
▼ ④ Weighted random draw → final token
Temperature changes the shape that top-p will measure. If you crank temperature up and then set a tight top-p, you first flatten the curve and then let a wider nucleus through anyway. The order matters for reasoning about what you’ll get.
Most APIs let you set temperature and one of top-k or top-p. Setting both top-k and top-p simultaneously is allowed but rarely needed — they both truncate, so the stricter one dominates.
When to actually tune these
| Task | Temperature | Top-P / Top-K |
|---|---|---|
| Extraction, structured JSON | Low (0.0–0.3) | Tight top-p (0.8) or low top-k |
| Q&A, factual retrieval | Low–Medium (0.3–0.7) | Default |
| Brainstorming, marketing copy | Medium–High (0.8–1.4) | Higher top-p (0.95+) |
| Creative writing, naming | High (1.0–1.5) | High top-p or no truncation |
| Code generation | Low (0.1–0.4) | Tight |
One practical rule of thumb from the video: if you need a guaranteed output shape (valid JSON, a specific format), do not try to force it with low temperature alone. Use tool-calling / structured output mode instead. Temperature controls variety, not schema conformance.
Also worth noting: if your model keeps producing vague or unhelpful output, the culprit is usually missing acceptance criteria in the prompt, not a temperature problem. Tightening the temperature on a bad prompt gives you a consistently bad answer faster.
Common misconceptions
- “Temperature = 0 means the model is deterministic.” Mostly true in practice, but floating-point arithmetic and batching can introduce tiny non-determinism even at temperature 0. Use a fixed seed if you need bit-for-bit reproducibility.
- “High temperature makes the model smarter.” It makes the model less constrained, not more capable. The knowledge is fixed; temperature only affects how much of the probability tail gets sampled. You can easily get more-varied wrong answers.
- “Top-p replaces top-k; they do the same thing.” They both truncate the distribution but by different criteria. Top-k is rank-based (always K tokens). Top-p is mass-based (adapts to distribution shape). Top-p is generally preferred for general-purpose use because it handles both confident and uncertain model states gracefully.
- “Lowering temperature reduces hallucinations.” Only sometimes. If the model’s top-1 token is an incorrect fact, temperature 0 will confidently return that wrong fact every time. Hallucination is a training and prompting problem more than a sampling one.
Frequently asked questions
What temperature should I use for Claude by default? Claude’s API default is 1.0 (raw model distribution). For most production tasks — structured data, factual Q&A, code — start at 0.3–0.7 and lower from there if you need consistency. Only go above 1.0 if you specifically want high-variance creative output.
Can I set both top-k and top-p at the same time? Yes, most inference engines support it. The two truncations are applied in sequence; whichever removes more tokens wins. In practice, just pick one — mixing them rarely helps and makes it harder to reason about the effective nucleus size.
Why does the same temperature feel “wilder” for some prompts than others? Because the base distribution varies by prompt. A prompt with a very peaked distribution (the model is highly confident) stays controlled even at moderate temperature. A prompt where the model is genuinely uncertain starts with a flat distribution; temperature just amplifies that flatness. Same knob, different baseline curves.
If I need deterministic output, is temperature enough? No. Set temperature to 0 (or as close as the API allows), use a fixed seed, and if format matters, use structured output / tool-calling mode. Temperature controls sampling randomness; structured output mode bypasses sampling for the schema-constrained parts entirely.
Where this fits in the series
This episode is part of How Claude Actually Works, a course that starts from raw transformer outputs and builds up to production-ready Claude integrations. Earlier episodes cover tokenization and the attention mechanism; later ones cover tool-calling, context window management, and why prompt clarity matters more than parameter tuning for most quality problems. Browse all tutorials to find the episodes on either side of this one.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →