Chapter 3 · Core Mechanics

Probability & Prediction

How AI "guesses" the next word — understanding softmax, temperature, sampling strategies, and the full generation pipeline that turns raw numbers into coherent language.

In this chapter

1. The Core Idea
2. Logits & Softmax
3. Autoregressive Generation
4. Temperature
5. Sampling Strategies
6. The Full Pipeline
7. Perplexity
8. Engineer's Playbook

The Core Idea — AI is a Probability Machine

At the most fundamental level, an LLM does not "know" things the way a database does, and it does not "think" the way humans do. Every single word it produces is the result of one operation, repeated thousands of times:

The Fundamental Operation

Given a sequence of tokens as input, compute a probability distribution over the entire vocabulary — then sample the next token from that distribution. Repeat until the response is complete.

This sounds almost too simple to explain the richness of GPT-4's outputs. But this one mechanism, applied at massive scale with billions of parameters trained on trillions of tokens, is genuinely all that's happening. यही वो "magic" है जिसे हम बाहर से intelligence की तरह देखते हैं।

Next token prediction — live example

Input: "The capital of France is ___"

"Paris"

94.1%

"Lyon"

2.8%

"London"

1.7%

"beautiful"

0.7%

other 99,996

0.7%

The model assigns a probability to every single token in its vocabulary (~100,000 tokens) simultaneously. The probabilities always sum to exactly 1.0. Notice: the model isn't "looking up" Paris — it learned this association purely from statistical patterns across trillions of training examples.

Why probability, not certainty?

A deterministic model (always pick the single highest-probability token) produces robotic, repetitive text. Real language has natural variation — "The weather today is lovely / nice / wonderful / great" are all correct. Probabilistic sampling preserves this richness. Determinism is useful for code generation; randomness is useful for creative writing. You control this tradeoff with temperature.

Logits & Softmax — From Raw Scores to Probabilities

Before the model produces a probability distribution, it first produces logits — raw, unnormalized scores. There's one logit per vocabulary token. These numbers can be anything: negative, positive, very large, very small. They're meaningless on their own until we apply Softmax.

Softmax — the conversion formula

Softmax takes a vector of logits [z₁, z₂, ..., zₙ] and converts them into a valid probability distribution: each output is between 0 and 1, and all outputs sum to exactly 1.0.

Softmax step-by-step

Input: logits for 4 candidate tokens → softmax → probabilities

Raw Logits

"Paris" 5.20

"Lyon" 1.80

"London" 1.50

"city" 0.30

→

Exponentiate (eˣ)

e^5.20 = 181.3

e^1.80 = 6.05

e^1.50 = 4.48

e^0.30 = 1.35

→

Divide by Sum (193.2)

"Paris" 93.8%

"Lyon" 3.1%

"London" 2.3%

"city" 0.7%

Exponentiation (eˣ) amplifies differences — a logit difference of 3.4 becomes a probability ratio of ~30x. Large logit → dominant probability.
Dividing by the sum normalizes everything to sum to 1.0 — making it a valid probability distribution.
The model produces 100,000 logits per step — one for every token in the vocabulary. Softmax runs over all 100,000 simultaneously.
Softmax is differentiable — which is why it works with backpropagation during training. The model can learn to push Paris's logit higher by observing its error.

Why exponentiation?

The exponential function eˣ has two critical properties: it's always positive (no negative probabilities), and it's extremely sensitive to differences — a logit of 5.2 becomes 181, while 1.8 becomes only 6. This "sharpening" effect means the highest logit gets disproportionately more probability mass, which is the right behavior for next-token prediction.

Autoregressive Generation — One Token at a Time

The model doesn't generate an entire sentence at once — it generates one token per forward pass, then feeds that token back as input for the next step. This loop is called autoregressive generation.

Autoregressive loop — generating "The sky is blue today"

"The" → predict → "sky" (p=0.72)

"The sky" → predict → "is" (p=0.88)

"The sky is" → predict → "blue" (p=0.61)

"The sky is blue" → predict → "today" (p=0.34)

"The sky is blue today" → predict → <|end|> (p=0.79)

Every step, the entire context so far is re-processed. "The sky is blue" all influences the prediction of "today".
This is why generation is slow and sequential — you can't generate token 5 without having generated tokens 1–4 first. Parallelization across the sequence is impossible during inference.
Each token sampled from a distribution — early tokens influence all future tokens. A single low-probability token early on can send the generation in a completely different direction. यही "butterfly effect" है LLM generation में।
Generation stops when the model produces the special <|endoftext|> token, or when it hits a user-specified max_tokens limit.

Why the same prompt gives different outputs

Because generation is probabilistic — each token is sampled from the distribution, not deterministically chosen. If "blue" has 61% probability, it still loses to other tokens 39% of the time. Set temperature=0 (always pick the highest-probability token) for deterministic outputs — useful for code, extraction, and factual Q&A.

Temperature — Controlling Randomness

Temperature is the most important generation parameter you'll set as an engineer. It controls how "sharp" or "flat" the probability distribution is — effectively controlling how creative vs. focused the model's output is.

Temperature Formula

Before applying Softmax, divide all logits by the temperature T:
Softmax(logits / T)
When T < 1: distribution sharpens (more confident). When T > 1: distribution flattens (more random). T = 1 means use the raw distribution.

T = 0.1 (Cold)

Focused & Deterministic

Token distribution

"Paris"

"Lyon"

"London"

"city"

Almost always picks "Paris". Ideal for: code generation, factual Q&A, data extraction, structured output.

T = 1.0 (Neutral)

Balanced & Natural

Token distribution

"Paris"

"Lyon"

"London"

"city"

Raw model distribution. Occasionally picks alternatives. Default for general-purpose assistants and chatbots.

T = 1.8 (Hot)

Creative & Unpredictable

Token distribution

"Paris"

"Lyon"

"London"

"city"

Much flatter — unusual tokens get real chances. Ideal for: brainstorming, poetry, creative writing. Risk: incoherence at T > 2.

Temperature = 0 special case

Dividing by T=0 is mathematically undefined, so in practice temperature=0 is implemented as greedy decoding — always pick the token with the highest logit, no sampling at all. This gives fully deterministic, reproducible outputs. Use this in production whenever consistency matters more than variety.

Sampling Strategies — How to Pick from the Distribution

Temperature shapes the distribution, but you still need to decide how to pick a token from it. There are several strategies, each with different quality/diversity tradeoffs.

Strategy	How it works	Best used for	Risk
Greedy decoding	Always pick the single highest-probability token.	Code, structured data, factual extraction	Repetitive, boring output. Can get stuck in loops.
Pure sampling	Sample randomly according to the full probability distribution.	Creative writing at low temperature	Very low-probability tokens can be selected — incoherence.
Top-k sampling	Keep only the k highest-probability tokens, redistribute their probabilities, then sample.	General text generation (k = 40–100)	k is a fixed number — sometimes too narrow, sometimes too wide depending on context.
Top-p (nucleus)	Keep the smallest set of tokens whose cumulative probability ≥ p. Sample from that set.	Most modern LLM APIs (p = 0.9–0.95 default)	Slightly complex to tune. Very low p approaches greedy.
Beam search	Maintain the top-B partial sequences at each step. Return the sequence with the highest overall probability.	Machine translation, summarization, where correctness matters	Expensive (B× compute). Often produces generic, "safe" outputs lacking diversity.

Top-p (Nucleus) Sampling — why it's the modern default

Top-p is adaptive: when the model is very confident (one token has 95% probability), only that one token is in the nucleus. When the model is uncertain (10 tokens each with ~10%), the nucleus includes all 10. This flexibility makes it better than top-k at handling varying confidence levels across different positions.

Top-p = 0.90 — nucleus highlighted

Context: "She opened the ___"

"door"

48% ✓

"window"

22% ✓

"letter"

12% ✓

"box"

9% ✓ (=91%)

"meeting"

5% ✗

rest...

4% ✗

With top-p=0.90, the nucleus is {"door", "window", "letter", "box"} — their cumulative probability (48+22+12+9=91%) just exceeds 90%. Only these 4 tokens are eligible for sampling. "Meeting" and everything below are excluded.

The Full Generation Pipeline — End to End

Now let's put it all together. Here's what actually happens inside the model from the moment you send a prompt to when you read the first word of the response:

Complete token generation pipeline — one step

📝

Tokenize

Text → token IDs via BPE vocabulary

→

🧩

Embed

Token IDs → dense vectors via Embedding Matrix

→

🔀

Transformer

Attention layers process context, update representations

→

📊

LM Head

Final hidden state → 100K raw logits (one per vocab token)

→

🌡️

Temperature

Divide logits by T to sharpen or flatten

→

🎯

Softmax

Logits → probability distribution (sums to 1.0)

→

🎲

Sample

Pick one token via top-p / top-k / greedy

→

🔁

Repeat

Append token to context. Loop until <end>

~20ms

Typical time per token (large model, GPU)

100K

Logits computed per generation step

1–8

Tokens/step for parallel beam search

Why generation is expensive

Every token requires a full Transformer forward pass — processing the entire context through all attention layers. A 500-token response means 500 separate forward passes. This is why latency scales with output length, not just input length. Techniques like KV-caching (reusing computed attention values) reduce this cost significantly, but the fundamental sequential nature of autoregressive generation remains the key bottleneck in production LLM systems.

Perplexity — Measuring How "Surprised" the Model Is

Perplexity (PPL) is the primary metric used to evaluate how well a language model predicts text. Intuitively: how many tokens is the model "effectively choosing between" at each step on average? Lower perplexity = better model.

Definition

Perplexity is the exponentiated average negative log-likelihood of the model assigning to the correct next token across a test set. A perplexity of N means the model is as "surprised" as if it were uniformly choosing between N tokens at each step.

Excellent PPL

GPT-4 on English Wikipedia. Model is very confident about the next token most of the time.

~20

Good PPL

A solid mid-size model on general text. Acceptable for most production use cases.

~100+

Poor PPL

A small model on out-of-domain text. The model is genuinely "guessing" at most steps.

Perplexity is used to compare models on a fixed dataset — lower PPL means the model assigns higher probability to the actual next token on average.
It does not measure factual accuracy, helpfulness, or safety. A model can have excellent perplexity while confidently hallucinating facts.
Domain mismatch explodes perplexity: a model trained on English gets very high PPL on Hindi text, code, or technical jargon it hasn't seen before.
During training, the model's loss (cross-entropy) is the log of perplexity — minimizing loss is the same as minimizing perplexity. यही वो objective function है जिस पर LLMs train होते हैं।

Cross-entropy loss and perplexity — the connection

During training, the model is given the correct next token and asked: "what probability did you assign to this token?" The loss is -log(p) — negative log of the assigned probability. If p=0.94 (very confident, correct), loss = 0.06 (low). If p=0.01 (wrong and confident), loss = 4.6 (high). Perplexity = e^(average loss). So minimizing training loss = the model learning to assign high probability to correct next tokens.

Engineer's Playbook — Choosing the Right Parameters

Understanding the theory is step one. Here's how to translate it into production decisions when calling an LLM API.

🤖

Code generation

Use temperature=0 or 0.1. Deterministic output. No top-p needed. You want the most likely correct syntax, not creative variation.

💬

Chatbot / assistant

Use temperature=0.7, top_p=0.9. Balanced: natural variation without going off-rails. This is the API default for most providers.

✍️

Creative writing

Use temperature=1.2–1.5, top_p=0.95. Wider exploration. Monitor for coherence — add a system prompt to ground the topic.

📋

Data extraction / JSON

Use temperature=0. Combine with structured output mode or function-calling to constrain the vocabulary to valid JSON tokens only.

🔁

Avoiding repetition

Use frequency_penalty=0.3–0.8 to reduce the logits of tokens that have already appeared. Prevents "the the the the" loops in long generations.

💰

Cost control

Use max_tokens to cap output length. Remember: you're billed for both input and output tokens. Short, precise prompts reduce both latency and cost.

Quick parameter reference

Parameter	Range	Effect
temperature	0.0 – 2.0	Controls distribution sharpness. 0 = greedy. 1 = raw model. >1 = more random.
top_p	0.0 – 1.0	Nucleus size. 0.9 = sample from tokens covering 90% of probability mass.
top_k	1 – vocab size	Fixed nucleus. Only sample from the k highest-probability tokens.
frequency_penalty	0.0 – 2.0	Reduces logits of already-generated tokens. Fights repetition.
presence_penalty	0.0 – 2.0	Binary version of frequency_penalty — penalizes any token seen at least once.
max_tokens	1 – context limit	Hard cap on output length. Generation stops regardless of EOS token.
seed	any integer	Fixed seed produces deterministic outputs (when temperature=0). Useful for evals.

Key Takeaways

An LLM is a probability machine — every token is the result of sampling from a distribution over the entire vocabulary, not "knowing" the answer. Logits → Temperature → Softmax → Sampling is the generation pipeline for every single token. Temperature controls the creativity/focus tradeoff — low for code and facts, high for creativity. Top-p (nucleus sampling) is the modern default — adaptive, handling confident and uncertain positions differently. Perplexity measures model confidence — the lower, the better, but it doesn't measure accuracy or safety. In production: always set temperature, top_p, and max_tokens explicitly — never rely on defaults for important applications.

📝 Practice Questions

What are logits, and why can't you use them directly as probabilities? What mathematical operation converts them into a valid probability distribution?

Walk through softmax step-by-step for three tokens with logits [3.0, 1.0, 0.5]. What are the resulting probabilities?

Explain autoregressive generation. Why can't an LLM generate all tokens of a response in parallel during inference?

A user complains that your LLM-powered app always gives the exact same response to the same question. Another complains it gives random, incoherent answers. What temperature settings would you use to fix each problem?

Compare top-k and top-p sampling. Give a concrete scenario where top-k gives worse results than top-p.

What is perplexity and what does a perplexity of 8 vs 80 tell you about a model? What important aspect of model quality does perplexity NOT capture?

You are building a JSON-extraction pipeline using an LLM. List all the generation parameters you would set and explain why for each.

Get Instant Alerts on WhatsApp

LLM Mastery: From Foundations to Agentic AI Architect

Probability and Prediction – Understanding how AI "guesses" the next word in a sequence.

Probability & Prediction

The Core Idea — AI is a Probability Machine

Logits & Softmax — From Raw Scores to Probabilities

Softmax step-by-step

Autoregressive Generation — One Token at a Time

Temperature — Controlling Randomness

Focused & Deterministic

Balanced & Natural

Creative & Unpredictable

Sampling Strategies — How to Pick from the Distribution

Top-p (Nucleus) Sampling — why it's the modern default

The Full Generation Pipeline — End to End

Perplexity — Measuring How "Surprised" the Model Is

Engineer's Playbook — Choosing the Right Parameters

Code generation

Chatbot / assistant

Creative writing

Data extraction / JSON

Avoiding repetition

Cost control

Quick parameter reference

📝 Practice Questions