
Master the technology behind ChatGPT and Gemini. This comprehensive course takes you from understanding simple text prediction to building sophisticated, autonomous AI agents capable of solving real-world enterprise challenges.
How AI "guesses" the next word — understanding softmax, temperature, sampling strategies, and the full generation pipeline that turns raw numbers into coherent language.
At the most fundamental level, an LLM does not "know" things the way a database does, and it does not "think" the way humans do. Every single word it produces is the result of one operation, repeated thousands of times:
Given a sequence of tokens as input, compute a probability distribution over the entire vocabulary — then sample the next token from that distribution. Repeat until the response is complete.
This sounds almost too simple to explain the richness of GPT-4's outputs. But this one mechanism, applied at massive scale with billions of parameters trained on trillions of tokens, is genuinely all that's happening. यही वो "magic" है जिसे हम बाहर से intelligence की तरह देखते हैं।
The model assigns a probability to every single token in its vocabulary (~100,000 tokens) simultaneously. The probabilities always sum to exactly 1.0. Notice: the model isn't "looking up" Paris — it learned this association purely from statistical patterns across trillions of training examples.
A deterministic model (always pick the single highest-probability token) produces robotic, repetitive text. Real language has natural variation — "The weather today is lovely / nice / wonderful / great" are all correct. Probabilistic sampling preserves this richness. Determinism is useful for code generation; randomness is useful for creative writing. You control this tradeoff with temperature.
Before the model produces a probability distribution, it first produces logits — raw, unnormalized scores. There's one logit per vocabulary token. These numbers can be anything: negative, positive, very large, very small. They're meaningless on their own until we apply Softmax.
Softmax takes a vector of logits [z₁, z₂, ..., zₙ] and converts them into a valid probability distribution: each output is between 0 and 1, and all outputs sum to exactly 1.0.
The exponential function eˣ has two critical properties: it's always positive (no negative probabilities), and it's extremely sensitive to differences — a logit of 5.2 becomes 181, while 1.8 becomes only 6. This "sharpening" effect means the highest logit gets disproportionately more probability mass, which is the right behavior for next-token prediction.
The model doesn't generate an entire sentence at once — it generates one token per forward pass, then feeds that token back as input for the next step. This loop is called autoregressive generation.
<|endoftext|> token, or when it hits a user-specified max_tokens limit.Because generation is probabilistic — each token is sampled from the distribution, not deterministically chosen. If "blue" has 61% probability, it still loses to other tokens 39% of the time. Set temperature=0 (always pick the highest-probability token) for deterministic outputs — useful for code, extraction, and factual Q&A.
Temperature is the most important generation parameter you'll set as an engineer. It controls how "sharp" or "flat" the probability distribution is — effectively controlling how creative vs. focused the model's output is.
Before applying Softmax, divide all logits by the temperature T:
Softmax(logits / T)
When T < 1: distribution sharpens (more confident). When T > 1: distribution flattens (more random). T = 1 means use the raw distribution.
Almost always picks "Paris". Ideal for: code generation, factual Q&A, data extraction, structured output.
Raw model distribution. Occasionally picks alternatives. Default for general-purpose assistants and chatbots.
Much flatter — unusual tokens get real chances. Ideal for: brainstorming, poetry, creative writing. Risk: incoherence at T > 2.
Dividing by T=0 is mathematically undefined, so in practice temperature=0 is implemented as greedy decoding — always pick the token with the highest logit, no sampling at all. This gives fully deterministic, reproducible outputs. Use this in production whenever consistency matters more than variety.
Temperature shapes the distribution, but you still need to decide how to pick a token from it. There are several strategies, each with different quality/diversity tradeoffs.
| Strategy | How it works | Best used for | Risk |
|---|---|---|---|
| Greedy decoding | Always pick the single highest-probability token. | Code, structured data, factual extraction | Repetitive, boring output. Can get stuck in loops. |
| Pure sampling | Sample randomly according to the full probability distribution. | Creative writing at low temperature | Very low-probability tokens can be selected — incoherence. |
| Top-k sampling | Keep only the k highest-probability tokens, redistribute their probabilities, then sample. | General text generation (k = 40–100) | k is a fixed number — sometimes too narrow, sometimes too wide depending on context. |
| Top-p (nucleus) | Keep the smallest set of tokens whose cumulative probability ≥ p. Sample from that set. | Most modern LLM APIs (p = 0.9–0.95 default) | Slightly complex to tune. Very low p approaches greedy. |
| Beam search | Maintain the top-B partial sequences at each step. Return the sequence with the highest overall probability. | Machine translation, summarization, where correctness matters | Expensive (B× compute). Often produces generic, "safe" outputs lacking diversity. |
Top-p is adaptive: when the model is very confident (one token has 95% probability), only that one token is in the nucleus. When the model is uncertain (10 tokens each with ~10%), the nucleus includes all 10. This flexibility makes it better than top-k at handling varying confidence levels across different positions.
With top-p=0.90, the nucleus is {"door", "window", "letter", "box"} — their cumulative probability (48+22+12+9=91%) just exceeds 90%. Only these 4 tokens are eligible for sampling. "Meeting" and everything below are excluded.
Now let's put it all together. Here's what actually happens inside the model from the moment you send a prompt to when you read the first word of the response:
Every token requires a full Transformer forward pass — processing the entire context through all attention layers. A 500-token response means 500 separate forward passes. This is why latency scales with output length, not just input length. Techniques like KV-caching (reusing computed attention values) reduce this cost significantly, but the fundamental sequential nature of autoregressive generation remains the key bottleneck in production LLM systems.
Perplexity (PPL) is the primary metric used to evaluate how well a language model predicts text. Intuitively: how many tokens is the model "effectively choosing between" at each step on average? Lower perplexity = better model.
Perplexity is the exponentiated average negative log-likelihood of the model assigning to the correct next token across a test set. A perplexity of N means the model is as "surprised" as if it were uniformly choosing between N tokens at each step.
During training, the model is given the correct next token and asked: "what probability did you assign to this token?" The loss is -log(p) — negative log of the assigned probability. If p=0.94 (very confident, correct), loss = 0.06 (low). If p=0.01 (wrong and confident), loss = 4.6 (high). Perplexity = e^(average loss). So minimizing training loss = the model learning to assign high probability to correct next tokens.
Understanding the theory is step one. Here's how to translate it into production decisions when calling an LLM API.
Use temperature=0 or 0.1. Deterministic output. No top-p needed. You want the most likely correct syntax, not creative variation.
Use temperature=0.7, top_p=0.9. Balanced: natural variation without going off-rails. This is the API default for most providers.
Use temperature=1.2–1.5, top_p=0.95. Wider exploration. Monitor for coherence — add a system prompt to ground the topic.
Use temperature=0. Combine with structured output mode or function-calling to constrain the vocabulary to valid JSON tokens only.
Use frequency_penalty=0.3–0.8 to reduce the logits of tokens that have already appeared. Prevents "the the the the" loops in long generations.
Use max_tokens to cap output length. Remember: you're billed for both input and output tokens. Short, precise prompts reduce both latency and cost.
| Parameter | Range | Effect |
|---|---|---|
| temperature | 0.0 – 2.0 | Controls distribution sharpness. 0 = greedy. 1 = raw model. >1 = more random. |
| top_p | 0.0 – 1.0 | Nucleus size. 0.9 = sample from tokens covering 90% of probability mass. |
| top_k | 1 – vocab size | Fixed nucleus. Only sample from the k highest-probability tokens. |
| frequency_penalty | 0.0 – 2.0 | Reduces logits of already-generated tokens. Fights repetition. |
| presence_penalty | 0.0 – 2.0 | Binary version of frequency_penalty — penalizes any token seen at least once. |
| max_tokens | 1 – context limit | Hard cap on output length. Generation stops regardless of EOS token. |
| seed | any integer | Fixed seed produces deterministic outputs (when temperature=0). Useful for evals. |
An LLM is a probability machine — every token is the result of sampling from a distribution over the entire vocabulary, not "knowing" the answer. Logits → Temperature → Softmax → Sampling is the generation pipeline for every single token. Temperature controls the creativity/focus tradeoff — low for code and facts, high for creativity. Top-p (nucleus sampling) is the modern default — adaptive, handling confident and uncertain positions differently. Perplexity measures model confidence — the lower, the better, but it doesn't measure accuracy or safety. In production: always set temperature, top_p, and max_tokens explicitly — never rely on defaults for important applications.