
Master the technology behind ChatGPT and Gemini. This comprehensive course takes you from understanding simple text prediction to building sophisticated, autonomous AI agents capable of solving real-world enterprise challenges.
How machines turn words into numbers — understanding the fundamental building blocks that power every modern LLM, from tokenization to vector space geometry.
LLMs don't read text the way humans do — letter by letter or word by word. Instead, they break text into chunks called tokens. Everything an LLM processes — input and output — is a stream of tokens. Tokens are the atoms of language for an AI.
A token is the smallest unit of text that an LLM processes. It can be a full word, a part of a word (sub-word), a punctuation mark, a number, or even a single character — depending on the tokenizer algorithm.
"sky", "blue", "the" — common short words usually become a single token each.
"unhappiness" splits into "un" + "happiness" — rare or long words are broken at meaningful boundaries.
"2024" may split into "20" + "24" — the tokenizer has no semantic understanding of numbers.
"नमस्ते" may split into 4–6 tokens — Hindi and other scripts are often over-tokenized by English-first vocabularies.
On average, 1 token ≈ 4 characters of English text, or roughly ¾ of a word. So 100 words ≈ 130–140 tokens. This ratio changes dramatically for code, math, and non-Latin scripts — always verify with the actual tokenizer.
The process of converting raw text into tokens is called tokenization. Modern LLMs use an algorithm called Byte Pair Encoding (BPE) or its variants (WordPiece, SentencePiece). Here's how it works in practice:
A pure word vocabulary would need millions of entries — every name, technical term, conjugation, and typo would need its own slot. BPE solves this: unknown words like "GPT-4o" can always be represented as a sequence of known sub-tokens, so the model never sees a truly unknown input.
The model doesn't actually see text strings — it sees integers. Every token in the vocabulary is assigned a unique integer ID. The model receives a list of these integers as input and outputs integers that are then decoded back to text.
| Token (text) | Token ID (integer) | Notes |
|---|---|---|
| hello | 15339 | Common word → single token |
| Transform | 27313 | Start of a longer word |
| ers | 364 | Sub-word suffix for "Transformers" |
| 20 | 508 | Part of "2017" — split by tokenizer |
| 17 | 1558 | Second part of "2017" |
| <|endoftext|> | 50256 | Special token — signals end of document |
Modern tokenizers include special tokens beyond normal text: <|im_start|> marks the start of a message, <|im_end|> marks the end, [PAD] fills batches to equal length, and [MASK] is used during training. These are essential for structuring conversations and training signals — they're invisible to users but critical to the model.
Token IDs are just arbitrary integers — the number 15339 for "hello" tells the model nothing about what "hello" means. To give words meaning, we convert each token ID into an embedding: a dense vector of floating-point numbers.
An embedding is a fixed-length list of numbers (a vector) that represents a token's meaning in a multi-dimensional mathematical space. Similar words end up with similar vectors — "king" and "queen" are close together, "king" and "pizza" are far apart.
Each token ID maps to a row in a giant lookup table called the Embedding Matrix. If the vocabulary is 50,000 tokens and the embedding dimension is 4,096 (as in LLaMA), this matrix is 50,000 × 4,096 — about 800 million numbers, all learned during training.
Notice how "king" and "queen" have very similar values across most dimensions — they occupy nearby points in the vector space. "Pizza" is completely different. This geometric closeness is semantic similarity, encoded mathematically.
The most mind-bending property of embeddings is that mathematical operations on vectors correspond to semantic operations on meaning. The famous example:
vec("king") − vec("man") + vec("woman") ≈ vec("queen")
The model learns that the direction from "man" to "woman" represents gender, and it can apply that direction to any word. This arithmetic works because the embedding space has structure — directions encode relationships, not just positions.
Real embeddings are hundreds or thousands of dimensions. We use techniques like t-SNE or PCA to project them down to 2D for visualization. Words with similar meanings cluster together:
This isn't hand-crafted — the model learns this clustering entirely from co-occurrence patterns in training text. Words that appear in similar contexts end up geometrically close.
The primary way to measure how similar two embeddings are is cosine similarity — it measures the angle between two vectors. A score of 1.0 means identical direction (same meaning), 0 means unrelated, and -1 means opposite meaning.
similarity(A, B) = (A · B) / (|A| × |B|)
The dot product A · B measures how much two vectors "point in the same direction." Dividing by their magnitudes normalizes for length, so only direction matters — not how large the numbers are.
Static embeddings (Word2Vec, GloVe) give "bank" the same vector regardless of context. Transformer-based models generate contextual embeddings — the vector for "bank" in "river bank" is genuinely different from the vector in "bank account" because the entire surrounding sentence influences the representation. This is a key reason Transformers outperform older embedding methods.
Embeddings are not just an internal mechanism — they are the foundation of many real-world AI applications you'll build as an engineer.
Convert documents and queries into embeddings. Return the documents with highest cosine similarity to the query — even if no exact keywords match. Google, Notion, and linear all use this.
Store your knowledge base as embeddings in a vector database. At query time, retrieve the most similar chunks and inject them into the LLM's context. This gives LLMs access to live, private data.
Embed users and items in the same space. Users are "close" to items they'd like. Netflix and Spotify use embedding-based recommendation at scale.
Group customer support tickets, tag documents, or detect duplicate content — all by comparing embedding distances rather than keyword matching.
Multilingual models embed "cat" (English) and "बिल्ली" (Hindi) close together in the same space — enabling zero-shot translation and cross-language search.
Flag documents or user inputs that are far from any known cluster — useful for detecting prompt injection, fraud, or off-topic inputs in production LLM systems.
| Model | Dimensions | Best for |
|---|---|---|
| OpenAI text-embedding-3-large | 3,072 | High-accuracy semantic search, RAG pipelines |
| OpenAI text-embedding-3-small | 1,536 | Cost-efficient production workloads |
| Google text-embedding-004 | 768 | GCP-native apps, multilingual tasks |
| Cohere embed-v3 | 1,024 | Enterprise search, document retrieval |
| BAAI/bge-m3 (open source) | 1,024 | Self-hosted, multilingual, free to use |
Understanding tokens and embeddings at a theoretical level isn't enough. Here are the mistakes that trip up engineers in production.
Emojis, code blocks, and non-Latin scripts consume far more tokens than expected. A single emoji can be 2–4 tokens. Always run your actual prompts through a tokenizer (tiktoken for OpenAI models) before estimating API costs at scale.
You cannot compare embeddings generated by different models. If you create your vector database with text-embedding-3-small and later switch to bge-m3, you must re-embed your entire corpus. Treat your embedding model as a schema — changing it is a migration.
Embedding a 10-page document as one vector loses detail — the embedding averages out everything. Embedding individual sentences loses context. The sweet spot for RAG is usually 256–512 token chunks with 50-token overlap. Chunk size is a hyperparameter you must tune for your domain.
Cosine similarity measures angle, not magnitude. For some tasks (e.g., detecting exact duplicates), L2 distance or dot product similarity may work better. Many vector databases support multiple distance metrics — pick based on how your embedding model was trained.
When building prompts, remember: "128K context window" means 128,000 tokens — roughly 96,000 words in English, but far fewer in code-heavy or Hindi text. Always estimate token counts programmatically, never by word count alone.
✓ Count tokens with the actual tokenizer before shipping prompts to production.
✓ Store your embedding model version alongside your vector database.
✓ Benchmark chunk sizes for your specific document type and retrieval task.
✓ Monitor token usage per request to avoid bill shock at scale.
✓ Test similarity thresholds on your data — don't assume 0.8 is a good cutoff.
Tokens are the atoms of LLM input/output — sub-words learned by BPE, mapped to integer IDs, never raw text or characters. Embeddings convert token IDs into dense vectors — giving mathematical meaning to language. Similar words live geometrically close. Cosine similarity measures semantic closeness — the foundation of semantic search, RAG, and recommendation. Contextual embeddings (from Transformers) are far more powerful than static ones — "bank" means something different in every sentence, and the vector reflects that. In production — always count tokens, version your embedding model, and tune your chunk size.
vec("king") − vec("man") + vec("woman") ≈ vec("queen") tell us about how embeddings encode meaning?