Chapter 2 · Building Blocks

Tokens & Embeddings

How machines turn words into numbers — understanding the fundamental building blocks that power every modern LLM, from tokenization to vector space geometry.

In this chapter

1. What is a Token?
2. Tokenization in depth
3. Token IDs & Vocabulary
4. What are Embeddings?
5. Vector Space & Geometry
6. Similarity & Distance
7. Why Embeddings Matter
8. Practical Gotchas

What is a Token? — The Atom of Language

LLMs don't read text the way humans do — letter by letter or word by word. Instead, they break text into chunks called tokens. Everything an LLM processes — input and output — is a stream of tokens. Tokens are the atoms of language for an AI.

Definition

A token is the smallest unit of text that an LLM processes. It can be a full word, a part of a word (sub-word), a punctuation mark, a number, or even a single character — depending on the tokenizer algorithm.

📝

Full word token

"sky", "blue", "the" — common short words usually become a single token each.

✂️

Sub-word token

"unhappiness" splits into "un" + "happiness" — rare or long words are broken at meaningful boundaries.

🔢

Number token

"2024" may split into "20" + "24" — the tokenizer has no semantic understanding of numbers.

🌍

Non-English token

"नमस्ते" may split into 4–6 tokens — Hindi and other scripts are often over-tokenized by English-first vocabularies.

Rule of thumb

On average, 1 token ≈ 4 characters of English text, or roughly ¾ of a word. So 100 words ≈ 130–140 tokens. This ratio changes dramatically for code, math, and non-Latin scripts — always verify with the actual tokenizer.

Tokenization in Depth — How Text Gets Split

The process of converting raw text into tokens is called tokenization. Modern LLMs use an algorithm called Byte Pair Encoding (BPE) or its variants (WordPiece, SentencePiece). Here's how it works in practice:

Live example — sentence tokenized

"Transformers changed AI in 2017 — नमस्ते world!"

Transform ers changed AI in 20 17 — नम स्ते world !

Full word

Sub-word

Punctuation

Number

Non-Latin script

How BPE (Byte Pair Encoding) works

Start with characters: Begin with every individual character as a separate token — "h", "e", "l", "l", "o".
Count frequent pairs: Find the most frequently co-occurring pair in the training corpus — e.g., "e" + "r" → "er".
Merge the pair: Replace all occurrences with the merged token. Repeat 50,000–100,000 times.
Result: A vocabulary of common sub-words that balances coverage (can represent any text) with efficiency (fewer tokens for common words).

Why not just use whole words?

A pure word vocabulary would need millions of entries — every name, technical term, conjugation, and typo would need its own slot. BPE solves this: unknown words like "GPT-4o" can always be represented as a sequence of known sub-tokens, so the model never sees a truly unknown input.

Token IDs & Vocabulary — Numbers All the Way Down

The model doesn't actually see text strings — it sees integers. Every token in the vocabulary is assigned a unique integer ID. The model receives a list of these integers as input and outputs integers that are then decoded back to text.

~50K

Vocabulary size — GPT-2

~100K

Vocabulary size — GPT-4

32K

Vocabulary size — Llama 2

Token (text)	Token ID (integer)	Notes
hello	15339	Common word → single token
Transform	27313	Start of a longer word
ers	364	Sub-word suffix for "Transformers"
20	508	Part of "2017" — split by tokenizer
17	1558	Second part of "2017"
<\|endoftext\|>	50256	Special token — signals end of document

Special tokens

Modern tokenizers include special tokens beyond normal text: <|im_start|> marks the start of a message, <|im_end|> marks the end, [PAD] fills batches to equal length, and [MASK] is used during training. These are essential for structuring conversations and training signals — they're invisible to users but critical to the model.

What are Embeddings? — Words as Points in Space

Token IDs are just arbitrary integers — the number 15339 for "hello" tells the model nothing about what "hello" means. To give words meaning, we convert each token ID into an embedding: a dense vector of floating-point numbers.

Definition

An embedding is a fixed-length list of numbers (a vector) that represents a token's meaning in a multi-dimensional mathematical space. Similar words end up with similar vectors — "king" and "queen" are close together, "king" and "pizza" are far apart.

From token ID to embedding vector

Each token ID maps to a row in a giant lookup table called the Embedding Matrix. If the vocabulary is 50,000 tokens and the embedding dimension is 4,096 (as in LLaMA), this matrix is 50,000 × 4,096 — about 800 million numbers, all learned during training.

Simplified 8-dimensional embedding vectors (real models use 768–12,288 dims)

"king"

+0.82 -0.31 +0.67 +0.91 -0.12 +0.44 +0.03 -0.58

"queen"

+0.79 -0.28 +0.61 +0.88 +0.33 +0.41 +0.09 -0.55

"pizza"

-0.55 +0.70 -0.44 -0.12 +0.88 -0.61 +0.74 +0.22

Notice how "king" and "queen" have very similar values across most dimensions — they occupy nearby points in the vector space. "Pizza" is completely different. This geometric closeness is semantic similarity, encoded mathematically.

Vector Space & Geometry — The Math Behind Meaning

The most mind-bending property of embeddings is that mathematical operations on vectors correspond to semantic operations on meaning. The famous example:

The famous king–queen analogy

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

The model learns that the direction from "man" to "woman" represents gender, and it can apply that direction to any word. This arithmetic works because the embedding space has structure — directions encode relationships, not just positions.

2D projection — visualizing the space

Real embeddings are hundreds or thousands of dimensions. We use techniques like t-SNE or PCA to project them down to 2D for visualization. Words with similar meanings cluster together:

This isn't hand-crafted — the model learns this clustering entirely from co-occurrence patterns in training text. Words that appear in similar contexts end up geometrically close.

Similarity & Distance — Measuring Meaning Mathematically

The primary way to measure how similar two embeddings are is cosine similarity — it measures the angle between two vectors. A score of 1.0 means identical direction (same meaning), 0 means unrelated, and -1 means opposite meaning.

Cosine similarity formula

similarity(A, B) = (A · B) / (|A| × |B|)

The dot product A · B measures how much two vectors "point in the same direction." Dividing by their magnitudes normalizes for length, so only direction matters — not how large the numbers are.

Real-world similarity scores

"king" vs "queen"

0.91

"cat" vs "dog"

0.84

"Python" vs "code"

0.76

"bank" (river) vs "bank" (money)

0.55

"king" vs "pizza"

0.08

Contextual embeddings — the upgrade

Static embeddings (Word2Vec, GloVe) give "bank" the same vector regardless of context. Transformer-based models generate contextual embeddings — the vector for "bank" in "river bank" is genuinely different from the vector in "bank account" because the entire surrounding sentence influences the representation. This is a key reason Transformers outperform older embedding methods.

Why Embeddings Matter — Real Applications

Embeddings are not just an internal mechanism — they are the foundation of many real-world AI applications you'll build as an engineer.

🔍

Semantic Search

Convert documents and queries into embeddings. Return the documents with highest cosine similarity to the query — even if no exact keywords match. Google, Notion, and linear all use this.

📚

RAG (Retrieval Augmented Generation)

Store your knowledge base as embeddings in a vector database. At query time, retrieve the most similar chunks and inject them into the LLM's context. This gives LLMs access to live, private data.

🎯

Recommendation Systems

Embed users and items in the same space. Users are "close" to items they'd like. Netflix and Spotify use embedding-based recommendation at scale.

🔗

Clustering & Classification

Group customer support tickets, tag documents, or detect duplicate content — all by comparing embedding distances rather than keyword matching.

🌐

Cross-lingual Transfer

Multilingual models embed "cat" (English) and "बिल्ली" (Hindi) close together in the same space — enabling zero-shot translation and cross-language search.

🔒

Anomaly Detection

Flag documents or user inputs that are far from any known cluster — useful for detecting prompt injection, fraud, or off-topic inputs in production LLM systems.

Popular embedding models

Model	Dimensions	Best for
OpenAI text-embedding-3-large	3,072	High-accuracy semantic search, RAG pipelines
OpenAI text-embedding-3-small	1,536	Cost-efficient production workloads
Google text-embedding-004	768	GCP-native apps, multilingual tasks
Cohere embed-v3	1,024	Enterprise search, document retrieval
BAAI/bge-m3 (open source)	1,024	Self-hosted, multilingual, free to use

Practical Gotchas — What Engineers Get Wrong

Understanding tokens and embeddings at a theoretical level isn't enough. Here are the mistakes that trip up engineers in production.

Token cost surprises

Emojis, code blocks, and non-Latin scripts consume far more tokens than expected. A single emoji can be 2–4 tokens. Always run your actual prompts through a tokenizer (tiktoken for OpenAI models) before estimating API costs at scale.

Embedding model mismatch

You cannot compare embeddings generated by different models. If you create your vector database with text-embedding-3-small and later switch to bge-m3, you must re-embed your entire corpus. Treat your embedding model as a schema — changing it is a migration.

Chunking strategy matters for RAG

Embedding a 10-page document as one vector loses detail — the embedding averages out everything. Embedding individual sentences loses context. The sweet spot for RAG is usually 256–512 token chunks with 50-token overlap. Chunk size is a hyperparameter you must tune for your domain.

Cosine similarity is not always the right metric

Cosine similarity measures angle, not magnitude. For some tasks (e.g., detecting exact duplicates), L2 distance or dot product similarity may work better. Many vector databases support multiple distance metrics — pick based on how your embedding model was trained.

Context window is measured in tokens, not words

When building prompts, remember: "128K context window" means 128,000 tokens — roughly 96,000 words in English, but far fewer in code-heavy or Hindi text. Always estimate token counts programmatically, never by word count alone.

Engineer's checklist

✓ Count tokens with the actual tokenizer before shipping prompts to production.
✓ Store your embedding model version alongside your vector database.
✓ Benchmark chunk sizes for your specific document type and retrieval task.
✓ Monitor token usage per request to avoid bill shock at scale.
✓ Test similarity thresholds on your data — don't assume 0.8 is a good cutoff.

Key Takeaways

Tokens are the atoms of LLM input/output — sub-words learned by BPE, mapped to integer IDs, never raw text or characters. Embeddings convert token IDs into dense vectors — giving mathematical meaning to language. Similar words live geometrically close. Cosine similarity measures semantic closeness — the foundation of semantic search, RAG, and recommendation. Contextual embeddings (from Transformers) are far more powerful than static ones — "bank" means something different in every sentence, and the vector reflects that. In production — always count tokens, version your embedding model, and tune your chunk size.

📝 Practice Questions

Why doesn't an LLM process text character-by-character or word-by-word? What problem does sub-word tokenization (BPE) solve?

Explain why "नमस्ते" costs more tokens than "hello" in GPT-4's tokenizer. What does this imply for building multilingual apps?

What is the Embedding Matrix? How does it convert a token ID (an integer) into a meaningful vector?

What does vec("king") − vec("man") + vec("woman") ≈ vec("queen") tell us about how embeddings encode meaning?

What is cosine similarity and why is it preferred over Euclidean distance for comparing embeddings?

How do contextual embeddings (from Transformers) differ from static embeddings (Word2Vec)? Give the example of the word "bank".

You're building a RAG system and your retrieval quality is poor. List three embedding/chunking decisions you would investigate first.

Get Instant Alerts on WhatsApp

LLM Mastery: From Foundations to Agentic AI Architect

The Building Blocks: Tokens and Embeddings – How machines turn words into numbers (Vectors).

Tokens & Embeddings

What is a Token? — The Atom of Language

Full word token

Sub-word token

Number token

Non-English token

Tokenization in Depth — How Text Gets Split

How BPE (Byte Pair Encoding) works

Token IDs & Vocabulary — Numbers All the Way Down

What are Embeddings? — Words as Points in Space

From token ID to embedding vector

Vector Space & Geometry — The Math Behind Meaning

2D projection — visualizing the space

Similarity & Distance — Measuring Meaning Mathematically

Real-world similarity scores

Why Embeddings Matter — Real Applications

Semantic Search

RAG (Retrieval Augmented Generation)

Recommendation Systems

Clustering & Classification

Cross-lingual Transfer

Anomaly Detection

Popular embedding models

Practical Gotchas — What Engineers Get Wrong

📝 Practice Questions