Chapter 4 · Foundations

Neural Network Basics

A beginner-friendly deep dive into neurons, weights, layers, and backpropagation — the engine underneath every LLM you'll ever build with.

In this chapter

1. What is a Neural Network?
2. The Neuron
3. Weights & Biases
4. Activation Functions
5. Layers & Depth
6. Forward Pass
7. Training & Backprop
8. From NN to LLM

What is a Neural Network?

Before you can understand GPT-4, Claude, or any modern LLM, you need to understand what's actually running underneath — a neural network. Everything else in this course builds on top of this foundation. Don't skip it.

Definition

A neural network is a mathematical function made of many interconnected layers of simple computational units called neurons. It takes numbers as input, performs a series of weighted calculations, and produces numbers as output. During training, it adjusts its internal numbers (weights) to get better at a task.

The word "neural" refers to a loose biological inspiration — the human brain's neurons fire signals to each other. But don't take the analogy too far. Artificial neural networks are just matrices of numbers being multiplied together at enormous scale. यह brain नहीं है — यह linear algebra है।

🧠 Human Brain (Biology)

•Biological neuron receives signals from other neurons

•Fires if combined signal crosses a threshold

•Synaptic strength changes with learning

•~86 billion neurons, trillions of connections

•Learns continuously from experience

🤖 Artificial Neural Network (Math)

•Artificial neuron receives numbers as inputs

•Applies weighted sum + activation function

•Weights adjusted by backpropagation + gradient descent

•Billions of parameters, organized in layers

•Learns in discrete training runs, not continuously

Why you need to understand this

When you tune an LLM's temperature, write a system prompt, or add RAG — you're working at the application layer. But when something breaks in production (hallucinations, repetition, token cutoff, unexpected behavior), understanding the neural network underneath is what separates an engineer who can debug from one who can only guess. यह chapter वो foundation है।

The Neuron — The Smallest Building Block

A single neuron is the atomic unit of a neural network. It does exactly three things: receive inputs, compute a weighted sum, and apply a non-linear function to the result.

Anatomy of a single artificial neuron

Every neuron in a neural network performs this exact same operation — just with different weights and bias values. A network with millions of neurons is just this operation repeated, layered, and connected in a specific pattern.

The key formula

output = f( w₁x₁ + w₂x₂ + w₃x₃ + b )

Where x = inputs, w = weights (learnable), b = bias (learnable), and f() = activation function. The entire complexity of a 175-billion-parameter model is just this formula, instantiated billions of times and arranged in layers.

Weights & Biases — Where Knowledge Lives

If the neuron is the building block, weights and biases are where all the learned knowledge actually lives. After training a model on terabytes of text, the only thing that changes is the values of these numbers. The architecture stays fixed — only the weights change.

⚖️

Weight (w)

Controls how much influence each input has on this neuron's output. A large positive weight means "this input strongly increases my output." A large negative weight means "this input strongly suppresses my output." Zero means "I ignore this input."

📍

Bias (b)

An offset added to the weighted sum before the activation. It shifts the entire activation curve left or right — letting the neuron fire (activate) even when all inputs are zero, or requiring a stronger signal before firing.

Concrete calculation example

Neuron calculation — step by step

x₁ = 0.5 x₂ = 0.8 x₃ = 0.2 w₁ = 1.2 w₂ = -0.7 w₃ = 0.9 b = 0.3

↳ inputs (given) · weights (learned) · bias (learned)

z = (0.5×1.2) + (0.8×-0.7) + (0.2×0.9) + 0.3

↳ weighted sum + bias

z = 0.60 + (-0.56) + 0.18 + 0.30 = z = 0.52

↳ pre-activation value (also called "logit" at this level)

output = ReLU(0.52) = max(0, 0.52) = 0.52

↳ activation applied — since 0.52 > 0, ReLU passes it through unchanged

This is the entire calculation for one neuron, one forward pass. A network with 1,000 neurons per layer just does this 1,000 times in parallel — using matrix multiplication for speed. GPT-3's 96 layers × thousands of neurons = 175 billion such operations per forward pass.

Weights as memory

The weights of a trained LLM are not just random numbers — they encode everything the model learned from terabytes of training text. The difference between a randomly initialized model (which produces garbage) and a trained model (which writes essays) is entirely in the values of these weights. Training = finding the weight values that minimize the model's prediction error.

Activation Functions — Adding Non-Linearity

Without an activation function, a neural network — no matter how many layers deep — is mathematically equivalent to a single linear transformation. It could never learn anything more complex than a straight line. Activation functions introduce non-linearity, which is what allows networks to learn complex patterns like language, vision, and reasoning.

ReLU

f(x) = max(0, x)

Output: [0, +∞)

Most common in hidden layers. Fast to compute. Kills negative values (sets them to 0) — the "dead ReLU" problem. Used in most feedforward layers of early deep nets.

GELU

f(x) ≈ x·Φ(x)

Output: (−0.17, +∞)

Used in GPT, BERT, and most modern LLMs. Smoother than ReLU — allows a small negative output for negative inputs. Better gradient flow during training.

Sigmoid

f(x) = 1/(1+e⁻ˣ)

Output: (0, 1)

Squeezes any value to between 0 and 1. Used in binary classification output layers and gates in LSTMs. Suffers from "vanishing gradients" in deep networks.

Softmax

f(xᵢ) = eˣⁱ / Σeˣʲ

Output: sums to 1.0

Converts a vector of logits into a probability distribution. Used exclusively in the final output layer of LLMs to produce token probabilities. Covered deeply in Chapter 3.

Why non-linearity is everything

Imagine stacking 96 layers of purely linear transformations (matrix multiplications with no activation). Mathematically, all 96 layers collapse into a single matrix multiplication — equivalent to having just one layer. The activation function after each layer is what makes the composition of layers genuinely "deep" and capable of learning hierarchical, non-linear patterns like the structure of language.

Layers & Depth — How Networks Get "Deep"

A single neuron can only draw one linear boundary. Stack layers of neurons, and the network can learn arbitrarily complex decision boundaries. This is what "deep learning" means — many layers stacked, each learning progressively more abstract representations.

3-layer neural network — input → hidden → output

Types of layers in modern networks

Input Layer

Receives raw data. In an LLM, this is the token embedding vectors. No computation — just passes data in. Width = embedding dimension (e.g., 4096).

Dense / Fully Connected

Dense Layer (FFN)

Every neuron connects to every neuron in the previous layer. The "Feed Forward Network" (FFN) in each Transformer block is two dense layers. Most parameters live here.

Attention Layer

Self-Attention

The special sauce of Transformers. Not a traditional dense layer — instead it learns which parts of the input context to "attend to" when computing each position's output.

Normalization

LayerNorm

Normalizes activations within each layer to have mean ~0 and variance ~1. Prevents training instability in deep networks. Applied before or after attention in Transformers.

Output Layer

LM Head

Projects the final hidden state to a vector of size equal to the vocabulary (~100K). Each value is a logit. Softmax converts these to token probabilities.

The Forward Pass — Data Flowing Through the Network

The forward pass is the process of data flowing from the input layer, through every hidden layer, to the output layer — producing a prediction. During inference (when you use a chatbot), only the forward pass happens. During training, it's followed by the backward pass.

Input arrives

Token embeddings enter Layer 1. Shape: [seq_len × embed_dim]. Each token is a vector of numbers.

Layer computation

Each layer applies its weights + activation. Output becomes input to the next layer. Shape preserved or transformed.

Final hidden state

After the last Transformer block, we have a rich contextual representation of every token in the sequence.

Output projection

LM Head projects the last token's hidden state to 100K logits. Softmax converts to probabilities. Sample next token.

Transformer layers in GPT-3

12,288

Hidden dimension (GPT-3)

~1ms

Forward pass time per layer (A100 GPU)

Matrix multiplication is the engine

At its core, the forward pass is a sequence of matrix multiplications. The input vector gets multiplied by the weight matrix of Layer 1, producing the Layer 1 output vector. That gets multiplied by Layer 2's weight matrix. And so on. GPUs are specialized hardware for executing these matrix multiplications in parallel at massive scale — which is why GPUs are essential for training and running LLMs.

Training & Backpropagation — How the Network Learns

Training a neural network is the process of finding the weight values that make the model good at its task. This is done by showing the network many examples, measuring its error, and nudging the weights in the direction that reduces that error. The algorithm that computes these nudges is backpropagation.

The training loop — 4 steps repeated millions of times

Forward Pass

Feed a batch of training examples through the network. Get predictions from the output layer.

Compute Loss

Compare predictions to correct answers using a loss function (e.g., cross-entropy). Get a single number: how wrong the model is.

Backward Pass

Use backpropagation to compute the gradient of the loss with respect to every weight. Gradients = "which direction should I change each weight?"

Update Weights

Gradient descent: adjust every weight slightly in the direction that reduces loss. Repeat from step 1 with the next batch.

Backpropagation — flowing gradients backward

Forward pass (left→right) then backward pass (right→left)

Forward — computing predictions

Input
tokens

→

Embed
Layer

→

Hidden
Layer 1

→

Hidden
Layer 2

→

Output
Logits

→

Loss
Computed

Backward — computing gradients (chain rule)

∂Loss
computed

←

∂Output
weights

←

∂Layer 2
weights

←

∂Layer 1
weights

←

∂Embed
weights

←

Update
all weights

Weight update with gradient descent

Parameter	Before Update	Gradient (∂Loss/∂w)	Learning Rate	After Update	Effect
w₁	0.850	+0.120 (increase w₁ increases loss)	0.01	0.8488 (−0.0012)	Loss decreased ↓
w₂	−0.430	−0.080 (decrease w₂ increases loss)	0.01	−0.4292 (+0.0008)	Loss decreased ↓
w₃	0.210	+0.005 (tiny gradient — nearly optimal)	0.01	0.20995	Minimal change
bias b	0.300	+0.350 (large gradient — far from optimal)	0.01	0.2965 (−0.0035)	Still converging

Learning rate — the most important hyperparameter

The learning rate (0.01 in the example above) controls how large each weight update step is. Too large: the model overshoots the optimal weights and diverges — training becomes unstable. Too small: training takes forever and may get stuck in poor local minima. Modern LLM training uses adaptive learning rate schedulers (AdamW) that automatically adjust the rate during training. यही reason है कि LLM training में learning rate schedule इतनी critical होती है।

From Neural Network to LLM — Connecting the Dots

Everything you've learned in this chapter is the engine inside every LLM. The Transformer architecture is just a specific way of arranging these components — neurons, weights, activations, layers — that happens to work extraordinarily well for language.

🔢

Neurons → Parameters

When you hear "GPT-3 has 175B parameters" — those are 175 billion individual weight values across all neurons in all layers. Each is a floating-point number updated during training.

📐

Layers → Transformer Blocks

GPT-3 has 96 "layers" — each one is a Transformer block containing an Attention layer and a Feed-Forward (dense) layer. Each block refines the representation of every token.

⚡

Activation → GELU

Modern LLMs use GELU (not ReLU) as their activation function inside the FFN sublayer of each Transformer block. GELU's smooth gradient helps training stability.

🎯

Loss → Cross-Entropy

The LLM's training loss is cross-entropy between the predicted token distribution and the actual next token. Minimizing this = teaching the model to predict text well.

🔄

Backprop → AdamW

LLMs use the AdamW optimizer — an improved gradient descent that adapts learning rates per parameter and adds weight decay. It's what makes training 175B parameters stable.

🖥️

Matrix Multiply → GPU

Every layer is a matrix multiplication. GPUs have thousands of cores specialized for parallel matrix math — a single A100 GPU can do 312 TFLOPS of matrix multiplications per second.

The Transformer is a special neural network

The Transformer (Chapter 5) doesn't invent new types of computation — it's still neurons, weights, activations, and backprop. What's novel is its architecture: how layers are arranged, how the Attention mechanism allows every token to "look at" every other token, and how this architecture scales exceptionally well with more data and more parameters. This chapter was your foundation. Chapter 5 builds the cathedral on top of it.

Key Takeaways

A neural network is a mathematical function — layers of neurons each computing a weighted sum + activation. Nothing magical.
Weights and biases are where knowledge lives — the only thing training changes. 175B parameters = 175B learnable numbers.
Activation functions add non-linearity — without them, deep networks collapse to single-layer equivalents. LLMs use GELU.
The forward pass = prediction. The backward pass = learning. Together, repeated millions of times = training.
Backpropagation computes gradients — which direction to nudge each weight. Gradient descent applies those nudges. Learning rate controls step size.
LLMs are neural networks with a specific architecture (Transformer), a specific loss (cross-entropy), and a specific optimizer (AdamW), trained at unprecedented scale.

📝 Practice Questions

Draw (or describe in detail) a single neuron with 3 inputs. Label x₁, x₂, x₃, w₁, w₂, w₃, bias b, and output a. Write out the formula.

A neuron has inputs [1.0, 0.5, -0.2], weights [0.8, -1.2, 0.6], and bias 0.1. Calculate z (the pre-activation value). Then apply ReLU. What is the output?

Why would a neural network with 100 layers but no activation functions perform no better than a single-layer network? What mathematical property explains this?

Explain backpropagation in plain English to someone who has never heard of it. What does a "gradient" represent, and why do we compute it?

What happens to training if the learning rate is too high? What happens if it's too low? What does AdamW solve that vanilla gradient descent doesn't?

A GPT-3 style model has 96 layers, each with a hidden dimension of 12,288. Why does the choice of activation function in the FFN sublayer matter at this scale?

Explain in one paragraph how this chapter connects to Chapter 3 (Probability & Prediction). Where exactly in the neural network is the softmax applied, and what does it operate on?

Get Instant Alerts on WhatsApp

LLM Mastery: From Foundations to Agentic AI Architect