Building Backpropagation from Scratch: The Magic of Micrograd

May 24, 2025

The Foundation of All Neural Networks

Andrej Karpathy's micrograd tutorial is the perfect entry point to understanding neural networks. Instead of using massive tensor libraries, micrograd implements automatic differentiation using only scalar values - individual numbers like 1.0, -2.5, etc. This constraint forces you to understand what's actually happening in backpropagation.

The Core Insight: Everything is Just Addition and Multiplication

Neural networks seem complex, but they're just chains of basic operations. A neuron performing w*x + b is literally just:

Input: x = 2.0
Weight: w = -3.0
Bias: b = 1.0

Step 1: multiply  w*x = -6.0
Step 2: add  -6.0 + 1.0 = -5.0
Step 3: activation  tanh(-5.0) = -0.9999

The Value Class: Building Blocks of Computation

The entire micrograd engine centers on a Value class that wraps scalars and tracks their computational history:

class Value:
  def __init__(self, data, _children=(), _op=''):
    self.data = data
    self.grad = 0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op

Each Value remembers:

  • Its actual numeric value (data)
  • Its gradient (grad)
  • What operation created it (_op)
  • What inputs created it (_prev)

Forward Pass: Building the Computation Graph

When you write c = a + b, micrograd creates a new Value that remembers it came from adding a and b:

    a = Value(-4.0)
    b = Value(2.0)
    c = a + b  # c.data = -2.0, c._prev = {a, b}, c._op = '+'

Complex expressions build directed acyclic graphs (DAGs):

    Expression: f = (a * b + b**3).relu()

Graph Structure:

a(-4.0) ──┐
          ├─→ mult ──┐
b(2.0) ───┘          ├─→ add ──→ relu ──→ f
          ┌──────────┘
b**3 ─────┘

The Backward Pass: Chain Rule in Action

The magic happens in backpropagation. Starting from the output, gradients flow backward following the chain rule:

def backward(self):
    # Topological sort to get correct order
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    # Backward pass
    self.grad = 1
    for v in reversed(topo):
        v._backward()

Each operation implements its own backward function:

    # Addition: derivative is 1 for both inputs
    def _backward():
        self.grad += out.grad
        other.grad += out.grad
    # Multiplication: derivative uses chain rule
    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad

Visualizing the Computation Graph

One of micrograd's coolest features is graph visualization. Here's what a simple neuron looks like:

    Input Layer:
    x₁ = 2.0  ──→ [w₁ = -3.0] ──→ mult ──┐
    x₂ = 0.0  ──→ [w₂ = 1.0]  ──→ mult ──┤
                                        ├──→ sum ──→ tanh ──→ output
    bias = 6.88 ────────────────────────┘

Each node shows both the forward value and the gradient:

    Node Format: [forward_value | gradient]
    Example: [2.0 | -3.0] means value=2.0, gradient=-3.0

Building Neural Networks

With automatic differentiation working, building neural networks becomes straightforward:

class Neuron:
    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)
        self.nonlin = nonlin

    def __call__(self, x):
        act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
        return act.tanh() if self.nonlin else act

class MLP:
    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]

Training Loop: Gradient Descent

Training follows the standard pattern:

    # Forward pass
    ypred = [n(x) for x in xs]
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))

    # Zero gradients
    for p in parameters():
        p.grad = 0

    # Backward pass
    loss.backward()

    # Update parameters
    for p in parameters():
        p.data += -0.01 * p.grad

It's crazy to think that you can explain modern machine learning concepts with high school calculus and some mental gymastics.

Comparison with PyTorch

The beauty of micrograd is its simplicity compared to PyTorch:

    # Micrograd (explicit)
    a = Value(2.0)
    b = Value(-3.0)
    c = a * b
    c.backward()
    print(a.grad)  # -3.0

    # PyTorch (tensor-based)
    a = torch.tensor(2.0, requires_grad=True)
    b = torch.tensor(-3.0, requires_grad=True)
    c = a * b
    c.backward()
    print(a.grad)  # tensor(-3.0)

Same mathematical principles, different abstractions.

The Gradient Flow Visualization

Understanding how gradients flow is crucial. Consider this expression:

    f = (x * y) + sin(x)

Gradient Flow: df/df = 1.0 (starting point) df/d(xy) = 1.0 (addition passes gradient through) df/dx = y * 1.0 + cos(x) * 1.0 (chain rule for both paths) df/dy = x * 1.0 (multiplication rule)

Key Implementation Details

Handling Repeated Variables

When a variable appears multiple times, gradients accumulate:

    a = Value(3.0)
    b = a + a  # a appears twice
    # Backward pass must accumulate: a.grad += 1.0 + 1.0 = 2.0

Topological Sorting

Critical for correct gradient flow:

# Incorrect: might compute gradients before their dependencies
for node in nodes:
node.\_backward()

# Correct: topological order ensures dependencies computed first
for node in topologically_sorted(nodes):
    node._backward()

Activation Functions

Each activation needs its derivative:

def tanh(self):
    t = (math.exp(2*self.data) - 1)/(math.exp(2*self.data) + 1)
    out = Value(t, (self,), 'tanh')

    def _backward():
        self.grad += (1 - t**2) * out.grad  # derivative of tanh
    out._backward = _backward
    return out

The Learning Journey

Following this tutorial transformed my understanding:

  1. Neural networks aren't magic - they're just computational graphs
  2. I felt the gradients flowing through the network
  3. I can implement custom operations confidently

Modern Connections

While micrograd uses scalars, modern frameworks use tensors for efficiency:

# Micrograd: Each number is separate
for i in range(1000):
    result += values[i] * weights[i]

# PyTorch: Vectorized operations
result = torch.dot(values, weights)  # All at once

Same math, different scale.

The 100-Line Revolution

The entire autograd engine is about 100 lines. This proves that the core ideas behind billion-parameter models are fundamentally simple. The complexity comes from scale, not from the underlying mathematics.


The complete micrograd implementation is available on GitHub. The tutorial video remains one of the clearest explanations of backpropagation ever created.