The Foundation of All Neural Networks
Andrej Karpathy's micrograd tutorial is the perfect entry point to understanding neural networks. Instead of using massive tensor libraries, micrograd implements automatic differentiation using only scalar values - individual numbers like 1.0, -2.5, etc. This constraint forces you to understand what's actually happening in backpropagation.
The Core Insight: Everything is Just Addition and Multiplication
Neural networks seem complex, but they're just chains of basic operations. A neuron performing w*x + b is literally just:
Input: x = 2.0
Weight: w = -3.0
Bias: b = 1.0
Step 1: multiply → w*x = -6.0
Step 2: add → -6.0 + 1.0 = -5.0
Step 3: activation → tanh(-5.0) = -0.9999The Value Class: Building Blocks of Computation
The entire micrograd engine centers on a Value class that wraps scalars and tracks their computational history:
class Value:
  def __init__(self, data, _children=(), _op=''):
    self.data = data
    self.grad = 0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _opEach Value remembers:
- Its actual numeric value (data)
- Its gradient (grad)
- What operation created it (_op)
- What inputs created it (_prev)
Forward Pass: Building the Computation Graph
When you write c = a + b, micrograd creates a new Value that remembers it came from adding a and b:
    a = Value(-4.0)
    b = Value(2.0)
    c = a + b  # c.data = -2.0, c._prev = {a, b}, c._op = '+'Complex expressions build directed acyclic graphs (DAGs):
    Expression: f = (a * b + b**3).relu()Graph Structure:
a(-4.0) ──┐
          ├─→ mult ──┐
b(2.0) ───┘          ├─→ add ──→ relu ──→ f
          ┌──────────┘
b**3 ─────┘The Backward Pass: Chain Rule in Action
The magic happens in backpropagation. Starting from the output, gradients flow backward following the chain rule:
def backward(self):
    # Topological sort to get correct order
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    # Backward pass
    self.grad = 1
    for v in reversed(topo):
        v._backward()Each operation implements its own backward function:
    # Addition: derivative is 1 for both inputs
    def _backward():
        self.grad += out.grad
        other.grad += out.grad
    # Multiplication: derivative uses chain rule
    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.gradVisualizing the Computation Graph
One of micrograd's coolest features is graph visualization. Here's what a simple neuron looks like:
    Input Layer:
    x₁ = 2.0  ──→ [w₁ = -3.0] ──→ mult ──┐
    x₂ = 0.0  ──→ [w₂ = 1.0]  ──→ mult ──┤
                                        ├──→ sum ──→ tanh ──→ output
    bias = 6.88 ────────────────────────┘Each node shows both the forward value and the gradient:
    Node Format: [forward_value | gradient]
    Example: [2.0 | -3.0] means value=2.0, gradient=-3.0Building Neural Networks
With automatic differentiation working, building neural networks becomes straightforward:
class Neuron:
    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)
        self.nonlin = nonlin
    def __call__(self, x):
        act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
        return act.tanh() if self.nonlin else act
class MLP:
    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]Training Loop: Gradient Descent
Training follows the standard pattern:
    # Forward pass
    ypred = [n(x) for x in xs]
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
    # Zero gradients
    for p in parameters():
        p.grad = 0
    # Backward pass
    loss.backward()
    # Update parameters
    for p in parameters():
        p.data += -0.01 * p.gradIt's crazy to think that you can explain modern machine learning concepts with high school calculus and some mental gymastics.
Comparison with PyTorch
The beauty of micrograd is its simplicity compared to PyTorch:
    # Micrograd (explicit)
    a = Value(2.0)
    b = Value(-3.0)
    c = a * b
    c.backward()
    print(a.grad)  # -3.0
    # PyTorch (tensor-based)
    a = torch.tensor(2.0, requires_grad=True)
    b = torch.tensor(-3.0, requires_grad=True)
    c = a * b
    c.backward()
    print(a.grad)  # tensor(-3.0)Same mathematical principles, different abstractions.
The Gradient Flow Visualization
Understanding how gradients flow is crucial. Consider this expression:
    f = (x * y) + sin(x)Gradient Flow: df/df = 1.0 (starting point) df/d(xy) = 1.0 (addition passes gradient through) df/dx = y * 1.0 + cos(x) * 1.0 (chain rule for both paths) df/dy = x * 1.0 (multiplication rule)
Key Implementation Details
Handling Repeated Variables
When a variable appears multiple times, gradients accumulate:
    a = Value(3.0)
    b = a + a  # a appears twice
    # Backward pass must accumulate: a.grad += 1.0 + 1.0 = 2.0Topological Sorting
Critical for correct gradient flow:
# Incorrect: might compute gradients before their dependencies
for node in nodes:
node.\_backward()
# Correct: topological order ensures dependencies computed first
for node in topologically_sorted(nodes):
    node._backward()Activation Functions
Each activation needs its derivative:
def tanh(self):
    t = (math.exp(2*self.data) - 1)/(math.exp(2*self.data) + 1)
    out = Value(t, (self,), 'tanh')
    def _backward():
        self.grad += (1 - t**2) * out.grad  # derivative of tanh
    out._backward = _backward
    return outThe Learning Journey
Following this tutorial transformed my understanding:
- Neural networks aren't magic - they're just computational graphs
- I felt the gradients flowing through the network
- I can implement custom operations confidently
Modern Connections
While micrograd uses scalars, modern frameworks use tensors for efficiency:
# Micrograd: Each number is separate
for i in range(1000):
    result += values[i] * weights[i]
# PyTorch: Vectorized operations
result = torch.dot(values, weights)  # All at onceSame math, different scale.
The 100-Line Revolution
The entire autograd engine is about 100 lines. This proves that the core ideas behind billion-parameter models are fundamentally simple. The complexity comes from scale, not from the underlying mathematics.
The complete micrograd implementation is available on GitHub. The tutorial video remains one of the clearest explanations of backpropagation ever created.