--- title: "How Automatic Differentiation Works" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How Automatic Differentiation Works} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(femtograd) ``` ## Why Automatic Differentiation? In statistics, we constantly need derivatives: - **MLE**: Find parameters that maximize the log-likelihood (gradient = 0) - **Standard errors**: Derived from the Hessian (matrix of second derivatives) - **Newton's method**: Uses both gradient and Hessian for optimization - **Fisher information**: The expected negative Hessian There are three ways to compute derivatives: | Method | Pros | Cons | |--------|------|------| | Symbolic | Exact, closed-form | Explosion of terms, hard to automate | | Numerical (finite differences) | Easy to implement | Approximation errors, slow for Hessians | | **Automatic differentiation** | **Exact, efficient, automated** | Requires computational graph | Femtograd uses automatic differentiation (AD), specifically reverse-mode AD (backpropagation) for gradients and forward-over-reverse AD for Hessians. ## The Computational Graph Every computation in femtograd builds a directed acyclic graph (DAG). Each node is a `value` object storing both a number and its gradient. Consider $f(x, y) = x^2 + x \cdot y$: ``` [z = v3 + v4] <- output / \ [v3 = x^2] [v4 = x*y] <- intermediate | / \ [x] [x] [y] <- inputs (leaves) ``` In femtograd: ```{r} x <- val(3, label = "x") y <- val(4, label = "y") v3 <- x^2 # v3 = 9 v4 <- x * y # v4 = 12 z <- v3 + v4 # z = 21 get_data(z) ``` ## The Forward Pass When you write `x^2 + x * y`, femtograd executes each operation and records the computation graph. This is the **forward pass**: compute the function value while building the graph structure. Each operation creates a new `value` node that stores: - **data**: The computed result (a matrix) - **grad**: Gradient, initially zero - **prev**: Pointers to parent nodes - **backward_fn**: A closure that knows how to propagate gradients backward ## Reverse-Mode AD (Backpropagation) To compute gradients, we traverse the graph in reverse topological order. This is the **backward pass**. The key insight is the **chain rule**. For a composition $f(g(x))$: $$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$ In reverse mode, we start from the output and propagate $\frac{\partial \text{output}}{\partial \text{node}}$ backward through each operation. ```{r} # After building the graph above, compute gradients: backward(z) # dz/dx = 2x + y = 2(3) + 4 = 10 grad(x) # dz/dy = x = 3 grad(y) ``` ### Step by Step Here is what `backward(z)` does internally: 1. **Initialize**: Set $\frac{\partial z}{\partial z} = 1$ 2. **Topological sort**: Order nodes so each node comes after its children 3. **Reverse traverse**: For each node, call its `backward_fn` For our example $z = x^2 + x \cdot y$: | Step | Node | Local derivative | Accumulated gradient | |------|------|-----------------|---------------------| | Init | z | $\partial z / \partial z = 1$ | `grad(z) = 1` | | 1 | z = v3 + v4 | $\partial z / \partial v_3 = 1$, $\partial z / \partial v_4 = 1$ | `grad(v3) += 1`, `grad(v4) += 1` | | 2 | v4 = x * y | $\partial v_4 / \partial x = y$, $\partial v_4 / \partial y = x$ | `grad(x) += 1 * 4 = 4`, `grad(y) += 1 * 3 = 3` | | 3 | v3 = x^2 | $\partial v_3 / \partial x = 2x$ | `grad(x) += 1 * 6 = 6` | | **Result** | | | `grad(x) = 10`, `grad(y) = 3` | ### Gradient Accumulation Notice that `grad(x)` receives contributions from **both** paths through the graph (via $v_3$ and $v_4$). This is gradient accumulation---it naturally handles the multivariate chain rule when a variable appears in multiple sub-expressions. ## Why Reverse Mode? For a function $f : \mathbb{R}^n \to \mathbb{R}$ (scalar output, many inputs), reverse mode computes **all** $n$ partial derivatives in a single backward pass. This is exactly the setting for MLE: the log-likelihood is a scalar function of $p$ parameters. One backward pass gives us the entire gradient vector. Forward mode, by contrast, computes one directional derivative per pass---it would need $p$ passes for the full gradient. | Mode | Cost | Best for | |------|------|----------| | Forward | $O(n)$ passes for $n$ inputs | Few inputs, many outputs | | Reverse | $O(1)$ passes for any $n$ | **Many inputs, scalar output (MLE!)** | ## Second Derivatives: Forward-over-Reverse For Hessians (second-derivative matrices), femtograd uses **forward-over-reverse** AD. This nests forward-mode AD inside reverse-mode: 1. **Forward mode** (using `dual` numbers): Propagate tangent vectors through the computation 2. **Reverse mode**: Differentiate the tangent computation to get second derivatives ```{r} # Compute Hessian of f(x,y) = x^2 + x*y f <- function(params) { x <- params[[1]] y <- params[[2]] x^2 + x * y } H <- hessian(f, list(val(3), val(4))) H ``` The Hessian is: $$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 0 \end{bmatrix}$$ ### Dual Numbers Forward-mode AD uses **dual numbers**: pairs $(v, \dot{v})$ where $v$ is the primal value and $\dot{v}$ is the tangent (directional derivative). Arithmetic on dual numbers automatically propagates derivatives: - $(a, \dot{a}) + (b, \dot{b}) = (a + b, \dot{a} + \dot{b})$ - $(a, \dot{a}) \times (b, \dot{b}) = (a \cdot b, a \cdot \dot{b} + \dot{a} \cdot b)$ In femtograd, the primal is a `value` object, so the tangent computation builds its own computational graph. Running `backward()` on the tangent then gives the second derivative. ## Connection to Statistics ### Fisher Information and Standard Errors At the MLE $\hat{\theta}$, the observed Fisher information matrix is: $$\mathcal{I}(\hat{\theta}) = -H(\hat{\theta})$$ where $H$ is the Hessian of the log-likelihood. The variance-covariance matrix of the MLE is approximately: $$\text{Var}(\hat{\theta}) \approx \mathcal{I}(\hat{\theta})^{-1} = (-H)^{-1}$$ Standard errors are the square roots of the diagonal of this matrix. ```{r} # Generate data set.seed(42) x <- rnorm(200, mean = 5, sd = 2) # Fit model - femtograd computes the Hessian automatically result <- fit( function(mu, log_sigma) { sigma <- exp(log_sigma) loglik_normal(mu, sigma, x) }, params = c(mu = 0, log_sigma = 0) ) # Standard errors from inverse Fisher information se(result) # For comparison: theoretical SE for mu is sigma/sqrt(n) 2 / sqrt(200) ``` ### Newton-Raphson Optimization Newton's method uses the Hessian to take optimal steps: $$\theta_{t+1} = \theta_t - H^{-1} \nabla \ell(\theta_t)$$ This converges quadratically near the optimum, while gradient-only methods converge linearly. Femtograd's `method = "newton"` option in `fit()` uses this approach. ## Summary | Concept | In femtograd | Purpose | |---------|-------------|---------| | Forward pass | `z <- x^2 + x*y` | Build graph, compute value | | Backward pass | `backward(z)` | Compute all gradients | | Gradient access | `grad(x)` | Get $\partial z / \partial x$ | | Hessian | `hessian(f, params)` | Second derivatives for SEs | | Dual numbers | Internal to `hessian()` | Forward-over-reverse AD | | Fisher information | `observed_info(result)` | $-H$ at MLE | The computational graph approach gives us exact derivatives (no numerical approximation) efficiently (one backward pass for all parameters), which is precisely what statistical inference requires.