---
title: "How Automatic Differentiation Works"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How Automatic Differentiation Works}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(femtograd)
```

## Why Automatic Differentiation?

In statistics, we constantly need derivatives:

- **MLE**: Find parameters that maximize the log-likelihood (gradient = 0)
- **Standard errors**: Derived from the Hessian (matrix of second derivatives)
- **Newton's method**: Uses both gradient and Hessian for optimization
- **Fisher information**: The expected negative Hessian

There are three ways to compute derivatives:

| Method | Pros | Cons |
|--------|------|------|
| Symbolic | Exact, closed-form | Explosion of terms, hard to automate |
| Numerical (finite differences) | Easy to implement | Approximation errors, slow for Hessians |
| **Automatic differentiation** | **Exact, efficient, automated** | Requires computational graph |

Femtograd uses automatic differentiation (AD), specifically reverse-mode AD
(backpropagation) for gradients and forward-over-reverse AD for Hessians.

## The Computational Graph

Every computation in femtograd builds a directed acyclic graph (DAG).
Each node is a `value` object storing both a number and its gradient.

Consider $f(x, y) = x^2 + x \cdot y$:

```
      [z = v3 + v4]        <- output
       /          \
  [v3 = x^2]   [v4 = x*y]  <- intermediate
      |          /    \
     [x]       [x]   [y]    <- inputs (leaves)
```

In femtograd:

```{r}
x <- val(3, label = "x")
y <- val(4, label = "y")

v3 <- x^2         # v3 = 9
v4 <- x * y       # v4 = 12
z  <- v3 + v4     # z  = 21

get_data(z)
```

## The Forward Pass

When you write `x^2 + x * y`, femtograd executes each operation and
records the computation graph. This is the **forward pass**: compute the
function value while building the graph structure.

Each operation creates a new `value` node that stores:

- **data**: The computed result (a matrix)
- **grad**: Gradient, initially zero
- **prev**: Pointers to parent nodes
- **backward_fn**: A closure that knows how to propagate gradients backward

## Reverse-Mode AD (Backpropagation)

To compute gradients, we traverse the graph in reverse topological order.
This is the **backward pass**.

The key insight is the **chain rule**. For a composition $f(g(x))$:

$$\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$$

In reverse mode, we start from the output and propagate
$\frac{\partial \text{output}}{\partial \text{node}}$ backward through
each operation.

```{r}
# After building the graph above, compute gradients:
backward(z)

# dz/dx = 2x + y = 2(3) + 4 = 10
grad(x)

# dz/dy = x = 3
grad(y)
```

### Step by Step

Here is what `backward(z)` does internally:

1. **Initialize**: Set $\frac{\partial z}{\partial z} = 1$
2. **Topological sort**: Order nodes so each node comes after its children
3. **Reverse traverse**: For each node, call its `backward_fn`

For our example $z = x^2 + x \cdot y$:

| Step | Node | Local derivative | Accumulated gradient |
|------|------|-----------------|---------------------|
| Init | z | $\partial z / \partial z = 1$ | `grad(z) = 1` |
| 1 | z = v3 + v4 | $\partial z / \partial v_3 = 1$, $\partial z / \partial v_4 = 1$ | `grad(v3) += 1`, `grad(v4) += 1` |
| 2 | v4 = x * y | $\partial v_4 / \partial x = y$, $\partial v_4 / \partial y = x$ | `grad(x) += 1 * 4 = 4`, `grad(y) += 1 * 3 = 3` |
| 3 | v3 = x^2 | $\partial v_3 / \partial x = 2x$ | `grad(x) += 1 * 6 = 6` |
| **Result** | | | `grad(x) = 10`, `grad(y) = 3` |

### Gradient Accumulation

Notice that `grad(x)` receives contributions from **both** paths through the
graph (via $v_3$ and $v_4$). This is gradient accumulation---it naturally
handles the multivariate chain rule when a variable appears in multiple
sub-expressions.

## Why Reverse Mode?

For a function $f : \mathbb{R}^n \to \mathbb{R}$ (scalar output, many inputs),
reverse mode computes **all** $n$ partial derivatives in a single backward pass.

This is exactly the setting for MLE: the log-likelihood is a scalar function
of $p$ parameters. One backward pass gives us the entire gradient vector.

Forward mode, by contrast, computes one directional derivative per pass---it
would need $p$ passes for the full gradient.

| Mode | Cost | Best for |
|------|------|----------|
| Forward | $O(n)$ passes for $n$ inputs | Few inputs, many outputs |
| Reverse | $O(1)$ passes for any $n$ | **Many inputs, scalar output (MLE!)** |

## Second Derivatives: Forward-over-Reverse

For Hessians (second-derivative matrices), femtograd uses **forward-over-reverse**
AD. This nests forward-mode AD inside reverse-mode:

1. **Forward mode** (using `dual` numbers): Propagate tangent vectors through
   the computation
2. **Reverse mode**: Differentiate the tangent computation to get second
   derivatives

```{r}
# Compute Hessian of f(x,y) = x^2 + x*y
f <- function(params) {
  x <- params[[1]]
  y <- params[[2]]
  x^2 + x * y
}

H <- hessian(f, list(val(3), val(4)))
H
```

The Hessian is:

$$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 0 \end{bmatrix}$$

### Dual Numbers

Forward-mode AD uses **dual numbers**: pairs $(v, \dot{v})$ where $v$ is
the primal value and $\dot{v}$ is the tangent (directional derivative).

Arithmetic on dual numbers automatically propagates derivatives:

- $(a, \dot{a}) + (b, \dot{b}) = (a + b, \dot{a} + \dot{b})$
- $(a, \dot{a}) \times (b, \dot{b}) = (a \cdot b, a \cdot \dot{b} + \dot{a} \cdot b)$

In femtograd, the primal is a `value` object, so the tangent computation
builds its own computational graph. Running `backward()` on the tangent then
gives the second derivative.

## Connection to Statistics

### Fisher Information and Standard Errors

At the MLE $\hat{\theta}$, the observed Fisher information matrix is:

$$\mathcal{I}(\hat{\theta}) = -H(\hat{\theta})$$

where $H$ is the Hessian of the log-likelihood. The variance-covariance
matrix of the MLE is approximately:

$$\text{Var}(\hat{\theta}) \approx \mathcal{I}(\hat{\theta})^{-1} = (-H)^{-1}$$

Standard errors are the square roots of the diagonal of this matrix.

```{r}
# Generate data
set.seed(42)
x <- rnorm(200, mean = 5, sd = 2)

# Fit model - femtograd computes the Hessian automatically
result <- fit(
  function(mu, log_sigma) {
    sigma <- exp(log_sigma)
    loglik_normal(mu, sigma, x)
  },
  params = c(mu = 0, log_sigma = 0)
)

# Standard errors from inverse Fisher information
se(result)

# For comparison: theoretical SE for mu is sigma/sqrt(n)
2 / sqrt(200)
```

### Newton-Raphson Optimization

Newton's method uses the Hessian to take optimal steps:

$$\theta_{t+1} = \theta_t - H^{-1} \nabla \ell(\theta_t)$$

This converges quadratically near the optimum, while gradient-only methods
converge linearly. Femtograd's `method = "newton"` option in `fit()` uses
this approach.

## Summary

| Concept | In femtograd | Purpose |
|---------|-------------|---------|
| Forward pass | `z <- x^2 + x*y` | Build graph, compute value |
| Backward pass | `backward(z)` | Compute all gradients |
| Gradient access | `grad(x)` | Get $\partial z / \partial x$ |
| Hessian | `hessian(f, params)` | Second derivatives for SEs |
| Dual numbers | Internal to `hessian()` | Forward-over-reverse AD |
| Fisher information | `observed_info(result)` | $-H$ at MLE |

The computational graph approach gives us exact derivatives (no numerical
approximation) efficiently (one backward pass for all parameters), which
is precisely what statistical inference requires.