---
title: "Architecture of an AD Engine"
subtitle: "From femtograd to micrograd"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Architecture of an AD Engine}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(femtograd)
```

## Introduction

femtograd is deliberately simple. Every operation creates an R6 object,
every backward pass walks a DAG of R closures, and every optimization
step rebuilds the entire computational graph from scratch. This makes
the code readable and debuggable---you can inspect any node, print any
gradient, and trace any computation in plain R.

But simplicity has costs. For large datasets or many parameters,
femtograd will be slower than production tools like TMB or torch.
This is by design: femtograd exists to teach, not to compete.

This vignette explains *why* femtograd is structured the way it is, and
what architectural changes would be needed to make it faster. We present
three tiers that any AD system progresses through, from pure-interpreter
implementations (where femtograd lives) to compiled backends (where
production tools live). Understanding these tiers makes production AD
tools less opaque.

## Tier 0: Graph-per-Call (femtograd today)

femtograd's current architecture rebuilds the computational graph on
every function evaluation. Consider what happens during one step of
gradient ascent:

```{r}
# Generate some data
set.seed(42)
x <- rexp(50, rate = 2)

# One step of the "inner loop":
# 1. Create fresh value objects from numeric parameters
params <- list(val(1.5))

# 2. Run the forward pass (builds the graph)
obj <- loglik_exponential(params[[1]], x)

# 3. Extract the scalar result
get_data(obj)

# 4. Run the backward pass (walks the graph)
backward(obj)

# 5. Extract the gradient
grad(params[[1]])
```

Steps 1--5 are the fundamental unit of work for any gradient-based
optimizer. femtograd's `gradient_ascent()`, `bfgs()`, and `lbfgs()` all
perform this sequence on every iteration.

### What happens under the hood

Each arithmetic operation (`+`, `*`, `log`, etc.) creates a new R6
`value` object on the R heap. The object stores:

- **data**: The numeric result (a matrix)
- **grad**: The gradient accumulator (initially zero)
- **prev**: Pointers to parent nodes (forming the DAG)
- **backward_fn**: An R closure encoding the local derivative

For a log-likelihood summing over $n$ observations, this creates
$O(n)$ nodes per evaluation. Each node involves R6 object allocation,
S3 method dispatch, and closure creation---all interpreted R operations.

### The cost profile

| Operation | Mechanism | Cost |
|-----------|-----------|------|
| Node creation | R6$new() per operation | Heap allocation + GC pressure |
| Forward pass | S3 dispatch per operation | Interpreter overhead |
| Backward pass | Walk DAG, call closures | Interpreted loop + closure calls |
| Graph structure | Rebuilt every iteration | Redundant work |

For a likelihood with 100 observations and 2 parameters, each optimizer
step creates hundreds of R6 objects, dispatches hundreds of S3 methods,
and discards the entire graph before the next step.

### Why this is the right choice for pedagogy

Despite the overhead, Tier 0 has critical advantages for learning:

1. **Inspectability**: You can `print()` any node and see its value,
   gradient, and parents
2. **Debuggability**: R's debugger works on every operation---set
   breakpoints in `backward_fn` closures
3. **Extensibility**: Adding a new operation means writing one R
   function with a closure
4. **No build step**: Pure R, no compilation, installs anywhere R runs

The code *is* the documentation. Reading `R/ops.R` teaches you how
multiplication's backward pass works. Reading `R/value.R` teaches you
how graphs are built. This transparency is the whole point.

## Tier 1: Tape-Based Tracing

The first major optimization is to separate **graph construction** from
**graph evaluation**. Instead of rebuilding the graph on every call,
trace it once and replay it.

### The idea

A *tape* is a linear record of operations. Instead of creating R6
objects during the forward pass, you record each operation as a tuple:

```
{operation, input_indices, output_index}
```

For example, the log-likelihood $\ell(\lambda) = n \log \lambda - \lambda \sum x_i$
might produce a tape like:

```
[1] LOG    input=0   output=1      # log(lambda)
[2] MUL    input=1,2 output=3      # n * log(lambda)
[3] MUL    input=0,4 output=5      # lambda * sum_x
[4] SUB    input=3,5 output=6      # n*log(lambda) - lambda*sum_x
```

The numeric values live in a flat vector (the *value buffer*), not in
individual R6 objects. To evaluate at new parameter values, you update
slot 0 in the buffer and replay the tape---no object creation, no S3
dispatch, no garbage collection.

### What this buys you

| Aspect | Tier 0 | Tier 1 |
|--------|--------|--------|
| Graph construction | Every call | Once |
| Memory pattern | R6 objects on heap | Flat numeric vectors |
| Dispatch overhead | S3 per operation | Switch/table per operation |
| Backward pass | Walk DAG of closures | Reverse-iterate the tape |

The key insight is that the *structure* of the computation doesn't
change between optimizer iterations---only the *values* do. Taping
exploits this by caching the structure.

### The trade-off

Taping requires that the computation's **control flow is static**: no
`if`/`else` branches that depend on parameter values. If the graph
structure can change (e.g., different code paths for positive vs.
negative parameters), the tape must be re-recorded.

This is exactly the distinction between TensorFlow 1.x's static graphs
(taped) and PyTorch's eager mode (graph-per-call, like femtograd).
JAX's `jit` is a hybrid: it traces once and re-traces only when the
control flow changes.

For MLE problems, the log-likelihood almost always has static structure,
so taping works well.

## Tier 2: Compiled Backends

The second optimization is to move the tape interpreter out of R
entirely.

### C++ via Rcpp

Even with taping, replaying the tape in R's interpreter has per-operation
overhead: function call setup, type checking, vector recycling rules.
Moving the tape interpreter to C++ via Rcpp eliminates this:
node data becomes contiguous C++ arrays, and the backward pass becomes
a tight loop with no R dispatch.

This is the approach taken by **TMB** (Template Model Builder), one of
the most widely used AD tools in statistics. TMB uses CppAD (a C++
automatic differentiation library) behind an R interface. Users write
their log-likelihood in C++ templates, and TMB compiles it into an
efficient tape that can be evaluated and differentiated without
returning to R.

The performance difference is substantial: TMB can handle models with
thousands of parameters and millions of observations, while femtograd
is practical for modest-sized problems.

### GPU acceleration

For problems where the bottleneck is arithmetic throughput rather than
per-node overhead---large matrix operations, batch likelihoods over
massive datasets---the tape can be evaluated on a GPU.

This is what **torch** (the R interface to PyTorch's libtorch) provides.
The R API stays the same; only the backend changes. For femtograd's
typical use case (small parameter spaces, moderate data), GPU
acceleration is rarely necessary. The bottleneck is per-node interpreter
overhead, not floating-point throughput, and GPUs don't help with that
until the tape is already compiled.

### The abstraction boundary

The key architectural insight is that Tier 2 doesn't change the *API*---it
changes the *backend*. In principle, you could keep femtograd's
`val()`, `backward()`, and `grad()` interface and swap the
implementation from R6 objects to a C++ tape engine. The user-facing
code would be identical; only the performance characteristics would
change.

This is why understanding Tier 0 is valuable even if you'll eventually
use a Tier 2 tool: the concepts (computational graphs, chain rule,
reverse-mode traversal) are the same at every tier. The tiers differ
only in how much interpreter overhead sits between the math and the
metal.

## Where femtograd Chooses to Live

femtograd stays at Tier 0 because its goal is not speed---it's
**understanding**. Moving to Tier 1 would hide the graph structure
behind a tape abstraction. Moving to Tier 2 would hide the operations
behind a C++ wall. Each tier trades transparency for performance.

For real-world statistical modeling:

- **Small problems** (< 10 parameters, < 1000 observations): femtograd
  is perfectly adequate
- **Medium problems**: Consider TMB for compiled AD with an R interface
- **Large problems** or deep learning: Use torch for GPU-accelerated
  tensor operations

But for *learning how automatic differentiation works*---for tracing a
gradient from output back to input, for understanding why the Hessian
gives you standard errors, for seeing the chain rule operate on real
code---Tier 0 is exactly right.

```{r}
# The whole pipeline, readable end to end:
set.seed(42)
data <- rnorm(100, mean = 5, sd = 2)

result <- fit(
  function(mu, log_sigma) {
    sigma <- exp(log_sigma)
    loglik_normal(mu, sigma, data)
  },
  params = c(mu = 0, log_sigma = 0)
)

summary(result)
```

The code above builds a graph, differentiates it, inverts the Hessian,
and reports standard errors---all in pure R, all inspectable, all
debuggable. That's the point.

## Summary

| Tier | Strategy | Example tools | Transparency | Speed |
|------|----------|--------------|--------------|-------|
| 0 | Graph-per-call (R6 + closures) | **femtograd** | Fully inspectable | Modest |
| 1 | Tape-based tracing | TensorFlow 1.x, JAX | Graph cached, less visible | Faster |
| 2 | Compiled backend (C++, GPU) | TMB, torch | Opaque backend | Production-grade |

Every tier implements the same mathematical ideas: computational graphs,
the chain rule, and reverse-mode traversal. The difference is how much
machinery sits between you and those ideas. femtograd minimizes that
machinery so you can see the math clearly.