--- title: "Architecture of an AD Engine" subtitle: "From femtograd to micrograd" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Architecture of an AD Engine} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(femtograd) ``` ## Introduction femtograd is deliberately simple. Every operation creates an R6 object, every backward pass walks a DAG of R closures, and every optimization step rebuilds the entire computational graph from scratch. This makes the code readable and debuggable---you can inspect any node, print any gradient, and trace any computation in plain R. But simplicity has costs. For large datasets or many parameters, femtograd will be slower than production tools like TMB or torch. This is by design: femtograd exists to teach, not to compete. This vignette explains *why* femtograd is structured the way it is, and what architectural changes would be needed to make it faster. We present three tiers that any AD system progresses through, from pure-interpreter implementations (where femtograd lives) to compiled backends (where production tools live). Understanding these tiers makes production AD tools less opaque. ## Tier 0: Graph-per-Call (femtograd today) femtograd's current architecture rebuilds the computational graph on every function evaluation. Consider what happens during one step of gradient ascent: ```{r} # Generate some data set.seed(42) x <- rexp(50, rate = 2) # One step of the "inner loop": # 1. Create fresh value objects from numeric parameters params <- list(val(1.5)) # 2. Run the forward pass (builds the graph) obj <- loglik_exponential(params[[1]], x) # 3. Extract the scalar result get_data(obj) # 4. Run the backward pass (walks the graph) backward(obj) # 5. Extract the gradient grad(params[[1]]) ``` Steps 1--5 are the fundamental unit of work for any gradient-based optimizer. femtograd's `gradient_ascent()`, `bfgs()`, and `lbfgs()` all perform this sequence on every iteration. ### What happens under the hood Each arithmetic operation (`+`, `*`, `log`, etc.) creates a new R6 `value` object on the R heap. The object stores: - **data**: The numeric result (a matrix) - **grad**: The gradient accumulator (initially zero) - **prev**: Pointers to parent nodes (forming the DAG) - **backward_fn**: An R closure encoding the local derivative For a log-likelihood summing over $n$ observations, this creates $O(n)$ nodes per evaluation. Each node involves R6 object allocation, S3 method dispatch, and closure creation---all interpreted R operations. ### The cost profile | Operation | Mechanism | Cost | |-----------|-----------|------| | Node creation | R6$new() per operation | Heap allocation + GC pressure | | Forward pass | S3 dispatch per operation | Interpreter overhead | | Backward pass | Walk DAG, call closures | Interpreted loop + closure calls | | Graph structure | Rebuilt every iteration | Redundant work | For a likelihood with 100 observations and 2 parameters, each optimizer step creates hundreds of R6 objects, dispatches hundreds of S3 methods, and discards the entire graph before the next step. ### Why this is the right choice for pedagogy Despite the overhead, Tier 0 has critical advantages for learning: 1. **Inspectability**: You can `print()` any node and see its value, gradient, and parents 2. **Debuggability**: R's debugger works on every operation---set breakpoints in `backward_fn` closures 3. **Extensibility**: Adding a new operation means writing one R function with a closure 4. **No build step**: Pure R, no compilation, installs anywhere R runs The code *is* the documentation. Reading `R/ops.R` teaches you how multiplication's backward pass works. Reading `R/value.R` teaches you how graphs are built. This transparency is the whole point. ## Tier 1: Tape-Based Tracing The first major optimization is to separate **graph construction** from **graph evaluation**. Instead of rebuilding the graph on every call, trace it once and replay it. ### The idea A *tape* is a linear record of operations. Instead of creating R6 objects during the forward pass, you record each operation as a tuple: ``` {operation, input_indices, output_index} ``` For example, the log-likelihood $\ell(\lambda) = n \log \lambda - \lambda \sum x_i$ might produce a tape like: ``` [1] LOG input=0 output=1 # log(lambda) [2] MUL input=1,2 output=3 # n * log(lambda) [3] MUL input=0,4 output=5 # lambda * sum_x [4] SUB input=3,5 output=6 # n*log(lambda) - lambda*sum_x ``` The numeric values live in a flat vector (the *value buffer*), not in individual R6 objects. To evaluate at new parameter values, you update slot 0 in the buffer and replay the tape---no object creation, no S3 dispatch, no garbage collection. ### What this buys you | Aspect | Tier 0 | Tier 1 | |--------|--------|--------| | Graph construction | Every call | Once | | Memory pattern | R6 objects on heap | Flat numeric vectors | | Dispatch overhead | S3 per operation | Switch/table per operation | | Backward pass | Walk DAG of closures | Reverse-iterate the tape | The key insight is that the *structure* of the computation doesn't change between optimizer iterations---only the *values* do. Taping exploits this by caching the structure. ### The trade-off Taping requires that the computation's **control flow is static**: no `if`/`else` branches that depend on parameter values. If the graph structure can change (e.g., different code paths for positive vs. negative parameters), the tape must be re-recorded. This is exactly the distinction between TensorFlow 1.x's static graphs (taped) and PyTorch's eager mode (graph-per-call, like femtograd). JAX's `jit` is a hybrid: it traces once and re-traces only when the control flow changes. For MLE problems, the log-likelihood almost always has static structure, so taping works well. ## Tier 2: Compiled Backends The second optimization is to move the tape interpreter out of R entirely. ### C++ via Rcpp Even with taping, replaying the tape in R's interpreter has per-operation overhead: function call setup, type checking, vector recycling rules. Moving the tape interpreter to C++ via Rcpp eliminates this: node data becomes contiguous C++ arrays, and the backward pass becomes a tight loop with no R dispatch. This is the approach taken by **TMB** (Template Model Builder), one of the most widely used AD tools in statistics. TMB uses CppAD (a C++ automatic differentiation library) behind an R interface. Users write their log-likelihood in C++ templates, and TMB compiles it into an efficient tape that can be evaluated and differentiated without returning to R. The performance difference is substantial: TMB can handle models with thousands of parameters and millions of observations, while femtograd is practical for modest-sized problems. ### GPU acceleration For problems where the bottleneck is arithmetic throughput rather than per-node overhead---large matrix operations, batch likelihoods over massive datasets---the tape can be evaluated on a GPU. This is what **torch** (the R interface to PyTorch's libtorch) provides. The R API stays the same; only the backend changes. For femtograd's typical use case (small parameter spaces, moderate data), GPU acceleration is rarely necessary. The bottleneck is per-node interpreter overhead, not floating-point throughput, and GPUs don't help with that until the tape is already compiled. ### The abstraction boundary The key architectural insight is that Tier 2 doesn't change the *API*---it changes the *backend*. In principle, you could keep femtograd's `val()`, `backward()`, and `grad()` interface and swap the implementation from R6 objects to a C++ tape engine. The user-facing code would be identical; only the performance characteristics would change. This is why understanding Tier 0 is valuable even if you'll eventually use a Tier 2 tool: the concepts (computational graphs, chain rule, reverse-mode traversal) are the same at every tier. The tiers differ only in how much interpreter overhead sits between the math and the metal. ## Where femtograd Chooses to Live femtograd stays at Tier 0 because its goal is not speed---it's **understanding**. Moving to Tier 1 would hide the graph structure behind a tape abstraction. Moving to Tier 2 would hide the operations behind a C++ wall. Each tier trades transparency for performance. For real-world statistical modeling: - **Small problems** (< 10 parameters, < 1000 observations): femtograd is perfectly adequate - **Medium problems**: Consider TMB for compiled AD with an R interface - **Large problems** or deep learning: Use torch for GPU-accelerated tensor operations But for *learning how automatic differentiation works*---for tracing a gradient from output back to input, for understanding why the Hessian gives you standard errors, for seeing the chain rule operate on real code---Tier 0 is exactly right. ```{r} # The whole pipeline, readable end to end: set.seed(42) data <- rnorm(100, mean = 5, sd = 2) result <- fit( function(mu, log_sigma) { sigma <- exp(log_sigma) loglik_normal(mu, sigma, data) }, params = c(mu = 0, log_sigma = 0) ) summary(result) ``` The code above builds a graph, differentiates it, inverts the Hessian, and reports standard errors---all in pure R, all inspectable, all debuggable. That's the point. ## Summary | Tier | Strategy | Example tools | Transparency | Speed | |------|----------|--------------|--------------|-------| | 0 | Graph-per-call (R6 + closures) | **femtograd** | Fully inspectable | Modest | | 1 | Tape-based tracing | TensorFlow 1.x, JAX | Graph cached, less visible | Faster | | 2 | Compiled backend (C++, GPU) | TMB, torch | Opaque backend | Production-grade | Every tier implements the same mathematical ideas: computational graphs, the chain rule, and reverse-mode traversal. The difference is how much machinery sits between you and those ideas. femtograd minimizes that machinery so you can see the math clearly.