Automatic differentiation

A tale of two languages

Guillaume Dalle

École nationale des Ponts et Chaussées

2025-03-18

Introduction

Slides

https://gdalle.github.io/JuliaMeetup2025-AutoDiff/

Motivation

What is differentiation?

Finding a linear approximation of a function around a point.

Why do we care?

Derivatives of computer programs are essential in optimization and machine learning.

What do we need to do?

Not much: automatic differentiation (AD) computes derivatives for us!

Flavors of differentiation

Derivatives: formal definition

Derivative of \(f: \mathbb{R}^n \to \mathbb{R}^m\) at point \(x\): linear map \(\partial f(x)\) such that \[f(x + \varepsilon) = f(x) + \partial f(x)[\varepsilon] + o(\varepsilon)\]

  • For \(n = 1, m = 1\), derivative represented by a number \(f'(x)\)
  • For \(n > 1, m = 1\), derivative represented by a gradient vector \(\nabla f(x)\)
  • For \(n > 1, m > 1\), derivative represented by a Jacobian matrix

\[\partial f(x) = \left(\frac{\partial f_i}{\partial x_j} (x)\right)_{1 \leq i \leq n, 1 \leq j \leq m}\]

Manual differentiation

Write down formulas like you’re in high school.

f(x) = √x  # computes sqrt

function g(x)  # computes approximate sqrt
  y = x
  for i in 1:3
    y = 0.5 * (y + x/y)
  end
  return y
end
df(x) = 1 / 2√x
dg(x) = @info "I'm too lazy"
x = 2.0
f(x), df(x)
(1.4142135623730951, 0.35355339059327373)
g(x), dg(x)
[ Info: I'm too lazy
(1.4142156862745097, nothing)

Drawback

Labor-intensive, error-prone.

Symbolic differentiation

Ask Mathematica / Wolfram Alpha to work out formulas for you.

using Symbolics, Latexify; xvar = Symbolics.variable(:x)
latexify(Symbolics.derivative(f(xvar), xvar))

\[\begin{equation} \frac{1}{2 \sqrt{x}} \end{equation}\]

latexify(Symbolics.derivative(g(xvar), xvar))

\[\begin{equation} 0.5 \left( 0.5 \left( 0.5 + \frac{1}{0.5 \left( 1 + x \right)} + \frac{ - 0.5 x}{0.25 \left( 1 + x \right)^{2}} \right) + \frac{1}{0.5 \left( 0.5 \left( 1 + x \right) + \frac{x}{0.5 \left( 1 + x \right)} \right)} - 0.5 \left( 0.5 + \frac{1}{0.5 \left( 1 + x \right)} + \frac{ - 0.5 x}{0.25 \left( 1 + x \right)^{2}} \right) \frac{x}{0.25 \left( 0.5 \left( 1 + x \right) + \frac{x}{0.5 \left( 1 + x \right)} \right)^{2}} \right) \end{equation}\]

Drawback

Does not scale to more complex functions.

Numeric differentiation

Rely on finite differences with a small perturbation.

\[\partial f(x)[\varepsilon] \approx \frac{f(x + \varepsilon) - f(x)}{\varepsilon}\]

ε1 = 1e-1  # too large
(f(x + ε1) - f(x)) / ε1
0.34924112245848793
ε2 = 1e-5  # just right
(f(x + ε2) - f(x)) / ε2
0.35355294865091474
ε3 = 1e-15  # too small
(f(x + ε3) - f(x)) / ε3
0.22204460492503128

Drawback

Truncation or floating point errors depending on \(\varepsilon\).

Automatic (or algorithmic) differentiation

Reinterpret the program computing \(f\) to obtain \(\partial f(x)\) instead.

import Base: *, +, /, sqrt

struct Dual
  val::Float64
  der::Float64
end

*(a, x::Dual) = Dual(a*x.val, a*x.der)
+(x::Dual, y::Dual) = Dual(x.val+y.val, x.der+y.der)
/(x::Dual, y::Dual) = Dual(x.val/y.val, (x.der*y.val-y.der*x.val)/y.val^2)
sqrt(x::Dual) = Dual(√x.val, x.der/2√x.val);
f(Dual(2.0, 1.0))
Dual(1.4142135623730951, 0.35355339059327373)
g(Dual(2.0, 1.0))
Dual(1.4142156862745097, 0.35356593617839294)

Drawback

Hard to reinterpret arbitrary code efficiently.

AD under the hood

How it works

  1. Hardcoded derivatives of basic functions: \(+, \times, \exp, \log, \sin, \cos\)
  2. Composition with the chain rule:

\[ f = g \circ h \qquad \implies \qquad \partial f(x) = \partial g(h(x)) \circ \partial h(x)\]

Main implementation paradigms:

Operator overloading

Define new types augmenting runtime operations.

Source transformation

Preprocess the source code at compile time.

Two different modes

Consider \(f : x \in \mathbb{R}^n \longmapsto y \in \mathbb{R}^m\). Time \(T(f)\) = one evaluation of \(f\).

Forward mode

At cost \(\propto T(f)\), get all \(m\) partial derivatives wrt input \(x_i\).

Propagate an input perturbation onto the outputs.

Reverse mode

At cost \(\propto T(f)\), get all \(n\) partial sensitivities for output \(y_j\).

Backpropagate an output sensitivity onto the inputs.

Why is deep learning possible?

Because gradients in reverse mode are fast.

AD in Python and Julia

A flurry of options

In Python, three main AD frameworks:

In Julia, a dozen or so AD backends:

Each backend has its use cases, especially for scientific ML.

Python & Julia: users

Image: courtesy of Adrian Hill

Python & Julia: developers

Image: courtesy of Adrian Hill

Why the difference?

Python Julia
Math & tensors Framework-specific Part of the core language
AD development Centralized (x3) Decentralized
Limits of AD Well-defined Fuzzy
Scientific libraries Split effort Shared effort

Does it have to be this way?

AD could be a language feature instead of a post-hoc addition.

DifferentiationInterface

Switching backends at low cost

Keras now supports Tensorflow, PyTorch and JAX.

Image: from Keras 3.0 blog post

DifferentiationInterface.jl talks to all Julia AD backends.

  • Downloads per month: 20k
  • Indirect dependents: 427
  • Famous clients: Optimization.jl, Turing.jl, NonlinearSolve.jl

Benefit 1: Standardization

Only one repo containing AD bindings: easier to maintain and improve.

Switching backends is now instantaneous.

using DifferentiationInterface
import ReverseDiff  # this changes
backend = AutoForwardDiff()  # this changes

f(x) = sum(abs2, x)
x = float.(1:4)
p = prepare_gradient(f, backend, similar(x))
gradient(f, p, backend, x)
4-element Vector{Float64}:
 2.0
 4.0
 6.0
 8.0
using DifferentiationInterface
import ReverseDiff  # this changes
backend = AutoReverseDiff()  # this changes

f(x) = sum(abs2, x)
x = float.(1:4)
p = prepare_gradient(f, backend, similar(x))
gradient(f, p, backend, x)
4-element Vector{Float64}:
 2.0
 4.0
 6.0
 8.0

Preliminary work is abstracted away into a preparation step.

Benefit 2: Superpowers

Having a common interface lets us do things we couldn’t do before:

  • Second-order differentiation
  • Sparsity handling
using DifferentiationInterface, SparseConnectivityTracer, SparseMatrixColorings
import ForwardDiff, ReverseDiff

backend = AutoSparse(
  SecondOrder(AutoForwardDiff(), AutoReverseDiff());
  sparsity_detector=TracerSparsityDetector(), coloring_algorithm=GreedyColoringAlgorithm()
)
f(x) = sum(abs2, x)
x = float.(1:4)
p = prepare_hessian(f, backend, similar(x))
hessian(f, p, backend, x)
4×4 SparseArrays.SparseMatrixCSC{Float64, Int64} with 4 stored entries:
 2.0   ⋅    ⋅    ⋅ 
  ⋅   2.0   ⋅    ⋅ 
  ⋅    ⋅   2.0   ⋅ 
  ⋅    ⋅    ⋅   2.0

Conclusion

Take-home message

Computing derivatives is automatic and efficient.

Each AD system comes with limitations, learn to recognize them.

Julia is a great ecosystem to play around with AD.

Do you have a tricky AD problem?

Reach out to me, let’s figure it out! My website: gdalle.github.io

Bibliography

  • Blondel and Roulet (2024): the most recent book
  • Griewank and Walther (2008): the bible of the field
  • Baydin et al. (2018), Margossian (2019): concise surveys

Going further

References

Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. “Automatic Differentiation in Machine Learning: A Survey.” Journal of Machine Learning Research 18 (153): 1–43. http://jmlr.org/papers/v18/17-468.html.
Blondel, Mathieu, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. 2022. “Efficient and Modular Implicit Differentiation.” In Advances in Neural Information Processing Systems. https://openreview.net/forum?id=Q-HOv_zn6G.
Blondel, Mathieu, and Vincent Roulet. 2024. “The Elements of Differentiable Programming.” arXiv. https://doi.org/10.48550/arXiv.2403.14606.
Griewank, Andreas, and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics. https://epubs.siam.org/doi/book/10.1137/1.9780898717761.
Mandi, Jayanta, James Kotary, Senne Berden, Maxime Mulamba, Victor Bucarey, Tias Guns, and Ferdinando Fioretto. 2024. “Decision-Focused Learning: Foundations, State of the Art, Benchmark and Future Opportunities.” Journal of Artificial Intelligence Research 80 (August): 1623–1701. https://doi.org/10.1613/jair.1.15320.
Margossian, Charles C. 2019. “A Review of Automatic Differentiation and Its Efficient Implementation.” WIREs Data Mining and Knowledge Discovery 9 (4): e1305. https://doi.org/10.1002/widm.1305.
Mohamed, Shakir, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. 2020. “Monte Carlo Gradient Estimation in Machine Learning.” Journal of Machine Learning Research 21 (132): 1–62. http://jmlr.org/papers/v21/19-346.html.
Sapienza, Facundo, Jordi Bolibar, Frank Schäfer, Brian Groenke, Avik Pal, Victor Boussange, Patrick Heimbach, et al. 2024. “Differentiable Programming for Differential Equations: A Review.” arXiv. https://doi.org/10.48550/arXiv.2406.09699.