Tutorial

We present a typical workflow with DifferentiationInterface.jl and showcase its potential performance benefits.

using DifferentiationInterface

Computing a gradient

A common use case of automatic differentiation (AD) is optimizing real-valued functions with first- or second-order methods. Let's define a simple objective and a random input vector

f(x) = sum(abs2, x)

x = collect(1.0:5.0)

5-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0
 5.0

To compute its gradient, we need to choose a "backend", i.e. an AD package to call under the hood. Most backend types are defined by ADTypes.jl and re-exported by DifferentiationInterface.jl.

ForwardDiff.jl is very generic and efficient for low-dimensional inputs, so it's a good starting point:

import ForwardDiff

backend = AutoForwardDiff()

Tip

To avoid name conflicts, load AD packages with import instead of using. Indeed, most AD packages also export operators like gradient and jacobian, but you only want to use the ones from DifferentiationInterface.jl.

Now you can use the following syntax to compute the gradient:

gradient(f, backend, x)

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

Was that fast? BenchmarkTools.jl helps you answer that question.

using BenchmarkTools

@benchmark gradient($f, $backend, $x)

BenchmarkTools.Trial: 10000 samples with 187 evaluations.
 Range (min … max):  550.872 ns …  32.409 μs  ┊ GC (min … max): 0.00% … 97.11%
 Time  (median):     569.353 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   601.610 ns ± 635.165 ns  ┊ GC (mean ± σ):  3.92% ±  4.57%

    ▂▅███▇▇▆▅▄▃▂▁                                               ▃
  ▄▆███████████████▇▆▅▅▁▁▆▆▇█▇█▇█▇█▇▇▇▆▇▇▇▆▇▆▅▅▆▅▅▆▅▁▅▄▅▆▃▅▆▅▆▅ █
  551 ns        Histogram: log(frequency) by time        723 ns <

 Memory estimate: 848 bytes, allocs estimate: 4.

Not bad, but you can do better.

Overwriting a gradient

Since you know how much space your gradient will occupy (the same as your input x), you can pre-allocate that memory and offer it to AD. Some backends get a speed boost from this trick.

grad = similar(x)
gradient!(f, grad, backend, x)
grad  # has been mutated

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

The bang indicates that one of the arguments of gradient! might be mutated. More precisely, our convention is that every positional argument between the function and the backend is mutated (and the extras too, see below).

@benchmark gradient!($f, _grad, $backend, $x) evals=1 setup=(_grad=similar($x))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  570.000 ns …  1.563 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     611.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   637.268 ns ± 87.177 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇ ██▆▃▅▃▁ ▂▂ ▂▂▂ ▂▂▂ ▁▁  ▁   ▁▁ ▁                          ▂
  ███▁███████▇██▁███▁███▁███████▇██▁███▁▇▇█▁██▇▇██▁███▁▇▇▆▁▇▆▆ █
  570 ns        Histogram: log(frequency) by time      1.01 μs <

 Memory estimate: 752 bytes, allocs estimate: 3.

For some reason the in-place version is not much better than your first attempt. However, it makes fewer allocations, thanks to the gradient vector you provided. Don't worry, you can get even more performance.

Preparing for multiple gradients

Internally, ForwardDiff.jl creates some data structures to keep track of things. These objects can be reused between gradient computations, even on different input values. We abstract away the preparation step behind a backend-agnostic syntax:

extras = prepare_gradient(f, backend, x)

You don't need to know what this object is, you just need to pass it to the gradient operator.

grad = similar(x)
gradient!(f, grad, backend, x, extras)
grad  # has been mutated

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

Preparation makes the gradient computation much faster, and (in this case) allocation-free.

@benchmark gradient!($f, _grad, $backend, $x, _extras) evals=1 setup=(
    _grad=similar($x);
    _extras=prepare_gradient($f, $backend, $x)
)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  69.000 ns …  1.252 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     80.000 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   83.796 ns ± 32.144 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    █    ▄    ▁                            ▂    ▃    ▂     ▂
  █▁▁▁▁█▁▁▁▆█▁▁▁▁█▁▁▁▇▅▁▁▁▅▁▁▁▁▄▁▁▁▁▆▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▆ █
  69 ns        Histogram: log(frequency) by time       190 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Beware that the extras object is nearly always mutated by differentiation operators, even though it is given as the last positional argument.

Switching backends

The whole point of DifferentiationInterface.jl is that you can easily experiment with different AD solutions. Typically, for gradients, reverse mode AD might be a better fit, so let's try ReverseDiff.jl!

For this one, the backend definition is slightly more involved, because you can specify whether the tape needs to be compiled:

import ReverseDiff

backend2 = AutoReverseDiff(; compile=true)

But once it is done, things run smoothly with exactly the same syntax:

gradient(f, backend2, x)

5-element Vector{Float64}:
  2.0
  4.0
  6.0
  8.0
 10.0

And you can run the same benchmarks to see what you gained (although such a small input may not be realistic):

@benchmark gradient!($f, _grad, $backend2, $x, _extras) evals=1 setup=(
    _grad=similar($x);
    _extras=prepare_gradient($f, $backend2, $x)
)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  280.000 ns …  11.041 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     300.000 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   302.812 ns ± 109.556 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃     █     █     ▆     ▃▃    ▁▂     ▁                        ▂
  █▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁██▁▁▁▁██▁▁▁▁██▁▁▁▁▇█▁▁▁▁▇█▁▁▁▁▇▇▁▁▁▁▃ █
  280 ns        Histogram: log(frequency) by time        380 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

In short, DifferentiationInterface.jl allows for easy testing and comparison of AD backends. If you want to go further, check out the DifferentiationInterfaceTest.jl tutorial. It provides benchmarking utilities to compare backends and help you select the one that is best suited for your problem.

Handling sparsity

To compute sparse Jacobians or Hessians, you need three ingredients (read this survey to understand why):

An underlying (dense) AD backend
A sparsity pattern detector like:
- TracerSparsityDetector, implemented by SparseConnectivityTracer.jl (our default recommendation)
- SymbolicsSparsityDetector, implemented by DifferentiationInterface.jl with Symbolics.jl but not exported nor part of the public API (it will soon be transferred)
A coloring algorithm like:
- GreedyColoringAlgorithm, implemented by DifferentiationInterface.jl

ADTypes.jl v1.0 provides the AutoSparse wrapper to combine these three ingredients, and DifferentiationInterface.jl re-exports it. Here's an example:

using SparseConnectivityTracer

dense_backend = AutoForwardDiff()

sparse_backend = AutoSparse(
    dense_backend;
    sparsity_detector=TracerSparsityDetector(),
    coloring_algorithm=GreedyColoringAlgorithm(),
)

AutoSparse{AutoForwardDiff{nothing, Nothing}, SparseConnectivityTracer.TracerSparsityDetector{BitSet}, GreedyColoringAlgorithm}(AutoForwardDiff{nothing, Nothing}(nothing), SparseConnectivityTracer.TracerSparsityDetector{BitSet}(), GreedyColoringAlgorithm())

See how the computed Hessian is sparse, whereas the underlying backend alone would give us a dense matrix:

hessian(f, sparse_backend, x)

5×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 2.0   ⋅    ⋅    ⋅    ⋅ 
  ⋅   2.0   ⋅    ⋅    ⋅ 
  ⋅    ⋅   2.0   ⋅    ⋅ 
  ⋅    ⋅    ⋅   2.0   ⋅ 
  ⋅    ⋅    ⋅    ⋅   2.0

hessian(f, dense_backend, x)

5×5 Matrix{Float64}:
 2.0  0.0  0.0  0.0  0.0
 0.0  2.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  2.0  0.0
 0.0  0.0  0.0  0.0  2.0

The sparsity detector and coloring algorithm are called during the preparation step, which can be fairly expensive. If you plan to compute several Jacobians or Hessians with the same pattern but different input vectors, you should reuse the extras object created by prepare_jacobian or prepare_hessian. After preparation, the sparse computation itself will be much faster than the dense one, and require fewer calls to the function.

Info

The symbolic backends have built-in sparsity handling, so AutoSparse(AutoSymbolics()) and AutoSparse(AutoFastDifferentiation()) do not need additional configuration for detection or coloring. However they still benefit from preparation.