Sparse linear regression

In this example, we show how to differentiate through the solution of the following constrained optimization problem:

\[\hat{y}(x) = \min_{y \in \mathcal{C}} f(x, y)\]

where $\mathcal{C}$ is a closed convex set. The optimal solution can be found as the fixed point of the projected gradient algorithm for any step size $\eta$. This insight yields the following optimality conditions:

\[F(x, \hat{y}(x)) = 0 \quad \text{with} \quad F(x,y) = \mathrm{proj}_{\mathcal{C}}(y - \eta \nabla_2 f(x, y)) - y\]

using ComponentArrays
using Convex
using FiniteDifferences
using ImplicitDifferentiation
using MathOptInterface
using MathOptSetDistances
using Random
using SCS
using Zygote


Random.seed!(63)
Random.TaskLocalRNG()

Introduction

We have a matrix of features $X \in \mathbb{R}^{n \times p}$ and a vector of targets $y \in \mathbb{R}^n$.

In a linear regression setting $y \approx X \beta$, one way to ensure sparsity of the parameter $\beta \in \mathbb{R}^p$ is to select it within the $\ell_1$ ball $\mathcal{B}_1$:

\[\hat{\beta}(X, y) = \min_{\beta} ~ \lVert y - X \beta \rVert_2^2 \quad \text{s.t.} \quad \lVert \beta \rVert_1 \leq 1 \tag{QP}\]

We want to compute the derivatives of the optimal parameter wrt to the data: $\partial \hat{\beta} / \partial X$ and $\partial \hat{\beta} / \partial y$.

Possible application: sensitivity analysis of $\hat{\beta}(X, y)$.

Forward solver

The function $\hat{\beta}$ is computed with a disciplined convex solver thanks to Convex.jl.

function lasso(X::AbstractMatrix, y::AbstractVector)
	n, p = size(X)
	β = Variable(p)
	objective = sumsquares(X * β - y)
	constraints = [norm(β, 1) <= 1.]
	problem = minimize(objective, constraints)
	solve!(problem, SCS.Optimizer; silent_solver=true)
	return Convex.evaluate(β)
end
lasso (generic function with 1 method)

To comply with the requirements of ImplicitDifferentiation.jl, we need to provide the input arguments within a single array. We exploit ComponentArrays.jl for that purpose.

lasso(data::ComponentVector) = lasso(data.X, data.y)
lasso (generic function with 2 methods)

Optimality conditions

We use MathOptSetDistances.jl to compute the projection onto the unit $\ell_1$ ball.

function proj_l1_ball(v::AbstractVector{R}) where {R<:Real}
	distance = MathOptSetDistances.DefaultDistance()
    cone = MathOptInterface.NormOneCone(length(v))
	ball = MathOptSetDistances.NormOneBall{R}(one(R), cone)
	return projection_on_set(distance, v, ball)
end
proj_l1_ball (generic function with 1 method)

Since this projection uses mutation internally, it is not compatible with Zygote.jl. Thus, we need to specify that it should be differentiated with ForwardDiff.jl.

function proj_grad_fixed_point(data, β)
	grad = 2 * data.X' * (data.X * β - data.y)
	return β - Zygote.forwarddiff(proj_l1_ball, β - grad)
end
proj_grad_fixed_point (generic function with 1 method)

This is the last ingredient we needed to build a differentiable sparse linear regression.

implicit = ImplicitFunction(lasso, proj_grad_fixed_point);

Testing

n, p = 5, 7;
X, y = rand(n, p), rand(n);
data = ComponentVector(X=X, y=y);

As expected, the forward pass returns a sparse solution

round.(implicit(data); digits=4)
7-element Vector{Float64}:
 0.3643
 0.0506
 0.0
 0.377
 0.2081
 0.0
 0.0

Note that implicit differentiation is necessary here because the convex solver breaks autodiff.

try; Zygote.jacobian(lasso, data); catch e; @error e; end
┌ Error: MethodError(zero, (+ (affine; real)
│ ├─ * (affine; real)
│ │  ├─ 5×7 reshape(view(::Vector{Float64}, 1:35), 5, 7) with eltype Float64
│ │  └─ 7-element real variable (id: 431…703)
│ └─ 5-element Vector{Float64},), 0x0000000000007b42)
└ @ Main 2_sparse_linear_regression.md:113

Meanwhile, our implicit wrapper makes autodiff work seamlessly.

J = Zygote.jacobian(implicit, data)[1]
7×40 Matrix{Float64}:
 -0.501362  -0.745416   0.804222  …   1.28638     0.101484  -0.906656
  0.426837   1.13999   -1.00262      -1.50601     0.985644   0.388479
  0.0        0.0        0.0           0.0         0.0        0.0
  0.193122   0.193906  -0.313604      0.130775   -0.902003   0.355226
 -0.118597  -0.588484   0.512001      0.0888583  -0.185125   0.16295
  0.0        0.0        0.0       …   0.0         0.0        0.0
  0.0        0.0        0.0           0.0         0.0        0.0

The number of columns of the Jacobian is explained by the following formula:

prod(size(X)) + prod(size(y))
40

We can validate the result using finite differences.

J_ref = FiniteDifferences.jacobian(central_fdm(5, 1), lasso, data)[1]
sum(abs, J - J_ref) / prod(size(J))
0.0005758208246394782

This page was generated using Literate.jl.