NeurIPS 2021 digest
This is my first attempt at a blog post, so please be kind!
NeurIPS is one of the leading conferences in machine learning, and due to an ongoing pandemic, the 2021 edition took place online. On the one hand, this means that AI researchers from around the world didn’t get to socialize in person and enjoy the free food. On the other hand, it also means that I was able to follow most of the talks from my livingroom, instead of spending my yearly carbon budget on a roundtrip to Sydney.
Now that I have finally caught up on everything I wanted to watch, it’s time for a quick summary of my personal highlights! All of the papers I mention are available on the preproceedings website. This is a long post, but to my defense, there were several thousands of accepted papers at NeurIPS 2021, so I think my compression rate is quite impressive already.
Machine learning theory
The papers in this section provide a critical perspective on some standard assumptions that most ML practitioners would have accepted without batting an eye.
Impossibility results

“Markov rewards are enough”: it depends what you want to do with them. On the Expressivity of Markov Reward questions this fundamental paradigm of reinforcement learning by rethinking the notion of task. Indeed, maximizing an agent’s reward is only a proxy to steer it towards a set of acceptable behaviors or trajectories. And as it turns out, the standard reward framework of MDPs is not sufficient to express every possible task: for instance, the policies achieving the task “always move in the same direction” cannot be given a strictly higher value than all others when using only Markov rewards. The key intuition is that Markov rewards are quick to forgive past offenses, whereas such “lifelong” tasks will never forget even one mistake!

“There is a classification measure that does everything I want”: sorry, but you cannot have your cake and eat it too. There are many interesting properties we can wish from our measure, for instance monotonicity or distancelike behavior. Another important one is the constant baseline property, which ensures that random labelings get similar scores regardless of class sizes. Unfortunately, as was shown by Good Classification Measures and How to Find Them, there is no binary classification measure that satisfies the three criteria listed above. What a bummer…

“Stochastic gradient descent can learn anything”: well, it actually depends on the minibatch size and gradient precision. On the Power of Differentiable Learning versus PAC and SQ Learning shows that there are some settings where the SGD framework is strictly weaker than the PAC learning framework in terms of expressivity. Disappointing, right?
Uncertainty and privacy

“Numerical algorithms are reliable”: not fully, because they often rely on a small parameter \(h\) that cannot be shrinked all the way to \(0\). A good example is the step size when solving differential equations. Black Box Probabilistic Numerics proposes to bridge this gap by using a sequence of increasingly precise computations to interpolate the \(h \to 0\) limit using a Gaussian process. The resulting quantity of interest is no longer deterministic, but stochastic: its variance can then be interpreted as an epistemic uncertainty related to the finite precision of the underlying solver.

“Differential privacy is the only way to go”: that may be true if you want to be robust against attacks regarding all members of the dataset. However, if you only care about the privacy of some data points (for instance, a group of users asking to be removed from your database), then you can do better by following Master Yoda’s advice: “You must unleaaaarn what you have leaaarned”. More specifically, you must Remember What You Want to Forget: Algorithms for Machine Unlearning. The authors of this paper show that by exploiting the hessian of the final training loss, we can delete more samples (when compared to DP) without giving up on prediction performance.
Discrete and continuous time

“Discrete time makes things easier to analyze”: not always. In fact, a growing number of works encourage us to interpret wellknown algorithms from a continuoustime point of view. In Framing RNN as a kernel method: A neural ODE approach, one discovers that taking the continuous limit of recurrent networks and studying the resulting differential equation can yield useful generalization guarantees. Throw in some kernel methods for good measure, and you got yourself a nice little theorypacked paper!

“Well, continuoustime analysis may be simpler but the resulting algorithm cannot be better”: wrong again! Continuized Accelerations of Deterministic and Stochastic Gradient Descents, and of Gossip Algorithms introduces a version of Nesterov acceleration for gradient descent that replaces the mixing steps with a differential equation and chooses discrete update times from a Poisson process. Not only does the convergence proof become much more intuitive, this continuization approach actually brings other benefits, most notably an unexpected potential for parallelization in certain settings. Watching the talk made me feel like this was sorcery, maybe you can convince me otherwise!
Machine learning & combinatorial optimization
A particular focus of my research which I share with my advisor Axel Parmentier and my former student Louis Bouvier, is the use of learning algorithms to enhance optimization procedures.
Indeed, we can often leverage the sequential nature of reallife applications, in which we solve many similar problems in a sequence, to enhance these algorithms by learning from past decisions. This field of research is particularly active, as you can see from the following list of contributions. I will try to edit this post and add some comments within the next few days.
Speeding up general linear, quadratic and integer programs
 Practical LargeScale Linear Programming using PrimalDual Hybrid Gradient
 Accelerating Quadratic Optimization with Reinforcement Learning
 Learning Large Neighborhood Search Policy for Integer Programming
 Learning to Schedule Heuristics in Branch and Bound
 Sample Complexity of Tree Search Configuration: Cutting Planes and Beyond
Learning from previous solutions of a specific problem
 Faster Matchings via Learned Duals
 USCOSolver: Solving Undetermined Stochastic Combinatorial Optimization Problems
 Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks
 Learning Hard Optimization Problems: A Data Generation Perspective
 A BiLevel Framework for Learning to Solve Combinatorial Optimization on Graphs
Bridging the gap between discrete structures and differentiable learning
 Learning with Algorithmic Supervision via Continuous Relaxations
 Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces
 DiBS: Differentiable Bayesian Structure Learning
 Neural BellmanFord Networks: A General Graph Neural Network Framework for Link Prediction
 PiRank: Scalable Learning To Rank via Differentiable Sorting
 Leveraging Recursive GumbelMax Trick for Approximate Inference in Combinatorial Spaces
Routing problems
 Learning to delegate for largescale vehicle routing
 Learning Collaborative Policies to Solve NPhard Routing Problems
 NeuroLKH: Combining Deep Learning Model with LinKernighanHelsgaun Heuristic for Solving the Traveling Salesman Problem
 Learning Space Partitions for Path Planning
Combinatorial optimization for machine learning
 Differentiable Optimization of Generalized Nondecomposable Functions using Linear Programs
 PartitionBased Formulations for MixedInteger Optimization of Trained ReLU Neural Networks