This is my first attempt at a blog post, so please be kind!
NeurIPS is one of the leading conferences in machine learning, and due to an ongoing pandemic, the 2021 edition took place online. On the one hand, this means that AI researchers from around the world didn’t get to socialize in person and enjoy the free food. On the other hand, it also means that I was able to follow most of the talks from my living-room, instead of spending my yearly carbon budget on a round-trip to Sydney.
Now that I have finally caught up on everything I wanted to watch, it’s time for a quick summary of my personal highlights! All of the papers I mention are available on the pre-proceedings website. This is a long post, but to my defense, there were several thousands of accepted papers at NeurIPS 2021, so I think my compression rate is quite impressive already.
Machine learning theory
The papers in this section provide a critical perspective on some standard assumptions that most ML practitioners would have accepted without batting an eye.
“Markov rewards are enough”: it depends what you want to do with them. On the Expressivity of Markov Reward questions this fundamental paradigm of reinforcement learning by rethinking the notion of task. Indeed, maximizing an agent’s reward is only a proxy to steer it towards a set of acceptable behaviors or trajectories. And as it turns out, the standard reward framework of MDPs is not sufficient to express every possible task: for instance, the policies achieving the task “always move in the same direction” cannot be given a strictly higher value than all others when using only Markov rewards. The key intuition is that Markov rewards are quick to forgive past offenses, whereas such “lifelong” tasks will never forget even one mistake!
“There is a classification measure that does everything I want”: sorry, but you cannot have your cake and eat it too. There are many interesting properties we can wish from our measure, for instance monotonicity or distance-like behavior. Another important one is the constant baseline property, which ensures that random labelings get similar scores regardless of class sizes. Unfortunately, as was shown by Good Classification Measures and How to Find Them, there is no binary classification measure that satisfies the three criteria listed above. What a bummer…
“Stochastic gradient descent can learn anything”: well, it actually depends on the minibatch size and gradient precision. On the Power of Differentiable Learning versus PAC and SQ Learning shows that there are some settings where the SGD framework is strictly weaker than the PAC learning framework in terms of expressivity. Disappointing, right?
Uncertainty and privacy
“Numerical algorithms are reliable”: not fully, because they often rely on a small parameter \(h\) that cannot be shrinked all the way to \(0\). A good example is the step size when solving differential equations. Black Box Probabilistic Numerics proposes to bridge this gap by using a sequence of increasingly precise computations to interpolate the \(h \to 0\) limit using a Gaussian process. The resulting quantity of interest is no longer deterministic, but stochastic: its variance can then be interpreted as an epistemic uncertainty related to the finite precision of the underlying solver.
“Differential privacy is the only way to go”: that may be true if you want to be robust against attacks regarding all members of the dataset. However, if you only care about the privacy of some data points (for instance, a group of users asking to be removed from your database), then you can do better by following Master Yoda’s advice: “You must unleaaaarn what you have leaaarned”. More specifically, you must Remember What You Want to Forget: Algorithms for Machine Unlearning. The authors of this paper show that by exploiting the hessian of the final training loss, we can delete more samples (when compared to DP) without giving up on prediction performance.
Discrete and continuous time
“Discrete time makes things easier to analyze”: not always. In fact, a growing number of works encourage us to interpret well-known algorithms from a continuous-time point of view. In Framing RNN as a kernel method: A neural ODE approach, one discovers that taking the continuous limit of recurrent networks and studying the resulting differential equation can yield useful generalization guarantees. Throw in some kernel methods for good measure, and you got yourself a nice little theory-packed paper!
“Well, continuous-time analysis may be simpler but the resulting algorithm cannot be better”: wrong again! Continuized Accelerations of Deterministic and Stochastic Gradient Descents, and of Gossip Algorithms introduces a version of Nesterov acceleration for gradient descent that replaces the mixing steps with a differential equation and chooses discrete update times from a Poisson process. Not only does the convergence proof become much more intuitive, this continuization approach actually brings other benefits, most notably an unexpected potential for parallelization in certain settings. Watching the talk made me feel like this was sorcery, maybe you can convince me otherwise!
Machine learning & combinatorial optimization
Indeed, we can often leverage the sequential nature of real-life applications, in which we solve many similar problems in a sequence, to enhance these algorithms by learning from past decisions. This field of research is particularly active, as you can see from the following list of contributions. I will try to edit this post and add some comments within the next few days.
Speeding up general linear, quadratic and integer programs
- Practical Large-Scale Linear Programming using Primal-Dual Hybrid Gradient
- Accelerating Quadratic Optimization with Reinforcement Learning
- Learning Large Neighborhood Search Policy for Integer Programming
- Learning to Schedule Heuristics in Branch and Bound
- Sample Complexity of Tree Search Configuration: Cutting Planes and Beyond
Learning from previous solutions of a specific problem
- Faster Matchings via Learned Duals
- USCO-Solver: Solving Undetermined Stochastic Combinatorial Optimization Problems
- Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks
- Learning Hard Optimization Problems: A Data Generation Perspective
- A Bi-Level Framework for Learning to Solve Combinatorial Optimization on Graphs
Bridging the gap between discrete structures and differentiable learning
- Learning with Algorithmic Supervision via Continuous Relaxations
- Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces
- DiBS: Differentiable Bayesian Structure Learning
- Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction
- PiRank: Scalable Learning To Rank via Differentiable Sorting
- Leveraging Recursive Gumbel-Max Trick for Approximate Inference in Combinatorial Spaces
- Learning to delegate for large-scale vehicle routing
- Learning Collaborative Policies to Solve NP-hard Routing Problems
- NeuroLKH: Combining Deep Learning Model with Lin-Kernighan-Helsgaun Heuristic for Solving the Traveling Salesman Problem
- Learning Space Partitions for Path Planning
Combinatorial optimization for machine learning
- Differentiable Optimization of Generalized Nondecomposable Functions using Linear Programs
- Partition-Based Formulations for Mixed-Integer Optimization of Trained ReLU Neural Networks