publications
2024
- ProceedingsAnalysis of Bootstrap and Subsampling in High-dimensional Regularized RegressionLucas Clarté, Adrien Vandenbroucque, Guillaume Dalle, and 3 more authorsIn The 40th Conference on Uncertainty in Artificial Intelligence, Jun 2024
We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples \n and dimension \d of the covariates grow at a comparable rate: {}alpha=n/d fixed. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when {}alpha is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime {}alpha<1 relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization.
- JournalHiddenMarkovModels.Jl: Generic, Fast and Reliable State Space ModelingGuillaume DalleJournal of Open Source Software, Apr 2024
Dalle, G., (2024). HiddenMarkovModels.jl: generic, fast and reliable state space modeling. Journal of Open Source Software, 9(96), 6436, https://doi.org/10.21105/joss.06436
2023
- JournalSolving a Continent-Scale Inventory Routing Problem at RenaultLouis Bouvier, Guillaume Dalle, Axel Parmentier, and 1 more authorTransportation Science, Oct 2023
This paper is the fruit of a partnership with Renault. Their reverse logistic requires solving a continent-scale multiattribute inventory routing problem (IRP). With an average of 30 commodities, 16 depots, and 600 customers spread across a continent, our instances are orders of magnitude larger than those in the literature. Existing algorithms do not scale, so we propose a large neighborhood search (LNS). To make it work, (1) we generalize existing split delivery vehicle routing problems and IRP neighborhoods to this context, (2) we turn a state-of-the-art matheuristic for medium-scale IRP into a large neighborhood, and (3) we introduce two novel perturbations: the reinsertion of a customer and that of a commodity into the IRP solution. We also derive a new lower bound based on a flow relaxation. In order to stimulate the research on large-scale IRP, we introduce a library of industrial instances. We benchmark our algorithms on these instances and make our code open source. Extensive numerical experiments highlight the relevance of each component of our LNS. Funding: This work was supported by Renault Group. Supplemental Material: The online appendix is available at https://doi.org/10.1287/trsc.2022.0342.
2022
- PreprintLearning with Combinatorial Optimization Layers: A Probabilistic ApproachGuillaume Dalle, Léo Baty, Louis Bouvier, and 1 more authorDec 2022
Combinatorial optimization (CO) layers in machine learning (ML) pipelines are a powerful tool to tackle data-driven decision tasks, but they come with two main challenges. First, the solution of a CO problem often behaves as a piecewise constant function of its objective parameters. Given that ML pipelines are typically trained using stochastic gradient descent, the absence of slope information is very detrimental. Second, standard ML losses do not work well in combinatorial settings. A growing body of research addresses these challenges through diverse methods. Unfortunately, the lack of well-maintained implementations slows down the adoption of CO layers. In this paper, building upon previous works, we introduce a probabilistic perspective on CO layers, which lends itself naturally to approximate differentiation and the construction of structured losses. We recover many approaches from the literature as special cases, and we also derive new ones. Based on this unifying perspective, we present InferOpt.jl, an open-source Julia package that 1) allows turning any CO oracle with a linear objective into a differentiable layer, and 2) defines adequate losses to train pipelines containing such layers. Our library works with arbitrary optimization algorithms, and it is fully compatible with Julia’s ML ecosystem. We demonstrate its abilities using a pathfinding problem on video game maps as guiding example, as well as three other applications from operations research.
- ThesisMachine Learning and Combinatorial Optimization Algorithms, with Applications to Railway PlanningGuillaume DalleÉcole des Ponts ParisTech, Dec 2022
This thesis investigates the frontier between machine learning and combinatorial optimization, two active areas of applied mathematics research. We combine theoretical insights with efficient algorithms, and develop several open source Julia libraries. Inspired by a collaboration with the Société nationale des chemins de fer français (SNCF), we study high-impact use cases from the railway world: train failure prediction, delay propagation, and track allocation. In Part I, we provide mathematical background and describe software implementations for various tools that will be needed later on: implicit differentiation, temporal point processes, Hidden Markov Models and Multi-Agent Path Finding. Our publicly available code fills a void in the Julia package ecosystem, aiming at ease of use without compromising on performance. In Part II, we highlight theoretical contributions related to both statistics and decision-making. We consider a Vector AutoRegressive process with partial observations, and prove matching upper and lower bounds on the estimation error. We unify and extend the state of the art for combinatorial optimization layers in deep learning, gathering various approaches in a Julia library called InferOpt.jl. We also seek to differentiate through multi-objective optimization layers, which leads to a novel theory of lexicographic convex analysis. In Part III, these mathematical and algorithmic foundations come together to tackle railway problems. We design a hierarchical model of train failures, propose a graph-based framework for delay propagation, and suggest new avenues for track allocation, with the Flatland challenge as a testing ground.
- PreprintMinimax Estimation of Partially-Observed Vector AutoRegressionsGuillaume Dalle, and Yohann de CastroMay 2022
High-dimensional time series are a core ingredient of the statistical modeling toolkit, for which numerous estimation methods are known.But when observations are scarce or corrupted, the learning task becomes much harder.The question is: how much harder? In this paper, we study the properties of a partially-observed Vector AutoRegressive process, which is a state-space model endowed with a stochastic observation mechanism.Our goal is to estimate its sparse transition matrix, but we only have access to a small and noisy subsample of the state components.Interestingly, the sampling process itself is random and can exhibit temporal correlations, a feature shared by many realistic data acquisition scenarios.We start by describing an estimator based on the Yule-Walker equation and the Dantzig selector, and we give an upper bound on its non-asymptotic error.Then, we provide a matching minimax lower bound, thus proving near-optimality of our estimator.The convergence rate we obtain sheds light on the role of several key parameters such as the sampling ratio, the amount of noise and the number of non-zero coefficients in the transition matrix.These theoretical findings are commented and illustrated by numerical experiments on simulated data.
2020
- PatentMethod for Determining a Soiling Speed of a Photovoltaic Generation UnitPierre Stephan, and Guillaume DalleJun 2020
The invention relates to a method for determining a soiling speed of a photovoltaic generation unit, in which, on the basis of values of an electrical variable generated by the photovoltaic generation unit at a plurality of moments in a time series and corresponding values of meteorological parameters, and on the basis of a relationship between an electrical variable generated by said generation unit at one moment, the values taken with the meteorological parameters at the same moment, and the occurrence of cleaning events, wherein said relationship comprises multiple relational parameters including the speed of soiling and the occurrence of a cleaning event is modelled by a probability law involving a relational parameter, the soiling speed is determined in iterations in which vectors that are representative of the occurrence of cleaning events are simulated.