This is a list of wonderful papers in machine learning, reflecting my own tastes and interests.
Compact finite difference schemes by Lele (1992). The paper that introduced the concept of compact finite difference schemes.
Generalized Procrustes analysis by Gower (1975)
Least Squares Quantization in PCM by Stuart P Lloyd (1982 but he got the method twenty years earlier). Definition of Lloyd's algorithm for k-means clustering.
Robbins-Monro, the original paper by Robbins and Monro (1951). The first paper on stochastic optimization.
The James-Stein paradox in estimation by Jamest and Stein, 1961. Sometimes, Maximum-likelihood is not the best estimator, even in a L2 world.
Compressed sensing paper by Candès, Romberg, and Tao (2006).
Scale Mixtures of Gaussians and the Stastistics of Natural Images by Wainwright and Simoncelli (1999).
Probabilistic PCA by Tipping and Bishop (1999). Lightweight generative model.
Annealed importance sampling by Radford Neal (2001).
A computational approach to edge detection by John Canny (1986)
The Sparse PCA paper by Zou, Hastie, Tibshirani (2006).
The BBP transition by Baik, Ben Arous, and Péché (2004). Probably the most important paper in random matrix theory.
On spectral clustering by Ng, Jordan and Weiss (2001).
Hyvärinen's Score Matching paper in 2005.
Exact Matrix Completion via Convex Optimization by Candes and Recht (2009). Matrix completion via optimization.
Matrix completion from a few entries by Keshavan, Montanari and Oh (2009). Matrix completion from SVD tresholding is (was ?) the go-to method for sparse matric completion.
Adaptive mixtures of experts by Jacobs et al. Introduces the famous MoE method that was popularized recently by Mistral.
Emergence of scaling in random networks, the original paper by Barabasi and Albert (1999)
Error in high-dimensional GLMs by Barbier et al. (2018)
Spectral algorithms for clustering by Nadakuditi and Newman, .from an RMT perspective
Spectral redemption in clustering sparse networks by Krzkala et al. (2013): classical versions of spectral clustering are failing for sparse graphs, but the authors show that a simple modification of the Laplacian matrix can lead to a successful clustering.
On Estimation of a Probability Density Function and Mode, the famous kernel density estimation paper by Parzen (1962)
Power laws, Pareto distributions and Zipf’s law, the survey by Newman on heavy-tails
ISOMAP, nonlinear dimensionality reduction for manifold learning.
t-SNE, the paper introducing the t-SNE dimension reduction technique, by van der Maaten and Hinton (2008)
Best subset or Lasso, by Hastie, Tibshirani and Friedman (2017)
Smoothing by spline functions, one of the seminal papers on spline smoothing, by Reinsh (1967)
Spline smoothing is almost kernel smoothing, a striking paper by Silverman (1984), and its generalization by Ong, Milanfar and Getreuer (2019). Global optimization problems (such as interpolation) can be approximated by local operations (kernel smoothing).
Tweediess formula and selection bias, a landmark paper by Bradley Efron. Tweedie's formula is key to many techniques in statistics, including diffusion-based generative models.
Local Learning algorithms by Bottou and Vapnik (1992). The paper that introduced the concept of local learning algorithms.
The ADAM optimizer by Kingma and Ba (2014).
The BatchNorm paper by Ioffe and Szegedy (2015).
The LayerNorm paper by Ba et al. (2016).
The Dropout paper by Srivastava et al. (2014).
The AlexNet paper by Krizhevsky, Sutskever, and Hinton (2012).
Density estimation by dual ascent of the log-likelihood by Tabak and Vanden-Eijnden (2010), first definition of coupling layers for normalizing flows.
Normalizing flows by Rezende and Mohamed (2015). They're not so popular now, but the paper is really a gem.
Invariant and equivariant graph networks by Maron et al. (2019). They compute the dimension of invariant and equivariant linear layers and study GNN expressivity.
The original paper introducing generative diffusion models, by Sohl-Dickstein et al (2015)
The second paper of diffusions by Song et al (2020), where they notably detailed the SDE formulation.
The Stable Diffusion paper by Rombach et al (2021)
The Neural ODE paper by Chen et al. (2018)
Attention is all you need, 2017. No comment.
Flow matching by Lipman et al, 2022, the most elegant generalization of diffusion models. Flow matching models are now SOTA and it is clear that diffusions will, at some point, disappear.
The data-driven Schrödinger bridge by Pavon, Tabak and Trigila (2021)
The Wasserstein GAN paper by Arzovsky, Chintala and Bottou (2017)
The Convmixer paper: fitting a big convolutional network in a tweet.
YOLO, now at its 11th version and still improving!
Deep learning for symbolic mathematics by Lample and Charton (2019)
An image is worth 16x16 words, the original Vision Transformer paper by Dosovitskiy et al. (2020). The paper that started the revolution of transformers in computer vision.
The Dinov2 paper from FAIR explains how they scaled a fully self-supervised feature extractor. That is, they learned the feature extractor (a ViT) using a dataset containing ONLY images: no labels, no text. The teacher/student setting is set up in a particularly smart way.
Per-Pixel Classification is Not All You Need for Semantic Segmentation : the paper that convinced me that segmentation is actually really hard.
Segment Anything, the original paper on segmentation by Kirillov et al. (2023) which really pushed the field forward.
Universal approximation theorem by Cybenko (1989). Roughly speaking, classes of neural networks are dense.
Language models are few-shot learners on LLM scaling laws.
Implicit regularization in deep networks by Martin and Mahonney (2021). On the training dynamics of the hessian spectrum of DNNs.
Edge of Stability paper by Cohen et al.
A U-turn on double descent by Curth et al.
The NTK paper by Jacot, Gabriel and Hongler (2018).
ViTs need Registers, by T. Darcet et al. A strikingly simple observation : ViT store some internal informations inside the tokens, because it needs to store them somewhere. Adding a small "memory register" (ie, additional tokens) solves it. A very nice scientific paper.
Chain-of-Thought, a landmark technique for having better at inference time. Partly responsible for the huge gap in reasonning quality of LLMs between 23 and 24.
Rotary Positional Encoding: previously, positional encoding did not retain relative position information. By working in the complex plane with rotations, this is no longer a problem. This solution is now practically implemented everywhere.
LatRon, the very recent Latent Reasonning paper : instead of asking a LLM to generate an answer to a question , you can fine-tune it to generate "the" question which, given to the LLM, maximizes the probability of answer –- and it's often not the raw question ! A very clever and simple trick.