This is a list of wonderful papers in machine learning, reflecting my own tastes and interests.
Least Squares Quantization in PCM by Stuart P Lloyd (1982 but he got the method twenty years earlier). Definition of Lloyd's algorithm for k-means clustering.
The James-Stein paradox in estimation by Jamest and Stein, 1961. Sometimes, Maximum-likelihood is not the best estimator, even in a L2 world.
Generalized Procrustes analysis by Gower (1975)
Universal approximation theorem by Cybenko (1989)
Compressed sensing paper by Candès, Romberg, and Tao (2006).
Scale Mixtures of Gaussians and the Stastistics of Natural Images by Wainwright and Simoncelli (1999).
Probabilistic PCA by Tipping and Bishop (1999). Lightweight generative model.
Annealed importance sampling by Radford Neal (2001).
A computational approach to edge detection by John Canny (1986)
The Sparse PCA paper by Zou, Hastie, Tibshirani (2006).
The BBP transition by Baik, Ben Arous, and Péché (2004). Probably the most important paper in random matrix theory.
On spectral clustering by Ng, Jordan and Weiss (2001).
Hyvärinen's Score Matching paper in 2005.
Exact Matrix Completion via Convex Optimization by Candes and Recht (2009). Matrix completion via optimization.
Matrix completion from a few entries by Keshavan, Montanari and Oh (2009). Matrix completion from SVD tresholding is (was ?) the go-to method for sparse matric completion.
Adaptive mixtures of experts by Jacobs et al. Introduces the famous MoE method.
The NTK paper by Jacot, Gabriel and Hongler (2018).
Density estimation by dual ascent of the log-likelihood by Tabak and Vanden-Eijnden (2010), first definition of coupling layers for normalizing flows.
Implicit regularization in deep networks by Martin and Mahonney (2021). On the training dynamics of the hessian spectrum of DNNs.
Edge of Stability paper by Cohen et al.
A U-turn on double descent by Curth et al.
Emergence of scaling in random networks, the original paper by Barabasi and Albert (1999)
Error in high-dimensional GLMs by Barbier et al. (2018)
Spectral algorithms for clustering by Nadakuditi and Newman, .from an RMT perspective
Spectral redemption in clustering sparse networks by Krzkala et al. (2013): classical versions of spectral clustering are failing for sparse graphs, but the authors show that a simple modification of the Laplacian matrix can lead to a successful clustering.
On Estimation of a Probability Density Function and Mode, the famous kernel density estimation paper by Parzen (1962)
Power laws, Pareto distributions and Zipf’s law, the survey by Newman on heavy-tails
Best subset or Lasso, by Hastie, Tibshirani and Friedman (2017)
Smoothing by spline functions, one of the seminal papers on spline smoothing, by Reinsh (1967)
Spline smoothing is almost kernel smoothing, a striking paper by Silverman (1984), and its generalization by Ong, Milanfar and Getreuer (2019). Global optimization problems (such as interpolation) can be approximated by local operations (kernel smoothing).
Tweediess formula and selection bias, a landmark paper by Bradley Efron. Tweedie's formula is key to many techniques in statistics, including diffusion-based generative models.
The ADAM optimizer by Kingma and Ba (2014).
The BatchNorm paper by Ioffe and Szegedy (2015).
The LayerNorm paper by Ba et al. (2016).
The Dropout paper by Srivastava et al. (2014).
The AlexNet paper by Krizhevsky, Sutskever, and Hinton (2012).
Normalizing flows by Rezende and Mohamed (2015). They're not so popular now, but the paper is really a gem.
Invariant and equivariant graph networks by Maron et al. (2019). They compute the dimension of invariant and equivariant linear layers and study GNN expressivity.
The original paper introducing generative diffusion models, by Sohl-Dickstein et al (2015)
The second paper of diffusions by Song et al (2020)
The Stable Diffusion paper by Rombach et al (2021)
The Neural ODE paper by Chen et al. (2018)
Attention is all you need, 2017. This paper changed the world.
https://arxiv.org/abs/2104.09864, a killer method.
Flow matching by Lipman et al, 2022, the most elegant generalization of diffusion models.
The data-driven Schrödinger bridge by Pavon, Tabak and Trigila (2021)
Language models are few-shot learners on LLM scaling laws
The Wasserstein GAN paper by Arzovsky, Chintala and Bottou (2017)
YOLO, now at its 11th version!
Deep learning for symbolic mathematics by Lample and Charton (2019)
The Convmixer paper: fitting a big convolutional network in a tweet.
An image is worth 16x16 words, the original Vision Transformer paper by Dosovitskiy et al. (2020). The paper that started the revolution of transformers in computer vision.
Per-Pixel Classification is Not All You Need for Semantic Segmentation
Segment Anything, the original paper on segmentation by Kirillov et al. (2023) which really pushed the field forward.