This is a list of wonderful papers in machine learning, reflecting my own tastes and interests.

Least Squares Quantization in PCM by Stuart P Lloyd (1982 but he got the method twenty years earlier). Definition of Lloyd's algorithm for k-means clustering.

The James-Stein paradox in estimation by Jamest and Stein, 1961

Generalized Procrustes analysis by Gower (1975)

Universal approximation theorem by Cybenko (1989)

Compressed sensing paper by Candès, Romberg, and Tao (2006).

Scale Mixtures of Gaussians and the Stastistics of Natural Images by Wainwright and Simoncelli (1999).

Probabilistic PCA by Tipping and Bishop (1999). Lightweight generative model.

Annealed importance sampling by Radford Neal (2001).

A computational approach to edge detection by John Canny (1986)

The Sparse PCA paper by Zou, Hastie, Tibshirani (2006).

The BBP transition by Baik, Ben Arous, and Péché (2004).

On spectral clustering by Ng, Jordan and Weiss (2001).

Hyvärinen's Score Matching paper in 2005.

The AlexNet paper by Krizhevsky, Sutskever, and Hinton (2012).

Exact Matrix Completion via Convex Optimization by Candes and Recht (2009). Matrix completion via optimization.

Matrix completion from a few entries by Keshavan, Montanari and Oh (2009). Matrix completion from SVD tresholding.

Adaptive mixtures of experts by Jacobs et al. Introduces the famous MoE method.

Normalizing flows by Rezende and Mohamed (2015).

The ADAM optimizer by Kingma and Ba (2014).

Invariant and equivariant graph networks by Maron et al. (2019). They compute the dimension of invariant and equivariant linear layers and study GNN expressivity.

The original paper introducing generative diffusion models, by Sohl-Dickstein et al (2015)

The second paper of diffusions by Song et al (2020)

The Stable Diffusion paper by Rombach et al (2021)

The Neural ODE paper by Chen et al. (2018)

Attention is all you need, 2017. This paper changed the world.

Flow matching by Lipman et al, 2022

The NTK paper by Jacot, Gabriel and Hongler (2018).

The data-driven Schrödinger bridge by Pavon, Tabak and Trigila (2021)

Density estimation by dual ascent of the log-likelihood by Tabak and Vanden-Eijnden (2010), first definition of coupling layers for normalizing flows.

Implicit regularization in deep networks by Martin and Mahonney (2021). On the training dynamics of the hessian spectrum of DNNs.

Language models are few-shot learners on LLM scaling laws

Edge of Stability paper by Cohen et al.

A U-turn on double descent by Curth et al.

The Wasserstein GAN paper by Arzovsky, Chintala and Bottou (2017)

Deep learning for symbolic mathematics by Lample and Charton (2019)

Emergence of scaling in random networks, the original paper by Barabasi and Albert (1999)

Error in high-dimensional GLMs by Barbier et al. (2018)

Spectral algorithms for clustering by Nadakuditi and Newman, from an RMT perspective

An image is worth 16x16 words, the Vision Transformer paper by Dosovitskiy et al. (2020)

On Estimation of a Probability Density Function and Mode, the famous kernel density estimation paper by Parzen (1962)

Power laws, Pareto distributions and Zipf’s law, the survey by Newman on heavy-tails

Best subset or Lasso, by Hastie, Tibshirani and Friedman (2017)