Learning Theory from First Principles

Francis Bach · MIT Press · 2024

statistical-learning-theorygeneralization-boundssupervised-learningoptimizationneural-networkskernel-methods

Learning Theory from First Principles

Authors: Francis Bach Year: 2024 Tags: statistical-learning-theory, generalization-bounds, empirical-risk-minimization, kernel-methods, neural-networks, optimization

TL;DR

A graduate-level textbook deriving mathematical guarantees for standard supervised learning algorithms (linear models, kernels, neural networks, ensembles) from first principles, organized around the approximation/estimation/optimization error trichotomy. It deliberately avoids VC dimension in favor of Rademacher complexities and prioritizes the simplest correct proofs over the tightest known results.

First pass — the five C's

Category. Textbook / pedagogical reference. Not a research paper; no novel theorems are claimed.

Context. Statistical learning theory. Explicitly positions against and defers to: Mohri, Rostamizadeh, and Talwalkar (2018) for broader coverage; Shalev-Shwartz and Ben-David (2014) as an alternative text; Koltchinskii (2011) and Christmann and Steinwart (2008) for depth; Boucheron, Lugosi, and Massart (2013) for concentration inequalities. Rakhlin and Sridharan lecture notes cited as complementary.

Correctness. The central load-bearing assumption is i.i.d. data throughout; the preface states this explicitly and does not claim the theory extends beyond it. The choice to gloss over measurability issues is acknowledged. The pedagogical claim — that simplest proofs capture essential concepts — is a framing choice, not a falsifiable assumption.

Contributions. - Unified first-principles treatment of learning theory structured around the approximation/estimation/optimization error decomposition, applied consistently across linear models, kernels, neural networks, and ensembles. - Rademacher complexity as the sole generic complexity measure, replacing VC dimension throughout, motivated by the focus on real-valued predictors. - Coverage of modern topics — overparameterized models, double descent, implicit bias of gradient descent, neural tangent kernels, structured prediction — within the same notational framework. - Public MATLAB and Python code for all synthetic experiments at https://www.di.ens.fr/~fbach/ltfp/.

Clarity. The preface is unusually transparent about scope and omissions; Chapter 1 proofs are self-contained and complete (Hoeffding, McDiarmid proofs written out in full). Diamond notation (♦, ♦♦) for advanced sections is a useful pedagogical device.

Second pass — content

Main thrust: Every learning algorithm analyzed in the book is decomposed into approximation error (model expressiveness), estimation error (finite-sample deviation, bounded via Rademacher complexity), and optimization error (convergence of gradient-based solvers); the book derives non-asymptotic bounds on each component for each major algorithm class.

Supporting evidence: - Hoeffding's inequality (Prop. 1.2): for Zᵢ ∈ [0,1] i.i.d., P(|empirical mean − true mean| ≥ t) ≤ 2exp(−2nt²); proof given in full via cumulant generating function bound ϕ(s) ≤ s²/8. - McDiarmid's inequality (Prop. 1.3): for functions of bounded variation c, P(|f − E[f]| ≥ t) ≤ 2exp(−2t²/(nc²)); proof given in full via martingale difference decomposition. - Sub-Gaussian characterization: [a,b]-bounded variables are sub-Gaussian with constant τ² = (b−a)²/4; Gaussian variables with variance σ² are sub-Gaussian with constant σ². - Bernstein's inequality (§1.2.3), matrix concentration (§1.2.6), and expectation-of-maximum bounds (§1.2.4) are stated for use in later chapters. - From the table of contents (not provided in body text): double descent analyzed for linear regression with Gaussian inputs (§12.2.3–12.2.4); minimax lower bounds derived via hypothesis-testing reduction and information theory (§15.1); kernel ridge regression bias-variance balance (§7.6.6).

Figures & tables: The provided text (preface + Chapter 1) contains no figures or tables. The preface states synthetic experiment figures exist in online code repositories and within chapter experiment sections (§3.5.2, §7.7, §8.4, §9.6, §10.3.7, §13.7). Axis labeling, error bars, and statistical significance reporting in those figures cannot be assessed from the provided text.

Follow-up references: - Mohri, Rostamizadeh, and Talwalkar (2018) — broader and deeper competing textbook recommended by the author for topics not covered here. - Boucheron, Lugosi, and Massart (2013) — deeper treatment of concentration inequalities used throughout Chapter 1. - Christmann and Steinwart (2008) — rigorous measurability treatment and kernel methods depth. - Vershynin (2018) — high-dimensional probability, covering random matrix concentration (§1.2.6 topic).

Third pass — critique

Implicit assumptions: - I.i.d. data is universal and never relaxed; results do not cover time series, Markov chains, or distribution shift — these would break nearly all bounds. - "First principles" presupposes linear algebra, probability, and differential calculus; the bar is not low for undergraduates without prior exposure. - Focus on real-valued predictors is a design choice that sidelines discrete combinatorial output spaces and compresses structured prediction into one chapter. - Simplest-proof preference means stated constants are often suboptimal; the book does not systematically quantify how loose the bounds are relative to state of the art.

Missing context or citations: - VC dimension is deliberately omitted with no quantitative comparison of when VC bounds are tighter than Rademacher bounds (a non-trivial question). - Information-theoretic generalization bounds (mutual-information-based, e.g., Russo–Zou 2016) are not mentioned in the preface, though PAC-Bayes appears in Chapter 14. - No coverage of distribution shift, domain adaptation, or robustness — gaps not acknowledged in the preface. - Online learning and bandits are explicitly described as one-chapter introductions, but no pointer is given to the canonical bandit textbook literature (e.g., Lattimore and Szepesvári).

Possible experimental / analytical issues: - All experiments are stated to be synthetic; no real-world benchmarks appear, so the gap between theoretical rates and practical performance is never demonstrated. - Exercises are embedded without solutions, making self-study verification difficult. - Measurability issues are explicitly glossed over; in kernel Hilbert space arguments and infinite-dimensional settings this is non-trivial and could hide subtle errors. - The double descent and overparameterized-model analysis (Chapter 12) is presented from the ToC as covering Gaussian input models and mean-field limits — both are idealized settings whose applicability to realistic deep learning is not established within the book's framework.

Ideas for future work: - Extend the approximation/estimation/optimization decomposition to non-i.i.d. settings (mixing processes, federated data) to test whether the trichotomy remains informative. - Add a systematic table comparing tightness of the book's simplified bounds against minimax-optimal rates for each algorithm class, so readers can calibrate how much is lost by choosing simple proofs. - Incorporate information-theoretic generalization bounds (mutual information, CMI) as a fourth framework alongside Rademacher, PAC-Bayes, and algorithmic stability to unify the modern landscape. - Provide exercise solutions (even partial) or autograded problem sets to support the stated goal of self-study use.

Figures from the paper

Methods

empirical risk minimization
Rademacher complexity
concentration inequalities
gradient descent
stochastic gradient descent
ridge regression
kernel ridge regression
PAC-Bayesian analysis
Lasso / l1-regularization
boosting
online mirror descent
neural tangent kernels

Claims

Learning theory can be developed from first principles using estimation, approximation, and optimization error decompositions to explain overfitting and underfitting phenomena.
Rademacher complexities provide generic generalization bounds for real-valued prediction functions as a preferred alternative to VC dimension.
Gradient-based optimization methods are central to modern machine learning and their convergence properties are tightly linked to generalization guarantees.
Overparameterized models exhibit double-descent behavior and implicit bias of gradient descent toward minimum-norm solutions.
Kernel methods and single-hidden-layer neural networks can be analyzed within a unified framework relating approximation error to function-space norms.