Probabilistic Machine Learning: An Introduction

Kevin P. Murphy · MIT Press · 2022

[github][github][github][github][github][github][github][github][huggingface] probabilistic-machine-learningbayesian-inferencedeep-learningsupervised-learningunsupervised-learning

Probabilistic Machine Learning: An Introduction

Authors: Kevin P. Murphy (with chapter contributions from Zico Kolter, Frederik Kunstner, Si Yi Meng, Aaron Mishkin, Sharan Vaswani, Mark Schmidt, Mathieu Blondel, Krzysztof Choromanski, Colin Raffel, Bryan Perozzi, Sami Abu-El-Haija, Ines Chami) Year: 2022 (first printing March 2022; online version updated April 2025) Tags: probabilistic-machine-learning, bayesian-inference, deep-learning, supervised-learning, unsupervised-learning, textbook

TL;DR

A graduate-level textbook unifying classical probabilistic/Bayesian ML with modern deep learning under a single lens of probabilistic modeling and Bayesian decision theory, spanning 23 chapters from probability and optimization foundations through transformers, VAEs, and graph embeddings. It is the introductory volume of a two-volume series that supersedes Murphy's 2012 Machine Learning: A Probabilistic Perspective, incorporating the deep learning advances made since 2012. All code is in Python with Jupyter notebooks runnable in Google Colab.

First pass — the five C's

Category. Graduate textbook / comprehensive survey of the ML field as of ~2021.

Context. Direct successor to Murphy (2012) Machine Learning: A Probabilistic Perspective (DeGroot Prize 2013). Situated within the MIT Press Adaptive Computation and Machine Learning series alongside Koller & Friedman Probabilistic Graphical Models, Rasmussen & Williams Gaussian Processes for Machine Learning, and Schölkopf & Smola Learning with Kernels. Pairs with a sequel volume [Mur23] covering variational inference, generative models, and reinforcement learning. Cites the 2012 ImageNet result [KSH12] (Krizhevsky, Sutskever, Hinton) as the triggering event motivating the revision.

Correctness. Load-bearing assumption: probabilistic modeling is a sufficient unifying lens for nearly all of ML (attributed to Shakir Mohamed). This is a pedagogical choice, not a theorem—frequentist and algorithmic perspectives receive secondary treatment. Additional assumption: readers have implicit calculus and linear algebra background; formal prerequisites are not stated in the provided text. Both assumptions are broadly defensible for the intended audience but limit the book's universality.

Contributions. - Unified probabilistic/Bayesian treatment spanning foundations through modern deep learning (transformers, self-supervised learning, VAEs) in a single coherent framework. - Integration of background material (linear algebra Ch. 7, optimization Ch. 8) omitted from the 2012 book, making the volume more self-contained. - Full migration of code from MATLAB to Python (NumPy, Scikit-learn, JAX, PyTorch, TensorFlow, PyMC) with reproducible Jupyter notebooks linked from figure captions. - Advanced sections marked with * throughout, enabling instructors to scope an introductory course without restructuring.

Clarity. Writing is clear and pedagogically careful throughout Chapter 1, with running examples (Iris, polynomial regression) that reappear across topics; multi-author sections (e.g., optimization, graph embeddings) may vary in style, but this cannot be assessed from the provided excerpt alone.

Second pass — content

Main thrust: Treat all unknown quantities as random variables with probability distributions; use Bayesian decision theory to unify model fitting, regularization, uncertainty quantification, and evaluation across classification, regression, unsupervised learning, and beyond. The 23-chapter arc moves from mathematical foundations → linear models → deep networks → nonparametric models → semi-supervised/unsupervised methods.

Supporting evidence: - Iris dataset (N=150, D=4 features, C=3 classes, 50 examples per class) used as the primary running classification example; depth-2 decision tree on petal length/width shown to separate classes. - Polynomial regression demonstration: degree-2 fit visually adequate, degree-14 overfits, degree-20 achieves 0 training MSE on N=21 points; MSE-vs-degree curve shows train/test divergence (Figure 1.7d). - NLL ∝ MSE derivation for Gaussian regression: NLL(θ) = (1/2σ²)·MSE(θ) + const, establishing MLE ≡ least-squares under Gaussian noise assumption. - Softmax formulation p(y=c|x;θ) = softmax_c(f(x;θ)) given explicitly, connecting logits to probability distributions for multi-class classification. - Asymmetric loss matrix example (Virginica poisonous: misclassification cost 10 vs. 1) motivates empirical risk minimization beyond zero-one loss.

Figures & tables: Chapter 1 figures include: pairwise scatter plot of Iris data color-coded by class (Fig. 1.3, axes labeled with feature names and units in cm); decision tree + 2D decision surface (Fig. 1.4, node counts shown); linear regression with residuals (Fig. 1.5, axes unlabeled for units); 2D temperature surface fits (Fig. 1.6, vertical axis is temperature, horizontal axes are room coordinates in unspecified units); polynomial-degree vs. MSE (Fig. 1.7d, train and test curves shown, no error bars or confidence intervals). No statistical significance reporting is present in Chapter 1; this is expected for illustrative textbook figures but means readers cannot assess sampling variability of the demonstrations. Figures link to named Jupyter notebooks (.ipynb) for reproduction.

Follow-up references: - Murphy [Mur23] Probabilistic Machine Learning: Advanced Topics — the sequel covering variational inference, generative models, RL. - Krizhevsky, Sutskever, Hinton [KSH12] — the ImageNet result cited as the empirical starting point of the deep learning era. - Géron [Gér19] — cited as source for decision tree figure adaptation; practical complement to this book. - Koller & Friedman Probabilistic Graphical Models — listed in the same MIT Press series; deeper treatment of PGMs covered lightly in Ch. 3.

Third pass — critique

Implicit assumptions: - The probabilistic lens is claimed to be (nearly) universal; this marginalizes PAC-learning / computational learning theory perspectives that do not reduce neatly to probabilistic models—if a reader's domain relies on worst-case guarantees, this framing breaks. - No formal prerequisite list is stated in the provided text; the book implicitly assumes multivariable calculus, introductory linear algebra, and programming fluency. Readers lacking these will be lost before Chapter 7 (Linear Algebra) despite its inclusion. - Python library ecosystem (JAX, PyTorch, TensorFlow, PyMC) is assumed stable; rapid API churn in these libraries is a real reproducibility risk for the code-linked figures. - The "deep learning revolution started in 2012" framing (based on ImageNet [KSH12]) is a simplification that omits parallel advances in speech (cited as [Cir+10; Cir+11; Hin+12]) and may misattribute the tipping point.

Missing context or citations: - Causal inference receives only a dismissive note ("correlation does not imply causation," §3.1.4) with no substantive treatment, despite being closely related to the probabilistic framework; Pearl's do-calculus or Spirtes et al. (listed in the series) are not engaged with in the provided content. - Reinforcement learning is introduced in §1.4 but has no dedicated chapter in this volume (deferred to the advanced volume); readers expecting RL coverage based on the introduction will be disappointed. - Fairness, accountability, and societal impact of ML are mentioned only under "Caveats" (§1.6.3), which is not provided in the excerpt; given the book's 2022 publication date, this coverage appears minimal based on the TOC. - No chapter on causal generative models or structural causal models, despite the probabilistic graphical models chapter (§3.6).

Possible experimental / analytical issues: - As a textbook, there are no novel experimental results to critique; illustrative figures use small toy datasets (N=21 for polynomial regression, N=150 for Iris) whose conclusions are pedagogically appropriate but cannot be generalized. - The MSE-vs-degree plot (Fig. 1.7d) shows no confidence intervals or multiple random seeds; it is a single illustrative run, which is standard for textbooks but could mislead readers about the reliability of the bias-variance tradeoff visualization. - Multi-author sections (optimization, graph embeddings, transformers) are acknowledged in the preface but no mechanism is described for ensuring notational or conceptual consistency across contributors; inconsistencies may exist but cannot be verified from the provided excerpt. - The online version (April 2025) post-dates the first printing (March 2022) by three years; the changelog is not included in the excerpt, so it is unknown which sections have been substantively revised versus merely corrected for typos.

Ideas for future work: - Add a dedicated chapter on causal inference (do-calculus, counterfactuals) to make the probabilistic framework applicable to intervention and not just prediction. - Include a formal prerequisites section and a self-assessment diagnostic so readers can gauge readiness before Chapter 1. - Given the April 2025 online update, a systematic comparison of the LLM coverage in §15.7 against post-GPT-4 developments (instruction tuning, RLHF, chain-of-thought) would update the most rapidly changing section of the book. - A companion chapter on ML fairness and uncertainty quantification for high-stakes decisions would address the gap hinted at in §1.6.3 ("Caveats") without requiring restructuring of the existing content.

Methods

maximum likelihood estimation
Bayesian inference
logistic regression
linear regression
neural networks
convolutional neural networks
recurrent neural networks
transformers
support vector machines
Gaussian processes
variational autoencoders
principal component analysis
stochastic gradient descent
EM algorithm
decision trees
random forests
boosting
k-means clustering

Datasets

Iris dataset
ImageNet
MNIST

Claims

Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking a unifying framework connecting ML to other computational sciences.
A probabilistic approach to ML is optimal for decision making under uncertainty and provides a principled way to represent epistemic and aleatoric uncertainty.
Deep learning, particularly using DNNs, achieved breakthrough performance on tasks such as image classification starting in 2012, catalyzing a revolution in ML.
Empirical risk minimization on the training set must be balanced with generalization to unseen data to avoid overfitting.
Bayesian methods provide a coherent framework for model selection, regularization, and uncertainty quantification beyond point estimation.