Probabilistic Machine Learning: An Introduction

Kevin P. Murphy · MIT Press · 2022

[github][github][github][github][github][github][github][github][huggingface]

Probabilistic Machine Learning: An Introduction

Authors: Kevin P. Murphy (with chapter contributions from Zico Kolter, Frederik Kunstner, Si Yi Meng, Aaron Mishkin, Sharan Vaswani, Mark Schmidt, Mathieu Blondel, Krzysztof Choromanski, Colin Raffel, Bryan Perozzi, Sami Abu-El-Haija, Ines Chami) Year: 2022 (first printing March 2022; online version updated April 2025) Tags: probabilistic-machine-learning, bayesian-inference, deep-learning, supervised-learning, unsupervised-learning, textbook

TL;DR

A graduate-level textbook unifying classical probabilistic/Bayesian ML with modern deep learning under a single lens of probabilistic modeling and Bayesian decision theory, spanning 23 chapters from probability and optimization foundations through transformers, VAEs, and graph embeddings. It is the introductory volume of a two-volume series that supersedes Murphy's 2012 Machine Learning: A Probabilistic Perspective, incorporating the deep learning advances made since 2012. All code is in Python with Jupyter notebooks runnable in Google Colab.

First pass — the five C's

Category. Graduate textbook / comprehensive survey of the ML field as of ~2021.

Context. Direct successor to Murphy (2012) Machine Learning: A Probabilistic Perspective (DeGroot Prize 2013). Situated within the MIT Press Adaptive Computation and Machine Learning series alongside Koller & Friedman Probabilistic Graphical Models, Rasmussen & Williams Gaussian Processes for Machine Learning, and Schölkopf & Smola Learning with Kernels. Pairs with a sequel volume [Mur23] covering variational inference, generative models, and reinforcement learning. Cites the 2012 ImageNet result [KSH12] (Krizhevsky, Sutskever, Hinton) as the triggering event motivating the revision.

Correctness. Load-bearing assumption: probabilistic modeling is a sufficient unifying lens for nearly all of ML (attributed to Shakir Mohamed). This is a pedagogical choice, not a theorem—frequentist and algorithmic perspectives receive secondary treatment. Additional assumption: readers have implicit calculus and linear algebra background; formal prerequisites are not stated in the provided text. Both assumptions are broadly defensible for the intended audience but limit the book's universality.

Contributions. - Unified probabilistic/Bayesian treatment spanning foundations through modern deep learning (transformers, self-supervised learning, VAEs) in a single coherent framework. - Integration of background material (linear algebra Ch. 7, optimization Ch. 8) omitted from the 2012 book, making the volume more self-contained. - Full migration of code from MATLAB to Python (NumPy, Scikit-learn, JAX, PyTorch, TensorFlow, PyMC) with reproducible Jupyter notebooks linked from figure captions. - Advanced sections marked with * throughout, enabling instructors to scope an introductory course without restructuring.

Clarity. Writing is clear and pedagogically careful throughout Chapter 1, with running examples (Iris, polynomial regression) that reappear across topics; multi-author sections (e.g., optimization, graph embeddings) may vary in style, but this cannot be assessed from the provided excerpt alone.

Second pass — content

Main thrust: Treat all unknown quantities as random variables with probability distributions; use Bayesian decision theory to unify model fitting, regularization, uncertainty quantification, and evaluation across classification, regression, unsupervised learning, and beyond. The 23-chapter arc moves from mathematical foundations → linear models → deep networks → nonparametric models → semi-supervised/unsupervised methods.

Supporting evidence: - Iris dataset (N=150, D=4 features, C=3 classes, 50 examples per class) used as the primary running classification example; depth-2 decision tree on petal length/width shown to separate classes. - Polynomial regression demonstration: degree-2 fit visually adequate, degree-14 overfits, degree-20 achieves 0 training MSE on N=21 points; MSE-vs-degree curve shows train/test divergence (Figure 1.7d). - NLL ∝ MSE derivation for Gaussian regression: NLL(θ) = (1/2σ²)·MSE(θ) + const, establishing MLE ≡ least-squares under Gaussian noise assumption. - Softmax formulation p(y=c|x;θ) = softmax_c(f(x;θ)) given explicitly, connecting logits to probability distributions for multi-class classification. - Asymmetric loss matrix example (Virginica poisonous: misclassification cost 10 vs. 1) motivates empirical risk minimization beyond zero-one loss.

Figures & tables: Chapter 1 figures include: pairwise scatter plot of Iris data color-coded by class (Fig. 1.3, axes labeled with feature names and units in cm); decision tree + 2D decision surface (Fig. 1.4, node counts shown); linear regression with residuals (Fig. 1.5, axes unlabeled for units); 2D temperature surface fits (Fig. 1.6, vertical axis is temperature, horizontal axes are room coordinates in unspecified units); polynomial-degree vs. MSE (Fig. 1.7d, train and test curves shown, no error bars or confidence intervals). No statistical significance reporting is present in Chapter 1; this is expected for illustrative textbook figures but means readers cannot assess sampling variability of the demonstrations. Figures link to named Jupyter notebooks (.ipynb) for reproduction.

Follow-up references: - Murphy [Mur23] Probabilistic Machine Learning: Advanced Topics — the sequel covering variational inference, generative models, RL. - Krizhevsky, Sutskever, Hinton [KSH12] — the ImageNet result cited as the empirical starting point of the deep learning era. - Géron [Gér19] — cited as source for decision tree figure adaptation; practical complement to this book. - Koller & Friedman Probabilistic Graphical Models — listed in the same MIT Press series; deeper treatment of PGMs covered lightly in Ch. 3.

Third pass — critique

Implicit assumptions: - The probabilistic lens is claimed to be (nearly) universal; this marginalizes PAC-learning / computational learning theory perspectives that do not reduce neatly to probabilistic models—if a reader's domain relies on worst-case guarantees, this framing breaks. - No formal prerequisite list is stated in the provided text; the book implicitly assumes multivariable calculus, introductory linear algebra, and programming fluency. Readers lacking these will be lost before Chapter 7 (Linear Algebra) despite its inclusion. - Python library ecosystem (JAX, PyTorch, TensorFlow, PyMC) is assumed stable; rapid API churn in these libraries is a real reproducibility risk for the code-linked figures. - The "deep learning revolution started in 2012" framing (based on ImageNet [KSH12]) is a simplification that omits parallel advances in speech (cited as [Cir+10; Cir+11; Hin+12]) and may misattribute the tipping point.

Missing context or citations: - Causal inference receives only a dismissive note ("correlation does not imply causation," §3.1.4) with no substantive treatment, despite being closely related to the probabilistic framework; Pearl's do-calculus or Spirtes et al. (listed in the series) are not engaged with in the provided content. - Reinforcement learning is introduced in §1.4 but has no dedicated chapter in this volume (deferred to the advanced volume); readers expecting RL coverage based on the introduction will be disappointed. - Fairness, accountability, and societal impact of ML are mentioned only under "Caveats" (§1.6.3), which is not provided in the excerpt; given the book's 2022 publication date, this coverage appears minimal based on the TOC. - No chapter on causal generative models or structural causal models, despite the probabilistic graphical models chapter (§3.6).

Possible experimental / analytical issues: - As a textbook, there are no novel experimental results to critique; illustrative figures use small toy datasets (N=21 for polynomial regression, N=150 for Iris) whose conclusions are pedagogically appropriate but cannot be generalized. - The MSE-vs-degree plot (Fig. 1.7d) shows no confidence intervals or multiple random seeds; it is a single illustrative run, which is standard for textbooks but could mislead readers about the reliability of the bias-variance tradeoff visualization. - Multi-author sections (optimization, graph embeddings, transformers) are acknowledged in the preface but no mechanism is described for ensuring notational or conceptual consistency across contributors; inconsistencies may exist but cannot be verified from the provided excerpt. - The online version (April 2025) post-dates the first printing (March 2022) by three years; the changelog is not included in the excerpt, so it is unknown which sections have been substantively revised versus merely corrected for typos.

Ideas for future work: - Add a dedicated chapter on causal inference (do-calculus, counterfactuals) to make the probabilistic framework applicable to intervention and not just prediction. - Include a formal prerequisites section and a self-assessment diagnostic so readers can gauge readiness before Chapter 1. - Given the April 2025 online update, a systematic comparison of the LLM coverage in §15.7 against post-GPT-4 developments (instruction tuning, RLHF, chain-of-thought) would update the most rapidly changing section of the book. - A companion chapter on ML fairness and uncertainty quantification for high-stakes decisions would address the gap hinted at in §1.6.3 ("Caveats") without requiring restructuring of the existing content.

Methods

  • maximum likelihood estimation
  • Bayesian inference
  • logistic regression
  • linear regression
  • neural networks
  • convolutional neural networks
  • recurrent neural networks
  • transformers
  • support vector machines
  • Gaussian processes
  • variational autoencoders
  • principal component analysis
  • stochastic gradient descent
  • EM algorithm
  • decision trees
  • random forests
  • boosting
  • k-means clustering

Datasets

  • Iris dataset
  • ImageNet
  • MNIST

Claims

  • Almost all of machine learning can be viewed in probabilistic terms, making probabilistic thinking a unifying framework connecting ML to other computational sciences.
  • A probabilistic approach to ML is optimal for decision making under uncertainty and provides a principled way to represent epistemic and aleatoric uncertainty.
  • Deep learning, particularly using DNNs, achieved breakthrough performance on tasks such as image classification starting in 2012, catalyzing a revolution in ML.
  • Empirical risk minimization on the training set must be balanced with generalization to unseen data to avoid overfitting.
  • Bayesian methods provide a coherent framework for model selection, regularization, and uncertainty quantification beyond point estimation.