Probabilistic Machine Learning: Advanced Topics (Supplementary Material)

Kevin Murphy · 2025

Supplementary Material for "Probabilistic Machine Learning: Advanced Topics"

Authors: Kevin Murphy Year: 2025 Tags: probabilistic-machine-learning, bayesian-inference, graphical-models, variational-inference, concept-learning, textbook-reference

TL;DR

Extended appendix material for Murphy's Probabilistic Machine Learning: Advanced Topics [Mur23], providing worked derivations, additional examples, and deeper treatments of topics omitted from the main text due to space. Spans fundamentals through decision-making across 36 chapters. Author explicitly flags reduced quality control relative to the main book.

First pass — the five C's

Category. Supplementary textbook material (worked examples, extended derivations, and application vignettes); not a research paper.

Context. Companion to Murphy [Mur23]. The portion provided builds on Tenenbaum [Ten99] for Bayesian concept learning, Griffiths & Tenenbaum [GT06] for cognitive prediction experiments with domain-appropriate priors, and Brin & Page [BL06] for PageRank as a Markov-chain application.

Correctness. Load-bearing assumptions in the visible content: (1) strong sampling assumption — examples drawn uniformly at random from concept extension; (2) Jeffreys uninformative priors for sensor-precision integration; (3) independence of x and y sensor measurements given z. All three are explicitly named. Assumption (1) is the most contestable in realistic settings.

Contributions. - Closed-form derivation of a bimodal posterior for sensor fusion when measurement precisions are unknown and integrated out (rather than estimated by MLE plug-in). - Numerical worked example of the number game illustrating that the likelihood ratio of h_two vs. h_even reaches ~5000:1 after four examples, formalizing Occam's razor via the size principle. - Closed-form posterior predictive p(y|D) for the healthy-levels rectangle game and analytic posterior median T_M = 2^(1/γ)·t under a power-law prior. - Compact unified treatment of PageRank as stationary-distribution computation, including the rank-1 sparse decomposition M = pGD + 1z^T and convergence (13 iterations on a 500-node harvard.edu subgraph).

Clarity. Generally clear and pedagogically sequenced; the author's own caveat ("caveat lector — some sections have not been checked as carefully as the main book") is the most important quality warning.

Second pass — content

Main thrust: In each worked example, a fully Bayesian treatment that integrates over uncertainty produces qualitatively different (less overconfident, sometimes multimodal) results compared to plug-in MLE/MAP approximations, particularly in small-data regimes; the supplement provides the derivations and numerics that the main text omits.

Supporting evidence: - Sensor fusion (N=2 per sensor): plug-in posterior is unimodal N(z|1.5788, 0.0798); exact Bayesian posterior is bimodal with modes near x̄=1.5 and ȳ=3.5, reflecting irreducible ambiguity about which sensor is reliable. - Number game: p(D|h_two) = (1/6)^4 = 7.7×10⁻⁴ vs. p(D|h_even) = (1/50)^4 = 1.6×10⁻⁷, a likelihood ratio of ~5000:1 favoring powers-of-two after D={16,8,2,64}. - PageRank on 6-node toy graph: π = (0.3209, 0.1706, 0.1065, 0.1368, 0.0643, 0.2008); power method converges in 13 iterations on 500-node harvard.edu subgraph from a uniform initialization. - Power-law prior: posterior median T_M = 2^(1/γ)·t analytically; at the empirically fitted γ for movie grosses, predicts total gross ≈ 1.5× observed-to-date. - Healthy-levels predictive: p(y|D) = [1/((1+d(y₁)/r₁)(1+d(y₂)/r₂))]^(N−1); undefined at N=1 without a stronger prior.

Figures & tables: Figures 2.1–3.8 carry the argument in the visible content. Axes are labeled with values in the text but axis labels on the actual figure images cannot be confirmed from the provided text alone. No error bars or confidence intervals appear in any figure in the provided excerpt (results are either analytical or exact posteriors). No statistical significance testing is reported, nor is it needed given the analytic/pedagogical nature. Figure 3.7 reproduces empirical human-subject median predictions as dots against Bayesian model predictions as solid lines — no uncertainty on the human data is shown.

Follow-up references: - [Ten99] Tenenbaum thesis — foundational source for the number game and healthy-levels game models. - [GT06] Griffiths & Tenenbaum — cognitive science experiments underlying the domain-prior section; source of Figure 3.7 and 3.8. - [BL06] Brin & Page — detailed PageRank exposition. - [Mur23] Murphy main book — required context; this supplement is not self-contained.

Third pass — critique

Implicit assumptions: - Strong sampling assumption (teacher samples uniformly at random from the concept's extension) is central to the size principle and the number game results; this assumption breaks if teachers are informative or adversarial, which is the norm in real pedagogy. - Jeffreys prior p(λ|z) ∝ 1/λ for sensor precision is presented without justification of why it is appropriate here; the result (bimodal posterior) is sensitive to this choice. - Independence of x and y sensors given z is assumed without discussion; correlated sensor errors would change the posterior shape. - The rectangle hypothesis space in the healthy-levels game is asserted as "reasonable from prior domain knowledge" without validation.

Missing context or citations: - The supplement does not engage with the frequentist critique of Jeffreys priors in the sensor-fusion section. - The number game is attributed entirely to Tenenbaum [Ten99]; no mention of subsequent critiques or replications of the human-subject data. - PageRank section omits discussion of HITS (Kleinberg) as an alternative authority-ranking algorithm and does not cite work on convergence rates or spectral gaps. - The discussion of MLE overfitting in concept learning does not cite PAC-learning or VC-dimension literature, which formalizes the same intuition.

Possible experimental / analytical issues: - Sensor fusion example uses only N=2 measurements per sensor; the multimodal behavior may resolve quickly with slightly larger N, but this is not analyzed. - Human behavioral data in Figures 3.1 and 3.7 are reproduced from 1999 and 2006 studies; sample sizes, experimental controls, and demographic details are not stated in this text. - The posterior predictive formula (Eq. 3.12) for the healthy-levels game is stated without derivation; "one can show" is the only justification given. - Figure 3.2 notes that predictions are "only plotted for those values for which human data is available," which limits direct visual comparison to the model's full predictive distribution.

Ideas for future work: - Relax the strong sampling assumption to a cooperative or pedagogically optimal sampling model (e.g., a teacher who picks maximally informative examples) and examine how the posterior and size principle change. - Extend the sensor-fusion bimodal posterior analysis to N > 2 to characterize how quickly the bimodality collapses and at what sample size the plug-in approximation becomes adequate. - Apply the domain-prior framework from Section 3.2 to modern LLM calibration: do fine-tuned models implicitly encode power-law vs. Gaussian priors for different prediction domains? - For the healthy-levels game, derive the posterior predictive for the 2D case without the conditional independence assumption (Eq. 3.10) and assess sensitivity to feature correlation.

Methods

  • Bayesian inference
  • variational inference
  • MCMC
  • sequential Monte Carlo
  • belief propagation
  • junction tree algorithm
  • Kalman filtering
  • extended Kalman filtering
  • PageRank
  • proximal gradient methods
  • ADMM
  • Gaussian processes
  • normalizing flows
  • denoising diffusion models
  • Dirichlet process mixture models
  • collapsed Gibbs sampling
  • expectation propagation
  • natural gradient descent

Claims

  • Bayesian model averaging over hypotheses provides better-calibrated predictions than plug-in (MLE/MAP) approximations, especially in the small-data regime.
  • The size principle (Occam's razor) naturally emerges from the strong sampling assumption in Bayesian concept learning, favoring the smallest hypothesis consistent with the data.
  • Sensor fusion with unknown measurement noise can yield a multimodal posterior, unlike the unimodal Gaussian posterior obtained when noise is known.
  • Informative, domain-appropriate priors allow Bayesian models to match human predictive judgments across diverse durational quantities such as life spans, movie runtimes, and poem lengths.
  • The exponential-family EKF performs online natural gradient descent, connecting filtering algorithms to optimization methods.