A Tutorial on Principal Component Analysis

Jonathon Shlens · 2014

[arxiv]

A Tutorial on Principal Component Analysis

Authors: Jonathon Shlens Year: 2014 Tags: principal-component-analysis, dimensionality-reduction, linear-algebra, singular-value-decomposition, tutorial, data-analysis

TL;DR

Pedagogical derivation of PCA from first principles — change of basis, variance maximization, and covariance diagonalization — culminating in two algebraic solutions via eigenvector decomposition and SVD. Targets readers with linear algebra backgrounds who want to understand why PCA works, not just how to call a function.

First pass — the five C's

Category. Tutorial / expository paper; no novel results or experiments.

Context. General applied linear algebra subfield. Builds on standard matrix decomposition theory (spectral theorem for symmetric matrices, SVD); mentions ICA and kernel PCA as alternatives in the Discussion but cites no specific prior works by name — all framing is self-contained.

Correctness. Central assumptions are stated explicitly: (1) linearity of the change of basis, (2) high-variance directions ≈ high SNR / interesting structure, (3) orthonormality of principal components. All three are load-bearing; the paper itself flags that assumptions 1 and 2 can fail (ferris-wheel example, non-Gaussian data). The mathematical derivations within those assumptions appear internally consistent.

Contributions. - Unified intuitive + algebraic derivation of PCA from covariance diagonalization goals. - Explicit proof that eigenvectors of the covariance matrix CX = (1/n)XX^T are the principal components. - Demonstration that PCA via eigenvector decomposition and PCA via SVD are equivalent, with the SVD approach recovering PCs as columns of V from the decomposition of Y = (1/√n)X^T. - Two reference MATLAB implementations (covariance-based and SVD-based) in Appendix B.

Clarity. Writing is consistently clear and well-paced for a tutorial; the mid-paper footnote reversing row/column conventions (Section 6.1) is a genuine readability stumble that the author flags but does not fully resolve before it causes confusion.

Second pass — content

Main thrust: PCA finds the orthonormal basis that diagonalizes the data covariance matrix, equivalently the basis of rank-ordered variance-maximizing directions; this is solved analytically either by eigendecomposition of CX or by SVD of the mean-centered data matrix.

Supporting evidence: - Toy spring example: 6-dimensional recordings (3 cameras × 2D position each, 120 Hz) used to motivate recovery of a single underlying degree of freedom. - SNR formalized as σ²_signal / σ²_noise; high-variance directions assumed to carry signal. - Covariance matrix CX = (1/n)XX^T shown to be square, symmetric m×m; off-diagonal zeros in CY encode full decorrelation. - Algebraic proof that choosing P = E^T (eigenvectors of CX as rows) yields CY = D (diagonal), i.e., PCA diagonalizes the covariance exactly. - SVD equivalence: principal components are columns of V in SVD of Y ≡ (1/√n)X^T, since Y^T Y = CX by construction. - PCA shown to optimally minimize mean squared error (L2) among linear dimensionality-reduction methods — stated without proof.

Figures & tables: Figure 1 (spring + camera schematic) and Figure 2 (2D data cloud with σ²_signal / σ²_noise annotated) carry the intuition; Figure 3 (redundancy spectrum) motivates covariance; Figure 4 (SVD matrix construction) is the key technical diagram; Figure 5 (3-step PCA recipe) and Figure 6 (failure modes) round out the paper. Axes in Figures 2–3 are labeled. No error bars, confidence intervals, or statistical significance — none are relevant to a purely expository paper. Figure 4's checkerboard notation for the sparse Σ matrix is non-standard and may confuse readers unfamiliar with the convention.

Follow-up references: Not stated (no bibliography is provided in the paper; ICA and kernel PCA are mentioned by name only, with no citations).

Third pass — critique

Implicit assumptions: - Gaussian (or at least second-order-sufficient) data distribution is implicitly required for decorrelation to imply independence; the paper mentions this only in a footnote. - Zero-mean data after per-dimension mean subtraction is assumed throughout; unequal measurement scales (i.e., whether to standardize to unit variance before PCA) are never discussed, which can materially change results when variables have different units. - The assumption that "large variance = interesting structure" is stated as an assumption but presented as nearly axiomatic; if this is violated (e.g., artifact-dominated dimensions), PCA silently produces wrong results. - Orthonormality of principal components is assumed from the outset as the "easiest" choice; the paper does not prove it is the only valid choice for diagonalization.

Missing context or citations: - No bibliography whatsoever; ICA and kernel PCA are invoked without citations, making follow-up impossible from the paper alone. - No discussion of numerical stability differences between the eigenvector and SVD approaches (SVD is generally preferred in practice for rank-deficient or nearly singular covariance matrices). - Whitening / sphering as a preprocessing step and its relationship to PCA is not mentioned. - Probabilistic PCA (Tipping & Bishop) is absent despite being a direct statistical extension of the framework discussed. - No mention of computational cost scaling (O(m³) for eigendecomposition, O(min(m,n)·mn) for SVD) relevant to practitioners.

Possible experimental / analytical issues: - The L2-optimality claim ("PCA provides the optimal reduced representation under MSE") is stated without proof or citation; a reader cannot verify it from this paper. - The reverse direction of Theorem 3 (symmetric iff orthogonally diagonalizable) is explicitly left unproven and handed to the reader, leaving a gap in the mathematical completeness the paper advertises. - MATLAB code uses 1/(N-1) normalization in the covariance computation but the main text derives everything with 1/n; this inconsistency is unacknowledged and will produce numerically different results for small n. - The toy spring example is never numerically demonstrated; it is purely illustrative, so the claim that PCA "recovers x" from the 6D recordings is asserted but not shown.

Ideas for future work: - Extend the tutorial with a worked numerical example on the spring dataset (simulated or real), showing the recovered principal component quantitatively aligning with the true axis of motion. - Add a section on choosing the number of components to retain (scree plots, explained variance thresholds, cross-validation), which is the most common practical question PCA users face but is entirely absent. - Derive or cite the L2-optimality result formally, connecting it to the Eckart–Young theorem for low-rank matrix approximation. - Contrast the covariance-based and SVD-based implementations numerically on ill-conditioned data to demonstrate when the SVD route is preferable for numerical stability.

Methods

  • principal component analysis
  • singular value decomposition
  • eigenvector decomposition
  • covariance matrix diagonalization
  • change of basis

Claims

  • PCA finds the orthonormal basis that maximizes variance and minimizes redundancy in a dataset by diagonalizing the covariance matrix.
  • The principal components of a data matrix X are the eigenvectors of the covariance matrix CX = (1/n) XX^T.
  • PCA and SVD are intimately related: the principal components of X correspond to the columns of V in the SVD of the normalized transpose of X.
  • Under mean squared error loss, PCA provides the optimal reduced representation of data, but it only removes second-order dependencies and fails when higher-order structure is present.
  • PCA assumes linearity, that large variances correspond to important structure, and that principal components are orthogonal.