A Tutorial on Principal Component Analysis

Jonathon Shlens · 2014

[arxiv] dimensionality-reductionprincipal-component-analysislinear-algebratutorialsingular-value-decomposition

A Tutorial on Principal Component Analysis

Authors: Jonathon Shlens Year: 2014 Tags: principal-component-analysis, dimensionality-reduction, linear-algebra, singular-value-decomposition, tutorial, data-analysis

TL;DR

Pedagogical derivation of PCA from first principles — change of basis, variance maximization, and covariance diagonalization — culminating in two algebraic solutions via eigenvector decomposition and SVD. Targets readers with linear algebra backgrounds who want to understand why PCA works, not just how to call a function.

First pass — the five C's

Category. Tutorial / expository paper; no novel results or experiments.

Context. General applied linear algebra subfield. Builds on standard matrix decomposition theory (spectral theorem for symmetric matrices, SVD); mentions ICA and kernel PCA as alternatives in the Discussion but cites no specific prior works by name — all framing is self-contained.

Correctness. Central assumptions are stated explicitly: (1) linearity of the change of basis, (2) high-variance directions ≈ high SNR / interesting structure, (3) orthonormality of principal components. All three are load-bearing; the paper itself flags that assumptions 1 and 2 can fail (ferris-wheel example, non-Gaussian data). The mathematical derivations within those assumptions appear internally consistent.

Contributions. - Unified intuitive + algebraic derivation of PCA from covariance diagonalization goals. - Explicit proof that eigenvectors of the covariance matrix CX = (1/n)XX^T are the principal components. - Demonstration that PCA via eigenvector decomposition and PCA via SVD are equivalent, with the SVD approach recovering PCs as columns of V from the decomposition of Y = (1/√n)X^T. - Two reference MATLAB implementations (covariance-based and SVD-based) in Appendix B.

Clarity. Writing is consistently clear and well-paced for a tutorial; the mid-paper footnote reversing row/column conventions (Section 6.1) is a genuine readability stumble that the author flags but does not fully resolve before it causes confusion.

Second pass — content

Main thrust: PCA finds the orthonormal basis that diagonalizes the data covariance matrix, equivalently the basis of rank-ordered variance-maximizing directions; this is solved analytically either by eigendecomposition of CX or by SVD of the mean-centered data matrix.

Supporting evidence: - Toy spring example: 6-dimensional recordings (3 cameras × 2D position each, 120 Hz) used to motivate recovery of a single underlying degree of freedom. - SNR formalized as σ²_signal / σ²_noise; high-variance directions assumed to carry signal. - Covariance matrix CX = (1/n)XX^T shown to be square, symmetric m×m; off-diagonal zeros in CY encode full decorrelation. - Algebraic proof that choosing P = E^T (eigenvectors of CX as rows) yields CY = D (diagonal), i.e., PCA diagonalizes the covariance exactly. - SVD equivalence: principal components are columns of V in SVD of Y ≡ (1/√n)X^T, since Y^T Y = CX by construction. - PCA shown to optimally minimize mean squared error (L2) among linear dimensionality-reduction methods — stated without proof.

Figures & tables: Figure 1 (spring + camera schematic) and Figure 2 (2D data cloud with σ²_signal / σ²_noise annotated) carry the intuition; Figure 3 (redundancy spectrum) motivates covariance; Figure 4 (SVD matrix construction) is the key technical diagram; Figure 5 (3-step PCA recipe) and Figure 6 (failure modes) round out the paper. Axes in Figures 2–3 are labeled. No error bars, confidence intervals, or statistical significance — none are relevant to a purely expository paper. Figure 4's checkerboard notation for the sparse Σ matrix is non-standard and may confuse readers unfamiliar with the convention.

Follow-up references: Not stated (no bibliography is provided in the paper; ICA and kernel PCA are mentioned by name only, with no citations).

Third pass — critique

Implicit assumptions: - Gaussian (or at least second-order-sufficient) data distribution is implicitly required for decorrelation to imply independence; the paper mentions this only in a footnote. - Zero-mean data after per-dimension mean subtraction is assumed throughout; unequal measurement scales (i.e., whether to standardize to unit variance before PCA) are never discussed, which can materially change results when variables have different units. - The assumption that "large variance = interesting structure" is stated as an assumption but presented as nearly axiomatic; if this is violated (e.g., artifact-dominated dimensions), PCA silently produces wrong results. - Orthonormality of principal components is assumed from the outset as the "easiest" choice; the paper does not prove it is the only valid choice for diagonalization.

Missing context or citations: - No bibliography whatsoever; ICA and kernel PCA are invoked without citations, making follow-up impossible from the paper alone. - No discussion of numerical stability differences between the eigenvector and SVD approaches (SVD is generally preferred in practice for rank-deficient or nearly singular covariance matrices). - Whitening / sphering as a preprocessing step and its relationship to PCA is not mentioned. - Probabilistic PCA (Tipping & Bishop) is absent despite being a direct statistical extension of the framework discussed. - No mention of computational cost scaling (O(m³) for eigendecomposition, O(min(m,n)·mn) for SVD) relevant to practitioners.

Possible experimental / analytical issues: - The L2-optimality claim ("PCA provides the optimal reduced representation under MSE") is stated without proof or citation; a reader cannot verify it from this paper. - The reverse direction of Theorem 3 (symmetric iff orthogonally diagonalizable) is explicitly left unproven and handed to the reader, leaving a gap in the mathematical completeness the paper advertises. - MATLAB code uses 1/(N-1) normalization in the covariance computation but the main text derives everything with 1/n; this inconsistency is unacknowledged and will produce numerically different results for small n. - The toy spring example is never numerically demonstrated; it is purely illustrative, so the claim that PCA "recovers x" from the 6D recordings is asserted but not shown.

Ideas for future work: - Extend the tutorial with a worked numerical example on the spring dataset (simulated or real), showing the recovered principal component quantitatively aligning with the true axis of motion. - Add a section on choosing the number of components to retain (scree plots, explained variance thresholds, cross-validation), which is the most common practical question PCA users face but is entirely absent. - Derive or cite the L2-optimality result formally, connecting it to the Eckart–Young theorem for low-rank matrix approximation. - Contrast the covariance-based and SVD-based implementations numerically on ill-conditioned data to demonstrate when the SVD route is preferable for numerical stability.

Methods

principal component analysis
singular value decomposition
eigenvector decomposition
covariance matrix diagonalization
change of basis

Claims

PCA finds the orthonormal basis that maximizes variance and minimizes redundancy in a dataset by diagonalizing the covariance matrix.
The principal components of a data matrix X are the eigenvectors of the covariance matrix CX = (1/n) XX^T.
PCA and SVD are intimately related: the principal components of X correspond to the columns of V in the SVD of the normalized transpose of X.
Under mean squared error loss, PCA provides the optimal reduced representation of data, but it only removes second-order dependencies and fails when higher-order structure is present.
PCA assumes linearity, that large variances correspond to important structure, and that principal components are orthogonal.