A Tutorial on Principal Component Analysis
[arxiv] dimensionality-reductionprincipal-component-analysislinear-algebratutorialsingular-value-decomposition
A Tutorial on Principal Component Analysis
Authors: Jonathon Shlens Year: 2014 Tags: principal-component-analysis, dimensionality-reduction, linear-algebra, singular-value-decomposition, tutorial, data-analysis
TL;DR
Pedagogical derivation of PCA from first principles — change of basis, variance maximization, and covariance diagonalization — culminating in two algebraic solutions via eigenvector decomposition and SVD. Targets readers with linear algebra backgrounds who want to understand why PCA works, not just how to call a function.
First pass — the five C's
Category. Tutorial / expository paper; no novel results or experiments.
Context. General applied linear algebra subfield. Builds on standard matrix decomposition theory (spectral theorem for symmetric matrices, SVD); mentions ICA and kernel PCA as alternatives in the Discussion but cites no specific prior works by name — all framing is self-contained.
Correctness. Central assumptions are stated explicitly: (1) linearity of the change of basis, (2) high-variance directions ≈ high SNR / interesting structure, (3) orthonormality of principal components. All three are load-bearing; the paper itself flags that assumptions 1 and 2 can fail (ferris-wheel example, non-Gaussian data). The mathematical derivations within those assumptions appear internally consistent.
Contributions. - Unified intuitive + algebraic derivation of PCA from covariance diagonalization goals. - Explicit proof that eigenvectors of the covariance matrix CX = (1/n)XX^T are the principal components. - Demonstration that PCA via eigenvector decomposition and PCA via SVD are equivalent, with the SVD approach recovering PCs as columns of V from the decomposition of Y = (1/√n)X^T. - Two reference MATLAB implementations (covariance-based and SVD-based) in Appendix B.
Clarity. Writing is consistently clear and well-paced for a tutorial; the mid-paper footnote reversing row/column conventions (Section 6.1) is a genuine readability stumble that the author flags but does not fully resolve before it causes confusion.
Second pass — content
Main thrust: PCA finds the orthonormal basis that diagonalizes the data covariance matrix, equivalently the basis of rank-ordered variance-maximizing directions; this is solved analytically either by eigendecomposition of CX or by SVD of the mean-centered data matrix.
Supporting evidence: - Toy spring example: 6-dimensional recordings (3 cameras × 2D position each, 120 Hz) used to motivate recovery of a single underlying degree of freedom. - SNR formalized as σ²_signal / σ²_noise; high-variance directions assumed to carry signal. - Covariance matrix CX = (1/n)XX^T shown to be square, symmetric m×m; off-diagonal zeros in CY encode full decorrelation. - Algebraic proof that choosing P = E^T (eigenvectors of CX as rows) yields CY = D (diagonal), i.e., PCA diagonalizes the covariance exactly. - SVD equivalence: principal components are columns of V in SVD of Y ≡ (1/√n)X^T, since Y^T Y = CX by construction. - PCA shown to optimally minimize mean squared error (L2) among linear dimensionality-reduction methods — stated without proof.
Figures & tables: Figure 1 (spring + camera schematic) and Figure 2 (2D data cloud with σ²_signal / σ²_noise annotated) carry the intuition; Figure 3 (redundancy spectrum) motivates covariance; Figure 4 (SVD matrix construction) is the key technical diagram; Figure 5 (3-step PCA recipe) and Figure 6 (failure modes) round out the paper. Axes in Figures 2–3 are labeled. No error bars, confidence intervals, or statistical significance — none are relevant to a purely expository paper. Figure 4's checkerboard notation for the sparse Σ matrix is non-standard and may confuse readers unfamiliar with the convention.
Follow-up references: Not stated (no bibliography is provided in the paper; ICA and kernel PCA are mentioned by name only, with no citations).
Third pass — critique
Implicit assumptions: - Gaussian (or at least second-order-sufficient) data distribution is implicitly required for decorrelation to imply independence; the paper mentions this only in a footnote. - Zero-mean data after per-dimension mean subtraction is assumed throughout; unequal measurement scales (i.e., whether to standardize to unit variance before PCA) are never discussed, which can materially change results when variables have different units. - The assumption that "large variance = interesting structure" is stated as an assumption but presented as nearly axiomatic; if this is violated (e.g., artifact-dominated dimensions), PCA silently produces wrong results. - Orthonormality of principal components is assumed from the outset as the "easiest" choice; the paper does not prove it is the only valid choice for diagonalization.
Missing context or citations: - No bibliography whatsoever; ICA and kernel PCA are invoked without citations, making follow-up impossible from the paper alone. - No discussion of numerical stability differences between the eigenvector and SVD approaches (SVD is generally preferred in practice for rank-deficient or nearly singular covariance matrices). - Whitening / sphering as a preprocessing step and its relationship to PCA is not mentioned. - Probabilistic PCA (Tipping & Bishop) is absent despite being a direct statistical extension of the framework discussed. - No mention of computational cost scaling (O(m³) for eigendecomposition, O(min(m,n)·mn) for SVD) relevant to practitioners.
Possible experimental / analytical issues:
- The L2-optimality claim ("PCA provides the optimal reduced representation under MSE") is stated without proof or citation; a reader cannot verify it from this paper.
- The reverse direction of Theorem 3 (symmetric iff orthogonally diagonalizable) is explicitly left unproven and handed to the reader, leaving a gap in the mathematical completeness the paper advertises.
- MATLAB code uses 1/(N-1) normalization in the covariance computation but the main text derives everything with 1/n; this inconsistency is unacknowledged and will produce numerically different results for small n.
- The toy spring example is never numerically demonstrated; it is purely illustrative, so the claim that PCA "recovers x" from the 6D recordings is asserted but not shown.
Ideas for future work: - Extend the tutorial with a worked numerical example on the spring dataset (simulated or real), showing the recovered principal component quantitatively aligning with the true axis of motion. - Add a section on choosing the number of components to retain (scree plots, explained variance thresholds, cross-validation), which is the most common practical question PCA users face but is entirely absent. - Derive or cite the L2-optimality result formally, connecting it to the Eckart–Young theorem for low-rank matrix approximation. - Contrast the covariance-based and SVD-based implementations numerically on ill-conditioned data to demonstrate when the SVD route is preferable for numerical stability.
Methods
- principal component analysis
- singular value decomposition
- eigenvector decomposition
- covariance matrix diagonalization
- change of basis
Claims
- PCA finds the orthonormal basis that maximizes variance and minimizes redundancy in a dataset by diagonalizing the covariance matrix.
- The principal components of a data matrix X are the eigenvectors of the covariance matrix CX = (1/n) XX^T.
- PCA and SVD are intimately related: the principal components of X correspond to the columns of V in the SVD of the normalized transpose of X.
- Under mean squared error loss, PCA provides the optimal reduced representation of data, but it only removes second-order dependencies and fails when higher-order structure is present.
- PCA assumes linearity, that large variances correspond to important structure, and that principal components are orthogonal.