On the slow death of scaling

Sara Hooker

scaling-lawslarge-language-modelscompute-efficiencyai-policymodel-compression

On the Slow Death of Scaling

Authors: Sara Hooker Year: Not stated. (Latest citations are from 2025; paper is consistent with a 2025 composition date.) Tags: scaling-laws, compute-efficiency, large-language-models, ai-governance, algorithmic-progress, position-paper

TL;DR

The essay argues that raw compute scaling is a weakening driver of AI progress: smaller models now routinely outperform much larger ones, scaling laws fail to predict downstream task performance, and the most consequential near-term gains will come from inference-time compute, data quality, and algorithmic innovation rather than additional pre-training FLOPs. The argument carries policy implications because compute thresholds underpin major AI governance frameworks.

First pass — the five C's

Category. Position essay / opinion piece. No original experiments; synthesizes published results and leaderboard observations.

Context. AI scaling subfield. Builds on: Sutton's "Bitter Lesson" (compute beats domain knowledge as the dominant historical driver); Kaplan et al. (2020) scaling laws for LMs; Hooker's own prior work on the hardware lottery (Hooker 2020, 2021b); Schaeffer et al. (2023/2024) on emergent abilities as artifact of evaluation metrics.

Correctness. Load-bearing assumptions: (1) Open LLM Leaderboard rankings are a valid proxy for general model capability; (2) the observed small-beats-large trend will persist rather than revert; (3) inference-time compute gains (estimated at 5×–20×, citing Davidson et al. 2023) generalize broadly. All three are asserted rather than rigorously established.

Contributions. - Synthesizes leaderboard evidence that models with 4.5%–5% of the parameters of larger contemporaries can outperform them (e.g., Aya Expanse 8B vs. BLOOM 176B, Llama-3 8B vs. Falcon 180B). - Identifies four internal contradictions in the scaling thesis: weight redundancy, long-tail learning inefficiency, data-quality substitutability, and architecture-dependence of any scaling law. - Argues that inference-time, gradient-free optimization constitutes a qualitatively new paradigm displacing training-compute scaling. - Critiques compute-threshold governance (EU AI Act, US executive orders) as built on an assumption of monotonically growing model size that the empirical record does not support.

Clarity. Readable and well-structured for a broad technical audience; leans on analogy (magnetron/microwave, silver bullet, barber anecdote) in places where quantitative precision would be more convincing.

Second pass — content

Main thrust: The compute–performance relationship is structurally weakening because of architectural redundancy, data-quality substitution effects, and a growing suite of algorithmic improvements that add little or no training compute; governance and investment frameworks that treat scaling as inevitable are therefore mispricing both risk and opportunity.

Supporting evidence: - Aya 23 8B and Aya Expanse 8B each outperform BLOOM 176B (≈22× the parameters) on the Open LLM Leaderboard (Beeching et al., 2023). - Llama-3 8B outperforms Falcon 180B (≈22× the parameters) on the same leaderboard. - Denil et al. (2014): a small subset of weights can predict 95% of all weights in a network, implying high redundancy. - Inference-time compute improvements estimated to yield 5×–20× gains over base post-training performance, at minimal footprint relative to pre-training cost (Davidson et al., 2023). - Scaling laws for downstream task performance are inconsistent or murky (Ganguli et al., 2022; Schaeffer et al., 2023, 2024) and scaling law datasets frequently contain fewer than 100 model data points (Ruan et al., 2024).

Figures & tables: Figure 1 (training cost log-scale 2016–23, axes labeled, sourced from Epoch/AI Index 2024 — no error bars, observational). Figure 3a/3b (Open LLM Leaderboard trends for small vs. large models — axes exist but are not described textually in sufficient detail to audit; no error bars, no statistical significance reported; selection methodology for "notable" models not specified). Figures 4 and 5 are illustrative/conceptual (no data axes). No formal tables. Statistical rigor in figures is absent throughout.

Follow-up references: - Kaplan et al. (2020) — foundational scaling laws this essay directly contests. - Schaeffer et al. (2023/2024) — empirical challenge to emergent abilities and downstream scaling predictability. - Ho et al. (2024a/b) — quantitative treatment of algorithmic progress in LMs. - Davidson et al. (2023) — the primary source for the 5×–20× inference-time compute gain claim; essential to evaluate that figure independently.

Third pass — critique

Implicit assumptions: - Open LLM Leaderboard rankings are an unbiased measure of general capability; if the leaderboard is gamed or narrowly scoped, the central small-beats-large trend is undermined. - The observed parameter-efficiency trend will continue; the essay does not address whether it could reverse with a new architecture class. - Inference-time compute is economically cheaper than training-time compute in relevant deployment settings — not demonstrated; inference at scale can dominate cost. - A "new architecture" will emerge that enables continual learning — asserted in §5.2 without candidate specification.

Missing context or citations: - No engagement with Hoffmann et al. (2022) "Chinchilla" (compute-optimal training), which is arguably the most direct prior work on the compute–data tradeoff and central to any serious treatment of scaling law limitations. - No discussion of OpenAI o1/o3-class reasoning models as concrete existence proofs of inference-time scaling — conspicuously absent given the essay's thesis. - The claim that frontier labs have "stopped publishing" (§1) is made without citation and is contested by available evidence. - No engagement with literature on neural scaling in domains where power-law scaling has been robust (e.g., the code-generation figure of 10 orders of magnitude is mentioned but not balanced against the essay's broader skepticism).

Possible experimental / analytical issues: - The Open LLM Leaderboard data is presented as a trend without controlling for: submission selection bias (authors choose when to submit), benchmark saturation, or data contamination — all known confounds on this leaderboard. - The 5×–20× inference-time compute gain comes from one source (Davidson et al., 2023) studying "a subset" of techniques; range, conditions, and task domains are unspecified in the essay. - Denil et al. (2014) results on weight predictability were obtained on architectures far smaller than modern LLMs; direct applicability is not argued. - The claim that scaling laws rest on <100 data points (Ruan et al., 2024) is stated without nuance about which specific laws or papers this applies to. - As a position essay, no methodology section exists; claims are not falsifiable from the text alone.

Ideas for future work: - Controlled comparison of inference-time compute vs. equivalent training-time compute budgets across matched task types and model families, with cost measured in wall-clock time and dollars. - Systematic replication of the Open LLM Leaderboard small-beats-large trend on held-out, contamination-audited benchmarks to verify it is not an artifact of benchmark saturation or submission timing. - Empirical study of which governance compute thresholds would have correctly flagged historically dangerous capability jumps, to test whether threshold-based policy is as flawed as argued. - Architecture search experiment explicitly targeting continual learning desiderata (minimal catastrophic forgetting, local weight updates) to operationalize the essay's claim that a successor to Transformers is needed.

Methods

data pruning
instruction fine-tuning
model distillation
retrieval-augmented generation
preference training
chain-of-thought reasoning
inference-time compute scaling

Datasets

Open LLM Leaderboard
ImageNet
MNIST
SQuAD

Claims

The relationship between training compute and performance is highly uncertain and rapidly changing, undermining the 'bigger is always better' assumption.
Smaller models increasingly outperform far larger ones due to algorithmic improvements, better data quality, and new optimization techniques.
Scaling laws have only been reliably shown to hold for pre-training test loss and fail when extrapolated to downstream task performance or over medium time horizons.
Data quality improvements, instruction fine-tuning, and inference-time compute strategies can compensate for reduced model size and training compute.
Policies and responsible scaling frameworks that use compute thresholds as proxies for capability and risk are built on an increasingly unreliable assumption.