Optimizer-Induced Spectral Scaling Laws

New York University
TL;DR
Same Architecture, Different Optimizer, Different Capacity

Realized representation capacity is not architecture-only; it emerges from the architecture–optimizer interaction. Optimizer geometry changes the scaling exponents that govern how FFN width becomes usable capacity. In controlled comparisons, optimizer-induced shifts in spectral scaling exceed architectural interventions.

Main result: At fixed architecture and width schedule, different optimizers produce different spectral scaling laws.

Spectral scaling exponents depend on optimizer choice

Takeaway: Hard-rank scaling separates optimizers much more sharply than soft-rank scaling. Soft rank grows with FFN width across optimizers, but hard-rank scaling is strongly optimizer-dependent: AdamW exhibits weak hard-rank scaling, while Muon achieves much stronger scaling under identical architecture and training data.

Key Insights

Five takeaways on optimizer geometry, representation capacity, and scaling laws.

01
Optimizer geometry changes representation scaling laws

The same Transformer architecture realizes different spectral-capacity scaling laws under different optimizers.

02
Added width is not automatically usable capacity

Soft rank grows across optimizers, but hard-rank scaling reveals whether added FFN width becomes dominant usable capacity or diffuse spectral mass.

03
Matched loss does not imply matched geometry

Runs with similar validation loss can exhibit sharply different hard-rank scaling and representation structure.

04
Capacity allocation is token-frequency dependent

MID and TAIL tokens expose the strongest optimizer-induced capacity shifts, showing that optimizer geometry changes how capacity is allocated across the token distribution.

05
Optimizer effects can exceed architectural interventions

In controlled comparisons, switching optimizers shifts spectral-scaling exponents more than attention-rank or positional-encoding interventions.

Abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3× increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard–soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer–architecture co-design.

How Optimizer Reallocates Capacity Across Token Regimes?

Why stratify by token frequency? Aggregate spectra can be dominated by frequent token occurrences, potentially obscuring how lower-frequency regimes use capacity. We therefore split representations into HEAD, MID, and TAIL regimes, motivated by the long-tailed structure of language and prior work showing that LLMs struggle with long-tail knowledge (Kandpal et al., 2023), as well as token-frequency/Zipfian scaling analyses (Kunstner & Bach, 2025).

HEAD frequent tokens MID middle-frequency tokens TAIL lower-frequency tokens

Main result: MID and TAIL tokens expose the clearest optimizer-dependent scaling differences. Muon/NorMuon nearly eliminate hard–soft asymmetry in these regimes, while AdamW maintains positive asymmetry across all token-frequency regimes.

Optimizer-dependent spectral scaling across token-frequency regimes

Takeaway: MID and TAIL regimes show the clearest optimizer separation. AdamW maintains positive hard–soft asymmetry, while Muon/NorMuon drive this asymmetry near zero for MID and TAIL tokens. Figure 2: Soft and hard spectral ranks as functions of FFN width for HEAD, MID, and TAIL tokens in GPT-2 160M.

HEAD MID TAIL
Optimizer βhard βsoft Δ1,2 βhard βsoft Δ1,2 βhard βsoft Δ1,2
AdamW 0.260.44+0.18 0.240.45+0.21 0.440.62+0.18
Muon 0.590.88+0.29 0.930.88−0.04 1.021.03+0.01
NorMuon 0.430.90+0.47 0.950.93−0.02 1.041.04+0.00
Dion (1/2) 0.520.89+0.37 0.670.82+0.15 0.880.95+0.07
Dion (1/16) 0.350.70+0.35 0.460.68+0.22 0.400.72+0.31

Table 2: β values for soft and hard ranks for GPT-2 160M. Positive Δ1,2 indicates concentrated eigenspectra; lower values indicate better utilization of FFN width.

Numbers to notice

TAIL hard-rank scaling: AdamW 0.44 → Muon 1.02, a 2.3× larger exponent.
MID hard-rank scaling: AdamW 0.24 → Muon 0.93, the largest absolute gain (+0.69).
Muon/NorMuon reduce MID/TAIL hard–soft asymmetry to approximately zero.

Matched Loss ≠ Matched Geometry

Extended AdamW training improves validation perplexity and matches the low-rank Dion control across the FFN-width sweep, but it does not recover representation scaling. Aggregate hard-rank scaling nearly vanishes, dropping from βhard = 0.29 at 6K steps to βhard = 0.03 at 12K steps, while hard–soft asymmetry grows from +0.37 to +0.55. Thus, loss can improve while the width-to-capacity scaling law breaks.

Geometry summary: hard-rank scaling collapses
Aggregate HEAD MID TAIL
Configuration βhard βsoft Δ1,2 βhard Δ1,2 βhard Δ1,2 βhard Δ1,2
AdamW 6K 0.290.66+0.37 0.26+0.18 0.24+0.21 0.44+0.18
AdamW 12K 0.030.58+0.55 0.13+0.28 0.17+0.30 0.18+0.35
Dion (1/16) 6K 0.500.74+0.24 0.35+0.35 0.46+0.22 0.40+0.31

Spectral geometry summary. AdamW 12K matches the low-rank Dion control in validation perplexity, but its hard-rank scaling nearly vanishes (βhard=0.03, R2=0.01), whereas Dion maintains reliable power-law scaling.

Loss control: AdamW 12K matches Dion r = 1/16
Optimizer d 2d 3d 4d 5d 6d 7d 8d
AdamW 6K 38.1536.7934.1234.3433.8531.6832.1732.43
AdamW 12K 34.2532.1530.4030.4729.4329.0428.4528.29
Muon 31.7929.8328.6527.9027.2726.8526.4326.15
NorMuon 31.7729.8228.6327.8227.2326.7426.2925.94
Dion (1/2) 32.0630.1928.9628.2827.6727.1626.8326.51
Dion (1/4) 32.6130.6929.5828.7928.2327.6827.4127.13
Dion (1/8) 33.2131.3430.1929.5128.9428.5328.2227.89
Dion (1/16) 34.1832.4131.3330.6630.0829.7129.4029.23

Perplexity control. AdamW 12K matches or improves over Dion r=1/16 across the FFN-width sweep, yet their spectral scaling laws differ sharply. Matched loss does not imply matched representation geometry.

Why the scaling law breaks
A. Scaling exponents over training
Extended AdamW training weakens hard-rank scaling
B. Width–capacity ordering breaks
Hard-rank dynamics under extended AdamW training

Extended AdamW training improves loss but breaks hard-capacity scaling. Top: βhard weakens over training while βsoft remains comparatively stable, increasing hard–soft asymmetry. Bottom: Wider AdamW models peak early and then lose hard-rank capacity faster than narrower models, breaking the monotonic width–capacity ordering needed for a power-law fit.

Takeaway

Standard training logs would report healthy progress: perplexity keeps improving. Spectral telemetry reveals the hidden failure mode: realized hard capacity stops scaling with width.

Update Rank Sets a Hard-Capacity Ceiling

Why Dion? Dion provides a controlled intervention: it varies the rank of the projected optimizer update while preserving orthonormalized-update structure. This lets us separate the effect of update geometry from update rank, and ask whether orthonormalization alone is sufficient.

Main result: Update rank acts as a hard-capacity bottleneck. As the Dion rank fraction decreases, TAIL hard-rank scaling drops sharply, while soft-rank scaling degrades more gradually.

Dion rank sweep

Figure 4: TAIL-token spectral scaling under Dion rank sweeps. As the Dion rank fraction decreases from r=1/2 to r=1/16, hard-rank scaling drops from β=0.88 to β=0.40, approaching the AdamW regime. Soft-rank scaling degrades more gradually (0.95 → 0.72), indicating that low update rank primarily limits dominant-mode hard capacity rather than all spectral growth.

Takeaway

Orthonormalization alone is not enough. The optimizer must have sufficient update rank to convert added FFN width into dominant-mode capacity.

Scale Persistence at 350M

The optimizer-dependent structure persists at 350M scale. Muon achieves near-linear hard-rank scaling (βhard = 1.13, R2 = 0.94), while AdamW remains clearly sublinear (βhard = 0.39, R2 = 0.82).

350M scale persistence

Figure 5: Optimizer-dependent TAIL spectral scaling persists at 350M scale. Muon and NorMuon maintain stronger hard-rank scaling than AdamW, while low-rank Dion(1/16) remains in a lower hard-capacity regime.

Optimizer Effects Can Dominate Architectural Interventions

We compare optimizer-induced spectral scaling shifts against the effect of increasing per-head attention rank at fixed parameter count. Optimizer-induced gains exceed attention-rank shifts in 28 of 30 comparisons.

Optimizer effects exceed architectural effects

Figure 6: Optimizer-induced shifts in spectral-scaling exceed attention-rank shifts in GPT-2 160M. Optimizer-induced gains (Δβopt, red dashed lines) exceed attention-rank shifts (bars) in 28 of 30 comparisons (marked with ★).

Rényi Analysis: Where Optimizer-Induced Capacity Forms?

The Rényi effective-rank family reveals that AdamW’s FFN nonlinearity must perform 10–12× larger compensatory reinjection than Muon’s, yet still achieves the lowest post-activation rank. Large reinjection is compensatory, not beneficial — it indicates the optimizer has created a more collapsed pre-activation geometry.

Renyi family analysis

Figure 7: Rényi-family view of optimizer-shaped spectral capacity in GPT-2 350M. Pre-activation rank (left), post-activation rank (middle), and nonlinear reinjection ratio (right). The strongest optimizer separation appears before the FFN nonlinearity.

Optimizer Geometry Expands Trainable Architecture Space

Beyond capacity scaling, optimizer choice determines which architectures are trainable in the first place. Muon-family optimizers train partial PostLN configurations at useful perplexity where AdamW diverges or collapses.

Optimizer PostLN-25 PostLN-50 PostLN-75 Full PostLN
AdamW (lr=10−4) 64.6 65.6 106.7 ×
AdamW (lr=3×10−4) 41.9 × × ×
Muon 28.7 30.1 40.9 ×
NorMuon 28.7 29.9 32.8 ×
Dion (r=1/2) 29.1 31.7 × ×
Dion (r=1/16) 30.7 34.7 × ×

Table: Validation perplexity for partial PostLN configurations in GPT-2 160M. × denotes diverged runs (PPL > 1000). NorMuon reaches PPL=32.8 at PostLN-75 where AdamW collapses to PPL=106.7.

Research Lineage

This paper extends a line of work on FFN representation geometry in language models: from nonlinear eigenspectrum dynamics, to spectral scaling laws under a fixed optimizer, to optimizer-induced scaling laws of realized capacity.

BibTeX

@article{jha2026optimizer,
  title={Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws},
  author={Nandan Kumar Jha and Brandon Reagen},
  year={2026},
  url={https://optimizer-scaling-laws.github.io}
}