Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

TL;DR

Same Architecture, Different Optimizer, Different Capacity

Realized representation capacity is not architecture-only; it emerges from the architecture–optimizer interaction. Optimizer geometry changes the scaling exponents that govern how FFN width becomes usable capacity. In controlled comparisons, optimizer-induced shifts in spectral scaling exceed architectural interventions.

Key Insights

Five takeaways on optimizer geometry, representation capacity, and scaling laws.

01

Optimizer geometry changes representation scaling laws

The same Transformer architecture realizes different spectral-capacity scaling laws under different optimizers.

02

Added width is not automatically usable capacity

Soft rank grows across optimizers, but hard-rank scaling reveals whether added FFN width becomes dominant usable capacity or diffuse spectral mass.

03

Matched loss does not imply matched geometry

Runs with similar validation loss can exhibit sharply different hard-rank scaling and representation structure.

04

Capacity allocation is token-frequency dependent

MID and TAIL tokens expose the strongest optimizer-induced capacity shifts, showing that optimizer geometry changes how capacity is allocated across the token distribution.

05

Optimizer effects can exceed architectural interventions

In controlled comparisons, switching optimizers shifts spectral-scaling exponents more than attention-rank or positional-encoding interventions.

Abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3× increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard–soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer–architecture co-design.

How Optimizer Reallocates Capacity Across Token Regimes?

Why stratify by token frequency? Aggregate spectra can be dominated by frequent token occurrences, potentially obscuring how lower-frequency regimes use capacity. We therefore split representations into HEAD, MID, and TAIL regimes, motivated by the long-tailed structure of language and prior work showing that LLMs struggle with long-tail knowledge (Kandpal et al., 2023), as well as token-frequency/Zipfian scaling analyses (Kunstner & Bach, 2025).

HEAD frequent tokens MID middle-frequency tokens TAIL lower-frequency tokens

Main result: MID and TAIL tokens expose the clearest optimizer-dependent scaling differences. Muon/NorMuon nearly eliminate hard–soft asymmetry in these regimes, while AdamW maintains positive asymmetry across all token-frequency regimes.

Optimizer-dependent spectral scaling across token-frequency regimes

Takeaway: MID and TAIL regimes show the clearest optimizer separation. AdamW maintains positive hard–soft asymmetry, while Muon/NorMuon drive this asymmetry near zero for MID and TAIL tokens. Figure 2: Soft and hard spectral ranks as functions of FFN width for HEAD, MID, and TAIL tokens in GPT-2 160M.

	HEAD			MID			TAIL
Optimizer	β_hard	β_soft	Δ_1,2	β_hard	β_soft	Δ_1,2	β_hard	β_soft	Δ_1,2
AdamW	0.26	0.44	+0.18	0.24	0.45	+0.21	0.44	0.62	+0.18
Muon	0.59	0.88	+0.29	0.93	0.88	−0.04	1.02	1.03	+0.01
NorMuon	0.43	0.90	+0.47	0.95	0.93	−0.02	1.04	1.04	+0.00
Dion (1/2)	0.52	0.89	+0.37	0.67	0.82	+0.15	0.88	0.95	+0.07
Dion (1/16)	0.35	0.70	+0.35	0.46	0.68	+0.22	0.40	0.72	+0.31

Table 2: β values for soft and hard ranks for GPT-2 160M. Positive Δ_1,2 indicates concentrated eigenspectra; lower values indicate better utilization of FFN width.

Numbers to notice

TAIL hard-rank scaling: AdamW 0.44 → Muon 1.02, a 2.3× larger exponent.
MID hard-rank scaling: AdamW 0.24 → Muon 0.93, the largest absolute gain (+0.69).
Muon/NorMuon reduce MID/TAIL hard–soft asymmetry to approximately zero.

Matched Loss ≠ Matched Geometry

Extended AdamW training improves validation perplexity and matches the low-rank Dion control across the FFN-width sweep, but it does not recover representation scaling. Aggregate hard-rank scaling nearly vanishes, dropping from β_hard = 0.29 at 6K steps to β_hard = 0.03 at 12K steps, while hard–soft asymmetry grows from +0.37 to +0.55. Thus, loss can improve while the width-to-capacity scaling law breaks.

Geometry summary: hard-rank scaling collapses

	Aggregate			HEAD		MID		TAIL
Configuration	β_hard	β_soft	Δ_1,2	β_hard	Δ_1,2	β_hard	Δ_1,2	β_hard	Δ_1,2
AdamW 6K	0.29	0.66	+0.37	0.26	+0.18	0.24	+0.21	0.44	+0.18
AdamW 12K	0.03	0.58	+0.55	0.13	+0.28	0.17	+0.30	0.18	+0.35
Dion (1/16) 6K	0.50	0.74	+0.24	0.35	+0.35	0.46	+0.22	0.40	+0.31

Spectral geometry summary. AdamW 12K matches the low-rank Dion control in validation perplexity, but its hard-rank scaling nearly vanishes (β_hard=0.03, R²=0.01), whereas Dion maintains reliable power-law scaling.

Loss control: AdamW 12K matches Dion r = 1/16

Optimizer	d	2d	3d	4d	5d	6d	7d	8d
AdamW 6K	38.15	36.79	34.12	34.34	33.85	31.68	32.17	32.43
AdamW 12K	34.25	32.15	30.40	30.47	29.43	29.04	28.45	28.29
Muon	31.79	29.83	28.65	27.90	27.27	26.85	26.43	26.15
NorMuon	31.77	29.82	28.63	27.82	27.23	26.74	26.29	25.94
Dion (1/2)	32.06	30.19	28.96	28.28	27.67	27.16	26.83	26.51
Dion (1/4)	32.61	30.69	29.58	28.79	28.23	27.68	27.41	27.13
Dion (1/8)	33.21	31.34	30.19	29.51	28.94	28.53	28.22	27.89
Dion (1/16)	34.18	32.41	31.33	30.66	30.08	29.71	29.40	29.23

Perplexity control. AdamW 12K matches or improves over Dion r=1/16 across the FFN-width sweep, yet their spectral scaling laws differ sharply. Matched loss does not imply matched representation geometry.

Why the scaling law breaks

A. Scaling exponents over training

Extended AdamW training weakens hard-rank scaling

B. Width–capacity ordering breaks

Hard-rank dynamics under extended AdamW training

Extended AdamW training improves loss but breaks hard-capacity scaling. Top: β_hard weakens over training while β_soft remains comparatively stable, increasing hard–soft asymmetry. Bottom: Wider AdamW models peak early and then lose hard-rank capacity faster than narrower models, breaking the monotonic width–capacity ordering needed for a power-law fit.

Takeaway

Standard training logs would report healthy progress: perplexity keeps improving. Spectral telemetry reveals the hidden failure mode: realized hard capacity stops scaling with width.

Update Rank Sets a Hard-Capacity Ceiling

Why Dion? Dion provides a controlled intervention: it varies the rank of the projected optimizer update while preserving orthonormalized-update structure. This lets us separate the effect of update geometry from update rank, and ask whether orthonormalization alone is sufficient.

Main result: Update rank acts as a hard-capacity bottleneck. As the Dion rank fraction decreases, TAIL hard-rank scaling drops sharply, while soft-rank scaling degrades more gradually.

Figure 4: TAIL-token spectral scaling under Dion rank sweeps. As the Dion rank fraction decreases from r=1/2 to r=1/16, hard-rank scaling drops from β=0.88 to β=0.40, approaching the AdamW regime. Soft-rank scaling degrades more gradually (0.95 → 0.72), indicating that low update rank primarily limits dominant-mode hard capacity rather than all spectral growth.

Takeaway

Orthonormalization alone is not enough. The optimizer must have sufficient update rank to convert added FFN width into dominant-mode capacity.

Scale Persistence at 350M

The optimizer-dependent structure persists at 350M scale. Muon achieves near-linear hard-rank scaling (β_hard = 1.13, R² = 0.94), while AdamW remains clearly sublinear (β_hard = 0.39, R² = 0.82).

Figure 5: Optimizer-dependent TAIL spectral scaling persists at 350M scale. Muon and NorMuon maintain stronger hard-rank scaling than AdamW, while low-rank Dion(1/16) remains in a lower hard-capacity regime.

Optimizer Effects Can Dominate Architectural Interventions

We compare optimizer-induced spectral scaling shifts against the effect of increasing per-head attention rank at fixed parameter count. Optimizer-induced gains exceed attention-rank shifts in 28 of 30 comparisons.

Optimizer effects exceed architectural effects

Figure 6: Optimizer-induced shifts in spectral-scaling exceed attention-rank shifts in GPT-2 160M. Optimizer-induced gains (Δβ^★_opt, red dashed lines) exceed attention-rank shifts (bars) in 28 of 30 comparisons (marked with ★).

Rényi Analysis: Where Optimizer-Induced Capacity Forms?

The Rényi effective-rank family reveals that AdamW’s FFN nonlinearity must perform 10–12× larger compensatory reinjection than Muon’s, yet still achieves the lowest post-activation rank. Large reinjection is compensatory, not beneficial — it indicates the optimizer has created a more collapsed pre-activation geometry.

Figure 7: Rényi-family view of optimizer-shaped spectral capacity in GPT-2 350M. Pre-activation rank (left), post-activation rank (middle), and nonlinear reinjection ratio (right). The strongest optimizer separation appears before the FFN nonlinearity.

Optimizer Geometry Expands Trainable Architecture Space

Beyond capacity scaling, optimizer choice determines which architectures are trainable in the first place. Muon-family optimizers train partial PostLN configurations at useful perplexity where AdamW diverges or collapses.

Optimizer	PostLN-25	PostLN-50	PostLN-75	Full PostLN
AdamW (lr=10⁻⁴)	64.6	65.6	106.7	×
AdamW (lr=3×10⁻⁴)	41.9	×	×	×
Muon	28.7	30.1	40.9	×
NorMuon	28.7	29.9	32.8	×
Dion (r=1/2)	29.1	31.7	×	×
Dion (r=1/16)	30.7	34.7	×	×

Table: Validation perplexity for partial PostLN configurations in GPT-2 160M. × denotes diverged runs (PPL > 1000). NorMuon reaches PPL=32.8 at PostLN-75 where AdamW collapses to PPL=106.7.

Research Lineage

This paper extends a line of work on FFN representation geometry in language models: from nonlinear eigenspectrum dynamics, to spectral scaling laws under a fixed optimizer, to optimizer-induced scaling laws of realized capacity.

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Published at ICLR 2026. NerVE studied how FFN nonlinearities reshape representation spectra at a fixed model width, showing that nonlinearities can reinject variance across eigendirections and that optimizer geometry modulates this redistribution. The present work lifts this fixed-width analysis into the scaling regime, asking how optimizer geometry changes the law by which added FFN width becomes realized capacity.
Project Page

BibTeX

@inproceedings{jha2026nerve,
  title={NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks},
  author={Nandan Kumar Jha and Brandon Reagen},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026}
}

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Published at EMNLP 2025. This work established spectral scaling laws for FFN latent-space utilization under a fixed optimizer, showing that soft and hard spectral ranks scale differently with FFN width and revealing a hard–soft asymmetry in how added width is used. The present work shows that this asymmetry is not architecture-invariant: optimizer choice can change the scaling exponents themselves.
arXiv

BibTeX

@inproceedings{jha2025spectral,
  title={Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?},
  author={Jha, Nandan Kumar and Reagen, Brandon},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025}
}

BibTeX

@article{jha2026optimizer,
  title={Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws},
  author={Nandan Kumar Jha and Brandon Reagen},
  year={2026},
  url={https://optimizer-scaling-laws.github.io}
}