Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Abstract

Scaling laws usually describe how loss improves with parameters, data, and compute. This work studies a complementary question: how much of an architecture’s nominal feed-forward width becomes realized representational capacity during training. Holding architecture and data fixed, we show that optimizers can induce sharply different spectral scaling behavior across token-frequency regimes. In particular, optimizer choice changes hard-rank scaling, spectral asymmetry, and the extent to which rare-token and mid-frequency representations use available width. The results suggest that capacity is not only specified by architecture; it is realized through learning dynamics.

Publication
Under review, 2026
Nandan Kumar Jha
Nandan Kumar Jha
Ph.D., New York University · Representation Learning, Scaling Laws, and High-Dimensional Learning Dynamics

I study nonlinear representation dynamics in large language models, focusing on how nonlinearities, architecture, and optimization jointly shape representational geometry, scaling behavior, and usable computational capacity.

Related