Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Abstract

Feed-forward networks account for a large fraction of transformer parameters, but parameter count alone does not reveal how effectively their latent width is used. This work studies spectral scaling laws for language-model feed-forward representations, measuring soft and hard spectral ranks as model width, depth, and training conditions vary. The analysis shows that nominal width and realized representational dimension can scale differently, motivating spectral telemetry as a complement to loss-based scaling laws.

Publication
Conference on Empirical Methods in Natural Language Processing 2025

Earlier version presented at the ICML 2025 Actionable Interpretability Workshop (AIW).

Nandan Kumar Jha
Nandan Kumar Jha
Ph.D., New York University · Representation Learning, Scaling Laws, and High-Dimensional Learning Dynamics

I study nonlinear representation dynamics in large language models, focusing on how nonlinearities, architecture, and optimization jointly shape representational geometry, scaling behavior, and usable computational capacity.

Related