[slides and audio] Linguistic Collapse%3A Neural Collapse in (Large) Language Models

This paper explores the phenomenon of neural collapse ($\mathcal{NC}$) in the context of large language models (LLMs), particularly in the setting of causal language modeling (CLM). $\mathcal{NC}$ is typically observed in classification tasks where top-layer representations collapse into their class means, which are equinorm, equiangular, and aligned with the classifiers. However, $\mathcal{NC}$ is not traditionally applicable to language modeling due to the large number of classes, imbalanced class distribution, ambiguous contexts, and undertraining. The authors investigate how scaling the architecture and training of CLMs affects the development of $\mathcal{NC}$ properties and their relationship with generalization. Key findings include: - $\mathcal{NC}$ properties, such as within-class variability collapse ($\mathcal{NC}_1$), hyperspherical uniformity ($\mathcal{GNC}_2$), uniform duality ($\mathcal{UNC}_3$), and classifier agreement ($\mathcal{NC}_4$), emerge with model scaling and training. - $\mathcal{GNC}_2$ shows improvement more clearly and consistently than $\mathcal{NC}_2$. - $\mathcal{UNC}_3$ is correlated with model width, training, and performance, while $\mathcal{NC}_3$ does not. - $\mathcal{NC}$ is generally promoted by model size and training and is correlated with generalization, even when controlling for model scale. The study highlights the generality of $\mathcal{NC}$ and its potential benefits for improving LLMs and understanding their training processes. The findings also suggest that $\mathcal{UNC}_3$ may be a better indicator of generalization than $\mathcal{NC}_3$.This paper explores the phenomenon of neural collapse ($\mathcal{NC}$) in the context of large language models (LLMs), particularly in the setting of causal language modeling (CLM). $\mathcal{NC}$ is typically observed in classification tasks where top-layer representations collapse into their class means, which are equinorm, equiangular, and aligned with the classifiers. However, $\mathcal{NC}$ is not traditionally applicable to language modeling due to the large number of classes, imbalanced class distribution, ambiguous contexts, and undertraining. The authors investigate how scaling the architecture and training of CLMs affects the development of $\mathcal{NC}$ properties and their relationship with generalization. Key findings include: - $\mathcal{NC}$ properties, such as within-class variability collapse ($\mathcal{NC}_1$), hyperspherical uniformity ($\mathcal{GNC}_2$), uniform duality ($\mathcal{UNC}_3$), and classifier agreement ($\mathcal{NC}_4$), emerge with model scaling and training. - $\mathcal{GNC}_2$ shows improvement more clearly and consistently than $\mathcal{NC}_2$. - $\mathcal{UNC}_3$ is correlated with model width, training, and performance, while $\mathcal{NC}_3$ does not. - $\mathcal{NC}$ is generally promoted by model size and training and is correlated with generalization, even when controlling for model scale. The study highlights the generality of $\mathcal{NC}$ and its potential benefits for improving LLMs and understanding their training processes. The findings also suggest that $\mathcal{UNC}_3$ may be a better indicator of generalization than $\mathcal{NC}_3$.

Linguistic Collapse: Neural Collapse in (Large) Language Models

28 May 2024 | Robert Wu, Vardan Papyan