Critical percolation clusters embedded in high dimensions, combined with taxonomic latent variables, form an analytically tractable synthetic data model whose ground-truth hierarchy can be linearly decoded from network activations.
Understanding llm behaviors via compression: Data generation, knowledge acquisition and scaling laws
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
other 1polarities
unclear 1representative citing papers
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
Controlled experiments show language models extract correct answers from contradictory data only when errors are structurally incoherent, supporting the hypothesis that gradient descent selects the most compressible answer cluster.
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
citing papers explorer
-
Critical Percolation as a Synthetic Data Model for Interpretability
Critical percolation clusters embedded in high dimensions, combined with taxonomic latent variables, form an analytically tractable synthetic data model whose ground-truth hierarchy can be linearly decoded from network activations.
-
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning
Introduces a hierarchical latent selection model showing SFT supplies raw module materials in compound traces while RL decomposes them to identify atomic modules and enable recombination for new reasoning configurations.
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
-
Truth as a Compression Artifact in Language Model Training
Controlled experiments show language models extract correct answers from contradictory data only when errors are structurally incoherent, supporting the hypothesis that gradient descent selects the most compressible answer cluster.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.