pith. sign in

arxiv: 2606.19941 · v1 · pith:D7XJUZKYnew · submitted 2026-06-18 · 💻 cs.LG

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

Pith reviewed 2026-06-26 18:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords compositionalityneural networksdepthconnectivitysparsitygeneralizationgradient descentsolution manifolds
0
0 comments X

The pith

Compositionality in neural networks arises only in a narrow depth and specific sparse connectivity regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural networks trained by gradient descent rarely develop internal compositionality, the reuse of meaningful primitives in new combinations that supports generalization. This work shows the property appears only when both network depth and connectivity fall inside a narrow, target-dependent sweet spot. Specific patterns of sparse connections are required; random or different sparsity patterns do not suffice. Shallower or deeper networks, or those outside the right connectivity pattern, converge instead to fractured non-compositional solutions. The authors supply a pruning procedure to locate the right connectivity, a depth heuristic, and a supporting theory based on compositional sparsity, volume ratios, and feature-interference bounds.

Core claim

Compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis it appears only in certain specifically sparse networks and depends on which connections remain rather than on weight sparsity alone. Along the depth axis it emerges inside a narrow, target-dependent regime, peaking at particular depths while both shallower and deeper networks fail. When either condition is violated, gradient descent silently converges to fractured solutions. The findings are supported by similarity-based pruning to recover compositional connectivity, a heuristic depth predictor, and a theoretical framework of compositional sparsity, volume-ratio arguments, and feature-interferenc

What carries the argument

The narrow depth-connectivity regime that constrains reachable solution manifolds, identified through compositional sparsity, volume-ratio arguments, and feature-interference bounds.

If this is right

  • Gradient descent reaches compositional solutions only when both the depth and the specific connectivity pattern satisfy the narrow regime.
  • Violating either the depth or connectivity condition causes convergence to fractured rather than compositional solutions.
  • Similarity-based pruning can recover the connectivity pattern that permits compositional solutions.
  • A heuristic depth predictor can locate the depths at which compositionality is most likely for a given target.
  • The theoretical framework of compositional sparsity, volume ratios, and feature-interference bounds accounts for the limited reachability of compositional manifolds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The regime may explain why standard architectures trained end-to-end often fail to exhibit strong compositionality even when the task admits it.
  • Task-specific depth selection or connectivity search could be used to steer training toward compositional solutions without changing the optimizer.
  • Different tasks likely possess different optimal depths inside the regime, requiring per-target tuning rather than a universal depth choice.

Load-bearing premise

The observed failure to reach compositional solutions outside the narrow regime is caused by architecture constraints on depth and specific connectivity rather than by optimization dynamics, data distribution, or initialization.

What would settle it

Demonstrating compositional internal structure in networks whose depth or connectivity lies outside the identified narrow regime would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19941 by Dat H. Do, Dianbo Liu, Duc V. Le, Rushi Shah.

Figure 1
Figure 1. Figure 1: Comparing internal structure (Red=-1, White=0, Blue=1): an evolutionary algorithm (EA) [11] setup can yield factorized, reusable intermediate features, whereas an SGD-trained network [10] often exhibits fragmented, entangled ones. We quantify compositionality via weight sweeping: perturb each nonzero parameter by noise δ and query a calibrated VLM judge under a fixed prompt to assess whether the image stil… view at source ↗
Figure 2
Figure 2. Figure 2: Architectural bias shapes compositionality (Red=-1, White=0, Blue=1). The MLPs takes pixel coordinate inputs (x, y, d = p x 2 + y 2, 1) and predicts (h, s, v), which are converted into the RGB image. Each square box visualizes the output/activation map of a single neuron over the image grid. Left: Preserving the NEAT sparse wiring and retraining the MLP yields partially compositional intermediate features.… view at source ↗
Figure 3
Figure 3. Figure 3: Compositional score vs depth offset on Picbreeder artifacts. We vary only the network depth around a reference depth while keeping prun￾ing and training settings fixed. Each data exhibits a peak at a target-specific depth, while shallower and deeper networks show reduced modularity. The original Picbreeder CPPNs (see Ap￾pendix D) already exhibit different depths across artifacts, suggesting that compositio… view at source ↗
Figure 4
Figure 4. Figure 4: Image complexity predicts optimally compositional depth. We compute the PNG com￾pression ratio for each target image and compare it with the empirically optimal depth found by depth sweep. Higher image complexity tends to correlate with larger optimal compositional depth. Section. 2.4 shows that compositionality peaks at a target-specific optimal compositional depth, implying there is no single depth that … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative result of out-of-domain targets beyond Picbreeder. We apply SP and heuristic depth search to images whose underlying compositional structure is unknown. The model exhibits monosemantic intermediate features, meaningful output changes under weight sweeping, and a depth-ablation peak near the predicted depth. More results are included in the Appendix H.4. Combining SP with heuristic depth search … view at source ↗
Figure 6
Figure 6. Figure 6: Theoretical versus empirical volume ratio [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Three mechanisms biasing SGD towards the compositional basin. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Feature orthogonality comparison across three Picbreeder artifacts, with and without SP [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity of the predicted compositional score to the number of primitives [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Internal representation of the Picbreeder skull CPPN. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Internal representation of the Picbreeder butterfly CPPN [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Internal representation of the Picbreeder apple CPPN [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Final training loss on Picbreeder’s skull. Using Muon with more Newton-Schultz steps reaches lower loss. While S-Prune can expose more distinctive subnetworks, the optimizer still strongly influences how fractured the learned solution remains [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full visualization of SP on Picbreeder’s butterfly artifact. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Full visualization of SP on Picbreeder’s apple artifact. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SP on MLPs having 11 layers (optimal depth is 12) on Picbreeder’s skull artifact lead to [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: SP on MLPs having 13 layers (optimal depth is 12) on Picbreeder’s skull artifact lead to [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: SP (2 rounds) + Adam on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: SP (2 rounds) + Muon on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: SP (2 rounds) + Muon (NS step=20) on Picbreeder’s skull. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: SP (2 rounds) + Adam on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: SP (2 rounds) + Muon on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: SP (2 rounds) + Muon (NS step=20) on Picbreeder’s butterfly. [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Full visualization of SP on a car image with some corresponding images from weights [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Full visualization of SP on a cat image with some corresponding images from weights [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Full visualization of SP on a butterfly image. [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Full visualization of SP on an image illustrating a red cube and a yellow sphere. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Full visualization of SP on an image illustrating a real butterfly with background. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Full visualization of SP on an image illustrating a real butterfly without background. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: SP round 1 - 1404 parameters [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Lottery Ticket Hypothesis [13] on skull image, 473 weights [PITH_FULL_IMAGE:figures/full_fig_p036_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Lottery Ticket Hypothesis [13] on skulll image, 1452 weights [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Wanda [14] on skull image, 1404 weights 37 [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: LLM-Pruner [15] on skull image, 1397 weights [PITH_FULL_IMAGE:figures/full_fig_p038_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: SP round 1 on butterfly image, 3405 weights [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Lottery Ticket Hypothesis [13] on butterfly image, 468 weights [PITH_FULL_IMAGE:figures/full_fig_p039_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Lottery Ticket Hypothesis [13] butterfly image, 3392 weights [PITH_FULL_IMAGE:figures/full_fig_p040_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Training loss with different optimizers using learning rate 5e-3 [PITH_FULL_IMAGE:figures/full_fig_p041_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Training loss with different optimizers using learning rate 1e-3 [PITH_FULL_IMAGE:figures/full_fig_p041_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Training loss with different optimizers using learning rate 5e-4 [PITH_FULL_IMAGE:figures/full_fig_p042_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Training loss with different optimizers using learning rate 1e-4 [PITH_FULL_IMAGE:figures/full_fig_p042_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Multi-CPPN for Picbreeder’s skull, n=-1 43 [PITH_FULL_IMAGE:figures/full_fig_p043_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Multi-CPPN for Picbreeder’s butterfly, n=0 44 [PITH_FULL_IMAGE:figures/full_fig_p044_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Multi-CPPN for Picbreeder’s apple, n=1 45 [PITH_FULL_IMAGE:figures/full_fig_p045_44.png] view at source ↗
read the original abstract

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that compositionality emerges in neural networks only within a narrow depth-connectivity regime: specific sparse connectivity patterns (not mere weight sparsity) along the connectivity axis, and a narrow target-dependent depth range (peaking at specific depths, failing for shallower or deeper nets) along the depth axis. Outside this regime, gradient descent converges to fractured non-compositional solutions. The authors introduce similarity-based pruning (SP) to recover compositional connectivity and a heuristic depth predictor, and support the findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds.

Significance. If the central claim holds, the work identifies architecture constraints that control reachability of compositional solutions under gradient descent, offering both an explanation for why such structure is rare and practical methods (SP and depth heuristic) to induce it. The empirical discovery of the narrow regime combined with the theoretical framing could guide architecture design for compositional generalization tasks.

major comments (2)
  1. [Theoretical Framework] Theoretical Framework section: the compositional sparsity, volume-ratio arguments, and feature-interference bounds are invoked to explain why compositional solutions are reachable only inside the narrow regime. However, these primarily bound manifold measure or interference; they do not derive that gradient-descent trajectories have no connecting paths to compositional solutions outside the regime or that the dynamics are forced into fractured attractors. The 'unreachability' direction therefore rests on extrapolation from observed empirical failures rather than a direct consequence of the bounds.
  2. [Experiments] Empirical results on depth axis (abstract and §Experiments): the claim that both shallower and deeper networks fail to reach compositional solutions is central, yet the manuscript does not report controls that isolate architecture constraints from optimization dynamics, data distribution, or initialization effects. Without such isolation, the narrow-regime conclusion remains vulnerable to the alternative that the failures are optimization artifacts rather than manifold unreachability.
minor comments (2)
  1. The description of similarity-based pruning (SP) would benefit from an explicit algorithm box or pseudocode to clarify how 'which connections remain' are selected versus random or magnitude-based sparsity.
  2. Notation for 'fractured solutions' is used without a formal definition; a short paragraph relating it to the volume-ratio or interference quantities would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below, clarifying the scope of our theoretical results and committing to additional experimental controls where appropriate.

read point-by-point responses
  1. Referee: [Theoretical Framework] Theoretical Framework section: the compositional sparsity, volume-ratio arguments, and feature-interference bounds are invoked to explain why compositional solutions are reachable only inside the narrow regime. However, these primarily bound manifold measure or interference; they do not derive that gradient-descent trajectories have no connecting paths to compositional solutions outside the regime or that the dynamics are forced into fractured attractors. The 'unreachability' direction therefore rests on extrapolation from observed empirical failures rather than a direct consequence of the bounds.

    Authors: We agree that the theoretical framework (compositional sparsity, volume-ratio arguments, and feature-interference bounds) establishes that compositional solution manifolds have larger relative measure and lower interference inside the identified regime, thereby making such solutions more accessible under gradient descent. The framework does not, however, derive a rigorous statement that no connecting paths exist in parameter space outside the regime or that the dynamics are provably trapped in fractured attractors. The unreachability claim outside the regime is therefore supported primarily by the empirical evidence of consistent convergence to fractured solutions across multiple depths, connectivities, and tasks. In revision we will explicitly distinguish the theoretical support for preferential reachability inside the regime from the empirical observation of unreachability outside it. revision: partial

  2. Referee: [Experiments] Empirical results on depth axis (abstract and §Experiments): the claim that both shallower and deeper networks fail to reach compositional solutions is central, yet the manuscript does not report controls that isolate architecture constraints from optimization dynamics, data distribution, or initialization effects. Without such isolation, the narrow-regime conclusion remains vulnerable to the alternative that the failures are optimization artifacts rather than manifold unreachability.

    Authors: We acknowledge that the current experiments do not include exhaustive ablations that fully isolate depth and connectivity constraints from optimizer hyperparameters, initialization distributions, or data variations. While we observe the same narrow-regime pattern across multiple random seeds, datasets, and architectures, additional targeted controls would strengthen the architectural interpretation. We will add such controls (varying learning-rate schedules, initialization scales, and data subsampling while fixing depth/connectivity) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation remains self-contained

full rationale

The provided abstract and context describe empirical results on compositionality in a narrow depth-connectivity regime, supported by a theoretical framework of compositional sparsity, volume-ratio arguments, and feature-interference bounds. No equations, self-citations, fitted parameters renamed as predictions, or self-definitional steps are exhibited in the text. The theory is presented as explanatory support for the observed regime rather than reducing to the inputs by construction. Without quotable reductions matching the enumerated patterns, the central claim does not collapse into tautology or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or sections from which free parameters, axioms, or invented entities can be extracted; the theoretical framework is referenced at a high level only.

pith-pipeline@v0.9.1-grok · 5749 in / 1289 out tokens · 32924 ms · 2026-06-26T18:21:19.179791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 22 canonical work pages · 6 internal anchors

  1. [1]

    arXiv preprint arXiv:2505.00661 , year=

    Andrew K Lampinen, Arslan Chaudhry, Stephanie CY Chan, Cody Wild, Diane Wan, Alex Ku, Jörg Bornschein, Razvan Pascanu, Murray Shanahan, and James L McClelland. On the generalization of language models from in-context learning and finetuning: a controlled study. arXiv preprint arXiv:2505.00661, 2025

  2. [2]

    a is b" fail to learn

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023

  3. [3]

    Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

    Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, Xudong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, and Angjoo Kanazawa. Visually prompted benchmarks are surprisingly fragile.ArXiv, abs/2512.17875, 2025

  4. [4]

    Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

    Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, et al. Vp-bench: A comprehensive benchmark for visual prompting in multimodal large language models.arXiv preprint arXiv:2511.11438, 2025

  5. [5]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  6. [6]

    T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i- compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  7. [7]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  8. [8]

    arXiv preprint arXiv:2512.16853 , year=

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

  9. [9]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  10. [10]

    Akarsh Kumar, Jeff Clune, Joel Lehman, and Kenneth O. Stanley. Questioning representational optimism in deep learning: The fractured entangled representation hypothesis.arXiv preprint arXiv:2505.11581, 2025

  11. [11]

    Picbreeder: evolving pictures collaboratively online

    Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: evolving pictures collaboratively online. InProceedings of the SIGCHI conference on human factors in computing systems, pages 1759–1768, 2008. 10

  12. [12]

    Efficient evolution of neural network topologies

    Kenneth O Stanley and Risto Miikkulainen. Efficient evolution of neural network topologies. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), volume 2, pages 1757–1762. IEEE, 2002

  13. [13]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks.arXiv preprint arXiv:1803.03635, 2018

  14. [14]

    A Simple and Effective Pruning Approach for Large Language Models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

  15. [15]

    Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in neural information processing systems, 36:21702–21720, 2023

  16. [16]

    [41]Cuchiero, C., Schmocker, P., and Teichmann, J.Global universal approximation of functional input maps on weighted spaces.Constructive Approximation(2026), 1–76

    David A. Danhofer, Davide D’Ascenzo, Rafael Dubach, and Tomaso A. Poggio. Position: A theory of deep learning must include compositional sparsity.ArXiv, abs/2507.02550, 2025

  17. [17]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

  18. [18]

    Three Factors Influencing Minima in SGD

    Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

  19. [19]

    Sharpness-aware min- imization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations

  20. [20]

    Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

    Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017

  21. [21]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  22. [22]

    The MIT press, 2017

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of causal inference: founda- tions and learning algorithms. The MIT press, 2017

  23. [23]

    Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367

    Dinghuai Zhang, Kartik Ahuja, Yilun Xu, Yisen Wang, and Aaron Courville. Can subnetwork structure be the key to out-of-distribution generalization? InInternational conference on machine learning, pages 12356–12367. PMLR, 2021

  24. [24]

    A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

    Noah D Goodman, Joshua B Tenenbaum, Jacob Feldman, and Thomas L Griffiths. A rational analysis of rule-based concept learning.Cognitive science, 32(1):108–154, 2008

  25. [25]

    Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

    Steven Phillips and William H Wilson. Categorial compositionality: A category theory explana- tion for the systematicity of human cognition.PLoS computational biology, 6(7):e1000858, 2010

  26. [26]

    Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc

    Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Sussman Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. InInternational conference on machine learning, pages 8489–8510. PMLR, 2023

  27. [27]

    Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

    Sam Spilsbury and Alexander Ilin. Compositional generalization in grounded language learning via induced model sparsity.arXiv preprint arXiv:2207.02518, 2022

  28. [28]

    Ablating concepts in text-to-image diffusion models

    Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023

  29. [29]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Ton g, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 11

  30. [30]

    Does clip bind concepts? probing compositionality in large image models

    Martha Lewis, Nihal Nayak, Peilin Yu, Jack Merullo, Qinan Yu, Stephen Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1487–1500, 2024

  31. [31]

    Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

    Tian Yun, Usha Bhalla, Ellie Pavlick, and Chen Sun. Do vision-language pretrained models learn primitive concepts.arXiv preprint arXiv:2203.17271, 3(5):6, 2022

  32. [32]

    Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

    Michael Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks.Advances in Neural Information Processing Systems, 36:42623–42660, 2023

  33. [33]

    Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

    Thaddäus Wiedemer, Prasanna Mayilvahanan, Matthias Bethge, and Wieland Brendel. Com- positional generalization from first principles.Advances in Neural Information Processing Systems, 36:6941–6960, 2023

  34. [34]

    Optimal brain damage.Advances in neural information processing systems, 2, 1989

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in neural information processing systems, 2, 1989

  35. [35]

    Optimal brain surgeon and general network pruning

    Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network pruning. InIEEE international conference on neural networks, pages 293–299. IEEE, 1993

  36. [36]

    Learning efficient convolutional networks through network slimming

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017

  37. [37]

    Depgraph: Towards any structural pruning

    Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16091–16101, 2023

  38. [38]

    Gradient-free structured pruning with unla- beled data

    Azade Nova, Hanjun Dai, and Dale Schuurmans. Gradient-free structured pruning with unla- beled data. InInternational Conference on Machine Learning, pages 26326–26341. PMLR, 2023

  39. [39]

    Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

  40. [40]

    Why random pruning is all we need to start sparse

    Advait Harshal Gadhikar, Sohom Mukherjee, and Rebekka Burkholz. Why random pruning is all we need to start sparse. InInternational Conference on Machine Learning, pages 10542–10570. PMLR, 2023

  41. [41]

    Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations

    Shiwei Liu, Tianlong Chen, Zhenyu Zhang, Xuxi Chen, Tianjin Huang, AJAY KUMAR JAISW AL, and Zhangyang Wang. Sparsity may cry: Let us fail (current) sparse neural networks together! InThe Eleventh International Conference on Learning Representations

  42. [42]

    Coreset-based neural network compression

    Abhimanyu Dubey, Moitreya Chatterjee, and Narendra Ahuja. Coreset-based neural network compression. InProceedings of the European Conference on Computer Vision (ECCV), pages 454–470, 2018

  43. [43]

    Pruning convolutional neural networks for resource efficient inference

    Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. InInternational Conference on Learning Representations, 2017

  44. [44]

    Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale

    Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11833–11856, 2023

  45. [45]

    Deja vu: Contextual sparsity for efficient llms at inference time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023

  46. [46]

    Neurons in large language models: Dead, n-gram, positional

    Elena V oita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. InFindings of the Association for Computational Linguistics: ACL 2024, pages 1288–1301, 2024. 12

  47. [47]

    Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

    Victor Sanh, Thomas Wolf, and Alexander Rush. Movement pruning: Adaptive sparsity by fine-tuning.Advances in neural information processing systems, 33:20378–20389, 2020

  48. [48]

    Soft threshold weight reparameterization for learnable sparsity

    Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity. In International conference on machine learning, pages 5544–5555. PMLR, 2020

  49. [49]

    Rethinking the value of network pruning

    Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. InInternational Conference on Learning Representations, 2019

  50. [50]

    A three-regime model of network pruning

    Yefan Zhou, Yaoqing Yang, Arin Chang, and Michael W Mahoney. A three-regime model of network pruning. InInternational Conference on Machine Learning, pages 42790–42809. PMLR, 2023

  51. [51]

    Comparing rewinding and fine-tuning in neural network pruning

    Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. InInternational Conference on Learning Representations, 2020

  52. [52]

    Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

    Steven M Frankland and Joshua D Greene. Concepts and compositionality: in search of the brain’s language of thought.Annual review of psychology, 71(1):273–303, 2020

  53. [53]

    Compositional clustering in task structure learning

    Nicholas T Franklin and Michael J Frank. Compositional clustering in task structure learning. PLoS computational biology, 14(4):e1006116, 2018

  54. [54]

    Compositional visual generation with composable diffusion models

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. InEuropean conference on computer vision, pages 423–439. Springer, 2022

  55. [55]

    Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

    Yilun Du, Shuang Li, Yash Sharma, Josh Tenenbaum, and Igor Mordatch. Unsupervised learning of compositional energy concepts.Advances in Neural Information Processing Systems, 34:15608–15620, 2021

  56. [56]

    Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

    Guangyue Xu, Parisa Kordjamshidi, and Joyce Chai. Prompting large pre-trained vision- language models for compositional concept learning.arXiv preprint arXiv:2211.05077, 2022

  57. [57]

    When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

  58. [58]

    The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

    Emanuele Bugliarello and Desmond Elliott. The role of syntactic planning in compositional image captioning.arXiv preprint arXiv:2101.11911, 2021

  59. [59]

    Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

    Colin Conwell and Tomer Ullman. Testing relational understanding in text-guided image generation.arXiv preprint arXiv:2208.00005, 2022

  60. [60]

    arXiv preprint arXiv:2212.10015 (2022)

    Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation.arXiv preprint arXiv:2212.10015, 2022

  61. [61]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022

  62. [62]

    Measuring compositionality in representation learning

    Jacob Andreas, Marco Baroni, Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, Antoine Bordes, Jacob Devlin, Alona Fyshe, Leila Wehbe, et al. Measuring compositionality in representation learning. InInternational conference on learning representations, volume 375, pages 2227–2237. Association for Computational Linguistics, 2019

  63. [63]

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks

    Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InInternational conference on machine learning, pages 2873–2882. PMLR, 2018. 13

  64. [64]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  65. [65]

    Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

    Lukas Schott, Julius V on Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, and Wieland Brendel. Visual representation learning does not generalize strongly within the same domain.arXiv preprint arXiv:2107.08221, 2021

  66. [66]

    Benchmark- ing compositionality with formal languages

    Josef Valvoda Naomi Saphra Jonathan Rawski and Adina Williams Ryan Cotterell. Benchmark- ing compositionality with formal languages. 2022

  67. [67]

    Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

    Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, and Sanjeev Arora. Conceptmix: A compositional image generation benchmark with controllable difficulty.Advances in Neural Information Processing Systems, 37:86004–86047, 2024

  68. [68]

    Importance estimation for neural network pruning

    Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019

  69. [69]

    Structured pruning learns compact and accurate models

    Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, 2022

  70. [70]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net- works with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015

  71. [71]

    prunable

    Mansheej Paul, Feng Chen, Brett W Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask?arXiv preprint arXiv:2210.03044, 2022. 14 A Proofs for Section 5 3 6 12 24 48 96 192 Width W 1 2 3 4 5 6 7 8 9Depth L (W,L) = C(W,P)/T otal [P=3] Predicted sweet spot ...