Recognition: 2 theorem links
· Lean TheoremThe Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models
Pith reviewed 2026-05-14 20:44 UTC · model grok-4.3
The pith
Probabilistic circuits match transformer separation rank only on data partitions aligned with their fixed vtree structure and degrade on heterogeneous dependency topologies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Structured-decomposable probabilistic circuits can match the separation rank of transformers on vtree-aligned partitions, yet this capacity is confined to partitions consistent with the fixed routing structure, causing severe degradation on data with heterogeneous dependency topologies. Decomposable probabilistic circuits are strictly more expressive than structured-decomposable ones. Parameterizing predictions as convex combinations in probability space creates an additional output bottleneck for sharp language distributions, though logit-space parameterization substantially narrows the gap.
What carries the argument
Vtree-aligned separation rank in structured-decomposable probabilistic circuits, which fixes the allowable variable partitions and thereby bounds context-encoding capacity.
If this is right
- Logit-space parameterization of PC outputs reduces the gap with transformers on sharp distributions.
- Structured PCs exhibit severe degradation precisely when data dependency topologies misalign with the fixed vtree.
- Decomposable PCs are strictly more expressive than structured-decomposable PCs, indicating a route to higher capacity once optimization improves.
- The exact inference advantage of PCs cannot offset these expressivity limits in autoregressive language modeling.
Where Pith is reading between the lines
- Dynamic or data-dependent routing structures inside PCs could bypass the fixed-vtree alignment requirement.
- Hybrid models that retain PC exact inference while borrowing flexible routing from transformers may close the remaining gap.
- Optimization advances for general decomposable PCs would test whether the strict expressivity advantage translates to practical gains.
- The results imply that any architecture with rigid decomposition structure will face analogous limits on data with variable dependency patterns.
Load-bearing premise
Language data contains heterogeneous dependency topologies that systematically misalign with any single fixed vtree structure.
What would settle it
An experiment in which a structured-decomposable PC with one fixed vtree maintains transformer-level performance across language datasets whose dependency topologies are known to be heterogeneous.
Figures
read the original abstract
Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Probabilistic Circuits (PCs) lag behind Transformer-based LLMs in autoregressive language modeling due to two main bottlenecks. First, an output bottleneck arises because PCs parameterize predictions as convex combinations in probability space, which struggles with sharp language distributions; switching to logit-space parameterization narrows the gap. Second, a context-encoding bottleneck is identified where structured-decomposable PCs can match Transformer separation ranks only on vtree-aligned partitions, with severe degradation on data exhibiting heterogeneous dependency topologies. The paper proves that decomposable PCs are strictly more expressive than structured-decomposable ones, though optimizing the former remains open.
Significance. If the central claims hold, this work provides a valuable theoretical and empirical dissection of PC expressivity limits relative to Transformers, explaining performance gaps in language modeling while suggesting concrete fixes like logit parameterization and decomposable structures. The separation-rank analysis and proofs offer a unified framework that could guide future exact-inference models toward closing the gap with LLMs, particularly if the heterogeneous-topology assumption is validated.
major comments (3)
- [Abstract and §3] Abstract and §3 (Context-Encoding Bottleneck): The degradation claim rests on language data systematically exhibiting heterogeneous dependency topologies that misalign with any fixed vtree; this assumption is load-bearing but receives only indirect empirical support. Direct analysis of dependency structures (e.g., via mutual information or parse trees) across datasets is needed to confirm the misalignment drives the observed performance gap rather than other factors.
- [§4] §4 (Expressivity Proof): The proof that decomposable PCs are strictly more expressive than structured-decomposable ones must explicitly state the measure (e.g., separation rank or function class) establishing the strict inequality and verify it holds under the autoregressive formulation used elsewhere; without this, the claim that structured-decomposable PCs are fundamentally limited remains incomplete.
- [Empirical section] Empirical section and associated tables: The reported degradation on misaligned partitions is central, yet details on vtree construction, how alignment is quantified, and baseline controls (e.g., random vs. learned vtrees) are insufficient for verification; this weakens the link between theory and the claimed practical expressivity gap.
minor comments (2)
- [Notation] Notation section: Define vtree, separation rank, and structured-decomposability at first use with a small illustrative example to aid readers outside the PC community.
- [Figures] Figure captions: Add explicit descriptions of what each panel shows regarding alignment vs. misalignment to improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and will incorporate revisions to improve clarity, completeness, and empirical support.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Context-Encoding Bottleneck): The degradation claim rests on language data systematically exhibiting heterogeneous dependency topologies that misalign with any fixed vtree; this assumption is load-bearing but receives only indirect empirical support. Direct analysis of dependency structures (e.g., via mutual information or parse trees) across datasets is needed to confirm the misalignment drives the observed performance gap rather than other factors.
Authors: We appreciate the referee's emphasis on strengthening the empirical grounding of the heterogeneous topology assumption. Our current results show substantial performance degradation specifically on misaligned partitions (while controlling for model size and training), which we interpret as evidence that language data exhibits such heterogeneity. To directly address the concern, the revised manuscript will include an additional analysis subsection quantifying dependency structures via mutual information matrices computed on the training data, along with a comparison of alignment scores between learned vtrees and these structures. This will provide more explicit confirmation that topology mismatch, rather than other factors, drives the observed gaps. revision: yes
-
Referee: [§4] §4 (Expressivity Proof): The proof that decomposable PCs are strictly more expressive than structured-decomposable ones must explicitly state the measure (e.g., separation rank or function class) establishing the strict inequality and verify it holds under the autoregressive formulation used elsewhere; without this, the claim that structured-decomposable PCs are fundamentally limited remains incomplete.
Authors: We thank the referee for this precise observation. The proof in §4 establishes the strict inequality using separation rank as the expressivity measure: we construct a family of functions whose separation rank exceeds what any structured-decomposable PC can achieve for a given vtree, while decomposable PCs can represent them exactly. This construction is carried out in the autoregressive setting by considering the joint distribution factorized as a product of conditionals. In the revision, we will add an explicit statement of the measure at the beginning of the proof and include a dedicated paragraph verifying that the separation-rank separation holds under the autoregressive formulation employed in the rest of the paper. revision: yes
-
Referee: [Empirical section] Empirical section and associated tables: The reported degradation on misaligned partitions is central, yet details on vtree construction, how alignment is quantified, and baseline controls (e.g., random vs. learned vtrees) are insufficient for verification; this weakens the link between theory and the claimed practical expressivity gap.
Authors: We agree that greater detail on the experimental setup is necessary for reproducibility and to solidify the theory-empirics connection. The revised empirical section will expand the description of vtree construction (including the specific bottom-up clustering heuristic and hyperparameters), introduce an explicit alignment metric (average normalized mutual information between variable pairs and vtree partitions), and report new baseline results using randomly generated vtrees of matching size. These additions will allow readers to verify that the performance differences arise from alignment rather than other implementation choices. revision: yes
Circularity Check
No circularity: proofs rely on external separation-rank results and standard PC definitions
full rationale
The paper's derivation chain consists of theoretical proofs establishing that structured-decomposable PCs match Transformer separation rank exactly on vtree-aligned partitions and that decomposable PCs are strictly more expressive than structured-decomposable ones. These steps invoke established separation-rank properties of Transformers and the standard definitions of vtrees and decomposability in probabilistic circuits; they do not reduce any claimed prediction or expressivity bound to a fitted parameter, self-referential equation, or load-bearing self-citation. The empirical degradation result is conditioned on the external modeling assumption of heterogeneous dependency topologies in language data, which is not derived from the paper's own equations. No self-definitional, fitted-input, or ansatz-smuggling patterns appear.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard axioms of probability and convex combinations in output parameterization
- domain assumption Properties of separation rank for transformers and decomposability for PCs from prior literature
Reference graph
Works this paper leans on
-
[1]
Learning the structure of sum-product net- works via an svd-based algorithm
Tameem Adel, David Balduzzi, and Ali Ghodsi. Learning the structure of sum-product net- works via an svd-based algorithm. InUAI, pages 32–41, 2015
work page 2015
-
[2]
On the sample complexity of learning sum-product networks
Ishaq Aden-Ali and Hassan Ashtiani. On the sample complexity of learning sum-product networks. InInternational Conference on Artificial Intelligence and Statistics, pages 4508–
-
[3]
The softmax bottleneck does not limit the probabilities of the most likely tokens
Ronen Basri and David Jacobs. The softmax bottleneck does not limit the probabilities of the most likely tokens. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[5]
Probabilistic cir- cuits: A unifying framework for tractable probabilistic models.UCLA
YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic cir- cuits: A unifying framework for tractable probabilistic models.UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 2020
work page 2020
-
[6]
Strudel: Learning structured- decomposable probabilistic circuits
Meihua Dang, Antonio Vergari, and Guy Van den Broeck. Strudel: Learning structured- decomposable probabilistic circuits. InInternational Conference on Probabilistic Graphical Models, pages 137–148. PMLR, 2020
work page 2020
-
[7]
A compilation of succinctness results for arithmetic circuits
Alexis de Colnet and Stefan Mengel. A compilation of succinctness results for arithmetic circuits. In18th International Conference on Principles of Knowledge Representation and Reasoning, KR 2021, 2021
work page 2021
-
[8]
Lost in backpropagation: The lm head is a gradient bottleneck
Nathan Godey and Yoav Artzi. Lost in backpropagation: The lm head is a gradient bottleneck. arXiv preprint arXiv:2603.10145, 2026
-
[9]
Generative adversarial nets.Advances in neural information processing systems, 27, 2014
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014
work page 2014
-
[10]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[11]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[12]
Probabilistic sentential decision diagrams
Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decision diagrams. InKR, 2014
work page 2014
-
[13]
UD English-ATIS.https://github
Aslı Kuzgun, Neslihan Cesur, and Olcay Taner Yıldız. UD English-ATIS.https://github. com/UniversalDependencies/UD_English-Atis, 2022. Universal Dependencies v2.9. License: CC BY-SA 4.0
work page 2022
-
[14]
On the hardness of approximating distributions with tractable probabilistic models
John Leland and YooJung Choi. On the hardness of approximating distributions with tractable probabilistic models. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems, 2025
work page 2025
-
[15]
Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541, 2021
-
[16]
Scaling tractable probabilistic circuits: a systems perspective
Anji Liu, Kareem Ahmed, and Guy Van Den Broeck. Scaling tractable probabilistic circuits: a systems perspective. InProceedings of the 41st International Conference on Machine Learn- ing, pages 30630–30646, 2024
work page 2024
-
[17]
Lossless compression with probabilistic circuits
Anji Liu, Stephan Mandt, and Guy Van den Broeck. Lossless compression with probabilistic circuits. InInternational Conference on Learning Representations, 2021. 10
work page 2021
-
[18]
Rethinking probabilistic circuit parameter learning.arXiv preprint arXiv:2505.19982, 2025
Anji Liu, Zilei Shao, and Guy Van den Broeck. Rethinking probabilistic circuit parameter learning.arXiv preprint arXiv:2505.19982, 2025
-
[19]
Tractable regularization of probabilistic circuits
Anji Liu and Guy Van den Broeck. Tractable regularization of probabilistic circuits. InAd- vances in Neural Information Processing Systems 34 (NeurIPS), dec 2021
work page 2021
-
[20]
Anji Liu, Honghua Zhang, and Guy Van den Broeck. Scaling up probabilistic circuits by latent variable distillation.arXiv preprint arXiv:2210.04398, 2022
-
[21]
Xuejie Liu, Anji Liu, Guy Van den Broeck, and Yitao Liang. Understanding the distillation process from deep generative models to tractable probabilistic circuits. InInternational Con- ference on Machine Learning, pages 21825–21838. PMLR, 2023
work page 2023
-
[22]
Xuejie Liu, Anji Liu, Guy Van den Broeck, and Yitao Liang. A tractable inference perspective of offline rl.Advances in Neural Information Processing Systems, 37:70953–70980, 2024
work page 2024
-
[23]
Lorenzo Loconte, Antonio Mari, Gennaro Gala, Robert Peharz, Cassio de Campos, Erik Quaeghebeur, Gennaro Vessio, and Antonio Vergari. What is the relationship between ten- sor factorizations and circuits (and how can we exploit it)?arXiv preprint arXiv:2409.07953, 2024
-
[24]
pytorch hmm: A PyTorch implementation of Hidden Markov Models
Loren Lugosch. pytorch hmm: A PyTorch implementation of Hidden Markov Models. https://github.com/lorenlugosch/pytorch_HMM, 2020
work page 2020
-
[25]
Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993
work page 1993
-
[26]
On the Expressive Efficiency of Sum Product Networks
James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product net- works.arXiv preprint arXiv:1411.7717, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
SPFlow: An Easy and Extensible Library for Deep Probabilistic Learning using Sum-Product Networks
Alejandro Molina, Antonio Vergari, Karl Stelzner, Robert Peharz, Pranav Subramani, Nicola Di Mauro, Pascal Poupart, and Kristian Kersting. Spflow: An easy and extensible library for deep probabilistic learning using sum-product networks.arXiv preprint arXiv:1901.03704, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[29]
Robert Peharz, Robert Gens, Franz Pernkopf, and Pedro Domingos. On the latent variable interpretation in sum-product networks.IEEE transactions on pattern analysis and machine intelligence, 39(10):2030–2044, 2016
work page 2030
-
[30]
Einsum networks: Fast and scalable learning of tractable probabilistic circuits
Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: Fast and scalable learning of tractable probabilistic circuits. InProceedings of the 37th International Conference on Machine Learning (ICML), jul 2020
work page 2020
-
[31]
On theoretical properties of sum-product networks
Robert Peharz, Sebastian Tschiatschek, Franz Pernkopf, and Pedro Domingos. On theoretical properties of sum-product networks. InArtificial Intelligence and Statistics, pages 744–752. PMLR, 2015
work page 2015
-
[32]
New compilation languages based on structured decomposability
Knot Pipatsrisawat and Adnan Darwiche. New compilation languages based on structured decomposability. InAAAI, volume 8, pages 517–522, 2008
work page 2008
-
[33]
Sum-product networks: A new deep architecture
Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 689–690. IEEE, 2011
work page 2011
-
[34]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[35]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
A compo- sitional atlas of tractable circuit operations for probabilistic inference
Antonio Vergari, YooJung Choi, Anji Liu, Stefano Teso, and Guy Van den Broeck. A compo- sitional atlas of tractable circuit operations for probabilistic inference. InAdvances in Neural Information Processing Systems 34 (NeurIPS), dec 2021
work page 2021
-
[37]
On the relationship between monotone and squared probabilistic circuits
Benjie Wang and Guy Van den Broeck. On the relationship between monotone and squared probabilistic circuits. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pages 21026–21034, 2025
work page 2025
-
[38]
Gwen Yidou Weng, Benjie Wang, and Guy Van den Broeck. Trace back from the fu- ture: A probabilistic reasoning approach to controllable language generation.arXiv preprint arXiv:2504.18535, 2025
-
[39]
Which transformer architecture fits my data? a vocabulary bottleneck in self-attention
Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? a vocabulary bottleneck in self-attention. InInternational Conference on Machine Learning, pages 11170–11181. PMLR, 2021
work page 2021
-
[40]
Etas: Zero-shot transformer architecture search via network trainability and expressivity
Jiechao Yang and Yong Liu. Etas: Zero-shot transformer architecture search via network trainability and expressivity. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6780–6795, 2024
work page 2024
-
[41]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model.arXiv preprint arXiv:1711.03953, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Tractable control for autoregressive language generation
Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for autoregressive language generation. InInternational Conference on Machine Learning, pages 40932–40945. PMLR, 2023
work page 2023
-
[43]
Scaling probabilistic circuits via monarch matrices
Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, and Guy Van Den Broeck. Scaling probabilistic circuits via monarch matrices. InInternational Conference on Machine Learning, pages 74804–74818. PMLR, 2025. 12 A Proofs A.1 Preliminary Definitions For a structured-decomposable PC with vtreeTover variablesX={X 1, . . . , XT }, we introduce...
work page 2025
-
[44]
IfvisA-pure, thenR A(v)≤d v
-
[45]
IfvisB-pure, thenR A(v) = 1
-
[46]
Ifvis mixed, thenR A(v)≤R A(L)·R A(R). Proof.Case 1(visA-pure). SinceU v ∩B=∅, all functions inH v depend only onx Uv∩A =x Uv. Let{ϕ v i }dv i=1 be any basis ofH v. Then everyf∈ H v can be written as f(x Uv) = dvX i=1 cf i ϕv i (xUv), wherec f i ∈Rare constants, which we view as trivial functions on the emptyB-side. Hence RA(v)≤d v. Case 2(visB-pure). Sin...
-
[47]
Define F(y, s) =1[s= 1]A(y) +1[s= 0]B(y)
there is no vtreeTonYsuch that bothAandBadmit polynomial-size structured- decomposable PCs respectingT. Define F(y, s) =1[s= 1]A(y) +1[s= 0]B(y). ThenFadmits a polynomial-size decomposable PC, but does not admit a polynomial-size structured-decomposable PC. Proof.We first show thatFadmits a polynomial-size decomposable PC. Construct a circuit with two pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.