arxiv: 2605.12940 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

Zhiyu Zhao , Xuejie Liu , Muhan Zhang , Anji Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords probabilistic circuitsexpressivitytransformersautoregressive language modelingseparation rankvtree structuredecomposable circuitsdependency topologies

0 comments

The pith

Probabilistic circuits match transformer separation rank only on data partitions aligned with their fixed vtree structure and degrade on heterogeneous dependency topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two main bottlenecks limiting probabilistic circuits in autoregressive language modeling relative to transformers. An output issue arises because PCs form predictions as convex combinations in probability space, which struggles to represent sharp token distributions; a logit-space reparameterization reduces this gap. A deeper context-encoding limit is proven for structured-decomposable PCs, which achieve equivalent separation rank to transformers only on partitions matching their fixed vtree routing, but suffer sharp performance loss when data dependency patterns vary across different topologies. The work further establishes that the broader class of decomposable PCs is strictly more expressive than the structured subclass, although effective optimization of the more general form remains unresolved.

Core claim

Structured-decomposable probabilistic circuits can match the separation rank of transformers on vtree-aligned partitions, yet this capacity is confined to partitions consistent with the fixed routing structure, causing severe degradation on data with heterogeneous dependency topologies. Decomposable probabilistic circuits are strictly more expressive than structured-decomposable ones. Parameterizing predictions as convex combinations in probability space creates an additional output bottleneck for sharp language distributions, though logit-space parameterization substantially narrows the gap.

What carries the argument

Vtree-aligned separation rank in structured-decomposable probabilistic circuits, which fixes the allowable variable partitions and thereby bounds context-encoding capacity.

If this is right

Logit-space parameterization of PC outputs reduces the gap with transformers on sharp distributions.
Structured PCs exhibit severe degradation precisely when data dependency topologies misalign with the fixed vtree.
Decomposable PCs are strictly more expressive than structured-decomposable PCs, indicating a route to higher capacity once optimization improves.
The exact inference advantage of PCs cannot offset these expressivity limits in autoregressive language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic or data-dependent routing structures inside PCs could bypass the fixed-vtree alignment requirement.
Hybrid models that retain PC exact inference while borrowing flexible routing from transformers may close the remaining gap.
Optimization advances for general decomposable PCs would test whether the strict expressivity advantage translates to practical gains.
The results imply that any architecture with rigid decomposition structure will face analogous limits on data with variable dependency patterns.

Load-bearing premise

Language data contains heterogeneous dependency topologies that systematically misalign with any single fixed vtree structure.

What would settle it

An experiment in which a structured-decomposable PC with one fixed vtree maintains transformer-level performance across language datasets whose dependency topologies are known to be heterogeneous.

Figures

Figures reproduced from arXiv: 2605.12940 by Anji Liu, Muhan Zhang, Xuejie Liu, Zhiyu Zhao.

**Figure 1.** Figure 1: Impact of output parameterization on modeling performance. Lower validation loss is better. Across all datasets, logit-space variants improve over their probability-space counterparts within the same model family: Logit-HMM improves over HMM, and Transformer improves over Prob-Transformer. This supports the hypothesis that probability-space convex combinations impose a stronger geometric constraint than lo… view at source ↗

read the original abstract

Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Structured-decomposable PCs match transformer separation ranks only on vtree-aligned partitions but degrade on heterogeneous topologies, while decomposable PCs are strictly more expressive.

read the letter

Hi there, the core of this paper is that structured-decomposable probabilistic circuits match transformer separation ranks only on vtree-aligned partitions. They prove this equivalence and then show both in theory and experiments that performance tanks when the data has heterogeneous dependency structures that don't match the fixed routing. They also prove decomposable PCs are strictly more expressive than the structured version. What stands out is the theoretical work. The separation rank comparisons and the strict inequality proof add something concrete to the literature on PC expressivity. The unified autoregressive formulation helps make the comparison fair. Suggesting a logit-space parameterization to address the output bottleneck is a simple idea that could be tested quickly. Where it feels thinner is on the language side. The degradation result relies on data having those misaligned topologies, and while they demonstrate it empirically, it seems mostly on constructed examples rather than deep analysis of actual language corpora. If language dependencies turn out to be more alignable than assumed, the practical takeaway for why PCs lag LLMs weakens. The open challenge on optimizing decomposable PCs is acknowledged, which is good. This paper would interest researchers focused on tractable generative models and their limits compared to transformers. Someone looking for theoretical tools to analyze expressivity would find the proofs worthwhile. It deserves peer review because the new results on boundaries are solid enough to discuss, even if the empirical tie to language needs tightening. I would recommend sending it out for review.

Referee Report

3 major / 2 minor

Summary. The paper claims that Probabilistic Circuits (PCs) lag behind Transformer-based LLMs in autoregressive language modeling due to two main bottlenecks. First, an output bottleneck arises because PCs parameterize predictions as convex combinations in probability space, which struggles with sharp language distributions; switching to logit-space parameterization narrows the gap. Second, a context-encoding bottleneck is identified where structured-decomposable PCs can match Transformer separation ranks only on vtree-aligned partitions, with severe degradation on data exhibiting heterogeneous dependency topologies. The paper proves that decomposable PCs are strictly more expressive than structured-decomposable ones, though optimizing the former remains open.

Significance. If the central claims hold, this work provides a valuable theoretical and empirical dissection of PC expressivity limits relative to Transformers, explaining performance gaps in language modeling while suggesting concrete fixes like logit parameterization and decomposable structures. The separation-rank analysis and proofs offer a unified framework that could guide future exact-inference models toward closing the gap with LLMs, particularly if the heterogeneous-topology assumption is validated.

major comments (3)

[Abstract and §3] Abstract and §3 (Context-Encoding Bottleneck): The degradation claim rests on language data systematically exhibiting heterogeneous dependency topologies that misalign with any fixed vtree; this assumption is load-bearing but receives only indirect empirical support. Direct analysis of dependency structures (e.g., via mutual information or parse trees) across datasets is needed to confirm the misalignment drives the observed performance gap rather than other factors.
[§4] §4 (Expressivity Proof): The proof that decomposable PCs are strictly more expressive than structured-decomposable ones must explicitly state the measure (e.g., separation rank or function class) establishing the strict inequality and verify it holds under the autoregressive formulation used elsewhere; without this, the claim that structured-decomposable PCs are fundamentally limited remains incomplete.
[Empirical section] Empirical section and associated tables: The reported degradation on misaligned partitions is central, yet details on vtree construction, how alignment is quantified, and baseline controls (e.g., random vs. learned vtrees) are insufficient for verification; this weakens the link between theory and the claimed practical expressivity gap.

minor comments (2)

[Notation] Notation section: Define vtree, separation rank, and structured-decomposability at first use with a small illustrative example to aid readers outside the PC community.
[Figures] Figure captions: Add explicit descriptions of what each panel shows regarding alignment vs. misalignment to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and will incorporate revisions to improve clarity, completeness, and empirical support.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Context-Encoding Bottleneck): The degradation claim rests on language data systematically exhibiting heterogeneous dependency topologies that misalign with any fixed vtree; this assumption is load-bearing but receives only indirect empirical support. Direct analysis of dependency structures (e.g., via mutual information or parse trees) across datasets is needed to confirm the misalignment drives the observed performance gap rather than other factors.

Authors: We appreciate the referee's emphasis on strengthening the empirical grounding of the heterogeneous topology assumption. Our current results show substantial performance degradation specifically on misaligned partitions (while controlling for model size and training), which we interpret as evidence that language data exhibits such heterogeneity. To directly address the concern, the revised manuscript will include an additional analysis subsection quantifying dependency structures via mutual information matrices computed on the training data, along with a comparison of alignment scores between learned vtrees and these structures. This will provide more explicit confirmation that topology mismatch, rather than other factors, drives the observed gaps. revision: yes
Referee: [§4] §4 (Expressivity Proof): The proof that decomposable PCs are strictly more expressive than structured-decomposable ones must explicitly state the measure (e.g., separation rank or function class) establishing the strict inequality and verify it holds under the autoregressive formulation used elsewhere; without this, the claim that structured-decomposable PCs are fundamentally limited remains incomplete.

Authors: We thank the referee for this precise observation. The proof in §4 establishes the strict inequality using separation rank as the expressivity measure: we construct a family of functions whose separation rank exceeds what any structured-decomposable PC can achieve for a given vtree, while decomposable PCs can represent them exactly. This construction is carried out in the autoregressive setting by considering the joint distribution factorized as a product of conditionals. In the revision, we will add an explicit statement of the measure at the beginning of the proof and include a dedicated paragraph verifying that the separation-rank separation holds under the autoregressive formulation employed in the rest of the paper. revision: yes
Referee: [Empirical section] Empirical section and associated tables: The reported degradation on misaligned partitions is central, yet details on vtree construction, how alignment is quantified, and baseline controls (e.g., random vs. learned vtrees) are insufficient for verification; this weakens the link between theory and the claimed practical expressivity gap.

Authors: We agree that greater detail on the experimental setup is necessary for reproducibility and to solidify the theory-empirics connection. The revised empirical section will expand the description of vtree construction (including the specific bottom-up clustering heuristic and hyperparameters), introduce an explicit alignment metric (average normalized mutual information between variable pairs and vtree partitions), and report new baseline results using randomly generated vtrees of matching size. These additions will allow readers to verify that the performance differences arise from alignment rather than other implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: proofs rely on external separation-rank results and standard PC definitions

full rationale

The paper's derivation chain consists of theoretical proofs establishing that structured-decomposable PCs match Transformer separation rank exactly on vtree-aligned partitions and that decomposable PCs are strictly more expressive than structured-decomposable ones. These steps invoke established separation-rank properties of Transformers and the standard definitions of vtrees and decomposability in probabilistic circuits; they do not reduce any claimed prediction or expressivity bound to a fitted parameter, self-referential equation, or load-bearing self-citation. The empirical degradation result is conditioned on the external modeling assumption of heterogeneous dependency topologies in language data, which is not derived from the paper's own equations. No self-definitional, fitted-input, or ansatz-smuggling patterns appear.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rely on standard definitions of probabilistic circuits, vtree structures, separation rank from transformer literature, and convex combination properties, with no new free parameters, invented entities, or ad-hoc axioms introduced.

axioms (2)

standard math Standard axioms of probability and convex combinations in output parameterization
Invoked when defining the output bottleneck in probability space versus logit space.
domain assumption Properties of separation rank for transformers and decomposability for PCs from prior literature
Used to establish matching capacity on vtree-aligned partitions.

pith-pipeline@v0.9.0 · 5466 in / 1348 out tokens · 68382 ms · 2026-05-14T20:44:11.017758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 6 internal anchors

[1]

Learning the structure of sum-product net- works via an svd-based algorithm

Tameem Adel, David Balduzzi, and Ali Ghodsi. Learning the structure of sum-product net- works via an svd-based algorithm. InUAI, pages 32–41, 2015

work page 2015
[2]

On the sample complexity of learning sum-product networks

Ishaq Aden-Ali and Hassan Ashtiani. On the sample complexity of learning sum-product networks. InInternational Conference on Artificial Intelligence and Statistics, pages 4508–

work page
[3]

The softmax bottleneck does not limit the probabilities of the most likely tokens

Ronen Basri and David Jacobs. The softmax bottleneck does not limit the probabilities of the most likely tokens. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[4]

Language mod- els are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language mod- els are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[5]

Probabilistic cir- cuits: A unifying framework for tractable probabilistic models.UCLA

YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic cir- cuits: A unifying framework for tractable probabilistic models.UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 2020

work page 2020
[6]

Strudel: Learning structured- decomposable probabilistic circuits

Meihua Dang, Antonio Vergari, and Guy Van den Broeck. Strudel: Learning structured- decomposable probabilistic circuits. InInternational Conference on Probabilistic Graphical Models, pages 137–148. PMLR, 2020

work page 2020
[7]

A compilation of succinctness results for arithmetic circuits

Alexis de Colnet and Stefan Mengel. A compilation of succinctness results for arithmetic circuits. In18th International Conference on Principles of Knowledge Representation and Reasoning, KR 2021, 2021

work page 2021
[8]

Lost in backpropagation: The lm head is a gradient bottleneck

Nathan Godey and Yoav Artzi. Lost in backpropagation: The lm head is a gradient bottleneck. arXiv preprint arXiv:2603.10145, 2026

work page arXiv 2026
[9]

Generative adversarial nets.Advances in neural information processing systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in neural information processing systems, 27, 2014

work page 2014
[10]

Denoising diffusion probabilistic models.Ad- vances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Ad- vances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[11]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[12]

Probabilistic sentential decision diagrams

Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decision diagrams. InKR, 2014

work page 2014
[13]

UD English-ATIS.https://github

Aslı Kuzgun, Neslihan Cesur, and Olcay Taner Yıldız. UD English-ATIS.https://github. com/UniversalDependencies/UD_English-Atis, 2022. Universal Dependencies v2.9. License: CC BY-SA 4.0

work page 2022
[14]

On the hardness of approximating distributions with tractable probabilistic models

John Leland and YooJung Choi. On the hardness of approximating distributions with tractable probabilistic models. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems, 2025

work page 2025
[15]

The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541, 2021

Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. The inductive bias of in-context learning: Rethinking pretraining example design.arXiv preprint arXiv:2110.04541, 2021

work page arXiv 2021
[16]

Scaling tractable probabilistic circuits: a systems perspective

Anji Liu, Kareem Ahmed, and Guy Van Den Broeck. Scaling tractable probabilistic circuits: a systems perspective. InProceedings of the 41st International Conference on Machine Learn- ing, pages 30630–30646, 2024

work page 2024
[17]

Lossless compression with probabilistic circuits

Anji Liu, Stephan Mandt, and Guy Van den Broeck. Lossless compression with probabilistic circuits. InInternational Conference on Learning Representations, 2021. 10

work page 2021
[18]

Rethinking probabilistic circuit parameter learning.arXiv preprint arXiv:2505.19982, 2025

Anji Liu, Zilei Shao, and Guy Van den Broeck. Rethinking probabilistic circuit parameter learning.arXiv preprint arXiv:2505.19982, 2025

work page arXiv 2025
[19]

Tractable regularization of probabilistic circuits

Anji Liu and Guy Van den Broeck. Tractable regularization of probabilistic circuits. InAd- vances in Neural Information Processing Systems 34 (NeurIPS), dec 2021

work page 2021
[20]

Scaling up probabilistic circuits by latent variable distillation.arXiv preprint arXiv:2210.04398, 2022

Anji Liu, Honghua Zhang, and Guy Van den Broeck. Scaling up probabilistic circuits by latent variable distillation.arXiv preprint arXiv:2210.04398, 2022

work page arXiv 2022
[21]

Understanding the distillation process from deep generative models to tractable probabilistic circuits

Xuejie Liu, Anji Liu, Guy Van den Broeck, and Yitao Liang. Understanding the distillation process from deep generative models to tractable probabilistic circuits. InInternational Con- ference on Machine Learning, pages 21825–21838. PMLR, 2023

work page 2023
[22]

A tractable inference perspective of offline rl.Advances in Neural Information Processing Systems, 37:70953–70980, 2024

Xuejie Liu, Anji Liu, Guy Van den Broeck, and Yitao Liang. A tractable inference perspective of offline rl.Advances in Neural Information Processing Systems, 37:70953–70980, 2024

work page 2024
[23]

What is the relationship between ten- sor factorizations and circuits (and how can we exploit it)?arXiv preprint arXiv:2409.07953, 2024

Lorenzo Loconte, Antonio Mari, Gennaro Gala, Robert Peharz, Cassio de Campos, Erik Quaeghebeur, Gennaro Vessio, and Antonio Vergari. What is the relationship between ten- sor factorizations and circuits (and how can we exploit it)?arXiv preprint arXiv:2409.07953, 2024

work page arXiv 2024
[24]

pytorch hmm: A PyTorch implementation of Hidden Markov Models

Loren Lugosch. pytorch hmm: A PyTorch implementation of Hidden Markov Models. https://github.com/lorenlugosch/pytorch_HMM, 2020

work page 2020
[25]

Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational linguistics, 19(2):313–330, 1993

work page 1993
[26]

On the Expressive Efficiency of Sum Product Networks

James Martens and Venkatesh Medabalimi. On the expressive efficiency of sum product net- works.arXiv preprint arXiv:1411.7717, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[28]

SPFlow: An Easy and Extensible Library for Deep Probabilistic Learning using Sum-Product Networks

Alejandro Molina, Antonio Vergari, Karl Stelzner, Robert Peharz, Pranav Subramani, Nicola Di Mauro, Pascal Poupart, and Kristian Kersting. Spflow: An easy and extensible library for deep probabilistic learning using sum-product networks.arXiv preprint arXiv:1901.03704, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[29]

On the latent variable interpretation in sum-product networks.IEEE transactions on pattern analysis and machine intelligence, 39(10):2030–2044, 2016

Robert Peharz, Robert Gens, Franz Pernkopf, and Pedro Domingos. On the latent variable interpretation in sum-product networks.IEEE transactions on pattern analysis and machine intelligence, 39(10):2030–2044, 2016

work page 2030
[30]

Einsum networks: Fast and scalable learning of tractable probabilistic circuits

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: Fast and scalable learning of tractable probabilistic circuits. InProceedings of the 37th International Conference on Machine Learning (ICML), jul 2020

work page 2020
[31]

On theoretical properties of sum-product networks

Robert Peharz, Sebastian Tschiatschek, Franz Pernkopf, and Pedro Domingos. On theoretical properties of sum-product networks. InArtificial Intelligence and Statistics, pages 744–752. PMLR, 2015

work page 2015
[32]

New compilation languages based on structured decomposability

Knot Pipatsrisawat and Adnan Darwiche. New compilation languages based on structured decomposability. InAAAI, volume 8, pages 517–522, 2008

work page 2008
[33]

Sum-product networks: A new deep architecture

Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture. In2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 689–690. IEEE, 2011

work page 2011
[34]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

A compo- sitional atlas of tractable circuit operations for probabilistic inference

Antonio Vergari, YooJung Choi, Anji Liu, Stefano Teso, and Guy Van den Broeck. A compo- sitional atlas of tractable circuit operations for probabilistic inference. InAdvances in Neural Information Processing Systems 34 (NeurIPS), dec 2021

work page 2021
[37]

On the relationship between monotone and squared probabilistic circuits

Benjie Wang and Guy Van den Broeck. On the relationship between monotone and squared probabilistic circuits. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pages 21026–21034, 2025

work page 2025
[38]

Trace back from the fu- ture: A probabilistic reasoning approach to controllable language generation.arXiv preprint arXiv:2504.18535, 2025

Gwen Yidou Weng, Benjie Wang, and Guy Van den Broeck. Trace back from the fu- ture: A probabilistic reasoning approach to controllable language generation.arXiv preprint arXiv:2504.18535, 2025

work page arXiv 2025
[39]

Which transformer architecture fits my data? a vocabulary bottleneck in self-attention

Noam Wies, Yoav Levine, Daniel Jannai, and Amnon Shashua. Which transformer architecture fits my data? a vocabulary bottleneck in self-attention. InInternational Conference on Machine Learning, pages 11170–11181. PMLR, 2021

work page 2021
[40]

Etas: Zero-shot transformer architecture search via network trainability and expressivity

Jiechao Yang and Yong Liu. Etas: Zero-shot transformer architecture search via network trainability and expressivity. InFindings of the Association for Computational Linguistics: ACL 2024, pages 6780–6795, 2024

work page 2024
[41]

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model.arXiv preprint arXiv:1711.03953, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Tractable control for autoregressive language generation

Honghua Zhang, Meihua Dang, Nanyun Peng, and Guy Van den Broeck. Tractable control for autoregressive language generation. InInternational Conference on Machine Learning, pages 40932–40945. PMLR, 2023

work page 2023
[43]

Scaling probabilistic circuits via monarch matrices

Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, and Guy Van Den Broeck. Scaling probabilistic circuits via monarch matrices. InInternational Conference on Machine Learning, pages 74804–74818. PMLR, 2025. 12 A Proofs A.1 Preliminary Definitions For a structured-decomposable PC with vtreeTover variablesX={X 1, . . . , XT }, we introduce...

work page 2025
[44]

IfvisA-pure, thenR A(v)≤d v

work page
[45]

IfvisB-pure, thenR A(v) = 1

work page
[46]

Proof.Case 1(visA-pure)

Ifvis mixed, thenR A(v)≤R A(L)·R A(R). Proof.Case 1(visA-pure). SinceU v ∩B=∅, all functions inH v depend only onx Uv∩A =x Uv. Let{ϕ v i }dv i=1 be any basis ofH v. Then everyf∈ H v can be written as f(x Uv) = dvX i=1 cf i ϕv i (xUv), wherec f i ∈Rare constants, which we view as trivial functions on the emptyB-side. Hence RA(v)≤d v. Case 2(visB-pure). Sin...

work page
[47]

Define F(y, s) =1[s= 1]A(y) +1[s= 0]B(y)

there is no vtreeTonYsuch that bothAandBadmit polynomial-size structured- decomposable PCs respectingT. Define F(y, s) =1[s= 1]A(y) +1[s= 0]B(y). ThenFadmits a polynomial-size decomposable PC, but does not admit a polynomial-size structured-decomposable PC. Proof.We first show thatFadmits a polynomial-size decomposable PC. Construct a circuit with two pro...

work page