arxiv: 2605.06148 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow

Bowen Zheng , Yihong Luo , Tianyang Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords discrete image tokenizersautoregressive priorsWasserstein gradient flowprior consistencyCIFAR-10ImageNetimage generationTripartite Variational Consistency

0 comments

The pith

Adding a Wasserstein-gradient-flow prior-matching signal during tokenizer training aligns discrete tokens with autoregressive priors, lowering prediction loss and improving generation quality at unchanged reconstruction fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard two-stage training first optimizes a discrete image tokenizer for reconstruction and then fits an autoregressive prior to its frozen outputs. This leaves the tokenizer unaware of how easy or hard its tokens will be for the prior to predict left-to-right. The paper introduces a distribution-level prior-matching term, optimized via Wasserstein gradient flow, that runs concurrently with reconstruction. For hard categorical tokens the flow update collapses to a simple contrast between an auxiliary autoregressive tracker of the current token distribution and the target prior; only forward passes through the two autoregressive models are required. On CIFAR-10 and ImageNet the resulting tokenizer yields lower autoregressive loss and better FID scores while reconstruction metrics remain comparable.

Core claim

Tripartite Variational Consistency decomposes latent-variable learning into conditional-likelihood, prior, and posterior consistency. Two-stage training satisfies the first and third but leaves prior consistency outside the tokenizer objective. Inserting a Wasserstein-gradient-flow update supplies the missing prior consistency by driving the tokenizer-induced token distribution toward the distribution the target autoregressive prior expects, without back-propagating through either autoregressive model and without altering the reconstruction loss.

What carries the argument

Wasserstein-gradient-flow prior-matching update that, for hard categorical tokens, reduces to a token-level contrast between an auxiliary AR model tracking the tokenizer distribution and the target AR prior, using only forward passes.

If this is right

Tokens produced by the tokenizer become easier for a fixed autoregressive prior to model, directly lowering autoregressive cross-entropy.
Generation FID improves on both CIFAR-10 and ImageNet while reconstruction quality stays at the level of standard two-stage training.
No gradients flow through the autoregressive models, keeping the added prior-matching step computationally lightweight and stable.
The same auxiliary-and-target contrast mechanism can be applied whenever a discrete tokenizer is trained for use with an autoregressive prior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other discrete latent spaces such as audio or text tokens to check whether prior consistency during tokenization generalizes beyond images.
Because the update uses only forward passes, it may allow joint training of tokenizer and prior in a single end-to-end loop without extra memory cost.
If the auxiliary AR tracker is replaced by a simpler density estimator, the method might become applicable to non-autoregressive priors as well.

Load-bearing premise

The Wasserstein-gradient-flow prior-matching update can be applied to hard categorical tokens using only forward passes through auxiliary and target AR models without degrading the reconstruction objective or introducing instability.

What would settle it

Run the wAR-Tok procedure on CIFAR-10 or ImageNet and observe either a drop in reconstruction PSNR/SSIM below the two-stage baseline or an increase in generation FID and autoregressive loss.

read the original abstract

Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a Wasserstein-gradient-flow prior-matching term to tokenizer training so that discrete tokens become easier for a downstream AR model, but the experimental controls are still thin.

read the letter

The main takeaway is that they use Wasserstein gradient flow to enforce prior consistency inside the tokenizer objective itself, rather than fitting the prior only after the tokenizer is frozen. The TVC analysis makes the missing term explicit, and the reduction to a forward-only token-level contrast between an auxiliary AR tracker and the target prior is a practical way to apply the update to hard categorical tokens without backpropagating through either AR model. On the positive side, this keeps the reconstruction loss unchanged by design and produces lower AR loss plus better generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality. That addresses a real bottleneck in two-stage discrete generative pipelines. The approach is new in how it folds the distribution-matching signal directly into tokenizer training via the flow update. The soft spots are mostly experimental. The abstract gives no error bars, no detailed ablations on the auxiliary AR update frequency or step size, and no checks on whether the tracker stays accurate as the tokenizer evolves. The stress-test concern about stability and reconstruction preservation for hard tokens is fair; if the full paper shows the joint optimization remains stable across runs and reconstruction does not degrade, the result is stronger, but that evidence is not visible from the summary. Minor implementation details like how the auxiliary model is initialized and maintained would also help. This is for people working on latent discrete models who want a concrete way to reduce the tokenizer-prior gap without redesigning the architecture. It deserves a serious referee because the problem is well-motivated, the proposed mechanism is testable, and the reported gains are on standard benchmarks, even if more controls would make the claims more convincing.

Referee Report

3 major / 3 minor

Summary. The paper proposes wAR-Tok, a method for training discrete image tokenizers that incorporates a distribution-level prior-matching signal derived from Wasserstein gradient flow (WGF) during the tokenizer optimization stage. Motivated by a Tripartite Variational Consistency (TVC) analysis showing that standard two-stage training leaves prior consistency unaddressed, the approach adds this signal while keeping the reconstruction objective unchanged. For hard categorical tokens the WGF update is reduced to a token-level contrast between an auxiliary AR model (tracking the current tokenizer output distribution) and the target AR prior, using only forward passes through both models with no back-propagation. Experiments on CIFAR-10 and ImageNet report that the resulting tokenizer achieves lower AR loss and improved generation FID at comparable reconstruction quality to baselines.

Significance. If the central claims hold, the work provides a principled and computationally lightweight way to make discrete tokenizers aware of the autoregressive prior they will be used with, potentially improving downstream generation quality in AR-based image models without retraining the prior or sacrificing reconstruction. The TVC framing and the forward-only reduction of WGF are conceptually clean contributions that could generalize to other discrete latent-variable settings.

major comments (3)

[§3.2] §3.2 (WGF reduction for hard tokens): The derivation claims the update reduces to a forward-only token-level contrast that leaves the reconstruction objective intact, but the manuscript provides no explicit verification (analytic or empirical) that the effective gradient on the quantization step does not increase reconstruction error or introduce instability once applied to non-differentiable tokens. This is load-bearing for the central claim that reconstruction quality remains comparable.
[§4, Tables 1–2] §4 Experiments and Tables 1–2: Reported reductions in AR loss and improvements in FID on CIFAR-10 and ImageNet are given without error bars, standard deviations across runs, or ablation controls isolating the contribution of the WGF prior-matching term versus other training choices (e.g., auxiliary AR capacity or update frequency). This prevents assessment of whether the gains are robust or statistically significant.
[§3.3] §3.3 (auxiliary AR tracker): The auxiliary AR is described only as tracking the tokenizer’s current distribution, yet no analysis or experiments address potential lag, oscillation, or drift as the tokenizer evolves, which directly affects whether the token-level contrast remains a faithful approximation of the WGF prior-matching signal.

minor comments (3)

[§2] The notation for the TVC decomposition (conditional-likelihood, prior, and posterior consistency) is introduced without a compact equation summarizing the three terms; adding such an equation would improve readability.
[§4] Hyper-parameter settings for the auxiliary AR model (architecture, training schedule, learning rate) and the exact frequency of the WGF update are not stated in the main text or appendix, hindering reproducibility.
[§1] Related work on Wasserstein gradient flows applied to discrete or categorical distributions is cited only lightly; a brief discussion of how the present reduction differs from prior discrete WGF approximations would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address each major comment point by point below, providing our responses and indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (WGF reduction for hard tokens): The derivation claims the update reduces to a forward-only token-level contrast that leaves the reconstruction objective intact, but the manuscript provides no explicit verification (analytic or empirical) that the effective gradient on the quantization step does not increase reconstruction error or introduce instability once applied to non-differentiable tokens. This is load-bearing for the central claim that reconstruction quality remains comparable.

Authors: The WGF update for hard tokens is designed such that it operates solely on forward passes of the auxiliary and target AR models, providing a distributional contrast signal without requiring differentiation through the non-differentiable quantization or the AR models themselves. Consequently, the gradients for the reconstruction objective (which flow through the encoder and decoder) are unaffected by this term. To provide explicit verification, we will include in the revision both a brief analytic note on why the quantization step remains unchanged and empirical plots showing that reconstruction metrics (e.g., rFID) stay stable throughout training with the added term, comparable to the baseline without it. revision: yes
Referee: [§4, Tables 1–2] §4 Experiments and Tables 1–2: Reported reductions in AR loss and improvements in FID on CIFAR-10 and ImageNet are given without error bars, standard deviations across runs, or ablation controls isolating the contribution of the WGF prior-matching term versus other training choices (e.g., auxiliary AR capacity or update frequency). This prevents assessment of whether the gains are robust or statistically significant.

Authors: We acknowledge that the lack of error bars and ablations makes it difficult to fully assess the statistical significance and robustness of the reported improvements. In the revised manuscript, we will conduct additional experiments with multiple random seeds and report means with standard deviations for the key metrics in Tables 1 and 2. We will also add an ablation study that isolates the effect of the WGF prior-matching term by training a variant without this component under otherwise identical conditions. This will allow readers to better evaluate the contribution of our proposed method. revision: yes
Referee: [§3.3] §3.3 (auxiliary AR tracker): The auxiliary AR is described only as tracking the tokenizer’s current distribution, yet no analysis or experiments address potential lag, oscillation, or drift as the tokenizer evolves, which directly affects whether the token-level contrast remains a faithful approximation of the WGF prior-matching signal.

Authors: The auxiliary AR model is updated at regular intervals during tokenizer training to follow the changing token distribution. While we chose the update frequency to balance accuracy and efficiency, we agree that further analysis would strengthen the claim. In the revision, we will add a discussion of the update schedule and include empirical results, such as the evolution of the auxiliary AR's loss or the token distribution divergence over training epochs, to show that lag and drift are minimal and do not compromise the approximation. If needed, we can also experiment with more frequent updates. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper's core chain begins with an external analysis via the newly introduced Tripartite Variational Consistency (TVC) decomposition, which isolates the missing prior-consistency term in standard two-stage training. It then augments the tokenizer objective with a Wasserstein-gradient-flow prior-matching signal whose discrete-token implementation is derived as a forward-only contrast between an auxiliary AR tracker (fitted only to the evolving tokenizer output distribution) and a separate target AR prior. Neither the auxiliary tracker nor the target prior is defined in terms of the final AR loss or reconstruction metric; the update is an external distribution-matching mechanism that does not back-propagate through the AR models and leaves the reconstruction loss unchanged by construction. No equation or claim reduces a reported prediction (lower AR loss, better FID) to a quantity that was fitted or renamed inside the same model. The reported empirical gains on CIFAR-10 and ImageNet are therefore independent outcomes rather than tautological restatements of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the newly introduced Tripartite Variational Consistency decomposition and on the assumption that a Wasserstein-gradient-flow update can be realized with only forward passes for hard categorical tokens. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Tripartite Variational Consistency decomposes latent-variable learning into conditional-likelihood consistency, prior consistency, and posterior consistency.
This decomposition is used to diagnose the missing prior-consistency term in two-stage tokenizer training.

pith-pipeline@v0.9.0 · 5540 in / 1388 out tokens · 46116 ms · 2026-05-08T13:50:59.375425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Kingma and Max Welling , title =

Diederik P. Kingma and Max Welling , title =. International Conference on Learning Representations (ICLR) , year =
[2]

Proceedings of the 31st International Conference on Machine Learning (ICML) , year =

Danilo Jimenez Rezende and Shakir Mohamed and Daan Wierstra , title =. Proceedings of the 31st International Conference on Machine Learning (ICML) , year =
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Aaron van den Oord and Oriol Vinyals and Koray Kavukcuoglu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[4]

Taming Transformers for High-Resolution Image Synthesis , booktitle =

Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =. 2021 , url =

2021
[5]

High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =

Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Bj. High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =. 2022 , url =

2022
[6]

2024 , eprint=

One-step Diffusion with Distribution Matching Distillation , author=. 2024 , eprint=

2024
[7]

2024 , eprint=

Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models , author=. 2024 , eprint=

2024
[8]

Masked Autoencoders Are Scalable Vision Learners , booktitle =

Kaiming He and Xiangyu Chen and Saining Xie and Yanghao Li and Piotr Doll. Masked Autoencoders Are Scalable Vision Learners , booktitle =. 2022 , url =

2022
[9]

International Conference on Learning Representations (ICLR) , year =

Hangbo Bao and Li Dong and Songhao Piao and Furu Wei , title =. International Conference on Learning Representations (ICLR) , year =
[10]

2021 , eprint =

Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever , title =. 2021 , eprint =

2021
[11]

Gomez and

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and. Attention Is All You Need , booktitle =. 2017 , url =

2017
[12]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[13]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Aaron van den Oord and Nal Kalchbrenner and Oriol Vinyals and Lasse Espeholt and Alex Graves and Koray Kavukcuoglu , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[14]

Image Transformer , booktitle =

Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and. Image Transformer , booktitle =. 2018 , url =

2018
[15]

Proceedings of the 37th International Conference on Machine Learning , pages =

Generative Pretraining From Pixels , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[16]

Freeman , title =

Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman , title =. 2022 , eprint =

2022
[17]

Bowman and Luke Vilnis and Oriol Vinyals and Andrew M

Samuel R. Bowman and Luke Vilnis and Oriol Vinyals and Andrew M. Dai and Rafal Jozefowicz and Samy Bengio , title =. Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL) , year =
[18]

DGS@ICLR , year=

Understanding Posterior Collapse in Generative Latent Variable Models , author=. DGS@ICLR , year=
[19]

The variational formulation of the fokker–planck equation

Jordan, Richard and Kinderlehrer, David and Otto, Felix , title =. SIAM Journal on Mathematical Analysis , volume =. 1998 , doi =. https://doi.org/10.1137/S0036141096303359 , abstract =

work page doi:10.1137/s0036141096303359 1998
[20]

Gradient Flows in Metric Spaces and in the Space of Probability Measures , publisher =

Luigi Ambrosio and Nicola Gigli and Giuseppe Savar. Gradient Flows in Metric Spaces and in the Space of Probability Measures , publisher =
[21]

Numerische Mathematik , year =

Jean-David Benamou and Yann Brenier , title =. Numerische Mathematik , year =
[22]

Williams and David Zipser , title =

Ronald J. Williams and David Zipser , title =. Neural Computation , year =
[23]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio and Nicholas L. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , year =. 1308.3432 , archivePrefix =

work page internal anchor Pith review arXiv
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[25]

Efros and Eli Shechtman and Oliver Wang , title =

Richard Zhang and Phillip Isola and Alexei A. Efros and Eli Shechtman and Oliver Wang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[26]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Jonathan Ho and Ajay Jain and Pieter Abbeel , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[27]

Yaron Lipman and Ricky T. Q. Chen and others , title =. 2022 , eprint =

2022
[28]

2024 , eprint =

Qihang Yu and Mark Weber and Xueqing Deng and Xiaohui Shen and Daniel Cremers and Liang-Chieh Chen , title =. 2024 , eprint =

2024
[29]

2024 , eprint =

Zhuoyan Luo and Fengyuan Shi and Yixiao Ge and Yujiu Yang and Limin Wang and Ying Shan , title =. 2024 , eprint =

2024
[30]

2024 , eprint =

Xiang Li and Kai Qiu and Hao Chen and Jason Kuen and Jiuxiang Gu and Bhiksha Raj and Zhe Lin , title =. 2024 , eprint =

2024
[31]

2025 , eprint =

Keita Miwa and Kento Sasaki and Hidehisa Arai and Tsubasa Takahashi and Yu Yamaguchi , title =. 2025 , eprint =

2025
[32]

2021 , eprint =

Neil Zeghidour and Alejandro Luebs and Ahmed Omran and Jan Skoglund and Marco Tagliasacchi , title =. 2021 , eprint =

2021
[33]

2023 , eprint =

Rithesh Kumar and Prem Seetharaman and Alejandro Luebs and Ishaan Kumar and Kundan Kumar , title =. 2023 , eprint =

2023
[34]

2025 , eprint =

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization , author =. 2025 , eprint =

2025
[35]

2025 , eprint =

Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and Oğuzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan , title =. 2025 , eprint =

2025
[36]

2025 , eprint =

Bohan Wang and Zhongqi Yue and Fengda Zhang and Shuo Chen and Li'an Bi and Junzhe Zhang and Xue Song and Kennard Yanting Chan and Jiachun Pan and Weijia Wu and Mingze Zhou and Wang Lin and Kaihang Pan and Saining Zhang and Liyu Jia and Wentao Hu and Wei Zhao and Hanwang Zhang , title =. 2025 , eprint =

2025
[37]

2009 , url=

Learning Multiple Layers of Features from Tiny Images , author=. 2009 , url=

2009
[38]

Optimal Transport: Old and New , publisher =

C. Optimal Transport: Old and New , publisher =
[39]

Filippo Santambrogio , title =
[40]

Communications in Partial Differential Equations , author =

Felix Otto , title =. Communications in Partial Differential Equations , volume =. 2001 , publisher =. doi:10.1081/PDE-100002243 , URL =

work page doi:10.1081/pde-100002243 2001
[41]

Goodfellow and Jean Pouget-Abadie and Mehdi Mirza and Bing Xu and David Warde-Farley and Sherjil Ozair and Aaron Courville and Yoshua Bengio , title =

Ian J. Goodfellow and Jean Pouget-Abadie and Mehdi Mirza and Bing Xu and David Warde-Farley and Sherjil Ozair and Aaron Courville and Yoshua Bengio , title =. 2014 , eprint =

2014
[42]

2017 , eprint =

Martin Arjovsky and Soumith Chintala and Léon Bottou , title =. 2017 , eprint =

2017
[43]

2017 , eprint =

Ishaan Gulrajani and Faruk Ahmed and Martin Arjovsky and Vincent Dumoulin and Aaron Courville , title =. 2017 , eprint =

2017
[44]

2016 , eprint =

Qiang Liu and Dilin Wang , title =. 2016 , eprint =

2016
[45]

2013 , eprint =

Marco Cuturi , title =. 2013 , eprint =

2013
[46]

2018 , eprint =

Gabriel Peyré and Marco Cuturi , title =. 2018 , eprint =

2018
[47]

2017 , eprint =

Ilya Tolstikhin and Olivier Bousquet and Sylvain Gelly and Bernhard Schoelkopf , title =. 2017 , eprint =

2017
[48]

Pope and Charles E

Soheil Kolouri and Phillip E. Pope and Charles E. Martin and Gustavo K. Rohde , title =. 2018 , eprint =

2018
[49]

2019 , eprint =

Ali Razavi and Aaron van den Oord and Oriol Vinyals , title =. 2019 , eprint =

2019
[50]

2021 , eprint =

Jiahui Yu and Xin Li and Jing Yu Koh and Han Zhang and Ruoming Pang and James Qin and Alexander Ku and Yuanzhong Xu and Jason Baldridge and Yonghui Wu , title =. 2021 , eprint =

2021
[51]

2023 , eprint =

Zixin Zhu and Xuelu Feng and Dongdong Chen and Jianmin Bao and Le Wang and Yinpeng Chen and Lu Yuan and Gang Hua , title =. 2023 , eprint =

2023
[52]

2020 , eprint =

Jiaming Song and Chenlin Meng and Stefano Ermon , title =. 2020 , eprint =

2020
[53]

2021 , eprint =

Alex Nichol and Prafulla Dhariwal , title =. 2021 , eprint =

2021
[54]

2022 , eprint =

Tero Karras and Miika Aittala and Timo Aila and Samuli Laine , title =. 2022 , eprint =

2022
[55]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Prafulla Dhariwal and Alexander Nichol , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[56]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Tianyang Hu and Fei Chen and Haonan Wang and Jiawei Li and Wenjia Wang and Jiacheng Sun and Zhenguo Li , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[57]

Tomczak and Max Welling , title =

Jakub M. Tomczak and Max Welling , title =. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) , year =
[58]

2019 , eprint =

Yilun Du and Igor Mordatch , title =. 2019 , eprint =

2019
[59]

2022 , eprint =

Jiahui Yu and Yuanzhong Xu and Jing Yu Koh and Thang Luong and Gunjan Baid and Zirui Wang and Vijay Vasudevan and Alexander Ku and Yinfei Yang and Burcu Karagol Ayan and Ben Hutchinson and Wei Han and others , title =. 2022 , eprint =

2022
[60]

Freeman and Michael Rubinstein and Yuanzhen Li and Dilip Krishnan , title =

Huiwen Chang and Han Zhang and Jarred Barber and AJ Maschinot and Jose Lezama and Lu Jiang and Ming-Hsuan Yang and Kevin Murphy and William T. Freeman and Michael Rubinstein and Yuanzhen Li and Dilip Krishnan , title =. 2023 , eprint =

2023
[61]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models , year =. 2405.09818 , archivePrefix =

work page internal anchor Pith review arXiv
[62]

2024 , eprint =

Peize Sun and Yi Jiang and Shoufa Chen and Shilong Zhang and Bingyue Peng and Ping Luo and Zehuan Yuan , title =. 2024 , eprint =

2024
[63]

2024 , eprint =

Vivek Ramanujan and Kushal Tirumala and Armen Aghajanyan and Luke Zettlemoyer and Ali Farhadi , title =. 2024 , eprint =

2024
[64]

2015 , eprint =

Alireza Makhzani and Jonathon Shlens and Navdeep Jaitly and Ian Goodfellow , title =. 2015 , eprint =

2015
[65]

2017 , eprint =

Shengjia Zhao and Jiaming Song and Stefano Ermon , title =. 2017 , eprint =

2017
[66]

International Conference on Learning Representations (ICLR) , year =

Eric Jang and Shixiang Gu and Ben Poole , title =. International Conference on Learning Representations (ICLR) , year =
[67]

Maddison and Andriy Mnih and Yee Whye Teh , title =

Chris J. Maddison and Andriy Mnih and Yee Whye Teh , title =. International Conference on Learning Representations (ICLR) , year =
[68]

2018 , eprint =

Xi Chen and Yan Duan and Rein Houthooft and John Schulman and Ilya Sutskever and Pieter Abbeel , title =. 2018 , eprint =

2018
[69]

2015 , eprint =

Danilo Jimenez Rezende and Shakir Mohamed , title =. 2015 , eprint =

2015
[70]

International Conference on Learning Representations (ICLR) , year =

Laurent Dinh and Jascha Sohl-Dickstein and Samy Bengio , title =. International Conference on Learning Representations (ICLR) , year =
[71]

Kingma and Prafulla Dhariwal , title =

Diederik P. Kingma and Prafulla Dhariwal , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[72]

Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model , booktitle =

Tian Han and Erik Nijkamp and Xiaolin Fang and Mitch Hill and Song. Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model , booktitle =. 2019 , doi =

2019
[73]

Joint Training of Variational Auto-Encoder and Latent Energy-Based Model , booktitle =

Tian Han and Erik Nijkamp and Linqi Zhou and Bo Pang and Song. Joint Training of Variational Auto-Encoder and Latent Energy-Based Model , booktitle =. 2020 , doi =

2020
[74]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2018 , month=. doi:10.1609/aaai.v32i1.11834 , abstractNote=

work page doi:10.1609/aaai.v32i1.11834 2018
[75]

Cooperative Training of Descriptor and Generator Networks , journal =

Jianwen Xie and Yang Lu and Ruiqi Gao and Song. Cooperative Training of Descriptor and Generator Networks , journal =. 2018 , doi =

2018
[76]

On the Anatomy of

Erik Nijkamp and Mitch Hill and Tian Han and Song. On the Anatomy of. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[77]

Schwing and Jan Kautz and Arash Vahdat , title =

Jyoti Aneja and Alexander G. Schwing and Jan Kautz and Arash Vahdat , title =. 2020 , eprint =

2020
[78]

2024 , eprint =

Hanyu Wang and Saksham Suri and Yixuan Ren and Hao Chen and Abhinav Shrivastava , title =. 2024 , eprint =

2024
[79]

International Conference on Learning Representations , year=

beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , author=. International Conference on Learning Representations , year=
[80]

Burgess and Irina Higgins and Arka Pal and Lo

Christopher P. Burgess and Irina Higgins and Arka Pal and Lo. Understanding Disentangling in. 2018 , eprint =

2018

Showing first 80 references.