pith. sign in

arxiv: 2605.30311 · v1 · pith:ZCNJOV24new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Pith reviewed 2026-06-29 08:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords unified multimodal modeldigital human generationautoregressive modeltalking video generationsemantic video reparameterizationmultimodal tokenizationcross-modal generation
0
0 comments X

The pith

Archon pretrains one autoregressive model on seven modalities and 72 tasks to generate digital humans holistically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Archon as a fully pretrained unified multimodal model for creating digital humans that combine text, audio, motion, and visual content. It uses modality-specific tokenizers inside a single native autoregressive model trained on synchronized data across 72 tasks to learn joint distributions. Special components address high token counts in video by cutting them fourfold with semantic reparameterization and a diffusion decoder, while a stepwise modality-thinking chain handles ambiguous cross-modal requests. The central goal is to show this single framework matches or exceeds specialized systems on diverse avatar generation tasks without separate fine-tuning per modality.

Core claim

Archon unifies seven modalities with modality-specific tokenizers and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. Memory-efficient semantic video reparameterization achieves 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. A Thinking in Modality strategy decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Experiments demonstrate superior or comparable performance across diverse digital human generation tasks.

What carries the argument

Native autoregressive unified multimodal model with modality-specific tokenizers, pretrained on synchronized data and 72 tasks, plus semantic video reparameterization and Thinking in Modality chain.

If this is right

  • Holistic avatar generation becomes possible in one forward pass without switching between separate models for text, audio, motion, or video.
  • High-fidelity talking videos can be processed autoregressively with only one-quarter the tokens while retaining dynamics via the diffusion decoder.
  • Cross-modal tasks gain controllability by routing through an explicit chain of alternative modalities rather than direct ambiguous mapping.
  • A single pretrained checkpoint suffices for the full range of 72 tasks instead of requiring per-task retraining or architectural variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the joint-distribution claim holds, deployment pipelines for virtual humans could collapse from many specialized models to one, simplifying maintenance and inference hardware requirements.
  • The modality-thinking decomposition suggests a general pattern for other ambiguous multimodal problems where direct conditioning fails but sequential rephrasing succeeds.
  • Extending the same pretraining recipe to non-human subjects such as animals or objects would test whether the seven-modality unification is human-specific or broadly applicable.
  • Real-time interactive use cases would require measuring whether the autoregressive nature plus reparameterization still supports low-latency responses.

Load-bearing premise

Pretraining a single autoregressive model on synchronized multimodal data and 72 tasks is sufficient to capture the joint distribution without modality-specific fine-tuning or architectural compromises that degrade any individual task.

What would settle it

A controlled comparison in which a task-specific model trained only on one digital human generation task produces measurably higher fidelity outputs than the unified Archon model on that same task.

Figures

Figures reproduced from arXiv: 2605.30311 by Chong Bao, David Futschik, Feitong Tan, Guofeng Zhang, Lijun Yu, Sean Fanello, Shefali Srivastava, Shichen Liu, Stylianos Moschoglou, Yinda Zhang, Zhaopeng Cui, Ziqian Bai.

Figure 1
Figure 1. Figure 1: Archon. We propose a novel unified multimodal model that performs cross-modal generation among a wide range of modalities, including description, script, speech, animation, semantic video, image, and video. Furthermore, we introduce the concept of Thinking in Modality to reduce ambiguity during cross-modal transitions and enhance generation quality. Our model inherently supports conditional generation acro… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline. We use modality tokenizers to tokenize description, script, speech, animation, image and semantic video into discrete tokens. These tokens are arranged in a structured format and input to the language model for multimodal reasoning. The synthesized tokens are detokenized into raw modalities. For synthesizing high-quality video, we employ a semantic-driven video diffusion model to synthesize high-… view at source ↗
Figure 3
Figure 3. Figure 3: Thinking in Modality. We show the results of speech￾to-video generation with different thinking strategies. The videos generated from chain 2 contain less distortion (e.g., blurry appear￾ance and undefined textual symbols) and exhibit closer identity alignment with the ground truth. videos for cross-modal reasoning, we introduce a semantic￾driven video diffusion model that decodes these semantic videos int… view at source ↗
Figure 4
Figure 4. Figure 4: Multimodal Generation. We show holistic modality generation and understanding given an arbitrary modality as input. from 3DMM shape and expression, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Modality-specific Editing. We show flexible any modality editing to modify an arbitrarily chosen modality while maintain the others untouched. The icons on the top show the modalities used in the example, and the highlighted icons are the ones that are edited. and new video with synchronized lip motion and animation while maintain the identity. Secondly, our model can seam￾lessly change the appearance of t… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison on the Speech-driven Video Generation. We qualitatively compare the video quality and video-audio synchro￾nization against AniPortrait [52], Echomimic [8], Hallo3 [12]. datasets. This notable performance highlights the excep￾tional quality of our lip synchronization and overall realism. We further provide qualitative comparisons in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents Archon, a fully pretrained autoregressive unified multimodal model for holistic digital human generation. It unifies seven modalities via modality-specific tokenizers, pretrains on synchronized multimodal data across 72 tasks to model joint distributions, introduces semantic video reparameterization for 4x token reduction with a diffusion decoder, and proposes 'Thinking in Modality' for stepwise cross-modal decomposition. The central claim is that this yields superior or comparable performance on diverse digital human tasks.

Significance. If the empirical claims hold with rigorous validation, the work could be significant for demonstrating that a single autoregressive model can capture holistic joint distributions across text/audio/motion/visual modalities without modality-specific fine-tuning. The token-reduction technique and cross-modal thinking strategy address practical challenges in high-fidelity generation. However, the provided manuscript contains only the abstract with no methods, results, ablations, or statistics, so significance cannot be assessed.

major comments (2)
  1. [Abstract] Abstract: The central performance claim ('achieves superior or comparable performance across diverse digital human generation tasks') is stated without any quantitative results, tables, figures, training details, or statistical tests. This prevents verification of whether gains are supported by data or affected by task selection, directly undermining evaluation of the unified pretraining assumption.
  2. [Abstract] Abstract: No description is given of the 72 tasks, data composition, synchronization method, or evaluation benchmarks. Without these, it is impossible to assess whether the autoregressive model captures the joint distribution without architectural compromises that degrade individual modalities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We agree that the current abstract is high-level and lacks the quantitative support and task details needed for full evaluation. We will revise the abstract to incorporate key results and brief descriptions while maintaining conciseness; an expanded manuscript with full methods, tables, and ablations will be prepared for resubmission.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim ('achieves superior or comparable performance across diverse digital human generation tasks') is stated without any quantitative results, tables, figures, training details, or statistical tests. This prevents verification of whether gains are supported by data or affected by task selection, directly undermining evaluation of the unified pretraining assumption.

    Authors: We agree that the abstract would be strengthened by including quantitative evidence. In the revision we will add 1-2 specific metrics (e.g., percentage improvements on representative tasks) drawn from our experiments to substantiate the claim without exceeding abstract length limits. revision: yes

  2. Referee: [Abstract] Abstract: No description is given of the 72 tasks, data composition, synchronization method, or evaluation benchmarks. Without these, it is impossible to assess whether the autoregressive model captures the joint distribution without architectural compromises that degrade individual modalities.

    Authors: We acknowledge the absence of these specifics in the abstract. We will add a concise sentence summarizing task categories, data sources, and synchronization approach, while directing readers to the methods section for complete details. Full task lists and benchmark descriptions will appear in the expanded manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and described structure contain no derivation chain, equations, fitted parameters presented as predictions, or self-citations that reduce claims to inputs by construction. The central claim rests on empirical experiments across tasks, which are independent of any definitional or self-referential loop under the enumerated patterns. This is the standard case for an empirical multimodal modeling paper with no load-bearing mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical multimodal learning work; the central claim depends on the unstated assumption that the chosen tokenizers and autoregressive objective are expressive enough for joint distributions, plus whatever hyperparameters and data curation choices were used during pretraining. No explicit free parameters, axioms, or invented physical entities are named in the abstract.

pith-pipeline@v0.9.1-grok · 5759 in / 1241 out tokens · 23968 ms · 2026-06-29T08:01:08.812100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 22 canonical work pages · 17 internal anchors

  1. [1]

    Alternating gradi- ent descent and mixture-of-experts for integrated multimodal perception.Advances in Neural Information Processing Sys- tems, 36:79142–79154, 2023

    Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, and Hartwig Adam. Alternating gradi- ent descent and mixture-of-experts for integrated multimodal perception.Advances in Neural Information Processing Sys- tems, 36:79142–79154, 2023. 5

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  3. [3]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023. 2, 5, 6

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhao- hai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Jun- yang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  5. [5]

    Location-relative attention mechanisms for robust long-form speech synthesis

    Eric Battenberg, RJ Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, and Tom Bagby. Location-relative attention mechanisms for robust long-form speech synthesis. InICASSP 2020-2020 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 6194–6198. IEEE, 2020. 7

  6. [6]

    Audiolm: a language modeling approach to audio genera- tion.IEEE/ACM transactions on audio, speech, and lan- guage processing, 31:2523–2533, 2023

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio genera- tion.IEEE/ACM transactions on audio, speech, and lan- guage processing, 31:2523–2533, 2023. 3

  7. [7]

    Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2023

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation, 2023. 19

  8. [8]

    Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions

    Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 2403–2410, 2025. 2, 6, 7, 8

  9. [9]

    Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023. 2, 5

  10. [10]

    Out of time: auto- mated lip sync in the wild

    Joon Son Chung and Andrew Zisserman. Out of time: auto- mated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016. 6, 7

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 5, 6, 15

  12. [12]

    Hallo3: Highly dynamic and realistic portrait image an- imation with diffusion transformer networks.arXiv e-prints, pages arXiv–2412, 2024

    Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image an- imation with diffusion transformer networks.arXiv e-prints, pages arXiv–2412, 2024. 2, 6, 7, 8

  13. [13]

    Emoca: Emotion driven monocular face capture and animation

    Radek Dan ˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20311–20322, 2022. 2, 17

  14. [14]

    Gemini 3

    Google Deepmind. Gemini 3. https://deepmind.google/models/gemini/, 2025. 19

  15. [15]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3

  16. [16]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 3

  17. [17]

    3d morphable face models—past, present, and future.ACM Transactions on Graphics (ToG), 39(5):1–38, 2020

    Bernhard Egger, William AP Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, et al. 3d morphable face models—past, present, and future.ACM Transactions on Graphics (ToG), 39(5):1–38, 2020. 17

  18. [18]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 3, 13

  19. [19]

    Learning an animatable detailed 3d face model from in-the- wild images.ACM Transactions on Graphics (ToG), 40(4): 1–13, 2021

    Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the- wild images.ACM Transactions on Graphics (ToG), 40(4): 1–13, 2021. 2

  20. [20]

    Face2speech: Towards multi- speaker text-to-speech synthesis using an embedding vector predicted from a face image

    Shunsuke Goto, Kotaro Onishi, Yuki Saito, Kentaro Tachibana, and Koichiro Mori. Face2speech: Towards multi- speaker text-to-speech synthesis using an embedding vector predicted from a face image. InInterspeech, pages 1321– 1325, 2020. 2

  21. [21]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, 9 et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 5

  22. [22]

    Vector quantization.IEEE Assp Magazine, 1 (2):4–29, 1984

    Robert Gray. Vector quantization.IEEE Assp Magazine, 1 (2):4–29, 1984. 14

  23. [23]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. In European Conference on Computer Vision, pages 393–411. Springer, 2024. 5

  24. [24]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 7

  25. [25]

    Face- stylespeech: Improved face-to-voice latent mapping for nat- ural zero-shot speech synthesis from a face image.arXiv preprint arXiv:2311.05844, 2023

    Minki Kang, Wooseok Han, and Eunho Yang. Face- stylespeech: Improved face-to-voice latent mapping for nat- ural zero-shot speech synthesis from a face image.arXiv preprint arXiv:2311.05844, 2023. 2

  26. [26]

    Sapiens: Foundation for human vision mod- els

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision mod- els. InEuropean Conference on Computer Vision, pages 206–228. Springer, 2024. 3

  27. [27]

    A method for stochastic optimization

    Diederik Kinga, Jimmy Ba Adam, et al. A method for stochastic optimization. InInternational conference on learning representations (ICLR). California;, 2015. 6

  28. [28]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vigh- nesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video gen- eration.arXiv preprint arXiv:2312.14125, 2023. 2, 3, 4

  29. [29]

    Imag- inary voice: Face-styled diffusion model for text-to-speech

    Jiyoung Lee, Joon Son Chung, and Soo-Whan Chung. Imag- inary voice: Face-styled diffusion model for text-to-speech. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2, 7, 8

  30. [30]

    Learning a model of facial shape and expression from 4d scans.ACM Trans

    Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017. 4

  31. [31]

    Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Re- thinking the scaling-up of one-stage conditioned human an- imation models. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 13847–13858,

  32. [32]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  33. [33]

    World-consistent video-to-video synthesis

    Arun Mallya, Ting-Chun Wang, Karan Sapra, and Ming-Yu Liu. World-consistent video-to-video synthesis. InEuropean Conference on Computer Vision, pages 359–378. Springer,

  34. [34]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4, 6, 17

  35. [35]

    Dpe: Dis- entanglement of pose and expression for general video por- trait editing

    Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xi- aodong Cun, Ying Shan, and Dong-ming Yan. Dpe: Dis- entanglement of pose and expression for general video por- trait editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 427–436,

  36. [36]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 2, 3

  37. [37]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 6, 17

  38. [38]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 3

  39. [39]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 4

  40. [40]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 5

  41. [41]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational Conference on Machine Learning, pages 4596–4604. PMLR,

  42. [42]

    X-vectors: Robust dnn em- beddings for speaker recognition

    David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn em- beddings for speaker recognition. In2018 IEEE interna- tional conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE, 2018. 7

  43. [43]

    Fixtalk: Taming identity leakage for high-quality talking head generation in extreme cases.arXiv preprint arXiv:2507.01390, 2025

    Shuai Tan, Bill Gong, Bin Ji, and Ye Pan. Fixtalk: Taming identity leakage for high-quality talking head generation in extreme cases.arXiv preprint arXiv:2507.01390, 2025. 2

  44. [44]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 3

  45. [45]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 7

  46. [46]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 4

  47. [47]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 14, 17 10

  48. [48]

    Generalized end-to-end loss for speaker verification

    Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. In2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883. IEEE, 2018. 7

  49. [49]

    Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open- ended decoder for vision-centric tasks.Advances in Neural Information Processing Systems, 36:61501–61513, 2023. 4

  50. [50]

    Prompthmr: Promptable human mesh recovery

    Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J Black, and Muhammed Kocabas. Prompthmr: Promptable human mesh recovery. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1148–1159, 2025. 3

  51. [51]

    Mocha: Towards movie-grade talking character synthesis.arXiv preprint arXiv:2503.23307, 2025

    Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, et al. Mocha: Towards movie-grade talking character synthesis.arXiv preprint arXiv:2503.23307, 2025. 2

  52. [52]

    Aniportrait: Audio-driven synthesis of photorealistic portrait animation

    Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animation. arXiv preprint arXiv:2403.17694, 2024. 2, 7, 8

  53. [53]

    Fake it till you make it: face analysis in the wild using synthetic data alone

    Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. Fake it till you make it: face analysis in the wild using synthetic data alone. InProceedings of the IEEE/CVF international conference on computer vision, pages 3681–3691, 2021. 6

  54. [54]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023. 7

  55. [55]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024. 2, 3, 18, 19

  56. [56]

    Show-o2: Improved Native Unified Multimodal Models

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 2, 18

  57. [57]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  58. [58]

    Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 2

  59. [59]

    Emotional face-to-speech.arXiv preprint arXiv:2502.01046, 2025

    Jiaxin Ye, Boyuan Cao, and Hongming Shan. Emotional face-to-speech.arXiv preprint arXiv:2502.01046, 2025. 2

  60. [60]

    Magvit: Masked generative video transformer

    Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming- Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023. 3

  61. [61]

    Language model beats diffusion-tokenizer is key to visual generation

    Lijun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InICLR, 2024. 2, 3, 4, 12

  62. [62]

    Soundstream: An end- to-end neural audio codec.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 30:495–507, 2021

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end- to-end neural audio codec.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 30:495–507, 2021. 3, 4, 17

  63. [63]

    Anygpt: Unified multimodal llm with dis- crete sequence modeling

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with dis- crete sequence modeling. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 9637–9662, 2024. 2, 3

  64. [64]

    Speechgpt: Empow- ering large language models with intrinsic cross-modal con- versational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal con- versational abilities. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. 3

  65. [65]

    Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation

    Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio- driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023. 2

  66. [66]

    Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset

    Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3661–3670, 2021. 6, 7, 8, 17, 19

  67. [67]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Pre- dict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 2

  68. [68]

    Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation

    Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. Pose-controllable talking face generation by implicitly modularized audio-visual rep- resentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4176–4186,

  69. [69]

    Celebv- hq: A large-scale video facial attributes dataset

    Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv- hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer,

  70. [70]

    Thinking in Modality

    6, 7, 8, 17 11 Supplementary Material This supplementary material provides additional imple- mentation details, experimental results and a brief discus- sion on ethics. In Sec. A, we claim that our work adheres to ethical guidelines. In Sec. C, we elaborate on the architec- tural designs and training protocols for our semantic and an- imation tokenizers, ...