pith. machine review for the scientific record. sign in

arxiv: 2601.03233 · v1 · submitted 2026-01-06 · 💻 cs.CV

Recognition: 1 theorem link

LTX-2: Efficient Joint Audio-Visual Foundation Model

Andrew Kvochko, Avishai Berkowitz, Benny Brazowski, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nisan Chiprut, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaki Bitterman, Yaron Inger, Yoav HaCohen, Yonatan Shiftan, Zeev Farbman, Zeev Melumian

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-audiovisualjoint audio-video generationdiffusion modelfoundation modelopen-sourcetransformerclassifier-free guidancemultimodal generation
0
0 comments X

The pith

LTX-2 generates high-quality synchronized audiovisual content from text using an asymmetric dual-stream transformer that prioritizes video capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LTX-2 as an open-source foundational model for unified text-to-audiovisual generation. It builds an asymmetric architecture with a 14B-parameter video stream and a 5B-parameter audio stream connected by bidirectional cross-attention layers. This setup enables efficient training and inference while producing audio that aligns with video elements such as characters, environment, style, and emotion, including speech, background sounds, and foley. The model incorporates a multilingual text encoder and modality-aware classifier-free guidance to improve prompt understanding and cross-modal alignment. If the claims hold, the approach would allow open-source systems to approach proprietary audiovisual quality at substantially reduced computational cost and inference time.

Core claim

LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. The model employs a multilingual text encoder and a modality-aware classifier-free guidance mechanism to generate temporally synchronized audiovisual outputs that follow scene characters, environment, style, and emotion, complete with natural background and foley elements.

What carries the argument

asymmetric dual-stream transformer with bidirectional audio-video cross-attention layers

Load-bearing premise

Bidirectional audio-video cross-attention layers combined with modality-aware classifier-free guidance suffice to produce temporally synchronized and semantically coherent audiovisual output without extra alignment losses or post-processing.

What would settle it

Side-by-side evaluation on complex scenes showing that LTX-2 audio consistently mismatches video timing or semantic content would disprove the mechanism's sufficiency.

read the original abstract

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LTX-2, an open-source joint audio-visual foundation model based on an asymmetric dual-stream transformer (14B-parameter video stream and 5B-parameter audio stream) connected by bidirectional cross-attention layers, temporal positional embeddings, cross-modality AdaLN for shared conditioning, and a modality-aware classifier-free guidance mechanism. It claims to generate high-quality, temporally synchronized audiovisual content from multilingual text prompts, producing speech, foley, background audio, and scene-coherent tracks, while achieving state-of-the-art audiovisual quality and prompt adherence among open-source systems and results comparable to proprietary models at lower computational cost and inference time, with public release of weights and code.

Significance. If the empirical claims are substantiated, this would represent a notable advance in multimodal generative modeling by delivering an efficient, publicly available unified architecture for joint audio-video synthesis that addresses the silence limitation of existing text-to-video models. The asymmetric capacity allocation, modality-CFG, and open release could serve as a practical baseline for future work in content creation, VR, and accessibility applications.

major comments (2)
  1. The central claim that bidirectional audio-video cross-attention, temporal positional embeddings, cross-modality AdaLN, and modality-aware CFG alone produce temporally synchronized and semantically coherent output (without additional alignment losses) is load-bearing for the SOTA and comparability assertions, yet the architecture section provides no explicit contrastive, cycle-consistency, or sync-specific training objectives to enforce frame-level correspondence in complex foley or rapid scene-change regimes.
  2. Evaluations section: the assertions of state-of-the-art audiovisual quality, prompt adherence among open-source systems, and comparability to proprietary models at a fraction of the cost lack any quantitative metrics, baseline tables, dataset descriptions, or ablation results, leaving the primary empirical claims without visible supporting evidence.
minor comments (2)
  1. Abstract: the phrase 'in our evaluations' is used without any accompanying numbers or references; adding one or two key metrics (e.g., FID, CLIP score, or human preference rates) would improve immediate readability.
  2. Notation: the distinction between 'modality-CFG' and standard classifier-free guidance is introduced without an explicit equation or pseudocode snippet showing how the modality-aware scaling is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [—] The central claim that bidirectional audio-video cross-attention, temporal positional embeddings, cross-modality AdaLN, and modality-aware CFG alone produce temporally synchronized and semantically coherent output (without additional alignment losses) is load-bearing for the SOTA and comparability assertions, yet the architecture section provides no explicit contrastive, cycle-consistency, or sync-specific training objectives to enforce frame-level correspondence in complex foley or rapid scene-change regimes.

    Authors: The model is trained end-to-end with a standard joint diffusion objective on large-scale paired audiovisual data; the bidirectional cross-attention layers, temporal positional embeddings, and modality-aware CFG are the mechanisms through which frame-level and semantic alignment are learned implicitly. No auxiliary contrastive or cycle-consistency losses were added in order to preserve training efficiency and architectural simplicity. We have expanded the architecture and training sections with a step-by-step description of how temporal correspondence arises from these components and have added qualitative examples demonstrating synchronization under rapid scene changes and complex foley. revision: partial

  2. Referee: [—] Evaluations section: the assertions of state-of-the-art audiovisual quality, prompt adherence among open-source systems, and comparability to proprietary models at a fraction of the cost lack any quantitative metrics, baseline tables, dataset descriptions, or ablation results, leaving the primary empirical claims without visible supporting evidence.

    Authors: We agree that the initial submission relied primarily on qualitative results. The revised manuscript now includes quantitative metrics (FID, CLIP-score, audio quality measures, and human preference scores), comparison tables against both open-source and proprietary baselines, a description of the evaluation datasets and protocols, and ablation studies isolating the contribution of cross-attention, temporal embeddings, and modality-CFG. These additions directly support the SOTA and efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and claims rest on empirical training, not self-referential derivations

full rationale

The paper presents an asymmetric dual-stream transformer (14B video + 5B audio) with bidirectional cross-attention, temporal embeddings, cross-modality AdaLN, and modality-aware CFG as design choices for unified audiovisual generation. No equations, uniqueness theorems, or derivations are shown that reduce by construction to fitted inputs, self-citations, or renamed empirical patterns. Central SOTA and comparability claims are supported by training runs and evaluations rather than any load-bearing self-reference or ansatz smuggled via prior author work. This is a standard empirical foundation-model paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied machine-learning paper. It relies on standard transformer and diffusion assumptions without introducing new mathematical axioms or postulated entities.

axioms (1)
  • standard math Transformer blocks with cross-attention can model joint temporal sequences across modalities
    Invoked implicitly in the description of the dual-stream architecture and bidirectional cross-attention layers.

pith-pipeline@v0.9.0 · 5650 in / 1159 out tokens · 50282 ms · 2026-05-13T06:59:48.168939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

    cs.CV 2026-05 unverdicted novelty 7.0

    CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

  2. OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.

  3. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  4. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

  5. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  6. Do Joint Audio-Video Generation Models Understand Physics?

    cs.SD 2026-05 unverdicted novelty 7.0

    Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.

  7. TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation

    cs.SD 2026-05 unverdicted novelty 7.0

    TMD-Bench is a multi-level benchmark that measures music-dance co-generation quality including beat-level rhythmic synchronization, supported by a new dataset and Music Captioner, and shows commercial models lag in rh...

  8. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  9. Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement ...

  10. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  11. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

  12. GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  13. Qwen-Image-VAE-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 6.0

    Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

  14. SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

  15. SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

  16. Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Unison introduces a unified framework using semantic-guided harmonization and bidirectional cross-modal forcing to generate human-centric videos with improved synchronization between motion, speech, and sound effects.

  17. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  18. Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    HSA assigns variable denoising steps to spatiotemporal tokens in DiTs based on velocity dynamics, with KV-cache sync and cached Euler updates, outperforming prior caching methods on quality-runtime tradeoffs for T2V a...

  19. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  20. MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    MMControl adds multi-modal controls for identity, timbre, pose, and layout to unified audio-video diffusion models via dual-stream injection and adjustable guidance scaling.

  21. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  22. OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...

  23. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  24. OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

  25. Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.

  26. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 26 Pith papers · 10 internal anchors

  1. [1]

    Cafa: a controllable automatic foley artist.arXiv preprint arXiv:2504.06778, 2025

    Roi Benita, Michael Finkelson, Tavi Halperin, Gleb Sterkin, and Yossi Adi. Cafa: a controllable automatic foley artist.arXiv preprint arXiv:2504.06778, 2025

  2. [2]

    Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28901–28911, 2025

  3. [3]

    Analyzing transformers in embedding space

    Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, 2023

  4. [4]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  5. [5]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

  6. [6]

    Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers, 2024

    Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, and Hongsheng Li. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transfor...

  7. [7]

    Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

    Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, et al. Wan-s2v: Audio-driven cinematic video generation.arXiv preprint arXiv:2508.18621, 2025

  8. [8]

    Veo 3: A diffusion-based audio+video generation system

    Google DeepMind. Veo 3: A diffusion-based audio+video generation system. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/ veo/Veo-3-Tech-Report.pdf

  9. [9]

    Taming text-to-sounding video generation via advanced modality condition and interaction, 2025

    Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao. Taming text-to-sounding video generation via advanced modality condition and interaction, 2025. URLhttps://arxiv.org/abs/2510.03117

  10. [10]

    Generating an image from 1,000 words: Enhancing text-to-image with structured captions, 2025

    Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, and Ron Mokady. Generating an image from 1,000 words: Enhancing text-to-image with structured captions, 2025. URL https://arxiv.org/abs/2511.06876

  11. [11]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  12. [12]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  13. [13]

    Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis. InAdvances in Neural Information Processing Systems (NeurIPS 2020), 2020. URL https://arxiv.org/abs/2010.05646. arXiv:2010.05646

  14. [14]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  15. [15]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  16. [16]

    Playground v3: Im- proving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Im- proving text-to-image alignment with deep-fusion large language models.arXiv preprint arXiv:2409.10695, 2024

  17. [17]

    AudioLDM: Text-to-audio generation with latent diffusion models

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. Proceedings of the International Conference on Machine Learning, pages 21450–21474, 2023

  18. [18]

    Plumbley

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D. Plumbley. Audioldm 2: Learning holistic audio generation with self-supervised pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871–2883, 2024. doi: 10.1109/TASLP.2024.3399607

  19. [19]

    Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models.Advances in Neural Information Processing Systems, 36:48855–48876, 2023

  20. [20]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022. URLhttps://arxiv.org/abs/2112.10741

  21. [21]

    Sora 2 is here.https://openai.com/index/sora-2/, 2025

    OpenAI. Sora 2 is here.https://openai.com/index/sora-2/, 2025

  22. [22]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025. 12

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  24. [24]

    Ovi: Twin backbone cross-modal fusion for audio-video generation

    Character.AI Research. Ovi: Twin backbone cross-modal fusion for audio-video generation. arXiv preprint arXiv:2510.01284, 2025. URLhttps://arxiv.org/abs/2510.01284

  25. [25]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022. URL https://arxiv.org/abs/ 2205.11487

  26. [26]

    Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

  27. [27]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  28. [28]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  29. [29]

    A comprehensive study of decoder-only llms for text-to-image generation

    Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025

  30. [30]

    Efficient vision-language models by summarizing visual tokens into compact registers.arXiv preprint arXiv:2410.14072, 2024

    Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, and Mahyar Najibi. Efficient vision- language models by summarizing visual tokens into compact registers.arXiv preprint arXiv:2410.14072, 2024

  31. [31]

    Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformers, 2024. URLhttps://arxiv.org/abs/2410.10629

  32. [32]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds. arXiv preprint arXiv:2407.01494, 2024. 13 A Supplementary Material A.1 Additional Figures (a) (b) Figure A1:LTX-2training and inference pipelines. (a) Training pipeline: audio and...