Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance , Heyi Chen , Siyan Chen , Xin Chen , Yanfei Chen , Ying Chen , Zhuo Chen , Feng Cheng

show 189 more authors

Tianheng Cheng Xinqi Cheng Xuyan Chi Jian Cong Jing Cui Qinpeng Cui Qide Dong Junliang Fan Jing Fang Zetao Fang Chengjian Feng Han Feng Mingyuan Gao Yu Gao Dong Guo Qiushan Guo Boyang Hao Qingkai Hao Bibo He Qian He Tuyen Hoang Ruoqing Hu Xi Hu Weilin Huang Zhaoyang Huang Zhongyi Huang Donglei Ji Siqi Jiang Wei Jiang Yunpu Jiang Zhuo Jiang Ashley Kim Jianan Kong Zhichao Lai Shanshan Lao Yichong Leng Ai Li Feiya Li Gen Li Huixia Li JiaShi Li Liang Li Ming Li Shanshan Li Tao Li Xian Li Xiaojie Li Xiaoyang Li Xingxing Li Yameng Li Yifu Li Yiying Li Chao Liang Han Liang Jianzhong Liang Ying Liang Zhiqiang Liang Wang Liao Yalin Liao Heng Lin Kengyu Lin Shanchuan Lin Xi Lin Zhijie Lin Feng Ling Fangfang Liu Gaohong Liu Jiawei Liu Jie Liu Jihao Liu Shouda Liu Shu Liu Sichao Liu Songwei Liu Xin Liu Xue Liu Yibo Liu Zikun Liu Zuxi Liu Junlin Lyu Lecheng Lyu Qian Lyu Han Mu Xiaonan Nie Jingzhe Ning Xitong Pan Yanghua Peng Lianke Qin Xueqiong Qu Yuxi Ren Kai Shen Guang Shi Lei Shi Yan Song Yinglong Song Fan Sun Li Sun Renfei Sun Yan Sun Zeyu Sun Wenjing Tang Yaxue Tang Zirui Tao Feng Wang Furui Wang Jinran Wang Junkai Wang Ke Wang Kexin Wang Qingyi Wang Rui Wang Sen Wang Shuai Wang Tingru Wang Weichen Wang Xin Wang Yanhui Wang Yue Wang Yuping Wang Yuxuan Wang Ziyu Wang Guoqiang Wei Wanru Wei Di Wu Guohong Wu Hanjie Wu Jian Wu Jie Wu Ruolan Wu Xinglong Wu Yonghui Wu Ruiqi Xia Liang Xiang Fei Xiao XueFeng Xiao Pan Xie Shuangyi Xie Shuang Xu Jinlan Xue Shen Yan Bangbang Yang Ceyuan Yang Jiaqi Yang Runkai Yang Tao Yang Yang Yang Yihang Yang ZhiXian Yang Ziyan Yang Songting Yao Yifan Yao Zilyu Ye Bowen Yu Jian Yu Chujie Yuan Linxiao Yuan Sichun Zeng Weihong Zeng Xuejiao Zeng Yan Zeng Chuntao Zhang Heng Zhang Jingjie Zhang Kuo Zhang Liang Zhang Liying Zhang Manlin Zhang Ting Zhang Weida Zhang Xiaohe Zhang Xinyan Zhang Yan Zhang Yuan Zhang Zixiang Zhang Fengxuan Zhao Huating Zhao Yang Zhao Hao Zheng Jianbin Zheng Xiaozheng Zheng Yangyang Zheng Yijie Zheng Jiexin Zhou Jiahui Zhu Kuan Zhu Shenhan Zhu Wenjia Zhu Benhui Zou Feilong Zuo

Authors on Pith no claims yet

Pith reviewed 2026-05-16 01:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords generationseedanceaudio-visualjointmodelenginenativeacceleration

0 comments

The pith

Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes an AI system for creating videos complete with matching audio from the start. It uses a transformer model split into two branches for video and audio that share information through a joint module. Training involves large datasets followed by fine-tuning with human feedback to improve quality and speed. The output includes videos where speech matches lip movements across languages and camera movements follow cinematic rules.

Core claim

Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality.

Load-bearing premise

The cross-modal joint module and multi-stage data pipeline combined with SFT and RLHF produce the stated synchronization and quality levels, an assumption stated without supporting metrics or comparisons in the provided abstract.

read the original abstract

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Seedance 1.5 pro, a foundation model for native joint audio-video generation. It employs a dual-branch Diffusion Transformer architecture with an integrated cross-modal joint module and a multi-stage data pipeline. Post-training includes Supervised Fine-Tuning (SFT) on high-quality data and Reinforcement Learning from Human Feedback (RLHF) using multi-dimensional reward models, plus an acceleration framework claimed to deliver over 10X faster inference. The model is asserted to achieve exceptional audio-visual synchronization, superior generation quality, precise multilingual lip-syncing, dynamic camera control, and enhanced narrative coherence, with availability via Volcano Engine.

Significance. If the architectural choices and training pipeline demonstrably deliver the claimed synchronization and quality levels, the work would represent a meaningful step toward unified audio-visual foundation models suitable for professional content creation. The combination of dual-branch DiT, cross-modal joint module, SFT/RLHF, and inference acceleration could influence downstream applications in video production and multimodal AI. However, the complete absence of any quantitative evaluation prevents assessment of whether these contributions advance the state of the art.

major comments (2)

[Abstract] Abstract: The central claims of 'exceptional audio-visual synchronization' and 'superior generation quality' are stated without any supporting quantitative results, such as SyncNet/LSE scores, lip-sync error rates, FVD, audio-visual alignment metrics, or head-to-head comparisons against prior models (e.g., existing diffusion-based video or audio-visual generators). This omission makes the performance assertions unverifiable and load-bearing for the paper's contribution.
[Abstract] Abstract: The description of the dual-branch Diffusion Transformer, cross-modal joint module, multi-stage data pipeline, SFT, RLHF, and acceleration framework remains at a high-level architectural summary with no equations, implementation details, ablation studies, or parameter counts, preventing evaluation of whether these components actually produce the stated outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Relies on standard diffusion model assumptions plus a new joint module and data pipeline whose effectiveness is asserted without independent verification.

free parameters (1)

diffusion model scale and conditioning parameters
Typical hyperparameters in diffusion transformers required for training but unspecified in abstract.

axioms (1)

domain assumption Diffusion transformers conditioned on cross-modal signals can produce synchronized audio-visual output
Core premise of the dual-branch architecture.

pith-pipeline@v0.9.0 · 7407 in / 994 out tokens · 171575 ms · 2026-05-16T01:28:08.039043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving exceptional audio-visual synchronization and superior generation quality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
cs.CV 2026-05 unverdicted novelty 7.0

TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Tracking High-order Evolutions via Cascading Low-rank Fitting
cs.LG 2026-04 unverdicted novelty 7.0

Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-incre...
AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

AVGen-Bench reveals that current text-to-audio-video models produce strong aesthetics but fail at semantic controllability including text rendering, speech coherence, physical reasoning, and musical pitch control.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
cs.CV 2026-05 unverdicted novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
ViPO: Visual Preference Optimization at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniHuman is a new large-scale multi-scene dataset with video-, frame-, and individual-level annotations for human-centric video generation, accompanied by the OHBench benchmark that adds metrics aligned with human pe...
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
cs.CV 2026-04 unverdicted novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
cs.CV 2026-01 unverdicted novelty 6.0

VideoGPA distills geometry priors via self-supervised DPO to enhance 3D consistency, temporal stability, and motion coherence in video diffusion models.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
Motif-Video 2B: Technical Report
cs.CV 2026-04 unverdicted novelty 5.0

Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
Advancing Open-source World Models
cs.CV 2026-01 unverdicted novelty 4.0

LingBot-World is presented as an open-source world model that delivers high-fidelity simulation, minute-level contextual consistency, and real-time interactivity under one second latency.
Seedance 2.0: Advancing Video Generation for World Complexity
cs.CV 2026-04 unverdicted novelty 3.0

Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 20 Pith papers · 10 internal anchors

[1]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

work page 2024
[2]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

[Text-to-image model]

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

work page arXiv 2025
[6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

work page 2025
[9]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Rayflow: Instance-aware diffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699, 2025

Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao. Rayflow: Instance-aware diffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699, 2025

work page arXiv 2025
[11]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025
[15]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, et al. Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024. 9 Appendix A Contributions and Acknowledgments All Authors of Seedance are listed in alphabetical order by their last names. Authors Heyi...

work page arXiv 2024