Bernini: Latent Semantic Planning for Video Diffusion
Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3
The pith
Bernini lets an MLLM predict semantic plans in ViT space that a DiT renderer turns into high-quality videos and edits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bernini shows that an MLLM-based planner can predict target semantic representations in ViT embedding space and pass them to a DiT-based renderer that synthesizes pixels from the plan together with text features and source VAE features, allowing the planner and renderer to be trained separately with only light co-training while achieving state-of-the-art video generation and editing.
What carries the argument
Latent semantic planning in ViT embedding space, where the MLLM outputs high-level guidance that the DiT renderer conditions on to produce video pixels.
If this is right
- The planner and renderer can be developed and scaled independently while still producing coherent video output.
- The MLLM's pretrained reasoning improves generalization on video editing tasks that require understanding source content.
- Segment-Aware 3D RoPE allows the model to handle multiple visual inputs without losing spatial-temporal coherence.
- Chain-of-thought steps inside the planner help translate language-model understanding into better generation decisions.
Where Pith is reading between the lines
- The same semantic interface could let researchers swap in newer MLLMs or diffusion backbones without retraining the entire system.
- The division of labor might extend to other generative domains such as audio synthesis or 3D scene creation where high-level plans guide low-level rendering.
- If the ViT space proves stable across models, it could become a standard latent protocol for connecting reasoning and synthesis modules in multimodal systems.
Load-bearing premise
Semantic representations in ViT embedding space form a sufficient and stable interface that lets the planner and renderer be trained separately and still produce high-quality output after light co-training.
What would settle it
Train the planner and renderer completely independently with no co-training at all and measure whether generated video quality falls substantially below jointly trained baselines on standard benchmarks.
read the original abstract
Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Bernini, a framework unifying MLLMs for semantic planning directly in ViT embedding space with a DiT-based diffusion renderer for video generation and editing. The planner predicts target semantics, while the renderer synthesizes pixels conditioned on the plan plus text features and (for editing) source VAE features. Components are trained separately with only light co-training; SA-3D RoPE is introduced for multi-input handling and chain-of-thought reasoning is added to the planner. The central claim is state-of-the-art performance on video generation and editing benchmarks with strong generalization from the MLLM's pretrained understanding.
Significance. If the empirical claims hold, the work offers a modular and training-efficient route to combine the reasoning strengths of MLLMs with the synthesis fidelity of diffusion models. The explicit separation of semantic planning from rendering, together with the use of pretrained components and minimal co-training, is a clear strength that could reduce compute while improving controllability and editing generalization. The SA-3D RoPE and chain-of-thought additions are concrete, reusable ideas.
major comments (2)
- [Abstract and §5] Abstract and §5 (Experiments): the manuscript asserts SOTA results across video generation and editing benchmarks yet the abstract supplies no quantitative metrics, baseline comparisons, or ablation tables. Without these data the central performance claim cannot be evaluated and remains load-bearing for acceptance.
- [§3.1 and §4] §3.1 (Planner) and §4 (Training): the claim that ViT semantic representations form a sufficient, stable interface permitting fully separate training plus only light co-training is central but unsupported by ablations that isolate the contribution of the semantic plan versus the VAE detail features or heavier joint optimization. The abstract notes VAE augmentation for detail preservation, which itself suggests the pure ViT plan may be insufficient for photorealistic temporal coherence.
minor comments (2)
- [Figure 2] Figure 2 or the method diagram would benefit from explicit arrows showing the exact conditioning path from planner output (ViT tokens) to the DiT renderer, including how SA-3D RoPE is injected.
- [§3.2] Notation for the Segment-Aware 3D RoPE should be formalized with an equation rather than prose description to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments help clarify how to better present the core contributions. We address each major comment below and outline the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Experiments): the manuscript asserts SOTA results across video generation and editing benchmarks yet the abstract supplies no quantitative metrics, baseline comparisons, or ablation tables. Without these data the central performance claim cannot be evaluated and remains load-bearing for acceptance.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will add specific metrics (e.g., the reported gains on standard video generation and editing benchmarks relative to the strongest baselines) while remaining within length constraints. The full tables, baselines, and ablations already appear in §5; the abstract update will make the central claim immediately verifiable without duplicating the experimental section. revision: yes
-
Referee: [§3.1 and §4] §3.1 (Planner) and §4 (Training): the claim that ViT semantic representations form a sufficient, stable interface permitting fully separate training plus only light co-training is central but unsupported by ablations that isolate the contribution of the semantic plan versus the VAE detail features or heavier joint optimization. The abstract notes VAE augmentation for detail preservation, which itself suggests the pure ViT plan may be insufficient for photorealistic temporal coherence.
Authors: We appreciate the referee’s emphasis on isolating the interface contribution. The semantic plan supplies high-level structure, motion, and editing intent while VAE features are used only for source-detail preservation during editing; generation relies primarily on the plan plus text. To directly address the request, the revised manuscript will include new ablation experiments that (i) remove the semantic plan (text-only conditioning), (ii) compare separate training plus light co-training against heavier joint optimization, and (iii) quantify temporal coherence with and without the plan. These results will be added to §4 and §5. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an architectural argument for dividing labor between an MLLM planner operating in ViT embedding space and a DiT renderer, with separate training plus light co-training as an empirical design choice that preserves pretrained capabilities. This is supported by benchmark results rather than any self-referential derivation, fitted parameter renamed as prediction, or load-bearing self-citation. The interface assumption is stated explicitly as a premise enabling the framework, not derived from the outputs themselves, and no equations or uniqueness theorems reduce the central claims to tautologies or prior author work by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan... Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jianhong Bai, Menghan Xia, Xintao Wang, Ziyang Yuan, Xiao Fu, Zuozhu Liu, Haoji Hu, Pengfei Wan, and Di Zhang. Syncammaster: Synchronizing multi-camera video generation from diverse viewpoints.arXiv preprint arXiv:2412.07760, 2024
-
[2]
Recammaster: Camera-controlled generative rendering from a single video
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled generative rendering from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[3]
Scaling instruction-based video editing with a high-quality synthetic dataset
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742, 2025
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[6]
Jimeng AI.https://jimeng.jianying.com, 2024
ByteDance. Jimeng AI.https://jimeng.jianying.com, 2024
work page 2024
-
[7]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022
work page 2022
-
[9]
Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, and Weicai Ye. Vino: A unified visual generator with interleaved omnimodal context.arXiv preprint arXiv:2601.02358, 2026
-
[10]
Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation
Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation. arXiv preprint arXiv:2506.18095, 2025
-
[11]
Intern VL: scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Intern VL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[12]
Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
Jiaxin Cheng, Tianjun Xiao, and Tong He. Consistent video-to-video transfer using synthetic dataset.arXiv preprint arXiv:2311.00213, 2023
-
[13]
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Quack: A quirky assortment of cute kernels, 2025
Dao-AILab. Quack: A quirky assortment of cute kernels, 2025. URLhttps://github.com/Dao-AILab/quack. GitHub repository
work page 2025
-
[15]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Magref: Masked guidance for any-reference video generation with subject disentanglement
Yufan Deng, Xun Guo, Shenghai Yuan, Zhaoxi Chen, Peng Zhou, Tianyu Ma, Boyuan Chen, Bin Lin, Li Yuan, and Wanli Wang. Magref: Masked guidance for any-reference video generation with subject disentanglement. arXiv preprint arXiv:2505.23742, 2025
-
[17]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. Flex attention: A programming model for generating optimized attention kernels.arXiv preprint arXiv:2412.05496, 2(3):4, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Dreamllm: Synergistic multimodal comprehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal comprehension and creation. 2024. 28
work page 2024
-
[19]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[20]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. 2024
work page 2024
-
[21]
Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Yu, and Mingyuan Fan. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
-
[22]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Tokenflow: Consistent diffusion features for consistent video editing
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[24]
Veo 3.https://deepmind.google/models/veo/, 2025
Google DeepMind. Veo 3.https://deepmind.google/models/veo/, 2025
work page 2025
-
[25]
Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, et al. Unireditbench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025
-
[26]
Haoyang He, Jie Wang, Jiangning Zhang, Zhucun Xue, Xingyuan Bu, Qiangpeng Yang, Shilei Wen, and Lei Xie. Openve-3m: A large-scale high-quality dataset for instruction-guided video editing.arXiv preprint arXiv:2512.07826, 2025
-
[27]
Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, and Weiyao Lin. Vidlada: Bidirectional diffusion large language models for efficient video understanding.arXiv preprint arXiv:2601.17868, 2026
-
[28]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[29]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...
work page 2024
-
[30]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. URLhttps://arxiv.org/abs/2309.14509
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191–17202, 2025
work page 2025
-
[32]
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, et al. Editverse: Unifying image and video editing and generation with in-context learning.arXiv preprint arXiv:2509.20360, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Kling 1.6.https://klingai.com/, 2024
Kuaishou Technology. Kling 1.6.https://klingai.com/, 2024. Accessed: 2026-04-22
work page 2024
-
[35]
Nohumansrequired: Autonomous high-quality image editing triplet mining
Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, and Aleksandr Gordeev. Nohumansrequired: Autonomous high-quality image editing triplet mining. arXiv preprint arXiv:2507.14119, 2025
-
[36]
FLUX.1.https://blackforestlabs.ai/, 2024
Black Forest Labs. FLUX.1.https://blackforestlabs.ai/, 2024. 29
work page 2024
-
[37]
Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025
-
[38]
Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. arXiv preprint arXiv:2406.08418, 2024
-
[39]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advancesin Neural Information Processing Systems, 37:56424–56445, 2024
work page 2024
-
[40]
Diffueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025
Xiaowen Li, Haolan Xue, Peiran Ren, and Liefeng Bo. Diffueraser: A diffusion model for video inpainting.arXiv preprint arXiv:2501.10018, 2025
-
[41]
Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648, 2025
-
[42]
Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, and Mohit Bansal. Bifrost-1: Bridging multimodal llms and diffusion models with patch-level CLIP latents.arXiv preprint arXiv.2508.05954, 2025
-
[43]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Phantom: Subject- consistent video generation via cross-modal alignment
Lijie Liu, Tianxiang Liu, Zhichao Wang, Yubin Zhao, Jun Chen, Zhongyuan Zhang, Boyuan Gong, and Jiashi Deng. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025
-
[46]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. URLhttps://arxiv.org/abs/2209.03003
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Camclonemaster: Enabling reference-based camera control for video generation
Yawen Luo, Xiaoyu Shi, Jianhong Bai, Menghan Xia, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Camclonemaster: Enabling reference-based camera control for video generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–10, 2025
work page 2025
-
[49]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Guo, Junshu Huang, Zhenyu Liu, Weihong Zhang, et al. Step-Video-T2V technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
X-clip: End-to-end multi- grained contrastive learning for video-text retrieval
Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi- grained contrastive learning for video-text retrieval. InProceedings of the 30th ACM international conference on multimedia, pages 638–647, 2022
work page 2022
-
[51]
On distillation of guided diffusion models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InCVPR, 2023
work page 2023
-
[52]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019
work page 2019
-
[53]
Sora: Creating video from text.https://openai.com/sora, 2024
OpenAI. Sora: Creating video from text.https://openai.com/sora, 2024
work page 2024
-
[54]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, Ji Hou, and Saining Xie. Transfer between modalities with metaqueries. arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[56]
Pika 2.1.https://pika.art/, 2024
Pika Labs. Pika 2.1.https://pika.art/, 2024. Accessed: 2026-04-22. 30
work page 2024
-
[57]
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025
-
[58]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[59]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URLhttps://arxiv.org/a...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[61]
Runway. Introducing Gen-3 Alpha: A new frontier for video generation.https://runwayml.com/research/ introducing-gen-3-alpha, 2024
work page 2024
-
[62]
Eliminating oversaturation and artifacts of high guidance scales in diffusion models
Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[63]
Vidu 2.0.https://www.vidu.com/, 2024
Shengshu Technology. Vidu 2.0.https://www.vidu.com/, 2024. Accessed: 2026-04-22
work page 2024
-
[64]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[65]
Emu: Generative Pretraining in Multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality.arXiv preprint arXiv:2307.05222, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025
-
[67]
Lucy edit: Open-weight text-guided video editing
DecartAI Team. Lucy edit: Open-weight text-guided video editing. 2025. URL https://d2drjpuinn46lb. cloudfront.net/Lucy_Edit__High_Fidelity_Text_Guided_Video_Editing.pdf
work page 2025
-
[68]
Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arX...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Lei Wang, Yuxin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, and Jian Yang. Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026
-
[73]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Yuhan Wang, Siwei Yang, Bingchen Wang, Letian Tu, Bingyu Li, Yuyin Hong, Yibing Wang, Yuyou Yan, Alan Yuille, and Cihang Xie. Gpt-image-edit-1.5m: A million-scale, gpt-generated image dataset.arXiv preprint arXiv:2507.21033, 2025
-
[75]
Omniedit: Building image editing generalist models through specialist supervision
Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Representations, 2024. 31
work page 2024
-
[76]
Univideo: Unified video understanding, generation, and editing.arXiv preprint arXiv:2510.08377, 2026
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified video understanding, generation, and editing.arXiv preprint arXiv:2510.08377, 2026
-
[77]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. 2022
work page 2022
-
[78]
FineVision: Open Data Is All You Need
Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, and Andrés Marafioti. Finevision: Open data is all you need.arXiv preprint arXiv:2510.17269, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[80]
Omnigen2: Exploration to advanced multimodal generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.