pith. machine review for the scientific record. sign in

arxiv: 2601.11194 · v2 · submitted 2026-01-16 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ATATA: One Algorithm to Align Them All

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords rectified flowjoint transportstructural alignmentpaired generationimage generationvideo generation3D generationmulti-modal inference
0
0 comments X

The pith

Joint transport of segments in sample space aligns paired outputs from any Rectified Flow model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ATATA as a method for generating structurally aligned sample pairs across modalities by jointly transporting segments through Rectified Flow models. It contrasts this with codependent generation or slow Score Distillation Sampling, claiming faster inference while preserving alignment and visual quality. The approach layers on top of existing models in structured latent spaces and is tested on image, video, and 3D tasks. A sympathetic reader would care because faster aligned generation could simplify consistent outputs in creative pipelines without retraining base models.

Core claim

The central claim is that joint transport of a segment in sample space on Rectified Flow models produces paired structurally aligned samples of high visual quality. The method applies to arbitrary Rectified Flow models operating in structured latent space and demonstrates superior structural alignment plus visual quality for image and video generation while achieving comparable 3D quality at orders-of-magnitude higher speed than prior joint-inference baselines.

What carries the argument

Joint transport of a segment in sample space, which moves paired points together through the flow to enforce alignment.

If this is right

  • Faster inference than Score Distillation Sampling for aligned sample pairs.
  • High structural alignment across generated image and video pairs.
  • Comparable visual quality for 3D shapes at much greater speed.
  • Works on top of existing Rectified Flow models in structured latent space without retraining.
  • Improves state-of-the-art results for image and video generation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The segment-transport idea might transfer to other flow or diffusion models if their latent spaces admit similar pairing.
  • Speed gains could support interactive tools that require consistent multi-view or temporal outputs.
  • Joint transport may lower mode-collapse risk by constraining the sampling trajectory for paired points.
  • Similar segment mechanisms could address alignment tasks in text-conditioned or multi-modal generation beyond the tested domains.

Load-bearing premise

Joint transport of a segment in sample space on an arbitrary Rectified Flow model will preserve structural alignment and visual quality without additional training or adjustments.

What would settle it

Apply the method to a standard Rectified Flow image model and check whether the output pairs exhibit measurable structural misalignment or visible quality drop relative to independent sampling runs.

Figures

Figures reproduced from arXiv: 2601.11194 by Aibek Alanov, Boyi Pang, Evgeny Burnaev, Maksim Nakhodnov, Oleg Voynov, Peter Wonka, Ramil Khafizov, Savva Ignatyev, Vladimir Ippolitov, Xiaopeng Fan, Yurii Melnik.

Figure 1
Figure 1. Figure 1: Visualization of generated images, videos, and 3D shapes using our method. The left pair is (horse animal, horse skeleton), the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method 4.1. Joint Inference with Rectified Flow Models Flow-matching models use time discretization to approxi￾mate trajectories along the velocity vector field vΘ(xt, t, c), which is parameterized by a neural network. Given a text embedding c and starting with a sample x ∼ N (0, I) taken from a Gaussian noise distribution, the sample xt1 at time step t1 can be used to calculate xt2 (where t1 > t2) with th… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of geometry preservation between two generated images. For each example, two images are blended into one with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the impact of different components of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of geometry preservation between source [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: User study image example. transport. Consequently, using only intermediate points sampling alone can be less effective and even detrimental. Our primary aim here is not to optimize each intermediate variant, but to demonstrate that the endpoint of this sequen￾tial component removal—corresponding to the MatchDif￾fusion setup—exhibits degraded performance compared to our full method. E. Additional image gene… view at source ↗
Figure 8
Figure 8. Figure 8: Aligned image generation results [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ATATA, a multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. The core idea is joint transport of a segment in sample space applied to an arbitrary pre-trained RF model operating in structured latent space, without additional training or post-hoc adjustments. The paper claims faster inference than Score Distillation Sampling, high structural alignment and visual quality on image/video/3D tasks, SOTA improvements for image and video generation, and comparable 3D quality at orders-of-magnitude higher speed.

Significance. If the central claims hold, the work would offer a practically significant, training-free method for efficient paired sample generation across modalities. By avoiding the computational cost and mode-collapse issues of SDS while building directly on existing RF models, it could enable faster pipelines for aligned image-video-3D data synthesis, provided the alignment guarantee is robust.

major comments (2)
  1. [Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.
  2. [Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'one algorithm to align them all' is informal and should be replaced with a precise description of the scope (image/video/3D).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment point by point below, providing clarifications on the underlying mechanisms and evidence while outlining planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.

    Authors: The joint segment transport operates by selecting and transporting a shared segment in sample space using the deterministic velocity field of the pre-trained Rectified Flow model. Because RF trajectories are straight-line paths in expectation and the same segment is mapped consistently across paired samples, structural alignment is preserved at inference without requiring additional coupling terms, shared noise schedules, or losses. This follows directly from the marginal training of RF models combined with the joint application of the transport map. We acknowledge that the abstract would benefit from a concise reference to this property and will add a brief explanatory sentence plus a pointer to the methods derivation in the revised version. revision: partial

  2. Referee: [Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.

    Authors: The abstract is intended as a concise summary; the full manuscript contains the supporting quantitative evaluations, including alignment metrics (e.g., structural similarity scores), visual quality measures (FID, CLIP scores), error bars from repeated runs, dataset specifications, and controlled comparisons against editing-based and joint-inference baselines in the Experiments section. To improve readability and address the concern directly, we will incorporate key quantitative highlights (e.g., specific SOTA improvements and alignment scores) into the abstract during revision. revision: yes

Circularity Check

0 steps flagged

No circularity: joint transport presented as independent construction on arbitrary RF models

full rationale

The paper introduces joint transport of a segment in sample space as a new algorithm for paired aligned samples on top of arbitrary pre-trained Rectified Flow models. No equations, derivations, or self-citations are shown that reduce the alignment claim to a fitted parameter, self-definition, or load-bearing prior result by the same authors. The method is described as a direct, training-free inference procedure that preserves structure by construction of the segment transport, without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit fitted constants or new postulated objects are named.

pith-pipeline@v0.9.0 · 5549 in / 1146 out tokens · 53285 ms · 2026-05-16T13:42:54.758303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 12 internal anchors

  1. [1]

    The chosen one: Consistent characters in text-to- image diffusion models.arXiv preprint arXiv:2311.10093,

    Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text-to- image diffusion models.arXiv preprint arXiv:2311.10093,

  2. [2]

    Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback

    Elior Benarous, Yilun Du, and Heng Yang. Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback. InSynthetic Data for Artificial Intelli- gence and Machine Learning: Tools, Techniques, and Appli- cations III, Proceedings of SPIE Vol. 13459, page 13459-0I (article ID) ??, 2025. 5

  3. [3]

    Understanding and improving interpolation in au- toencoders via an adversarial regularizer

    David Berthelot*, Colin Raffel*, Aurko Roy, and Ian Good- fellow. Understanding and improving interpolation in au- toencoders via an adversarial regularizer. InInternational Conference on Learning Representations, 2019. 3

  4. [4]

    FLUX.1 [dev].https : / / huggingface

    Black Forest Labs. FLUX.1 [dev].https : / / huggingface . co / black - forest - labs / FLUX . 1- dev, 2024. Open-weight rectified-flow text-to-image model. 2, 3, 5, 12

  5. [5]

    Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas J. Guibas. Generic 3d diffusion adapter using controlled multi-view editing.arXiv preprint arXiv:2403.12032, 2024. 2, 3, 5, 6

  6. [6]

    Dge: Di- rect gaussian 3d editing by consistent multi-view editing

    Minghao Chen, Iro Laina, and Andrea Vedaldi. Dge: Di- rect gaussian 3d editing by consistent multi-view editing. InEuropean Conference on Computer Vision, pages 74–92. Springer, 2024. 3

  7. [7]

    Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023. 3

  8. [8]

    Lucy edit: Open-weight text-guided video editing, 2025

    DecartAI Team. Lucy edit: Open-weight text-guided video editing, 2025. Technical report. 3, 7

  9. [9]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  10. [10]

    Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

    Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025. 3

  11. [11]

    Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 3

  12. [12]

    Image editing in gemini just got a major upgrade,

    Google. Image editing in gemini just got a major upgrade,

  13. [13]

    Accessed: 2025-11-13

    DescribesNano Banana(Gemini 2.5 Flash Image). Accessed: 2025-11-13. 2

  14. [14]

    Style aligned image generation via shared atten- tion

    Amir Hertz, Andrey V oynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4775– 4785, 2024. 2

  15. [15]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  16. [16]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 3

  17. [17]

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pages 1–17, 2025. 5

  18. [18]

    The gan is dead; long live the gan! a mod- ern gan baseline.Advances in Neural Information Process- ing Systems, 37:44177–44215, 2024

    Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a mod- ern gan baseline.Advances in Neural Information Process- ing Systems, 37:44177–44215, 2024. 3

  19. [19]

    Mv-adapter: Multi-view consistent image generation made easy

    Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 3

  20. [20]

    A3d: Does diffusion dream about 3D alignment? InIn- ternational Conference on Learning Representations (ICLR),

    Savva Victorovich Ignatyev, Nina Konovalova, Daniil Se- likhanovych, Oleg V oynov, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton Konushin, Alexander Filippov, Peter Wonka, and Evgeny Burnaev. A3d: Does diffusion dream about 3D alignment? InIn- ternational Conference on Learning Representations (ICLR),

  21. [21]

    2, 3, 5, 6, 12

    Poster. 2, 3, 5, 6, 12

  22. [22]

    Vace: All-in-one video creation and editing, 2025

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing, 2025. 3

  23. [23]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 5, 7

  24. [24]

    Editverse: Unifying image and video editing and generation with in-context learning, 2025

    Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, and Qiang Xu. Editverse: Unifying image and video editing and generation with in-context learning, 2025. 5, 16

  25. [25]

    Scal- ing up gans for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scal- ing up gans for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10124–10134, 2023. 3

  26. [26]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 8107–8116, 2020. 3

  27. [27]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

  28. [28]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  29. [29]

    Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

    Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 3

  30. [30]

    Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, et al

    Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qingxiang Lin, Jingwei Huang, Yuhong Liu, et al. Unleashing vecset diffusion model for fast shape generation.arXiv preprint arXiv:2503.16302,

  31. [31]

    Editsplat: Multi-view fusion and attention-guided optimization for view-consistent 3d scene editing with 3d gaussian splatting

    Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Sangheon Shin, Sangmin Kim, and Sangpil Kim. Editsplat: Multi-view fusion and attention-guided optimization for view-consistent 3d scene editing with 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11135–11145, 2025. 3

  32. [32]

    Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023

    Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 3

  33. [33]

    arXiv preprint arXiv:2311.06214 , year=

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

  34. [34]

    V oxhammer: Training-free precise and coherent 3D editing in native 3D space.arXiv preprint arXiv:2508.19247, 2025

    Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, and Lu Sheng. V oxhammer: Training-free precise and coherent 3d editing in native 3d space.arXiv preprint arXiv:2508.19247, 2025. 3

  35. [35]

    Focaldreamer: Text- driven 3d editing via focal-fusion assembly

    Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer: Text- driven 3d editing via focal-fusion assembly. InProceed- ings of the AAAI conference on artificial intelligence, pages 3279–3287, 2024. 3

  36. [36]

    Luciddreamer: Towards high-fidelity text-to-3d generation via interval score match- ing.arXiv preprint arXiv:2311.11284, 2023

    Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiao- gang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score match- ing.arXiv preprint arXiv:2311.11284, 2023. 3, 6

  37. [37]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 3

  38. [38]

    Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024

    Artem Lukoianov, Haitz S ´aez de Oc ´ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 3

  39. [39]

    Matchdiffusion: Training-free generation of match-cuts,

    Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts,

  40. [40]

    Matchdiffusion: Training-free generation of match-cuts

    Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 3, 5, 7, 8

  41. [41]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  42. [42]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

  43. [43]

    Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025

    Qwen Team. Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025. Image editing foundation model based on Qwen-Image. 2, 3, 5

  44. [44]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

  45. [45]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 5

  46. [46]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  47. [47]

    Semantic im- age inversion and editing using rectified stochastic differen- tial equations

    Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. 3, 5

  48. [48]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

  49. [49]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3

  50. [50]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

  51. [51]

    Bolt3d: Generating 3d scenes in seconds

    Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 24846–24857, 2025. 3

  52. [52]

    Emergent correspondence from image diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. 5

  53. [53]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 7

  54. [54]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 3

  55. [55]

    Wan: Open and advanced large-scale video generative models, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  56. [56]

    Qwen-image technical report,

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Wei Hu, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

  57. [57]

    Unilat3d: Geometry-appearance uni- fied latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025

    Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xi- aopeng Zhang, et al. Unilat3d: Geometry-appearance uni- fied latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025. 3

  58. [58]

    Guibas, Dahua Lin, and Gordon Wetzstein

    Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas J. Guibas, Dahua Lin, and Gordon Wetzstein. GPT-4V(ision) is a Human-Aligned Evaluator for Text-to- 3D Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5

  59. [59]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 3, 6

  60. [60]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 5, 16

  61. [61]

    Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion

    Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, An- dreas Geiger, and Yiyi Liao. Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2857–2869, 2025. 3

  62. [62]

    Nano3d: A training-free approach for efficient 3d editing without masks.arXiv preprint arXiv:2510.15019, 2025

    Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, and Jun Zhu. Nano3d: A training-free approach for efficient 3d editing without masks.arXiv preprint arXiv:2510.15019, 2025. 3

  63. [63]

    Stochsync: Stochastic diffusion synchronization for im- age generation in arbitrary spaces.arXiv preprint arXiv:2501.15445, 2025

    Kyeongmin Yeo, Jaihoon Kim, and Minhyuk Sung. Stochsync: Stochastic diffusion synchronization for im- age generation in arbitrary spaces.arXiv preprint arXiv:2501.15445, 2025. 3

  64. [64]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025. 3

  65. [65]

    Dreameditor: Text-driven 3d scene editing with neural fields

    Jingyu Zhuang, Chen Wang, Liang Lin, Lingjie Liu, and Guanbin Li. Dreameditor: Text-driven 3d scene editing with neural fields. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023. 3

  66. [66]

    Tip-editor: An accurate 3d editor fol- lowing both text-prompts and image-prompts.ACM Trans- actions on Graphics (TOG), 43(4):1–12, 2024

    Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, and Ying Shan. Tip-editor: An accurate 3d editor fol- lowing both text-prompts and image-prompts.ACM Trans- actions on Graphics (TOG), 43(4):1–12, 2024. 3 ATATA: One Algorithm to Align Them All Supplementary Material A. Competitor details A.1. 2D competitors For running competitors, we use the sa...

  67. [67]

    Alignment.In which row are Video A and Video B bet- ter aligned with each other in terms of overall structure, overall meaning, pose, and 3D geometry?

  68. [68]

    Visual Appeal.In which row is the pair Video A and Video B more visually appealing in terms of realism, smoothness, and overall perceptual quality?

  69. [69]

    All videos in the task were played simultaneously for the an- notator with the ability to view them frame-by-frame

    Text Prompt Consistency.In which row do Video A and Video B better match their textual description? Figure 7 shows an example task from the study. All videos in the task were played simultaneously for the an- notator with the ability to view them frame-by-frame. For each question, annotators could choose between three options: preference for the first row...