arxiv: 2601.11194 · v2 · submitted 2026-01-16 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ATATA: One Algorithm to Align Them All

Boyi Pang , Savva Ignatyev , Vladimir Ippolitov , Ramil Khafizov , Yurii Melnik , Oleg Voynov , Maksim Nakhodnov , Aibek Alanov

show 3 more authors

Xiaopeng Fan Peter Wonka Evgeny Burnaev

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords rectified flowjoint transportstructural alignmentpaired generationimage generationvideo generation3D generationmulti-modal inference

0 comments

The pith

Joint transport of segments in sample space aligns paired outputs from any Rectified Flow model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ATATA as a method for generating structurally aligned sample pairs across modalities by jointly transporting segments through Rectified Flow models. It contrasts this with codependent generation or slow Score Distillation Sampling, claiming faster inference while preserving alignment and visual quality. The approach layers on top of existing models in structured latent spaces and is tested on image, video, and 3D tasks. A sympathetic reader would care because faster aligned generation could simplify consistent outputs in creative pipelines without retraining base models.

Core claim

The central claim is that joint transport of a segment in sample space on Rectified Flow models produces paired structurally aligned samples of high visual quality. The method applies to arbitrary Rectified Flow models operating in structured latent space and demonstrates superior structural alignment plus visual quality for image and video generation while achieving comparable 3D quality at orders-of-magnitude higher speed than prior joint-inference baselines.

What carries the argument

Joint transport of a segment in sample space, which moves paired points together through the flow to enforce alignment.

If this is right

Faster inference than Score Distillation Sampling for aligned sample pairs.
High structural alignment across generated image and video pairs.
Comparable visual quality for 3D shapes at much greater speed.
Works on top of existing Rectified Flow models in structured latent space without retraining.
Improves state-of-the-art results for image and video generation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The segment-transport idea might transfer to other flow or diffusion models if their latent spaces admit similar pairing.
Speed gains could support interactive tools that require consistent multi-view or temporal outputs.
Joint transport may lower mode-collapse risk by constraining the sampling trajectory for paired points.
Similar segment mechanisms could address alignment tasks in text-conditioned or multi-modal generation beyond the tested domains.

Load-bearing premise

Joint transport of a segment in sample space on an arbitrary Rectified Flow model will preserve structural alignment and visual quality without additional training or adjustments.

What would settle it

Apply the method to a standard Rectified Flow image model and check whether the output pairs exhibit measurable structural misalignment or visible quality drop relative to independent sampling runs.

Figures

Figures reproduced from arXiv: 2601.11194 by Aibek Alanov, Boyi Pang, Evgeny Burnaev, Maksim Nakhodnov, Oleg Voynov, Peter Wonka, Ramil Khafizov, Savva Ignatyev, Vladimir Ippolitov, Xiaopeng Fan, Yurii Melnik.

**Figure 1.** Figure 1: Visualization of generated images, videos, and 3D shapes using our method. The left pair is (horse animal, horse skeleton), the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Method 4.1. Joint Inference with Rectified Flow Models Flow-matching models use time discretization to approximate trajectories along the velocity vector field vΘ(xt, t, c), which is parameterized by a neural network. Given a text embedding c and starting with a sample x ∼ N (0, I) taken from a Gaussian noise distribution, the sample xt1 at time step t1 can be used to calculate xt2 (where t1 > t2) with th… view at source ↗

**Figure 3.** Figure 3: Visualization of geometry preservation between two generated images. For each example, two images are blended into one with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the impact of different components of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 5.** Figure 5: Visualization of geometry preservation between source [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: User study image example. transport. Consequently, using only intermediate points sampling alone can be less effective and even detrimental. Our primary aim here is not to optimize each intermediate variant, but to demonstrate that the endpoint of this sequential component removal—corresponding to the MatchDiffusion setup—exhibits degraded performance compared to our full method. E. Additional image gene… view at source ↗

**Figure 8.** Figure 8: Aligned image generation results [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATATA claims joint segment transport on pre-trained Rectified Flow models gives fast, high-quality structural alignment across images, video and 3D, but the abstract supplies no numbers to check whether the alignment actually holds.

read the letter

ATATA's main contribution is a joint transport of a segment in sample space to produce paired, structurally aligned outputs from any pre-trained Rectified Flow model operating in structured latent space. This framing is distinct from the codependent generation and SDS baselines mentioned, and the paper applies it to image, video, and 3D generation without additional training or post-hoc fixes. It reports faster inference than SDS while claiming high alignment and visual quality, plus SOTA gains on image and video pipelines and comparable 3D quality at orders-of-magnitude speed-up. If the experiments deliver on those points, the method could be a practical addition for pipelines that need aligned multi-modal samples quickly. The approach is presented as general and lightweight, which is the part that would interest people already using RF models. The soft spot is the complete absence of quantitative metrics, error bars, or controls in the abstract. Without those, it is impossible to judge whether the joint transport actually enforces alignment or whether the pairs simply decouple as independent marginal samples. The stress-test concern about missing explicit coupling mechanisms lands directly on the description given: RF models are trained on marginals, and the paper asserts the method works on arbitrary models with no extra constraints. If the full text shows only post-hoc pairing or implicit assumptions, the central claim weakens. This paper is aimed at researchers and practitioners working on flow-based generative models who need faster aligned outputs for content or simulation tasks. A reader already familiar with Rectified Flow would get the most value from seeing the exact transport implementation and the quantitative results. I would send it to peer review so the experiments and alignment measurements can be checked properly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ATATA, a multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. The core idea is joint transport of a segment in sample space applied to an arbitrary pre-trained RF model operating in structured latent space, without additional training or post-hoc adjustments. The paper claims faster inference than Score Distillation Sampling, high structural alignment and visual quality on image/video/3D tasks, SOTA improvements for image and video generation, and comparable 3D quality at orders-of-magnitude higher speed.

Significance. If the central claims hold, the work would offer a practically significant, training-free method for efficient paired sample generation across modalities. By avoiding the computational cost and mode-collapse issues of SDS while building directly on existing RF models, it could enable faster pipelines for aligned image-video-3D data synthesis, provided the alignment guarantee is robust.

major comments (2)

[Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.
[Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.

minor comments (1)

[Abstract] Abstract: the phrase 'one algorithm to align them all' is informal and should be replaced with a precise description of the scope (image/video/3D).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment point by point below, providing clarifications on the underlying mechanisms and evidence while outlining planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.

Authors: The joint segment transport operates by selecting and transporting a shared segment in sample space using the deterministic velocity field of the pre-trained Rectified Flow model. Because RF trajectories are straight-line paths in expectation and the same segment is mapped consistently across paired samples, structural alignment is preserved at inference without requiring additional coupling terms, shared noise schedules, or losses. This follows directly from the marginal training of RF models combined with the joint application of the transport map. We acknowledge that the abstract would benefit from a concise reference to this property and will add a brief explanatory sentence plus a pointer to the methods derivation in the revised version. revision: partial
Referee: [Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.

Authors: The abstract is intended as a concise summary; the full manuscript contains the supporting quantitative evaluations, including alignment metrics (e.g., structural similarity scores), visual quality measures (FID, CLIP scores), error bars from repeated runs, dataset specifications, and controlled comparisons against editing-based and joint-inference baselines in the Experiments section. To improve readability and address the concern directly, we will incorporate key quantitative highlights (e.g., specific SOTA improvements and alignment scores) into the abstract during revision. revision: yes

Circularity Check

0 steps flagged

No circularity: joint transport presented as independent construction on arbitrary RF models

full rationale

The paper introduces joint transport of a segment in sample space as a new algorithm for paired aligned samples on top of arbitrary pre-trained Rectified Flow models. No equations, derivations, or self-citations are shown that reduce the alignment claim to a fitted parameter, self-definition, or load-bearing prior result by the same authors. The method is described as a direct, training-free inference procedure that preserves structure by construction of the segment transport, without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit fitted constants or new postulated objects are named.

pith-pipeline@v0.9.0 · 5549 in / 1146 out tokens · 53285 ms · 2026-05-16T13:42:54.758303+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint transport of a segment in the sample space... transporting a distribution of samples on the line segment [xa, xb]... restore the linear structure... smoothness regularization on ||xb(t) − xa(t)||
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

velocity-guided transport of a segment in latent space... anchor velocity v_anchor = v_Θ((xa + xb)/2, t, (ca + cb)/2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 12 internal anchors

[1]

The chosen one: Consistent characters in text-to- image diffusion models.arXiv preprint arXiv:2311.10093,

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text-to- image diffusion models.arXiv preprint arXiv:2311.10093,

work page arXiv
[2]

Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback

Elior Benarous, Yilun Du, and Heng Yang. Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback. InSynthetic Data for Artificial Intelli- gence and Machine Learning: Tools, Techniques, and Appli- cations III, Proceedings of SPIE Vol. 13459, page 13459-0I (article ID) ??, 2025. 5

work page 2025
[3]

Understanding and improving interpolation in au- toencoders via an adversarial regularizer

David Berthelot*, Colin Raffel*, Aurko Roy, and Ian Good- fellow. Understanding and improving interpolation in au- toencoders via an adversarial regularizer. InInternational Conference on Learning Representations, 2019. 3

work page 2019
[4]

FLUX.1 [dev].https : / / huggingface

Black Forest Labs. FLUX.1 [dev].https : / / huggingface . co / black - forest - labs / FLUX . 1- dev, 2024. Open-weight rectified-flow text-to-image model. 2, 3, 5, 12

work page 2024
[5]

Hansheng Chen, Ruoxi Shi, Yulin Liu, Bokui Shen, Jiayuan Gu, Gordon Wetzstein, Hao Su, and Leonidas J. Guibas. Generic 3d diffusion adapter using controlled multi-view editing.arXiv preprint arXiv:2403.12032, 2024. 2, 3, 5, 6

work page arXiv 2024
[6]

Dge: Di- rect gaussian 3d editing by consistent multi-view editing

Minghao Chen, Iro Laina, and Andrea Vedaldi. Dge: Di- rect gaussian 3d editing by consistent multi-view editing. InEuropean Conference on Computer Vision, pages 74–92. Springer, 2024. 3

work page 2024
[7]

Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023. 3

work page arXiv 2023
[8]

Lucy edit: Open-weight text-guided video editing, 2025

DecartAI Team. Lucy edit: Open-weight text-guided video editing, 2025. Technical report. 3, 7

work page 2025
[9]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[10]

Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis

Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025. 3

work page 2025
[11]

Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 3

work page 2020
[12]

Image editing in gemini just got a major upgrade,

Google. Image editing in gemini just got a major upgrade,

work page
[13]

Accessed: 2025-11-13

DescribesNano Banana(Gemini 2.5 Flash Image). Accessed: 2025-11-13. 2

work page 2025
[14]

Style aligned image generation via shared atten- tion

Amir Hertz, Andrey V oynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4775– 4785, 2024. 2

work page 2024
[15]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[16]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 3

work page 2022
[17]

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pages 1–17, 2025. 5

work page 2025
[18]

The gan is dead; long live the gan! a mod- ern gan baseline.Advances in Neural Information Process- ing Systems, 37:44177–44215, 2024

Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a mod- ern gan baseline.Advances in Neural Information Process- ing Systems, 37:44177–44215, 2024. 3

work page 2024
[19]

Mv-adapter: Multi-view consistent image generation made easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 3

work page 2025
[20]

A3d: Does diffusion dream about 3D alignment? InIn- ternational Conference on Learning Representations (ICLR),

Savva Victorovich Ignatyev, Nina Konovalova, Daniil Se- likhanovych, Oleg V oynov, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton Konushin, Alexander Filippov, Peter Wonka, and Evgeny Burnaev. A3d: Does diffusion dream about 3D alignment? InIn- ternational Conference on Learning Representations (ICLR),

work page
[21]

2, 3, 5, 6, 12

Poster. 2, 3, 5, 6, 12

work page
[22]

Vace: All-in-one video creation and editing, 2025

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing, 2025. 3

work page 2025
[23]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 5, 7

work page 2025
[24]

Editverse: Unifying image and video editing and generation with in-context learning, 2025

Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, and Qiang Xu. Editverse: Unifying image and video editing and generation with in-context learning, 2025. 5, 16

work page 2025
[25]

Scal- ing up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scal- ing up gans for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10124–10134, 2023. 3

work page 2023
[26]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 8107–8116, 2020. 3

work page 2020
[27]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 3

work page internal anchor Pith review arXiv 2025
[30]

Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, et al

Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qingxiang Lin, Jingwei Huang, Yuhong Liu, et al. Unleashing vecset diffusion model for fast shape generation.arXiv preprint arXiv:2503.16302,

work page arXiv
[31]

Editsplat: Multi-view fusion and attention-guided optimization for view-consistent 3d scene editing with 3d gaussian splatting

Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Sangheon Shin, Sangmin Kim, and Sangpil Kim. Editsplat: Multi-view fusion and attention-guided optimization for view-consistent 3d scene editing with 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11135–11145, 2025. 3

work page 2025
[32]

Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 3

work page 2023
[33]

arXiv preprint arXiv:2311.06214 , year=

Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3

work page arXiv 2023
[34]

V oxhammer: Training-free precise and coherent 3D editing in native 3D space.arXiv preprint arXiv:2508.19247, 2025

Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, and Lu Sheng. V oxhammer: Training-free precise and coherent 3d editing in native 3d space.arXiv preprint arXiv:2508.19247, 2025. 3

work page arXiv 2025
[35]

Focaldreamer: Text- driven 3d editing via focal-fusion assembly

Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer: Text- driven 3d editing via focal-fusion assembly. InProceed- ings of the AAAI conference on artificial intelligence, pages 3279–3287, 2024. 3

work page 2024
[36]

Luciddreamer: Towards high-fidelity text-to-3d generation via interval score match- ing.arXiv preprint arXiv:2311.11284, 2023

Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiao- gang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score match- ing.arXiv preprint arXiv:2311.11284, 2023. 3, 6

work page arXiv 2023
[37]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 3

work page 2023
[38]

Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024

Artem Lukoianov, Haitz S ´aez de Oc ´ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 3

work page 2024
[39]

Matchdiffusion: Training-free generation of match-cuts,

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts,

work page
[40]

Matchdiffusion: Training-free generation of match-cuts

Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 3, 5, 7, 8

work page 2025
[41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[42]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025

Qwen Team. Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025. Image editing foundation model based on Qwen-Image. 2, 3, 5

work page 2025
[44]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[47]

Semantic im- age inversion and editing using rectified stochastic differen- tial equations

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. 3, 5

work page 2025
[48]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022
[49]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[51]

Bolt3d: Generating 3d scenes in seconds

Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 24846–24857, 2025. 3

work page 2025
[52]

Emergent correspondence from image diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. 5

work page 2023
[53]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 3

work page 2017
[55]

Wan: Open and advanced large-scale video generative models, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page 2025
[56]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Wei Hu, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[57]

Unilat3d: Geometry-appearance uni- fied latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025

Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xi- aopeng Zhang, et al. Unilat3d: Geometry-appearance uni- fied latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025. 3

work page arXiv 2025
[58]

Guibas, Dahua Lin, and Gordon Wetzstein

Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas J. Guibas, Dahua Lin, and Gordon Wetzstein. GPT-4V(ision) is a Human-Aligned Evaluator for Text-to- 3D Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5

work page 2024
[59]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 3, 6

work page internal anchor Pith review arXiv 2024
[60]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 5, 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion

Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, An- dreas Geiger, and Yiyi Liao. Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2857–2869, 2025. 3

work page 2025
[62]

Nano3d: A training-free approach for efficient 3d editing without masks.arXiv preprint arXiv:2510.15019, 2025

Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, and Jun Zhu. Nano3d: A training-free approach for efficient 3d editing without masks.arXiv preprint arXiv:2510.15019, 2025. 3

work page arXiv 2025
[63]

Stochsync: Stochastic diffusion synchronization for im- age generation in arbitrary spaces.arXiv preprint arXiv:2501.15445, 2025

Kyeongmin Yeo, Jaihoon Kim, and Minhyuk Sung. Stochsync: Stochastic diffusion synchronization for im- age generation in arbitrary spaces.arXiv preprint arXiv:2501.15445, 2025. 3

work page arXiv 2025
[64]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Dreameditor: Text-driven 3d scene editing with neural fields

Jingyu Zhuang, Chen Wang, Liang Lin, Lingjie Liu, and Guanbin Li. Dreameditor: Text-driven 3d scene editing with neural fields. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023. 3

work page 2023
[66]

Tip-editor: An accurate 3d editor fol- lowing both text-prompts and image-prompts.ACM Trans- actions on Graphics (TOG), 43(4):1–12, 2024

Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, and Ying Shan. Tip-editor: An accurate 3d editor fol- lowing both text-prompts and image-prompts.ACM Trans- actions on Graphics (TOG), 43(4):1–12, 2024. 3 ATATA: One Algorithm to Align Them All Supplementary Material A. Competitor details A.1. 2D competitors For running competitors, we use the sa...

work page 2024
[67]

Alignment.In which row are Video A and Video B bet- ter aligned with each other in terms of overall structure, overall meaning, pose, and 3D geometry?

work page
[68]

Visual Appeal.In which row is the pair Video A and Video B more visually appealing in terms of realism, smoothness, and overall perceptual quality?

work page
[69]

All videos in the task were played simultaneously for the an- notator with the ability to view them frame-by-frame

Text Prompt Consistency.In which row do Video A and Video B better match their textual description? Figure 7 shows an example task from the study. All videos in the task were played simultaneously for the an- notator with the ability to view them frame-by-frame. For each question, annotators could choose between three options: preference for the first row...

work page