Recognition: 2 theorem links
· Lean TheoremATATA: One Algorithm to Align Them All
Pith reviewed 2026-05-16 13:42 UTC · model grok-4.3
The pith
Joint transport of segments in sample space aligns paired outputs from any Rectified Flow model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that joint transport of a segment in sample space on Rectified Flow models produces paired structurally aligned samples of high visual quality. The method applies to arbitrary Rectified Flow models operating in structured latent space and demonstrates superior structural alignment plus visual quality for image and video generation while achieving comparable 3D quality at orders-of-magnitude higher speed than prior joint-inference baselines.
What carries the argument
Joint transport of a segment in sample space, which moves paired points together through the flow to enforce alignment.
If this is right
- Faster inference than Score Distillation Sampling for aligned sample pairs.
- High structural alignment across generated image and video pairs.
- Comparable visual quality for 3D shapes at much greater speed.
- Works on top of existing Rectified Flow models in structured latent space without retraining.
- Improves state-of-the-art results for image and video generation pipelines.
Where Pith is reading between the lines
- The segment-transport idea might transfer to other flow or diffusion models if their latent spaces admit similar pairing.
- Speed gains could support interactive tools that require consistent multi-view or temporal outputs.
- Joint transport may lower mode-collapse risk by constraining the sampling trajectory for paired points.
- Similar segment mechanisms could address alignment tasks in text-conditioned or multi-modal generation beyond the tested domains.
Load-bearing premise
Joint transport of a segment in sample space on an arbitrary Rectified Flow model will preserve structural alignment and visual quality without additional training or adjustments.
What would settle it
Apply the method to a standard Rectified Flow image model and check whether the output pairs exhibit measurable structural misalignment or visible quality drop relative to independent sampling runs.
Figures
read the original abstract
We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ATATA, a multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. The core idea is joint transport of a segment in sample space applied to an arbitrary pre-trained RF model operating in structured latent space, without additional training or post-hoc adjustments. The paper claims faster inference than Score Distillation Sampling, high structural alignment and visual quality on image/video/3D tasks, SOTA improvements for image and video generation, and comparable 3D quality at orders-of-magnitude higher speed.
Significance. If the central claims hold, the work would offer a practically significant, training-free method for efficient paired sample generation across modalities. By avoiding the computational cost and mode-collapse issues of SDS while building directly on existing RF models, it could enable faster pipelines for aligned image-video-3D data synthesis, provided the alignment guarantee is robust.
major comments (2)
- [Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.
- [Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.
minor comments (1)
- [Abstract] Abstract: the phrase 'one algorithm to align them all' is informal and should be replaced with a precise description of the scope (image/video/3D).
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment point by point below, providing clarifications on the underlying mechanisms and evidence while outlining planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that joint segment transport on arbitrary pre-trained RF models 'automatically' yields high structural alignment without explicit coupling (shared noise schedule, cross-attention, or latent correspondence loss) is load-bearing yet unsupported by any derivation or mechanism in the description; RF models are trained only on marginals, so independent trajectories can decouple and the no-additional-training guarantee risks collapse to post-hoc pairing.
Authors: The joint segment transport operates by selecting and transporting a shared segment in sample space using the deterministic velocity field of the pre-trained Rectified Flow model. Because RF trajectories are straight-line paths in expectation and the same segment is mapped consistently across paired samples, structural alignment is preserved at inference without requiring additional coupling terms, shared noise schedules, or losses. This follows directly from the marginal training of RF models combined with the joint application of the transport map. We acknowledge that the abstract would benefit from a concise reference to this property and will add a brief explanatory sentence plus a pointer to the methods derivation in the revised version. revision: partial
-
Referee: [Abstract] Abstract: the assertions of 'high degree of structural alignment', 'high visual quality', and 'improves the state-of-the-art' are presented without any quantitative metrics, error bars, dataset details, or experimental controls, making it impossible to assess whether the results actually support the SOTA and alignment claims.
Authors: The abstract is intended as a concise summary; the full manuscript contains the supporting quantitative evaluations, including alignment metrics (e.g., structural similarity scores), visual quality measures (FID, CLIP scores), error bars from repeated runs, dataset specifications, and controlled comparisons against editing-based and joint-inference baselines in the Experiments section. To improve readability and address the concern directly, we will incorporate key quantitative highlights (e.g., specific SOTA improvements and alignment scores) into the abstract during revision. revision: yes
Circularity Check
No circularity: joint transport presented as independent construction on arbitrary RF models
full rationale
The paper introduces joint transport of a segment in sample space as a new algorithm for paired aligned samples on top of arbitrary pre-trained Rectified Flow models. No equations, derivations, or self-citations are shown that reduce the alignment claim to a fitted parameter, self-definition, or load-bearing prior result by the same authors. The method is described as a direct, training-free inference procedure that preserves structure by construction of the segment transport, without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained and does not collapse to its inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
joint transport of a segment in the sample space... transporting a distribution of samples on the line segment [xa, xb]... restore the linear structure... smoothness regularization on ||xb(t) − xa(t)||
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
velocity-guided transport of a segment in latent space... anchor velocity v_anchor = v_Θ((xa + xb)/2, t, (ca + cb)/2)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. The chosen one: Consistent characters in text-to- image diffusion models.arXiv preprint arXiv:2311.10093,
-
[2]
Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback
Elior Benarous, Yilun Du, and Heng Yang. Spie: Semantic and structural post-training of image editing diffusion mod- els with ai feedback. InSynthetic Data for Artificial Intelli- gence and Machine Learning: Tools, Techniques, and Appli- cations III, Proceedings of SPIE Vol. 13459, page 13459-0I (article ID) ??, 2025. 5
work page 2025
-
[3]
Understanding and improving interpolation in au- toencoders via an adversarial regularizer
David Berthelot*, Colin Raffel*, Aurko Roy, and Ian Good- fellow. Understanding and improving interpolation in au- toencoders via an adversarial regularizer. InInternational Conference on Learning Representations, 2019. 3
work page 2019
-
[4]
FLUX.1 [dev].https : / / huggingface
Black Forest Labs. FLUX.1 [dev].https : / / huggingface . co / black - forest - labs / FLUX . 1- dev, 2024. Open-weight rectified-flow text-to-image model. 2, 3, 5, 12
work page 2024
- [5]
-
[6]
Dge: Di- rect gaussian 3d editing by consistent multi-view editing
Minghao Chen, Iro Laina, and Andrea Vedaldi. Dge: Di- rect gaussian 3d editing by consistent multi-view editing. InEuropean Conference on Computer Vision, pages 74–92. Springer, 2024. 3
work page 2024
-
[7]
Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023. 3
-
[8]
Lucy edit: Open-weight text-guided video editing, 2025
DecartAI Team. Lucy edit: Open-weight text-guided video editing, 2025. Technical report. 3, 7
work page 2025
-
[9]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[10]
Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis
Hyojun Go, Byeongjun Park, Jiho Jang, Jin-Young Kim, Soonwoo Kwon, and Changick Kim. Splatflow: Multi-view rectified flow model for 3d gaussian splatting synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21524–21536, 2025. 3
work page 2025
-
[11]
Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 3
work page 2020
-
[12]
Image editing in gemini just got a major upgrade,
Google. Image editing in gemini just got a major upgrade,
-
[13]
DescribesNano Banana(Gemini 2.5 Flash Image). Accessed: 2025-11-13. 2
work page 2025
-
[14]
Style aligned image generation via shared atten- tion
Amir Hertz, Andrey V oynov, Shlomi Fruchter, and Daniel Cohen-Or. Style aligned image generation via shared atten- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4775– 4785, 2024. 2
work page 2024
-
[15]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[16]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022. 3
work page 2022
-
[17]
Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, pages 1–17, 2025. 5
work page 2025
-
[18]
Nick Huang, Aaron Gokaslan, V olodymyr Kuleshov, and James Tompkin. The gan is dead; long live the gan! a mod- ern gan baseline.Advances in Neural Information Process- ing Systems, 37:44177–44215, 2024. 3
work page 2024
-
[19]
Mv-adapter: Multi-view consistent image generation made easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16377–16387, 2025. 3
work page 2025
-
[20]
Savva Victorovich Ignatyev, Nina Konovalova, Daniil Se- likhanovych, Oleg V oynov, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton Konushin, Alexander Filippov, Peter Wonka, and Evgeny Burnaev. A3d: Does diffusion dream about 3D alignment? InIn- ternational Conference on Learning Representations (ICLR),
- [21]
-
[22]
Vace: All-in-one video creation and editing, 2025
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing, 2025. 3
work page 2025
-
[23]
Vace: All-in-one video creation and editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17191–17202, 2025. 2, 5, 7
work page 2025
-
[24]
Editverse: Unifying image and video editing and generation with in-context learning, 2025
Xuan Ju, Tianyu Wang, Yuqian Zhou, He Zhang, Qing Liu, Nanxuan Zhao, Zhifei Zhang, Yijun Li, Yuanhao Cai, Shaoteng Liu, Daniil Pakhomov, Zhe Lin, Soo Ye Kim, and Qiang Xu. Editverse: Unifying image and video editing and generation with in-context learning, 2025. 5, 16
work page 2025
-
[25]
Scal- ing up gans for text-to-image synthesis
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scal- ing up gans for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10124–10134, 2023. 3
work page 2023
-
[26]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 8107–8116, 2020. 3
work page 2020
-
[27]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[28]
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
Zeqiang Lai, Yunfei Zhao, Haolin Liu, Zibo Zhao, Qingxi- ang Lin, Huiwen Shi, Xianghui Yang, Mingxin Yang, Shuhui Yang, Yifei Feng, et al. Hunyuan3d 2.5: Towards high- fidelity 3d assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[30]
Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Fuyun Wang, Huiwen Shi, Xianghui Yang, Qingxiang Lin, Jingwei Huang, Yuhong Liu, et al. Unleashing vecset diffusion model for fast shape generation.arXiv preprint arXiv:2503.16302,
-
[31]
Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Sangheon Shin, Sangmin Kim, and Sangpil Kim. Editsplat: Multi-view fusion and attention-guided optimization for view-consistent 3d scene editing with 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11135–11145, 2025. 3
work page 2025
-
[32]
Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 3
work page 2023
-
[33]
arXiv preprint arXiv:2311.06214 , year=
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 3
-
[34]
Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, and Lu Sheng. V oxhammer: Training-free precise and coherent 3d editing in native 3d space.arXiv preprint arXiv:2508.19247, 2025. 3
-
[35]
Focaldreamer: Text- driven 3d editing via focal-fusion assembly
Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer: Text- driven 3d editing via focal-fusion assembly. InProceed- ings of the AAAI conference on artificial intelligence, pages 3279–3287, 2024. 3
work page 2024
-
[36]
Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiao- gang Xu, and Yingcong Chen. Luciddreamer: Towards high-fidelity text-to-3d generation via interval score match- ing.arXiv preprint arXiv:2311.11284, 2023. 3, 6
-
[37]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Rep- resentations (ICLR), 2023. 3
work page 2023
-
[38]
Artem Lukoianov, Haitz S ´aez de Oc ´ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 3
work page 2024
-
[39]
Matchdiffusion: Training-free generation of match-cuts,
Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts,
-
[40]
Matchdiffusion: Training-free generation of match-cuts
Alejandro Pardo, Fabio Pizzati, Tong Zhang, Alexander Pon- daven, Philip Torr, Juan Camilo Perez, and Bernard Ghanem. Matchdiffusion: Training-free generation of match-cuts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 3, 5, 7, 8
work page 2025
-
[41]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[42]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025
Qwen Team. Qwen-image-edit.https : / / huggingface.co/Qwen/Qwen-Image-Edit, 2025. Image editing foundation model based on Qwen-Image. 2, 3, 5
work page 2025
-
[44]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3
work page 2022
-
[47]
Semantic im- age inversion and editing using rectified stochastic differen- tial equations
Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. 3, 5
work page 2025
-
[48]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3
work page 2022
-
[49]
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[51]
Bolt3d: Generating 3d scenes in seconds
Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 24846–24857, 2025. 3
work page 2025
-
[52]
Emergent correspondence from image diffusion
Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023. 5
work page 2023
-
[53]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017. 3
work page 2017
-
[55]
Wan: Open and advanced large-scale video generative models, 2025
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page 2025
-
[56]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Wei Hu, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[57]
Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xi- aopeng Zhang, et al. Unilat3d: Geometry-appearance uni- fied latents for single-stage 3d generation.arXiv preprint arXiv:2509.25079, 2025. 3
-
[58]
Guibas, Dahua Lin, and Gordon Wetzstein
Tong Wu, Guandao Yang, Zhibing Li, Kai Zhang, Ziwei Liu, Leonidas J. Guibas, Dahua Lin, and Gordon Wetzstein. GPT-4V(ision) is a Human-Aligned Evaluator for Text-to- 3D Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5
work page 2024
-
[59]
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration.arXiv preprint arXiv:2412.01506, 2024. 3, 6
work page internal anchor Pith review arXiv 2024
-
[60]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 5, 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion
Yuanbo Yang, Jiahao Shao, Xinyang Li, Yujun Shen, An- dreas Geiger, and Yiyi Liao. Prometheus: 3d-aware latent diffusion models for feed-forward text-to-3d scene genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2857–2869, 2025. 3
work page 2025
-
[62]
Junliang Ye, Shenghao Xie, Ruowen Zhao, Zhengyi Wang, Hongyu Yan, Wenqiang Zu, Lei Ma, and Jun Zhu. Nano3d: A training-free approach for efficient 3d editing without masks.arXiv preprint arXiv:2510.15019, 2025. 3
-
[63]
Kyeongmin Yeo, Jaihoon Kim, and Minhyuk Sung. Stochsync: Stochastic diffusion synchronization for im- age generation in arbitrary spaces.arXiv preprint arXiv:2501.15445, 2025. 3
-
[64]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Dreameditor: Text-driven 3d scene editing with neural fields
Jingyu Zhuang, Chen Wang, Liang Lin, Lingjie Liu, and Guanbin Li. Dreameditor: Text-driven 3d scene editing with neural fields. InSIGGRAPH Asia 2023 Conference Papers, pages 1–10, 2023. 3
work page 2023
-
[66]
Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, and Ying Shan. Tip-editor: An accurate 3d editor fol- lowing both text-prompts and image-prompts.ACM Trans- actions on Graphics (TOG), 43(4):1–12, 2024. 3 ATATA: One Algorithm to Align Them All Supplementary Material A. Competitor details A.1. 2D competitors For running competitors, we use the sa...
work page 2024
-
[67]
Alignment.In which row are Video A and Video B bet- ter aligned with each other in terms of overall structure, overall meaning, pose, and 3D geometry?
-
[68]
Visual Appeal.In which row is the pair Video A and Video B more visually appealing in terms of realism, smoothness, and overall perceptual quality?
-
[69]
Text Prompt Consistency.In which row do Video A and Video B better match their textual description? Figure 7 shows an example task from the study. All videos in the task were played simultaneously for the an- notator with the ability to view them frame-by-frame. For each question, annotators could choose between three options: preference for the first row...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.