Recognition: no theorem link
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
Pith reviewed 2026-05-12 02:48 UTC · model grok-4.3
The pith
Unifying geometric guidance at representation, architecture, and loss levels lets video models edit images under new camera poses with less drift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that fragmented geometric guidance is the root cause of instability in video-model-based camera-controllable editing and that injecting unified guidance at representation, architecture, and loss levels jointly resolves it. At the representation level a frame-decoupled geometric reference injection supplies cross-view context. At the architecture level geometric anchor attention aligns multi-view features. At the loss level a trajectory-endpoint supervision strategy explicitly reinforces structural fidelity of target views. Experiments across public benchmarks with both extensive and limited camera motion show the resulting outputs exceed prior methods in visual quality and,
What carries the argument
The three-level unified geometric guidance system that combines frame-decoupled reference injection for context, geometric anchor attention for feature alignment, and trajectory-endpoint supervision for fidelity.
If this is right
- The unified approach outperforms existing methods on public benchmarks for both large and small camera motions.
- Geometric drift and structural degradation are reduced under continuous camera movement.
- Cross-view consistency is maintained more reliably because guidance acts at every level that shapes the output.
Where Pith is reading between the lines
- The same multi-level unification pattern could be tested on other tasks that require multi-view consistency, such as video prediction or light-field rendering.
- Extending the supervision to longer sequences would check whether the stability scales to extended camera paths not covered in current benchmarks.
- Pairing the framework with real-time pose estimation could enable interactive editing sessions where users freely move the virtual camera.
Load-bearing premise
That fragmented guidance is the main driver of drift and that adding unified injections at precisely these three levels will stabilize output without creating fresh inconsistencies or demanding heavy retuning.
What would settle it
A controlled test on long or rapid camera trajectories where the three-level guidance still produces measurable geometric drift or structural degradation comparable to earlier methods.
Figures
read the original abstract
Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes UniGeo, a camera-controllable image editing framework that leverages video models to address geometric drift under continuous camera motion. It unifies geometric guidance at three levels: representation (via a frame-decoupled geometric reference injection mechanism), architecture (via geometric anchor attention), and loss function (via a trajectory-endpoint geometric supervision strategy). The paper claims this yields superior visual quality and geometric consistency compared to prior methods based on fragmented guidance and image diffusion models, supported by comprehensive experiments on public benchmarks covering extensive and limited camera motion settings.
Significance. If the empirical results hold, the work could advance camera-controllable editing by demonstrating how video priors can be stabilized through explicit multi-level geometric unification rather than relying on fragmented cues. The three concrete mechanisms (frame-decoupled reference injection, geometric anchor attention, and trajectory-endpoint supervision) represent specific, potentially reusable contributions that credit the authors for targeting the multi-level structure of generative models.
major comments (2)
- [Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.
- [Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.
minor comments (1)
- [Abstract] Abstract: The distinction between 'extensive and limited camera motion settings' is referenced but not defined with specific thresholds or examples, which could aid reader understanding.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed review. We address the major comments point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'UniGeo significantly outperforms existing methods in both visual quality and geometric consistency' is presented without any quantitative metrics, baseline details, error bars, or ablation results, leaving the central performance claim without visible supporting evidence.
Authors: We agree that the abstract would benefit from brief supporting evidence to contextualize the performance claim. In the revised manuscript, we have updated the abstract to include key quantitative metrics (e.g., specific improvements in PSNR, SSIM, and geometric consistency scores) and a concise reference to the main baselines and experimental settings. Detailed tables with error bars, full ablations, and per-scenario breakdowns remain in the Experiments section, as they exceed the length constraints of an abstract while preserving its summary nature. revision: yes
-
Referee: [Experiments] Experiments section: No ablation isolates the contribution of a video-model baseline using only representation-level injection against the full three-level UniGeo on the reported geometric consistency metrics. This is load-bearing for the claim that fragmented guidance remains the dominant failure mode and that joint injection at all three levels is required to avoid drift or new inconsistencies.
Authors: This is a fair and substantive point. Our experiments compare UniGeo against prior fragmented-guidance methods (both image- and video-based) and include component-wise ablations, but we did not explicitly report a video-model baseline limited to representation-level injection evaluated on the geometric consistency metrics. To directly address the concern and reinforce the necessity of multi-level unification, we will add this ablation in the revised Experiments section, including quantitative results on the relevant metrics to show that representation-level injection alone is insufficient to prevent drift under continuous camera motion. revision: yes
Circularity Check
No circularity: new mechanisms validated on external benchmarks
full rationale
The paper proposes three distinct new components (frame-decoupled geometric reference injection, geometric anchor attention, and trajectory-endpoint supervision) to unify guidance across representation, architecture, and loss levels in a video model. These are introduced as original contributions rather than derived from or equivalent to prior inputs. Performance is assessed via comparisons on public benchmarks under varied camera motion settings, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central claims to the paper's own definitions or data subsets. The derivation chain remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video models provide continuous viewpoint priors that can be leveraged for camera-controllable editing.
invented entities (3)
-
frame-decoupled geometric reference injection mechanism
no independent evidence
-
geometric anchor attention
no independent evidence
-
trajectory-endpoint geometric supervision strategy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Building Normalizing Flows with Stochastic Interpolants
Albergo, M.S., Vanden-Eijnden, E.: Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571 (2022)
work page internal anchor Pith review arXiv 2022
-
[2]
World Simulation with Video Foundation Models for Physical AI
Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)
work page internal anchor Pith review arXiv 2025
-
[3]
In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)
Avrahami, O., Patashnik, O., Fried, O., Nemchinov, E., Aberman, K., Lischinski, D., Cohen-Or, D.: Stable flow: Vital layers for training-free image editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR). pp. 7877–7888 (June 2025)
work page 2025
-
[4]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22875–22889 (2025)
work page 2025
-
[5]
arXiv preprint arXiv:2407.12781 , year=
Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781 (2024)
-
[6]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)
work page 2025
-
[7]
arXiv preprint arXiv:2510.20385 (2025)
Bai, Y., Li, H., Huang, Q.: Positional encoding field. arXiv preprint arXiv:2510.20385 (2025)
-
[8]
In: European conference on computer vision
Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European conference on computer vision. pp. 707–723. Springer (2022)
work page 2022
-
[9]
IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence35(8), 1798–1828 (2013)
work page 2013
-
[10]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)
work page 2023
-
[12]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., Passos, A.: Ledits++: Limitless image editing using text-to-image models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8861–8870 (2024)
work page 2024
-
[13]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)
work page 2023
-
[14]
OpenAI Blog1(8), 1 (2024) 16 H
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., et al.: Video generation models as world simulators. OpenAI Blog1(8), 1 (2024) 16 H. Jiang et al
work page 2024
-
[15]
arXiv preprint arXiv:2601.18993 (2026)
Cao, W., Zhang, H., Tian, F., Wu, Y., Li, Y., Wang, S., Yu, N., Liu, Y.: Freeorbit4d: Training-free arbitrary camera redirection for monocular videos via geometry- complete 4d reconstruction. arXiv preprint arXiv:2601.18993 (2026)
-
[16]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., De Mello, S., Karras, T., Wetzstein, G.: Generative novel view synthesis with 3d- aware diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4217–4229 (2023)
work page 2023
-
[17]
Blip3o-next: Next frontier of native image generation.arXiv preprint arXiv:2510.15857, 2025
Chen, J., Xue, L., Xu, Z., Pan, X., Yang, S., Qin, C., Yan, A., Zhou, H., Chen, Z., Huang, L., et al.: Blip3o-next: Next frontier of native image generation. arXiv preprint arXiv:2510.15857 (2025)
-
[18]
arXiv preprint arXiv:2503.13265 (2025)
Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)
-
[19]
Contributors, L.: Lightx2v: Light video generation inference framework.https: //github.com/ModelTC/lightx2v(2025)
work page 2025
-
[20]
arXiv preprint arXiv:2210.11427 (2022)
Couairon, G., Verbeek, J., Schwenk, H., Cord, M.: Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427 (2022)
-
[21]
Cui, Y., Chen, H., Deng, H., Huang, X., Li, X., Liu, J., Liu, Y., Luo, Z., Wang, J., Wang, W., Wang, Y., Wang, C., Zhang, F., Zhao, Y., Pan, T., Li, X., Hao, Z., Ma, W., Chen, Z., Ao, Y., Huang, T., Wang, Z., Wang, X.: Emu3.5: Native multimodal models are world learners (2025),https://arxiv.org/abs/2510.26583
-
[22]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., Shi, G., Fan, H.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Communications of the ACM55(10), 78–87 (2012)
Domingos, P.: A few useful things to know about machine learning. Communications of the ACM55(10), 78–87 (2012)
work page 2012
-
[24]
Dong, H., Wang, W., Li, C., Lin, D.: Wan-alpha: High-quality text-to-video genera- tion with alpha channel. arXiv e-prints pp. arXiv–2509 (2025)
work page 2025
-
[25]
In: Forty-first international conference on machine learning (2024)
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)
work page 2024
-
[26]
Cat3d: Create any- thing in 3d with multi-view diffusion models,
Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 (2024)
-
[27]
MIT Press (2016),http: //www.deeplearningbook.org
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016),http: //www.deeplearningbook.org
work page 2016
-
[28]
In: European Conference on Computer Vision
Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., Dai, B.: Sparsectrl: Adding sparse controls to text-to-video diffusion models. In: European Conference on Computer Vision. pp. 330–348. Springer (2024)
work page 2024
-
[29]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)
work page internal anchor Pith review arXiv 2024
-
[31]
Prompt-to-Prompt Image Editing with Cross Attention Control
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control.(2022). URL https://arxiv. org/abs/2208.016263, 3 (2022) Abbreviated paper title 17
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Advances in neural information processing systems30(2017)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
work page 2017
-
[33]
Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024
Hou, C., Chen, Z.: Training-free camera control for video generation. arXiv preprint arXiv:2406.10126 (2024)
-
[34]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Huang, Y., Huang, J., Liu, Y., Yan, M., Lv, J., Liu, J., Xiong, W., Zhang, H., Cao, L., Chen, S.: Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
work page 2025
-
[35]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang, R., Zhang, R., et al.: Smartedit: Exploring complex instruction-based image editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024)
work page 2024
-
[36]
In: The Twelfth International Conference on Learning Representations (2023)
Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In: The Twelfth International Conference on Learning Representations (2023)
work page 2023
-
[37]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6007–6017 (2023)
work page 2023
-
[38]
ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)
Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36(4), 1–13 (2017)
work page 2017
-
[39]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Advances in Neural Information Processing Systems37, 16240–16271 (2024)
Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)
work page 2024
-
[41]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kulikov, V., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion- free text-based editing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025)
work page 2025
-
[42]
Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)
work page 2024
-
[43]
nature521(7553), 436–444 (2015)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 (2015)
work page 2015
-
[44]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., Freeman, W.T.: Learning the depths of moving people by watching frozen people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4521–4530 (2019)
work page 2019
-
[45]
Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)
-
[46]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)
work page internal anchor Pith review arXiv 2025
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)
work page 2024
-
[48]
Flow Matching for Generative Modeling
Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 18 H. Jiang et al
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
In: Proceedings of the IEEE/CVF international conference on computer vision
Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9298–9309 (2023)
work page 2023
-
[50]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Ma, B., Gao, H., Deng, H., Luo, Z., Huang, T., Tang, L., Wang, X.: You see it, you got it: Learning 3d creation on pose-free videos at scale. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2016–2029 (2025)
work page 2016
-
[53]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
work page internal anchor Pith review arXiv 2021
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6038–6047 (2023)
work page 2023
-
[55]
In: ACM SIGGRAPH 2023 conference proceedings
Parmar, G., Kumar Singh, K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to-image translation. In: ACM SIGGRAPH 2023 conference proceedings. pp. 1–11 (2023)
work page 2023
-
[56]
Open-sora 2.0: Training a commercial-level video generation model in $200k
Peng, X., Zheng, Z., Shen, C., Young, T., Guo, X., Wang, B., Xu, H., Liu, H., Jiang, M., Li, W., Wang, Y., Ye, A., Ren, G., Ma, Q., Liang, W., Lian, X., Wu, X., Zhong, Y., Li, Z., Gong, C., Lei, G., Cheng, L., Zhang, L., Li, M., Zhang, R., Hu, S., Huang, S., Wang, X., Zhao, Y., Wang, Y., Wei, Z., You, Y.: Open-sora 2.0: Training a commercial-level video g...
-
[57]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)
work page 2025
-
[58]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Rotstein, N., Yona, G., Silver, D., Velich, R., Bensaïd, D., Kimmel, R.: Pathways on the image manifold: Image editing via video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7857–7866 (2025)
work page 2025
-
[59]
arXiv preprint arXiv:2410.10792 (2024)
Rout, L., Chen, Y., Ruiz, N., Caramanis, C., Shakkottai, S., Chu, W.S.: Semantic image inversion and editing using rectified stochastic differential equations. arXiv preprint arXiv:2410.10792 (2024)
-
[60]
Advances in Neural Information Processing Systems37, 80220– 80243 (2024)
Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. Advances in Neural Information Processing Systems37, 80220– 80243 (2024)
work page 2024
-
[61]
arXiv preprint arXiv:2310.15110 , year=
Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
-
[62]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[63]
arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19
Song, W., Jiang, H., Yang, Z., Quan, R., Yang, Y.: Insert anything: Image insertion via in-context editing in dit. arXiv preprint arXiv:2504.15009 (2025) Abbreviated paper title 19
-
[64]
arXiv preprint arXiv:2411.04928 (2024)
Sun, W., Chen, S., Liu, F., Chen, Z., Duan, Y., Zhang, J., Wang, Y.: Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928 (2024)
-
[65]
In: European Conference on Computer Vision
Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)
work page 2024
-
[67]
Wan: Open and Advanced Large-Scale Video Generative Models
Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.203143(4), 6 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Wang, G.H., Zhao, S., Zhang, X., Cao, L., Zhan, P., Duan, L., Lu, S., Fu, M., Zhao, J., Li, Y., Chen, Q.G.: Ovis-u1 technical report. arXiv preprint arXiv:2506.23044 (2025)
-
[69]
arXiv preprint arXiv:2411.04746 (2024)
Wang, J., Pu, J., Qi, Z., Guo, J., Ma, Y., Huang, N., Chen, Y., Li, X., Shan, Y.: Taming rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746 (2024)
-
[70]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)
work page 2025
-
[71]
IEEE transactions on image processing 13(4), 600–612 (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
work page 2004
-
[72]
In: ACM SIGGRAPH 2024 Conference Papers
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)
work page 2024
-
[73]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. arXiv preprint arXiv:2510.04290 (2025)
-
[76]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wu, P., Zhu, K., Liu, Y., Zhao, L., Zhai, W., Cao, Y., Zha, Z.J.: Improved video vae for latent video diffusion model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18124–18133 (2025) 20 H. Jiang et al
work page 2025
-
[77]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction with diffusion priors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21551–21561 (2024)
work page 2024
-
[78]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)
work page 2025
-
[79]
Xu, D., Nie, W., Liu, C., Liu, S., Kautz, J., Wang, Z., Vahdat, A.: Camco: Camera-controllable 3d-consistent image-to-video generation. arXiv preprint arXiv:2406.02509 (2024)
-
[80]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with language-guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9452–9461 (2024)
work page 2024
-
[81]
In: ACM SIGGRAPH 2024 Conference Papers
Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.