Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering
Pith reviewed 2026-05-18 22:13 UTC · model grok-4.3
The pith
Two single-step diffusion models reinforce each other via cycle consistency to unify forward and inverse rendering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ouroboros is a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. A cycle consistency mechanism ensures coherence between the outputs of the two models. This construction extends intrinsic decomposition to both indoor and outdoor scenes, produces state-of-the-art results, and runs at substantially higher speed than other diffusion-based methods. The same pair of models can be applied to video decomposition in a training-free manner to reduce temporal inconsistency while preserving per-frame quality.
What carries the argument
The cycle consistency mechanism that links the forward-rendering and inverse-rendering single-step diffusion models so their outputs reinforce each other during training and inference.
If this is right
- State-of-the-art performance on intrinsic decomposition across diverse indoor and outdoor scenes.
- Substantially faster inference speed than existing multi-step diffusion approaches.
- Direct transfer to video sequences without additional training, reducing temporal inconsistency.
- Coherent outputs that remain aligned when the forward and inverse tasks are chained.
Where Pith is reading between the lines
- The single-step design could support real-time graphics pipelines where previous diffusion methods were too slow.
- The same mutual-reinforcement pattern might generalize to other paired tasks such as depth estimation and view synthesis.
- Testing the models on scenes with extreme dynamic range would reveal whether cycle consistency preserves fine detail under strong lighting changes.
- Deployment in robotics or AR could become simpler if only one pair of models is needed instead of separate forward and inverse networks.
Load-bearing premise
The cycle consistency mechanism can be enforced during training and inference without introducing new artifacts or breaking coherence in complex real-world scenes.
What would settle it
Visible cycle inconsistencies, such as mismatched lighting or geometry when the forward output is fed back through the inverse model on held-out real scenes, would show the mechanism fails.
Figures
read the original abstract
While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ouroboros, a framework of two single-step diffusion models for forward and inverse rendering trained with a cycle-consistency mechanism for mutual reinforcement. It claims to extend intrinsic decomposition to indoor and outdoor scenes, achieve state-of-the-art performance with substantially faster inference than multi-step diffusion methods, and enable training-free transfer to video decomposition while reducing temporal inconsistency.
Significance. If the central claims hold, the work offers a practical efficiency gain for diffusion-based rendering pipelines by replacing iterative denoising with single-step prediction while using cycle consistency to maintain coherence. The extension beyond indoor scenes and the training-free video application are notable strengths. The manuscript provides reproducible experimental protocols and quantitative comparisons on standard benchmarks, which strengthens the assessment.
major comments (2)
- [§4.3, Eq. (12)] §4.3, Eq. (12): the cycle-consistency term is implemented as an L2 penalty on the composition of the two single-step mappings; however, because each model performs a direct prediction rather than iterative refinement, residual errors on non-Lambertian surfaces or complex illumination can accumulate without the corrective iterations available in multi-step diffusion, and the paper does not provide a quantitative bound or ablation showing that the composition remains close to identity on real scenes.
- [Table 4] Table 4, outdoor-scene rows: the reported PSNR and SSIM gains over the strongest diffusion baseline are 1.2 dB and 0.03 respectively, yet the standard deviations across the 50 test scenes are not reported and the improvement is not tested for statistical significance; this weakens the SOTA claim for outdoor scenes where the single-step approximation is most stressed.
minor comments (2)
- [Figure 3] Figure 3 caption: the legend labels for the forward and inverse branches are swapped relative to the diagram in §3.1; this should be corrected for clarity.
- [§5.2] §5.2: the video-transfer experiment uses a fixed number of frames (8) but does not report how performance scales with longer sequences or with varying motion magnitude.
Simulated Author's Rebuttal
We are grateful to the referee for the positive evaluation of our work's significance and for the detailed major comments. We respond to each comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: [§4.3, Eq. (12)] §4.3, Eq. (12): the cycle-consistency term is implemented as an L2 penalty on the composition of the two single-step mappings; however, because each model performs a direct prediction rather than iterative refinement, residual errors on non-Lambertian surfaces or complex illumination can accumulate without the corrective iterations available in multi-step diffusion, and the paper does not provide a quantitative bound or ablation showing that the composition remains close to identity on real scenes.
Authors: We agree that empirical validation of cycle consistency on challenging real scenes is important. In the revised manuscript we will add an ablation that directly measures the cycle reconstruction error (deviation from identity) on both indoor and outdoor test scenes, with explicit examples involving non-Lambertian surfaces and complex illumination. This will provide quantitative evidence that the learned single-step mappings compose close to the identity under our training regime. revision: yes
-
Referee: Table 4, outdoor-scene rows: the reported PSNR and SSIM gains over the strongest diffusion baseline are 1.2 dB and 0.03 respectively, yet the standard deviations across the 50 test scenes are not reported and the improvement is not tested for statistical significance; this weakens the SOTA claim for outdoor scenes where the single-step approximation is most stressed.
Authors: We thank the referee for highlighting this statistical gap. In the revised Table 4 we will report standard deviations across the 50 outdoor test scenes for all metrics. We will also add a paired statistical significance test (e.g., Wilcoxon signed-rank) between Ouroboros and the strongest baseline to confirm that the reported gains are statistically significant. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents Ouroboros as a framework of two single-step diffusion models trained with an added cycle consistency mechanism for mutual reinforcement between forward and inverse rendering. No equations, derivations, or self-citations are exhibited that reduce the central claims (coherence, SOTA performance, or training-free video transfer) to fitted inputs or self-referential definitions by construction. The cycle consistency is introduced as an independent training objective rather than a renaming or forced prediction of the input data, and the overall approach remains self-contained with external validation on diverse indoor/outdoor scenes.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement... cycle consistency mechanism that ensures coherence
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
single-step diffusion models... 50× acceleration in inference speed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Let EEG Models Learn EEG
JET is a conditional flow matching framework that generates EEG as continuous raw sequences with added constraints for spectral and temporal properties, achieving over 40% lower TS-FID than prior discrete denoising me...
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
Reference graph
Works this paper leans on
-
[1]
Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from shading.IEEE transactions on pattern analysis and machine intelligence, 37(8):1670–1687, 2014. 1
work page 2014
-
[2]
Re- covering intrinsic scene characteristics.Comput
Harry Barrow, J Tenenbaum, A Hanson, and E Riseman. Re- covering intrinsic scene characteristics.Comput. vis. syst, 2 (3-26):2, 1978. 1, 3
work page 1978
-
[3]
Stylegan knows normal, depth, albedo, and more
Anand Bhattad, Daniel McKee, Derek Hoiem, and David Forsyth. Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems, 36: 73082–73103, 2023. 3
work page 2023
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 2
work page 2023
-
[6]
Intrinsic image decomposi- tion via ordinal shading.ACM Transactions on Graphics, 43 (1):1–24, 2023
Chris Careaga and Ya ˘gız Aksoy. Intrinsic image decomposi- tion via ordinal shading.ACM Transactions on Graphics, 43 (1):1–24, 2023. 3
work page 2023
-
[7]
Chris Careaga and Ya ˘gız Aksoy. Colorful diffuse intrinsic image decomposition in the wild.ACM Transactions on Graphics (TOG), 43(6):1–12, 2024. 6, 7
work page 2024
-
[8]
Pix2video: Video editing using image diffusion
Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 23206–23217, 2023. 2
work page 2023
-
[9]
Stable- video: Text-driven consistency-aware diffusion video edit- ing
Wenhao Chai, Xun Guo, Gaoang Wang, and Yan Lu. Stable- video: Text-driven consistency-aware diffusion video edit- ing. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 23040–23050, 2023
work page 2023
-
[10]
Anydoor: Zero-shot object-level im- age customization
Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6593–6602, 2024. 2
work page 2024
-
[11]
Intrinsicanything: Learning diffusion priors for inverse rendering under unknown illumi- nation
Xi Chen, Sida Peng, Dongchen Yang, Yuan Liu, Bowen Pan, Chengfei Lv, and Xiaowei Zhou. Intrinsicanything: Learning diffusion priors for inverse rendering under unknown illumi- nation. InEuropean Conference on Computer Vision, pages 450–467. Springer, 2024. 7
work page 2024
-
[12]
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video edit- ing
Yuren Cong, Mengmeng Xu, christian simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video edit- ing. InThe Twelfth International Conference on Learning Representations, 2024. 5
work page 2024
-
[13]
Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, and Anand Bhattad. Generative models: What do they know? do they know things? let’s find out!arXiv preprint arXiv:2311.17137, 2023. 3, 8
-
[14]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[15]
Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry esti- mation from a single image. InEuropean Conference on Computer Vision, pages 241–258. Springer, 2024. 2
work page 2024
-
[16]
Tree-structured shading decompo- sition
Chen Geng, Hong-Xing Yu, Sharon Zhang, Maneesh Agrawala, and Jiajun Wu. Tree-structured shading decompo- sition. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 488–498, 2023. 3
work page 2023
-
[17]
Diffpose: Toward more reliable 3d pose estimation
Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hos- sein Rahmani, and Jun Liu. Diffpose: Toward more reliable 3d pose estimation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13041–13051, 2023. 2
work page 2023
-
[18]
Out- cast: Outdoor single-image relighting with cast shadows
David Griffiths, Tobias Ritschel, and Julien Philip. Out- cast: Outdoor single-image relighting with cast shadows. In Computer Graphics Forum, pages 179–193. Wiley Online Library, 2022. 3
work page 2022
-
[19]
Ground truth dataset and baseline eval- uations for intrinsic image algorithms
Roger Grosse, Micah K Johnson, Edward H Adelson, and William T Freeman. Ground truth dataset and baseline eval- uations for intrinsic image algorithms. In2009 IEEE 12th International Conference on Computer Vision, pages 2335–
-
[20]
Depthfm: Fast monocular depth estimation with flow matching.arXiv preprint arXiv:2403.13788, 2024
Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Bj ¨orn Ommer. Depthfm: Fast monocular depth estimation with flow matching.arXiv preprint arXiv:2403.13788, 2024. 2
-
[21]
Lotus: Diffusion-based visual foundation model for high- quality dense prediction,
Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Zhang, Bingbing Liu, and Ying- Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024. 2, 6, 7
-
[22]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2
work page 2020
-
[24]
Video dif- fusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 5
work page 2022
-
[25]
Haian Jin, Yuan Li, Fujun Luan, Yuanbo Xiangli, Sai Bi, Kai Zhang, Zexiang Xu, Jin Sun, and Noah Snavely. Neural gaffer: Relighting any object via diffusion.Advances in Neu- ral Information Processing Systems, 37:141129–141152,
-
[26]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 3
work page 2019
-
[27]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6007–6017, 2023. 2
work page 2023
-
[28]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492– 9502, 2024. 2
work page 2024
-
[29]
Hoon Kim, Minje Jang, Wonjun Yoon, Jisoo Lee, Donghyun Na, and Sanghyun Woo. Switchlight: Co-design of physics- driven architecture and pre-training framework for human portrait relighting. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 25096–25106, 2024. 3
work page 2024
-
[30]
Auto-Encoding Variational Bayes
Diederik P Kingma. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013. 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[31]
In- trinsic image diffusion for single-view material estimation
Peter Kocsis, Vincent Sitzmann, and Matthias Nießner. In- trinsic image diffusion for single-view material estimation. arXiv preprint arXiv:2312.12274, 2023. 2, 3, 6, 7, 8
-
[32]
Lightit: Illumination modeling and control for diffusion models
Peter Kocsis, Julien Philip, Kalyan Sunkavalli, Matthias Nießner, and Yannick Hold-Geoffroy. Lightit: Illumination modeling and control for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9359–9369, 2024. 2, 3
work page 2024
-
[33]
Shading annotations in the wild
Balazs Kovacs, Sean Bell, Noah Snavely, and Kavita Bala. Shading annotations in the wild. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 6998–7007, 2017. 3
work page 2017
-
[34]
One diffusion to generate them all.arXiv preprint arXiv:2411.16318, 2024
Duong H Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, and Jiasen Lu. One diffusion to generate them all.arXiv preprint arXiv:2411.16318, 2024. 2
-
[35]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 5
work page 2022
-
[36]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 5
work page 2023
-
[37]
Controlnet++: Improving conditional controls with efficient consistency feedback
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaon- ing Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InEuropean Conference on Computer Vision,
-
[38]
Interiornet: Mega-scale multi- sensor photo-realistic indoor scenes dataset
Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi- sensor photo-realistic indoor scenes dataset. InBritish Ma- chine Vision Conference (BMVC), 2018. 3
work page 2018
-
[39]
Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 2, 3, 5, 6, 7
work page 2023
-
[40]
Zhengqin Li, Mohammad Shafiei, Ravi Ramamoorthi, Kalyan Sunkavalli, and Manmohan Chandraker. Inverse ren- dering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2475–2484, 2020. 2
work page 2020
-
[41]
Openrooms: An open framework for photorealistic indoor scene datasets
Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gun- davarapu, Jia Shi, et al. Openrooms: An open framework for photorealistic indoor scene datasets. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7190–7199, 2021. 2, 3
work page 2021
-
[42]
arXiv preprint arXiv:2501.18590 (2025)
Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nan- dita Vijaykumar, Sanja Fidler, et al. Diffusionrenderer: Neu- ral inverse and forward rendering with video diffusion mod- els.arXiv preprint arXiv:2501.18590, 2025. 2, 3
-
[43]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 5
work page 2014
-
[44]
Zhi-Hao Lin, Bohan Liu, Yi-Ting Chen, David Forsyth, Jia-Bin Huang, Anand Bhattad, and Shenlong Wang. Ur- banir: Large-scale urban scene inverse rendering from a sin- gle video.arXiv preprint arXiv:2306.09349, 2023. 3
-
[45]
Video-p2p: Video editing with cross-attention control
Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia. Video-p2p: Video editing with cross-attention control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8599–8608, 2024. 2
work page 2024
-
[46]
Intrinsicdiffusion: Joint in- trinsic layers from latent diffusion models
Jundan Luo, Duygu Ceylan, Jae Shin Yoon, Nanxuan Zhao, Julien Philip, Anna Fr ¨uhst¨uck, Wenbin Li, Christian Richardt, and Tuanfeng Wang. Intrinsicdiffusion: Joint in- trinsic layers from latent diffusion models. InACM SIG- GRAPH 2024 Conference Papers, pages 1–11, 2024. 2, 3
work page 2024
-
[47]
Fine-tuning image-conditional diffusion models is easier than you think
Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, and Bastian Leibe. Fine-tuning image-conditional diffusion models is easier than you think. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV),
-
[48]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 2
work page 2024
-
[49]
Deep shading: convolutional neural networks for screen space shading
Oliver Nalbach, Elena Arabadzhiyska, Dushyant Mehta, H- P Seidel, and Tobias Ritschel. Deep shading: convolutional neural networks for screen space shading. InComputer graphics forum, pages 65–78. Wiley Online Library, 2017. 3
work page 2017
-
[50]
Total relighting: learning to relight portraits for background replacement.ACM Trans
Rohit Pandey, Sergio Orts-Escolano, Chloe Legendre, Chris- tian Haene, Sofien Bouaziz, Christoph Rhemann, Paul E De- bevec, and Sean Ryan Fanello. Total relighting: learning to relight portraits for background replacement.ACM Trans. Graph., 40(4):43–1, 2021. 3
work page 2021
-
[51]
Matt Pharr, Wenzel Jakob, and Greg Humphreys.Physi- cally based rendering: From theory to implementation. MIT Press, 2023. 2, 3
work page 2023
-
[52]
Multi-view relighting using a geometry-aware network.ACM Trans
Julien Philip, Micha ¨el Gharbi, Tinghui Zhou, Alexei A Efros, and George Drettakis. Multi-view relighting using a geometry-aware network.ACM Trans. Graph., 38(4):78–1,
-
[53]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 5
work page 2015
-
[54]
Unicontrol: A unified diffusion model for controllable visual generation in the wild,
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023. 2
-
[55]
Infinite photore- alistic worlds using procedural generation
Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng. Infinite photore- alistic worlds using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 12...
work page 2023
-
[56]
Infinigen indoors: Photorealistic in- door scenes using procedural generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan, David Yan, Yiming Zuo, Beining Han, Hongyu Wen, Meenal Parakh, Stamatis Alexandropoulos, Lahav Lipson, Zeyu Ma, and Jia Deng. Infinigen indoors: Photorealistic in- door scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21783–...
work page 2024
-
[57]
A signal-processing framework for inverse rendering
Ravi Ramamoorthi and Pat Hanrahan. A signal-processing framework for inverse rendering. InProceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 117–128, 2001. 1
work page 2001
-
[58]
A theory of joint light and heat transport for lambertian scenes
Mani Ramanagopal, Sriram Narayanan, Aswin C Sankara- narayanan, and Srinivasa G Narasimhan. A theory of joint light and heat transport for lambertian scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11924–11933, 2024. 3
work page 2024
-
[59]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020. 4
work page 2020
-
[60]
Relightful harmonization: Lighting-aware portrait background replacement
Mengwei Ren, Wei Xiong, Jae Shin Yoon, Zhixin Shu, Jianming Zhang, HyunJoon Jung, Guido Gerig, and He Zhang. Relightful harmonization: Lighting-aware portrait background replacement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6452–6462, 2024. 2
work page 2024
-
[61]
Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021. 2, 3, 5, 6, 7, 9
work page 2021
-
[62]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3
work page 2022
-
[63]
Nerf for outdoor scene relighting
Viktor Rudnev, Mohamed Elgharib, William Smith, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt. Nerf for outdoor scene relighting. InEuropean Conference on Com- puter Vision, pages 615–631. Springer, 2022. 3
work page 2022
-
[64]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2
work page 2022
-
[65]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, and Chunhua Shen. Autostory: Generating diverse storytelling images with minimal human efforts.Interna- tional Journal of Computer Vision, pages 1–22, 2024. 2
work page 2024
-
[67]
Neural fields meet explicit geometric representations for inverse rendering of urban scenes
Zian Wang, Tianchang Shen, Jun Gao, Shengyu Huang, Jacob Munkberg, Jon Hasselgren, Zan Gojcic, Wenzheng Chen, and Sanja Fidler. Neural fields meet explicit geometric representations for inverse rendering of urban scenes. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8370–8380, 2023. 3
work page 2023
-
[68]
Lavin- dit: Large vision diffusion transformer.arXiv preprint arXiv:2411.11505, 2024
Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, and Tongliang Liu. Lavin- dit: Large vision diffusion transformer.arXiv preprint arXiv:2411.11505, 2024. 2
-
[69]
Measured albedo in the wild: Filling the gap in intrinsics evaluation
Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, and Soumyadip Sengupta. Measured albedo in the wild: Filling the gap in intrinsics evaluation. In2023 IEEE International Conference on Computational Photogra- phy (ICCP), pages 1–12. IEEE, 2023. 3
work page 2023
-
[70]
Tong Wu, Jia-Mu Sun, Yu-Kun Lai, Yuewen Ma, Leif Kobbelt, and Lin Gao. Deferredgs: Decoupled and editable gaussian splatting with deferred shading.arXiv preprint arXiv:2404.09412, 2024. 3
-
[71]
Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for general dense perception tasks?arXiv preprint arXiv:2403.06090,
-
[72]
Paint by example: Exemplar-based image editing with diffusion mod- els
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion mod- els. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 18381–18391,
-
[73]
Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, and Xiaoguang Han. Stablenormal: Reducing diffusion variance for stable and sharp normal.ACM Transactions on Graphics (TOG), 43(6):1–18, 2024. 2, 6, 7
work page 2024
-
[74]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Intrinsicnerf: Learning intrinsic neural radiance fields for editable novel view synthesis
Weicai Ye, Shuo Chen, Chong Bao, Hujun Bao, Marc Polle- feys, Zhaopeng Cui, and Guofeng Zhang. Intrinsicnerf: Learning intrinsic neural radiance fields for editable novel view synthesis. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 339–351,
-
[76]
Light source separation and intrinsic image decomposition under ac illumination
Yusaku Yoshida, Ryo Kawahara, and Takahiro Okabe. Light source separation and intrinsic image decomposition under ac illumination. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5735– 5743, 2023. 3
work page 2023
-
[77]
Self- supervised outdoor scene relighting
Ye Yu, Abhimitra Meka, Mohamed Elgharib, Hans-Peter Seidel, Christian Theobalt, and William AP Smith. Self- supervised outdoor scene relighting. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXII 16, pages 84–101. Springer, 2020. 3
work page 2020
-
[78]
Dilightnet: Fine-grained light- ing control for diffusion-based image generation
Chong Zeng, Yue Dong, Pieter Peers, Youkang Kong, Hongzhi Wu, and Xin Tong. Dilightnet: Fine-grained light- ing control for diffusion-based image generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 2
work page 2024
-
[79]
Rgb↔x: Image decomposition and synthe- sis using material- and lighting-aware diffusion models
Zheng Zeng, Valentin Deschaintre, Iliyan Georgiev, Yannick Hold-Geoffroy, Yiwei Hu, Fujun Luan, Ling-Qi Yan, and Miloˇs Ha ˇsan. Rgb↔x: Image decomposition and synthe- sis using material- and lighting-aware diffusion models. In ACM SIGGRAPH 2024 Conference Papers, New York, NY , USA, 2024. Association for Computing Machinery. 1, 2, 3, 5, 6, 7, 8
work page 2024
-
[80]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.