pith. machine review for the scientific record. sign in

arxiv: 2604.19587 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Bo Li, Chengcheng Liu, Guangyuan Li, Jian Zhang, Jinwei Chen, Linxiao Shi, Miaosen Luo, Peng-Tao Jiang, Qirui Yang, Ruiyang Fan, Siming Zheng, Yang Yang, Ying Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords automatic photographic image editingimage quality reasoninggenerative image enhancementreinforcement learning for editingphoto-realistic retouchingimage critic modulereasoning-guided generation
0
0 comments X

The pith

SmartPhotoCrafter automatically edits photos by reasoning about quality deficiencies and generating targeted enhancements without user instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that photographic image editing can be automated as a single reasoning-to-generation process rather than depending on vague human directions. An Image Critic first analyzes an input photo for aesthetic shortfalls, then a Photographic Artist module produces the actual changes to improve visual appeal. This matters for non-expert users who struggle to express precise retouching goals, since the system handles both restoration and tonal adjustments while keeping results realistic and consistent with color semantics. The authors train the model in three stages that progressively add reasoning supervision and reinforcement learning to align the two modules. Experiments indicate the resulting edits exceed those of prior generative models in realism and sensitivity to tone-related cues.

Core claim

SmartPhotoCrafter formulates automatic photographic image editing as a tightly coupled reasoning-to-generation process in which the Image Critic module identifies aesthetic deficiencies and the Photographic Artist module performs targeted edits, trained end-to-end through foundation pretraining, reasoning-guided multi-edit supervision, and coordinated reinforcement learning to deliver photo-realistic results on restoration and retouching tasks while adhering to color- and tone-related semantics.

What carries the argument

The unified reasoning-to-generation pipeline that pairs an Image Critic for deficiency identification with a Photographic Artist for edit realization, jointly optimized via multi-stage training that includes reinforcement learning.

If this is right

  • The method supports both image restoration and retouching while maintaining consistent adherence to color- and tone-related semantics.
  • It achieves higher tonal sensitivity to retouching needs than existing generative models.
  • Photo-realistic enhancements become possible without requiring users to supply explicit aesthetic instructions.
  • A stage-specific dataset progressively builds reasoning capability, controllable generation, and cross-module collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same critic-plus-artist structure with staged reinforcement learning could be adapted to other generative tasks such as video enhancement or style transfer where internal quality assessment is needed.
  • If the critic's judgments prove stable across cultural or stylistic variations, the model might reduce reliance on subjective user prompts in consumer photo apps.
  • Mobile-camera integration could allow real-time automatic corrections during capture by running the reasoning step on-device before final image output.

Load-bearing premise

The Image Critic can reliably detect aesthetic deficiencies and the training data plus reinforcement learning produce edits that match broad human aesthetic preferences without any explicit instructions.

What would settle it

A side-by-side human evaluation study on the same input photographs in which participants rate SmartPhotoCrafter outputs against those from instruction-based editing models or professional retouchers for realism, tonal accuracy, and overall appeal.

read the original abstract

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces SmartPhotoCrafter, a unified model for automatic photographic image editing formulated as a reasoning-to-generation process. It consists of an Image Critic module that performs image quality comprehension and identifies aesthetic deficiencies, followed by a Photographic Artist module that executes targeted edits for enhancement without requiring explicit human instructions. The approach uses a three-stage training pipeline—foundation pretraining for basic aesthetic understanding, adaptation via reasoning-guided multi-edit supervision, and coordinated reinforcement learning to jointly optimize reasoning and generation—along with stage-specific datasets. Experiments are claimed to show outperformance over existing generative models in photo-realistic enhancement, with strong adherence to color- and tone-related semantics for both restoration and retouching tasks.

Significance. If the empirical results hold after controlling for model scale and data, the work could meaningfully advance automatic, instruction-free image editing by making professional-level photographic adjustments accessible to non-experts. The tight coupling of comprehension and generation through RL coordination, combined with explicit emphasis on photo-realism and tonal sensitivity, offers a coherent pipeline that addresses a practical gap in consumer photography tools. The progressive dataset construction for building cross-module collaboration is a constructive element.

minor comments (2)
  1. [Abstract] The abstract states that experiments demonstrate outperformance and higher tonal sensitivity but provides no quantitative metrics, baseline models, or dataset sizes; adding these details (even at a high level) would strengthen the summary for readers.
  2. The description of the Image Critic's deficiency identification and the RL coordination objective remains high-level; a concrete example of a reasoning trace or loss formulation would clarify how the modules interact without explicit instructions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of SmartPhotoCrafter, including the recognition of its unified reasoning-to-generation pipeline, multi-stage training, and potential impact on automatic photographic editing. We appreciate the minor_revision recommendation and will incorporate any minor clarifications or improvements in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivations

full rationale

The paper describes a procedural multi-stage training pipeline (foundation pretraining, reasoning-guided adaptation, and RL coordination) for an image editing model consisting of an Image Critic and Photographic Artist. All central claims of outperformance and tonal sensitivity are presented as empirical experimental results rather than mathematical derivations or first-principles predictions. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The method is self-contained as a descriptive architecture whose validity rests on external benchmarks and datasets, not on internal reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard deep-learning assumptions; the claim rests on the unstated premise that aesthetic quality can be learned from the constructed stage-specific dataset.

pith-pipeline@v0.9.0 · 5611 in / 1095 out tokens · 37691 ms · 2026-05-10T02:06:55.929591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 27 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  2. [2]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 19

  3. [3]

    In: The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition (2011)

    Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input / output image pairs. In: The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition (2011)

  4. [4]

    In: CVPR 2011

    Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011)

  5. [5]

    arXiv preprint arXiv:2506.05384 (2025)

    Cai, Z., Zhang, J., Yuan, X., Jiang, P.T., Chen, W., Tang, B., Yao, L., Wang, Q., Chen, J., Li, B.: Q-ponder: A unified training pipeline for reasoning-based visual quality assessment. arXiv preprint arXiv:2506.05384 (2025)

  6. [6]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  7. [7]

    arXiv preprint arXiv:2511.12998 (2025)

    Chang, Z., Duan, Z.P., Zhang, J., Guo, C.L., Liu, S., Chun, H., Park, H., Liu, Z., Li, C.: Pertouch: Vlm-driven agent for personalized and semantic image retouching. arXiv preprint arXiv:2511.12998 (2025)

  8. [8]

    arXiv preprint arXiv:2505.23130(2025)

    Chen, H., Tao, K., Wang, Y., Wang, X., Zhu, L., Gu, J.: Photoartagent: Intelligent photo retouching with language model-based artist agents. arXiv preprint arXiv:2505.23130 (2025)

  9. [9]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

    Chen, Y.S., Wang, Y.C., Kao, M.H., Chuang, Y.Y.: Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

  10. [10]

    In: 13th International Conference on Learning Representations, ICLR 2025

    Cui, Y., Zamir, S., Khan, S., Knoll, A., Shah, M., Khan, F.: Adair: Adaptive all-in-one image restoration via frequency mining and modulation. In: 13th International Conference on Learning Representations, ICLR 2025. pp. 57335–57356. 13th International Conference on Learning Representations, ICLR 2025, International Conference on Learning Representations,...

  11. [11]

    In: Pro- ceedings of the 26th ACM International Conference on Multimedia

    Deng, Y., Loy, C.C., Tang, X.: Aesthetic-driven image enhancement by adversarial learning. In: Pro- ceedings of the 26th ACM International Conference on Multimedia. p. 870–878. MM ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3240508.3240531, https://doi.org/10.1145/3240508.3240531

  12. [12]

    In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id= Z9FjSaBuYt

    Duan, C., Fang, R., Wang, Y., Wang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of autoregressive visual generation with reinforcement learning. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id= Z9FjSaBuYt

  13. [13]

    ACM Transactions on Graphics (TOG)44(4), 1–12 (2025)

    Dutt, N.S., Ceylan, D., Mitra, N.J.: Monetgpt: Solving puzzles enhances mllms’ image retouching skills. ACM Transactions on Graphics (TOG)44(4), 1–12 (2025)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3677–3686 (2020)

  15. [15]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Fortes, A., Wei, T., Zhou, S., Pan, X.: Bokeh diffusion: Defocus blur control in text-to-image diffusion models. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

  16. [16]

    ACM Trans

    Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM Trans. Graph.36(4) (Jul 2017). https://doi.org/10.1145/3072959.3073592, https://doi.org/10.1145/3072959.3073592

  17. [17]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=9RFocgIccP

    Gu, X., Li, M., Zhang, L., Chen, F., Wen, L., Luo, T., Zhu, S.: Multi-reward as condition for instruction- based image editing. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=9RFocgIccP

  18. [18]

    Abhimanyu Dubey et al

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., 20 Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu,...

  19. [19]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  20. [20]

    Advances in neural information processing systems30 (2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30 (2017)

  21. [21]

    In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020),https://proceedings.neurips.cc/paper_files/paper/ 2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

  22. [22]

    Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010)

  23. [23]

    IEEE Transactions on Image Processing29, 4041–4056 (2020)

    Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing29, 4041–4056 (2020)

  24. [24]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021)

  25. [25]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  26. [26]

    Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2(2025)

  27. [27]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Li, H., Chen, X., Dong, J., Tang, J., Pan, J.: Foundir: Unleashing million-scale training data to advance foundation models for image restoration. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12626–12636 (2025)

  28. [28]

    Li, H., Jiang, L., Yan, Q., Song, Y., Kang, H., Liu, Z., Lu, X., Wu, B., Cai, D.: Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing (2026),https://arxiv.org/abs/2601.03467

  29. [29]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview.net/forum?id=Bds54EfR9x

    Li, W., Zhang, X., Zhao, S., ZHANG, Y., Li, J., zhang, L., Zhang, J.: Q-insight: Understanding image quality via visual reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview.net/forum?id=Bds54EfR9x

  30. [30]

    In: IJCAI

    Li, Z., Chen, X., Wang, S., Pun, C.M.: A large-scale film style dataset for learning multi-frequency driven film enhancement. In: IJCAI. vol. 2023, pp. 1160–1168 (2023) 21

  31. [31]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld- v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

  32. [32]

    In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX)

    Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–3. IEEE (2019)

  33. [33]

    Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

    Lin, Y., Lin, Z., Lin, K., Bai, J., Pan, P., Li, C., Chen, H., Wang, Z., Ding, X., Li, W., et al.: Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612 (2025)

  34. [34]

    Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

    Lin, Y., Wang, L., Lin, K., Lin, Z., Gong, K., Li, W., Lin, B., Li, Z., Zhang, S., Peng, Y., et al.: Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization. arXiv preprint arXiv:2511.23002 (2025)

  35. [35]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

  36. [36]

    IEEE Transactions on image processing21(12), 4695–4708 (2012)

    Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing21(12), 4695–4708 (2012)

  37. [37]

    completely blind

    Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal processing letters20(3), 209–212 (2012)

  38. [38]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2408–2415. IEEE (2012)

  39. [39]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

    Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

  40. [40]

    In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

  41. [41]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Qin, X., Wang, Z., Li, F., Chen, H., Pei, R., Li, W., Cao, X.: Camedit: Continuous camera parameter control for photorealistic image editing. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  42. [42]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

    Qu, L., Tian, J., He, S., Tang, Y., Lau, R.W.H.: Deshadownet: A multi-context embedding deep network for shadow removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

  43. [43]

    In: Proceedings of 3rd IEEE international conference on image processing

    Rahman, Z.u., Jobson, D.J., Woodell, G.A.: Multi-scale retinex for color image enhancement. In: Proceedings of 3rd IEEE international conference on image processing. vol. 3, pp. 1003–1006. IEEE (1996)

  44. [44]

    In: European Conference on Computer Vision

    Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: European Conference on Computer Vision. pp. 184–201. Springer (2020)

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022)

  46. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Seizinger, T., Vasluianu, F.A., Conde, M.V., Wu, Z., Timofte, R.: Bokehlicious: Photorealistic bokeh rendering with controllable apertures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8908–8917 (2025)

  47. [47]

    IEEE Transactions on image processing20(5), 1211–1220 (2010) 22

    Sen, D., Pal, S.K.: Automatic exact histogram specification for contrast enhancement and visual system based quantitative evaluation. IEEE Transactions on image processing20(5), 1211–1220 (2010) 22

  48. [48]

    Clip-fields: Weakly supervised semantic fields for robotic memory

    Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663 (2022)

  49. [49]

    In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=St1giarCHLP

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=St1giarCHLP

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3667–3676 (2020)

  51. [51]

    doi:10.1109/TIP.2018.2831899 , shorttitle =

    Talebi, H., Milanfar, P.: Nima: Neural image assessment. IEEE Transactions on Image Processing27(8), 3998–4011 (2018). https://doi.org/10.1109/TIP.2018.2831899

  52. [52]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

    Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

  53. [53]

    IEEE transactions on image processing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

  54. [54]

    Deep Retinex Decomposition for Low-Light Enhancement

    Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560 (2018)

  55. [55]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  56. [56]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

  57. [57]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Xu, K., Li, C., Hou, J., Zhai, G., Xue, G., Sun, W., Yan, Q., Lin, W.: Q-instruct: Improving low-level visual abilities for multi-modality foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25490–25500 (June 2024)

  58. [58]

    In: Proceedings of the 41st International Conference on Machine Learning

    Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: teaching lmms for visual scoring via discrete text-defined levels. In: Proceedings of the 41st International Conference on Machine Learning. pp. 54015–54029 (2024)

  59. [59]

    arXiv preprint arXiv:2602.17558 (2026)

    Wu, Q., Shi, J., Jenni, S., Kafle, K., Wang, T., Chang, S., Zhao, H.: Retouchiq: Mllm agents for instruction-based image retouching with generalist reward. arXiv preprint arXiv:2602.17558 (2026)

  60. [60]

    Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

    Wu, T., Zou, J., Liang, J., Zhang, L., Ma, K.: Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank. arXiv preprint arXiv:2505.14460 (2025)

  61. [62]

    ACM Trans

    Yan, Z., Zhang, H., Wang, B., Paris, S., Yu, Y.: Automatic photo adjustment using deep neural networks. ACM Trans. Graph.35(2) (Feb 2016). https://doi.org/10.1145/2790296,https://doi.org/10.1145/ 2790296

  62. [63]

    Yang, Q., Yang, Y., Zeng, Y., Hu, X., Li, B., Yue, H., Yang, J., Jiang, P.T.: Cameramaster: Unifiedcamera semantic-parameter control for photography retouching (2025),https://arxiv.org/abs/2511.21024

  63. [64]

    IEEE Transactions on Image Processing30, 2072–2086 (2021)

    Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing30, 2072–2086 (2021)

  64. [65]

    Yao, M., You, Z., Tam, K.M., Wang, M., Xue, T.: Photoagent: Agentic photo editing with exploratory visual aesthetic planning (2026),https://arxiv.org/abs/2602.22809 23

  65. [66]

    arXiv preprint arXiv:2602.09084 (2026)

    Ye, R., Zhang, J., Liu, Z., Zhu, Z., Yang, S., Li, L., Fu, T., Dernoncourt, F., Zhao, Y., Zhu, J., et al.: Agent banana: High-fidelity image editing with agentic thinking and tooling. arXiv preprint arXiv:2602.09084 (2026)

  66. [67]

    Ye-Bin, M., Miles, R., Oh, T.H., Elezi, I., Deng, J.: Retouchllm: Training-free code-based image retouching with vision language models (2025),https://arxiv.org/abs/2510.08054

  67. [68]

    Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., Chen, P., Zhang, X., Jiang, D., Zeng, X., Yu, G.: Reasonedit: Towards reasoning-enhanced image editing models (2025),https://arxiv.org/abs/2511.22625

  68. [69]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    You, Z., Cai, X., Gu, J., Xue, T., Dong, C.: Teaching large language models to regress accurate image quality scores using score distribution. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14483–14494 (2025)

  69. [70]

    IEEE Transactions on Multimedia25, 5589–5600 (2022)

    Yue, H., Cheng, Y., Mao, Y., Cao, C., Yang, J.: Recaptured screen image demoiréing in raw domain. IEEE Transactions on Multimedia25, 5589–5600 (2022)

  70. [71]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zamfir, E., Wu, Z., Mehta, N., Tan, Y., Paudel, D.P., Zhang, Y., Timofte, R.: Complexity experts are task-discriminative learners for any image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12753–12763 (June 2025)

  71. [72]

    IEEE transactions on Image Processing20(8), 2378–2386 (2011)

    Zhang, L., Zhang, L., Mou, X., Zhang, D.: Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing20(8), 2378–2386 (2011)

  72. [73]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhang, L., He, Y., Zhang, Q., Liu, Z., Zhang, X., Xiao, C.: Document image shadow removal guided by color-aware background. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1818–1827 (June 2023)

  73. [75]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  74. [76]

    In: International Conference on Learning Representations (ICLR) (2026)

    Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y.: Diffusionnft: Online diffusion reinforcement with forward process. In: International Conference on Learning Representations (ICLR) (2026)

  75. [77]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhu, C., Fan, Q., Zhang, Q., Chen, J., Zhang, H., Xu, C., Shi, B.: Bokehdiff: Neural lens blur with one-step diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9508–9518 (2025)

  76. [78]

    arXiv preprint arXiv:2507.07105 (2025)

    Zuo, Y., Zheng, Q., Wu, M., Jiang, X., Li, R., Wang, J., Zhang, Y., Mai, G., Wang, L.V., Zou, J., et al.: 4kagent: agentic any image to 4k super-resolution. arXiv preprint arXiv:2507.07105 (2025) 24 A More Qualitative Results We present additional qualitative examples of automatic photographic enhancement. As shown in Figure 6, SmartPhotoCrafter produces ...

  77. [79]

    However, there is noticeable low exposure, particularly in the darker areas surrounding the dog

    Image Quality/Degradations Analysis: The image exhibits no significant motion blur, defocus, or distortion. However, there is noticeable low exposure, particularly in the darker areas surrounding the dog. This creates a lack of detail in the background and could be considered a form of visual interference due to shadow occlusion. There is no apparent digi...

  78. [80]

    The white balance seems accurate, as the dog’s fur tones look true-to-life

    Color Performance&Lighting Analysis: The saturation appears moderate, with natural-looking colors. The white balance seems accurate, as the dog’s fur tones look true-to-life. However, the contrast is somewhat subdued due to the low exposure, leading to less distinction between the subject and its surroundings. The lighting is focused on the dog, creating ...

  79. [81]

    The focus is sharp on the dog, drawing attention directly to it

    Composition&Layout Analysis: The main subject, the dog, is clearly defined and positioned slightly off-center, adhering to the rule of thirds. The focus is sharp on the dog, drawing attention directly to it. The camera viewpoint is at eye level, providing a natural perspective. The background blur is effective in isolating the subject, though it might be ...

  80. [82]

    The creativity lies in the use of lighting and shadow to evoke emotion

    Aesthetic Impression Analysis: The tone style is somewhat muted due to the low exposure, giving the image a contemplative and serene feel. The creativity lies in the use of lighting and shadow to evoke emotion. The emotional expression is calm and introspective, with the dog appearing relaxed yet alert. The semantic richness is moderate, as the image tell...

Showing first 80 references.