Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network
Pith reviewed 2026-05-15 14:52 UTC · model grok-4.3
The pith
Coupling Retinex illumination correction with CLIP language guidance enhances underwater images more effectively.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a Physics-Semantics-Guided Underwater Image Enhancement Network, which combines a Prior-Free Illumination Estimator grounded in Retinex with a Semantics-Guided Image Restorer that uses CLIP-generated textual descriptions, can achieve superior enhancement by enforcing semantic consistency through a dedicated loss on a newly constructed dataset of 6,418 image-reference-text triplets.
What carries the argument
The Semantics-Guided Image Restorer that leverages CLIP textual descriptions to inject high-level semantics for guiding the restoration process.
If this is right
- Enhanced correction of color distortion and low contrast without relying on strict physical assumptions.
- Improved generalization to varied underwater conditions through multimodal training.
- New capability to measure and optimize semantic alignment between images and text in enhancement tasks.
- Comparable or better performance against fifteen existing methods on public datasets.
Where Pith is reading between the lines
- Similar integration of physical models with language guidance could be tested on other image degradation problems like haze or night scenes.
- The constructed dataset may serve as a benchmark for future multimodal underwater vision research.
- Using different vision-language models beyond CLIP might yield further gains if the semantic guidance is the key factor.
Load-bearing premise
That the textual descriptions produced by CLIP provide accurate and useful perceptual guidance for restoring details in underwater images.
What would settle it
Running the enhancement on a set of underwater images paired with deliberately incorrect or mismatched textual descriptions and checking if performance drops significantly compared to correct descriptions.
Figures
read the original abstract
Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PSG-UIENet for underwater image enhancement, coupling a Prior-Free Illumination Estimator grounded in Retinex theory with a Semantics-Guided Image Restorer that injects high-level semantics via CLIP-generated textual descriptions. The authors introduce the LUIQD-TD multimodal dataset (6,418 image-reference-text triplets) and an ITSS loss to enforce image-text semantic consistency, claiming this is the first integration of textual guidance and multimodal data into UIE tasks. Extensive experiments reportedly show superior or comparable results to 15 state-of-the-art methods on the new dataset plus four public benchmarks.
Significance. If the performance gains prove robust and the semantic component demonstrably contributes beyond standard Retinex or CNN baselines, the work would advance UIE by bridging physical priors with multimodal language models, directly addressing data scarcity and generalization limitations. The LUIQD-TD dataset is a concrete community resource, and the explicit ITSS loss provides a falsifiable mechanism for semantic alignment; these elements strengthen the contribution even if the absolute gains are incremental.
major comments (2)
- [§3.2] §3.2 (Semantics-Guided Image Restorer): the reliance on CLIP-generated captions for perceptually meaningful guidance is central to the novelty claim, yet the manuscript provides no quantitative validation (e.g., caption accuracy metrics or human study) that CLIP embeddings remain informative under severe underwater color-cast and low-contrast conditions; without this, the ITSS loss may optimize to generic or domain-shifted text rather than task-relevant semantics.
- [§4] §4 (Experiments): the central performance claim against 15 SOTA methods is load-bearing, but the manuscript must include full quantitative tables (PSNR, SSIM, UIQM, etc.) on all five datasets, ablation studies isolating the ITSS loss and CLIP component, and error analysis or failure cases; the abstract alone supplies no such evidence, preventing verification that gains are not due to post-hoc dataset choices or missing baselines.
minor comments (2)
- Notation for the ITSS loss and the precise formulation of how CLIP text embeddings are fused into the restorer should be clarified with an explicit equation or diagram to avoid ambiguity in reproduction.
- The manuscript should add a brief related-work subsection contrasting the proposed Retinex+CLIP approach with prior multimodal or language-guided enhancement methods outside UIE to strengthen the novelty positioning.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to incorporate additional validation and expanded experimental details where needed.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Semantics-Guided Image Restorer): the reliance on CLIP-generated captions for perceptually meaningful guidance is central to the novelty claim, yet the manuscript provides no quantitative validation (e.g., caption accuracy metrics or human study) that CLIP embeddings remain informative under severe underwater color-cast and low-contrast conditions; without this, the ITSS loss may optimize to generic or domain-shifted text rather than task-relevant semantics.
Authors: We agree that direct quantitative validation of CLIP caption quality specifically on underwater images would strengthen the central novelty claim. The current manuscript relies on end-to-end performance gains and the ITSS loss to demonstrate utility, but we will add a new subsection with caption accuracy metrics (e.g., CLIP similarity scores between generated texts and reference descriptions) computed on a held-out subset of LUIQD-TD, along with qualitative examples of captions under varying degradation levels. This revision will directly address whether the embeddings remain informative. revision: yes
-
Referee: [§4] §4 (Experiments): the central performance claim against 15 SOTA methods is load-bearing, but the manuscript must include full quantitative tables (PSNR, SSIM, UIQM, etc.) on all five datasets, ablation studies isolating the ITSS loss and CLIP component, and error analysis or failure cases; the abstract alone supplies no such evidence, preventing verification that gains are not due to post-hoc dataset choices or missing baselines.
Authors: We acknowledge that the experimental section requires fuller documentation to support the claims. In the revised manuscript we will expand Section 4 with complete tables reporting PSNR, SSIM, UIQM and additional metrics across all five datasets (LUIQD-TD plus the four public benchmarks), dedicated ablation tables isolating the ITSS loss and CLIP semantic injection, and a new error-analysis subsection that discusses representative failure cases. These additions will allow independent verification that observed gains arise from the proposed components. revision: yes
Circularity Check
No derivation reduces to fitted input by construction; ITSS loss and CLIP guidance remain external
full rationale
The paper introduces a Prior-Free Illumination Estimator grounded in Retinex and a Semantics-Guided Image Restorer that injects CLIP-generated text via a newly defined ITSS loss. Neither component is shown to be fitted to a subset of the target outputs and then re-predicted; the multimodal dataset LUIQD-TD is constructed externally rather than derived from model parameters. Standard Retinex decomposition is invoked without self-referential redefinition of illumination or reflectance terms. Any self-citations are peripheral and do not carry the central claim of first multimodal UIE integration. The reported performance gains are therefore not forced by internal re-labeling of fitted quantities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Retinex model can be used for illumination correction without rigid physical assumptions
- domain assumption CLIP text embeddings supply perceptually meaningful guidance for image restoration
invented entities (2)
-
LUIQD-TD dataset
no independent evidence
-
ITSS loss
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Retinex theory decomposes an image into reflectance and illumination... Ideg = (R + R̂) ⊙ (L + L̂)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A perception- aware decomposition and fusion framework for underwater image enhancement,
Y . Kang, Q. Jiang, C. Li, W. Ren, H. Liu, and P. Wang, “A perception- aware decomposition and fusion framework for underwater image enhancement,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 988–1002, 2022
work page 2022
-
[2]
Beyond single reference for training: Underwater image enhancement via comparative learning,
K. Li, L. Wu, Q. Qi, W. Liu, X. Gao, L. Zhou, and D. Song, “Beyond single reference for training: Underwater image enhancement via comparative learning,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 6, pp. 2561–2576, 2022
work page 2022
-
[3]
H. Wang, A. C. Frery, M. Li, and P. Ren, “Underwater image enhance- ment via histogram similarity-oriented color compensation comple- mented by multiple attribute adjustment,”Intelligent Marine Technology and Systems, vol. 1, no. 1, p. 12, 2023
work page 2023
-
[4]
Image enhancement in turbid water using multiscale weighted features and attention mechanisms,
H. Zhang, W. Zhang, H. Yuan, S. Bai, Y . Tian, and Z. Liu, “Image enhancement in turbid water using multiscale weighted features and attention mechanisms,”Intelligent Marine Technology and Systems, vol. 3, no. 1, p. 35, 2025
work page 2025
-
[5]
Retinex- former: One-stage retinex-based transformer for low-light image en- hancement,
Y . Cai, H. Bian, J. Lin, H. Wang, R. Timofte, and Y . Zhang, “Retinex- former: One-stage retinex-based transformer for low-light image en- hancement,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 12 504–12 513. 13
work page 2023
-
[6]
Retinexmamba: Retinex- based mamba for low-light image enhancement,
J. Bai, Y . Yin, Q. He, Y . Li, and X. Zhang, “Retinexmamba: Retinex- based mamba for low-light image enhancement,” inInternational Con- ference on Neural Information Processing. Springer, 2025, pp. 427– 442
work page 2025
-
[7]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004
work page 2004
-
[8]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595
work page 2018
-
[9]
E. H. Land and J. J. McCann, “Lightness and retinex theory,”Journal of the Optical society of America, vol. 61, no. 1, pp. 1–11, 1971
work page 1971
-
[10]
Single image haze removal using dark channel prior,
K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, 2010
work page 2010
-
[11]
Transmission estimation in underwater single images,
P. Drews, E. Nascimento, F. Moraes, S. Botelho, and M. Campos, “Transmission estimation in underwater single images,” inProceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 825–830
work page 2013
-
[12]
Generalization of the dark channel prior for single image restoration,
Y .-T. Peng, K. Cao, and P. C. Cosman, “Generalization of the dark channel prior for single image restoration,”IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2856–2868, 2018
work page 2018
-
[13]
Diving into haze-lines: Color restoration of underwater images,
D. Berman, T. Treibitz, and S. Avidan, “Diving into haze-lines: Color restoration of underwater images,” inProc. British Machine Vision Conference (BMVC), vol. 1, no. 2, 2017
work page 2017
-
[14]
The retinex theory of color vision
E. H. Land, “The retinex theory of color vision.”Scientific American, p. 108–128, Feb 2010. [Online]. Available: http://dx.doi.org/10.1038/ scientificamerican1277-108
work page 2010
-
[15]
Underwater image enhancement with hyper-laplacian reflectance priors,
P. Zhuang, J. Wu, F. Porikli, and C. Li, “Underwater image enhancement with hyper-laplacian reflectance priors,”IEEE Transactions on Image Processing, vol. 31, pp. 5442–5455, 2022
work page 2022
-
[16]
Under- water image enhancement via minimal color loss and locally adaptive contrast enhancement,
W. Zhang, P. Zhuang, H.-H. Sun, G. Li, S. Kwong, and C. Li, “Under- water image enhancement via minimal color loss and locally adaptive contrast enhancement,”IEEE Transactions on Image Processing, vol. 31, pp. 3997–4010, 2022
work page 2022
-
[17]
Hfm: A hybrid fusion method for underwater image enhancement,
S. An, L. Xu, I. Senior Member, Z. Deng, and H. Zhang, “Hfm: A hybrid fusion method for underwater image enhancement,”Engineering Applications of Artificial Intelligence, vol. 127, p. 107219, 2024
work page 2024
-
[18]
An underwater image enhancement benchmark dataset and beyond,
C. Li, C. Guo, W. Ren, R. Cong, J. Hou, S. Kwong, and D. Tao, “An underwater image enhancement benchmark dataset and beyond,”IEEE Transactions on Image Processing, vol. 29, pp. 4376–4389, 2019
work page 2019
-
[19]
Underwater image enhancement via medium transmission-guided multi-color space embedding,
C. Li, S. Anwar, J. Hou, R. Cong, C. Guo, and W. Ren, “Underwater image enhancement via medium transmission-guided multi-color space embedding,”IEEE Transactions on Image Processing, vol. 30, pp. 4985– 5000, 2021
work page 2021
-
[20]
A semi-supervised physics-aware triple-stream underwater image enhancement network,
S. Xu, H. Qi, W. Wang, C. Huang, J. Wen, J. Dong, and X. Dong, “A semi-supervised physics-aware triple-stream underwater image enhancement network,” 2025. [Online]. Available: https: //arxiv.org/abs/2307.11470
-
[21]
Uncertainty inspired underwater image enhancement,
Z. Fu, W. Wang, Y . Huang, X. Ding, and K.-K. Ma, “Uncertainty inspired underwater image enhancement,” inEuropean conference on computer vision. Springer, 2022, pp. 465–482
work page 2022
-
[22]
U-shape transformer for underwater image enhancement,
L. Peng, C. Zhu, and L. Bian, “U-shape transformer for underwater image enhancement,”IEEE Transactions on Image Processing, vol. 32, pp. 3066–3079, 2023
work page 2023
-
[23]
Underwater ranker: Learn which is better and how to be better,
C. Guo, R. Wu, X. Jin, L. Han, W. Zhang, Z. Chai, and C. Li, “Underwater ranker: Learn which is better and how to be better,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 702–709
work page 2023
-
[24]
Deep color-corrected multi-scale retinex network for underwater image enhancement,
H. Qi, H. Zhou, J. Dong, and X. Dong, “Deep color-corrected multi-scale retinex network for underwater image enhancement,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024
work page 2024
-
[25]
Uwformer: Underwater image enhancement via a semi-supervised multi-scale trans- former,
W. Chen, Y . Lei, S. Luo, Z. Zhou, M. Li, and C.-M. Pun, “Uwformer: Underwater image enhancement via a semi-supervised multi-scale trans- former,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–8
work page 2024
-
[26]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[27]
Iterative prompt learning for unsupervised backlit image enhancement,
Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, “Iterative prompt learning for unsupervised backlit image enhancement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8094–8103
work page 2023
-
[28]
Hazeclip: Towards language guided real-world image dehazing,
R. Wang, W. Li, X. Liu, C. Li, Z. Zhang, X. Min, and G. Zhai, “Hazeclip: Towards language guided real-world image dehazing,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
work page 2025
-
[29]
Underwater image enhancement by diffusion model with customized clip-classifier,
S. Liu, K. Li, Y . Ding, and Q. Qi, “Underwater image enhancement by diffusion model with customized clip-classifier,”arXiv preprint arXiv:2405.16214, 2024
-
[30]
B. Lin, J. Dong, and X. Dong, “Perception-aware underwater image quality assessment: Dataset, perceptual quality scores and assessment network,”IEEE Transactions on Circuits and Systems for Video Tech- nology, 2025
work page 2025
-
[31]
Underwater single image color restoration using haze-lines and a new quantitative dataset,
D. Berman, D. Levy, S. Avidan, and T. Treibitz, “Underwater single image color restoration using haze-lines and a new quantitative dataset,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2822–2837, 2020
work page 2020
-
[32]
Zuiderveld,Contrast Limited Adaptive Histogram Equalization
K. Zuiderveld,Contrast Limited Adaptive Histogram Equalization. USA: Academic Press Professional, Inc., 1994, p. 474–485
work page 1994
-
[33]
D. Huang, Y . Wang, W. Song, J. Sequeira, and S. Mavromatis, “Shallow- water image enhancement using relative global histogram stretching based on adaptive parameter acquisition,” inMultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5- 7, 2018, Proceedings, Part I 24. Springer, 2018, pp. 453–465
work page 2018
-
[34]
Underwater image enhance- ment via extended multi-scale retinex,
S. Zhang, T. Wang, J. Dong, and H. Yu, “Underwater image enhance- ment via extended multi-scale retinex,”Neurocomputing, vol. 245, pp. 1–9, 2017
work page 2017
-
[35]
Enhancing underwa- ter images and videos by fusion,
C. Ancuti, C. O. Ancuti, T. Haber, and P. Bekaert, “Enhancing underwa- ter images and videos by fusion,” in2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 81–88
work page 2012
-
[36]
Color balance and fusion for underwater image enhancement,
C. O. Ancuti, C. Ancuti, C. De Vleeschouwer, and P. Bekaert, “Color balance and fusion for underwater image enhancement,”IEEE Transac- tions on Image Processing, vol. 27, no. 1, pp. 379–393, 2017
work page 2017
-
[37]
Sguie-net: Semantic attention guided underwater image enhancement with multi- scale perception,
Q. Qi, K. Li, H. Zheng, X. Gao, G. Hou, and K. Sun, “Sguie-net: Semantic attention guided underwater image enhancement with multi- scale perception,”IEEE Transactions on Image Processing, vol. 31, pp. 6816–6830, 2022
work page 2022
-
[38]
Rave: Residual vector embedding for clip-guided backlit image enhancement,
T. Gaintseva, M. Benning, and G. Slabaugh, “Rave: Residual vector embedding for clip-guided backlit image enhancement,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 412–428
work page 2024
-
[39]
Vqa: Visual question answering,
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433
work page 2015
-
[40]
Masked au- toencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[41]
Film: Visual reasoning with a general conditioning layer,
E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018
work page 2018
-
[42]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[43]
Image enhancement by histogram transformation,
R. Hummel, “Image enhancement by histogram transformation,”Com- puter Graphics Image Processing, vol. 6, no. 2, pp. 184–195, 1977
work page 1977
-
[44]
Uif: An objective quality assessment for underwater image enhancement,
Y . Zheng, W. Chen, R. Lin, T. Zhao, and P. Le Callet, “Uif: An objective quality assessment for underwater image enhancement,”IEEE Transactions on Image Processing, vol. 31, pp. 5456–5468, 2022
work page 2022
-
[45]
An underwater color image quality evaluation metric,
M. Yang and A. Sowmya, “An underwater color image quality evaluation metric,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 6062–6071, 2015
work page 2015
-
[46]
Human-visual-system-inspired underwater image quality measures,
K. Panetta, C. Gao, and S. Agaian, “Human-visual-system-inspired underwater image quality measures,”IEEE Journal of Oceanic Engi- neering, vol. 41, no. 3, pp. 541–551, 2015. Shixuan Xureceived the bachelor’s degree in En- gineering from Lanzhou University of Finance and Economics (LZUFE), Lanzhou, Gansu, China, in
work page 2015
-
[47]
His research interests include computer vision, deep learning and image enhancement
He is currently pursuing the master’s degree in Artificial Intelligence at Ocean University of China. His research interests include computer vision, deep learning and image enhancement. 14 Yabo Liureceived the Ph.D. degree in computer technology from Harbin Institute of Technology, Shenzhen, China, in 2025. From 2021 to 2025, he was a jointly supervised ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.