pith. sign in

arxiv: 2603.07076 · v2 · submitted 2026-03-07 · 💻 cs.CV

Retinex Meets Language: A Physics-Semantics-Guided Underwater Image Enhancement Network

Pith reviewed 2026-05-15 14:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords underwater image enhancementRetinex modelCLIP guidancemultimodal datasetsemantic consistency lossimage restorationphysics-informed learning
0
0 comments X

The pith

Coupling Retinex illumination correction with CLIP language guidance enhances underwater images more effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a network that uses Retinex to estimate illumination without priors and incorporates semantic guidance from language models to restore underwater images. This tackles the problems of rigid assumptions in physical models and insufficient data in learning approaches by building a new multimodal dataset of image and text pairs. The method optimizes for both physical correction and semantic consistency, leading to better results than existing techniques on several datasets. Readers might care because improved underwater visibility supports applications in oceanography and underwater vehicles. The core idea is that language provides high-level cues that physics alone cannot supply in degraded scenes.

Core claim

The paper establishes that a Physics-Semantics-Guided Underwater Image Enhancement Network, which combines a Prior-Free Illumination Estimator grounded in Retinex with a Semantics-Guided Image Restorer that uses CLIP-generated textual descriptions, can achieve superior enhancement by enforcing semantic consistency through a dedicated loss on a newly constructed dataset of 6,418 image-reference-text triplets.

What carries the argument

The Semantics-Guided Image Restorer that leverages CLIP textual descriptions to inject high-level semantics for guiding the restoration process.

If this is right

  • Enhanced correction of color distortion and low contrast without relying on strict physical assumptions.
  • Improved generalization to varied underwater conditions through multimodal training.
  • New capability to measure and optimize semantic alignment between images and text in enhancement tasks.
  • Comparable or better performance against fifteen existing methods on public datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar integration of physical models with language guidance could be tested on other image degradation problems like haze or night scenes.
  • The constructed dataset may serve as a benchmark for future multimodal underwater vision research.
  • Using different vision-language models beyond CLIP might yield further gains if the semantic guidance is the key factor.

Load-bearing premise

That the textual descriptions produced by CLIP provide accurate and useful perceptual guidance for restoring details in underwater images.

What would settle it

Running the enhancement on a set of underwater images paired with deliberately incorrect or mismatched textual descriptions and checking if performance drops significantly compared to correct descriptions.

Figures

Figures reproduced from arXiv: 2603.07076 by Chao Huang, Junyu Dong, Shixuan Xu, Xinghui Dong, Yabo Liu.

Figure 1
Figure 1. Figure 1: Comparison of three Retinex-based UIE methods, including Retinex [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sixteen degraded-reference-text triplets contained in the LUIQD-TD. Each triplet consists of three components: a degraded image (top-left) and the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistical analysis of the textual annotations in the LUIQD-TD, including (a) the distribution of word frequencies, (b) the distribution of caption [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The architecture of our PSG-UIENet, which comprises two modules: (a) a Prior-Free Illumination Estimator that generates multi-scale light-enhanced [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The architecture of the Semantics-Guided Encoder-Decoder Network. This network is built on top of a symmetric encoder-decoder network, which [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The results produced by 15 baselines and our method in terms of a degraded image in the Test-L622 test set. Here, the PSNR, SSIM and LPIPS [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The results produced by 15 baselines and our method in terms of a degraded image in the Test-U80 test set. Here, the PSNR, SSIM and LPIPS [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The results produced by 15 baselines and our method in terms of a degraded image in the Test-S110 test set. Here, the PSNR, SSIM and LPIPS [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The results produced by 15 baselines and our method in terms of a degraded image in the Test-C60 test set. Here, both the PAUQA and UIF values [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The results produced by 15 baselines and our method in terms of a degraded image in the Test-R53 test set. Here, both the PAUQA and UIF values [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Underwater images often suffer from severe degradation caused by light absorption and scattering, leading to color distortion, low contrast and reduced visibility. Existing Underwater Image Enhancement (UIE) methods can be divided into two categories, i.e., prior-based and learning-based methods. The former rely on rigid physical assumptions that limit the adaptability, while the latter often face data scarcity and weak generalization. To address these issues, we propose a Physics-Semantics-Guided Underwater Image Enhancement Network (PSG-UIENet), which couples the Retinex-grounded illumination correction with the language-informed guidance. This network comprises a Prior-Free Illumination Estimator and a Semantics-Guided Image Restorer. In particular, the restorer leverages the textual descriptions generated by the Contrastive Language-Image Pre-training (CLIP) model to inject high-level semantics for perceptually meaningful guidance. Since multimodal UIE data sets are not publicly available, we also construct a large-scale image-text UIE data set, namely, LUIQD-TD, which contains 6,418 image-reference-text triplets. To explicitly measure and optimize semantic consistency between textual descriptions and images, we further design an Image-Text Semantic Similarity (ITSS) loss function. To our knowledge, this study makes the first effort to introduce both textual guidance and the multimodal data set into UIE tasks. Extensive experiments on our data set and four publicly available data sets demonstrate that the proposed PSG-UIENet achieves superior or comparable performance against fifteen state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PSG-UIENet for underwater image enhancement, coupling a Prior-Free Illumination Estimator grounded in Retinex theory with a Semantics-Guided Image Restorer that injects high-level semantics via CLIP-generated textual descriptions. The authors introduce the LUIQD-TD multimodal dataset (6,418 image-reference-text triplets) and an ITSS loss to enforce image-text semantic consistency, claiming this is the first integration of textual guidance and multimodal data into UIE tasks. Extensive experiments reportedly show superior or comparable results to 15 state-of-the-art methods on the new dataset plus four public benchmarks.

Significance. If the performance gains prove robust and the semantic component demonstrably contributes beyond standard Retinex or CNN baselines, the work would advance UIE by bridging physical priors with multimodal language models, directly addressing data scarcity and generalization limitations. The LUIQD-TD dataset is a concrete community resource, and the explicit ITSS loss provides a falsifiable mechanism for semantic alignment; these elements strengthen the contribution even if the absolute gains are incremental.

major comments (2)
  1. [§3.2] §3.2 (Semantics-Guided Image Restorer): the reliance on CLIP-generated captions for perceptually meaningful guidance is central to the novelty claim, yet the manuscript provides no quantitative validation (e.g., caption accuracy metrics or human study) that CLIP embeddings remain informative under severe underwater color-cast and low-contrast conditions; without this, the ITSS loss may optimize to generic or domain-shifted text rather than task-relevant semantics.
  2. [§4] §4 (Experiments): the central performance claim against 15 SOTA methods is load-bearing, but the manuscript must include full quantitative tables (PSNR, SSIM, UIQM, etc.) on all five datasets, ablation studies isolating the ITSS loss and CLIP component, and error analysis or failure cases; the abstract alone supplies no such evidence, preventing verification that gains are not due to post-hoc dataset choices or missing baselines.
minor comments (2)
  1. Notation for the ITSS loss and the precise formulation of how CLIP text embeddings are fused into the restorer should be clarified with an explicit equation or diagram to avoid ambiguity in reproduction.
  2. The manuscript should add a brief related-work subsection contrasting the proposed Retinex+CLIP approach with prior multimodal or language-guided enhancement methods outside UIE to strengthen the novelty positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to incorporate additional validation and expanded experimental details where needed.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Semantics-Guided Image Restorer): the reliance on CLIP-generated captions for perceptually meaningful guidance is central to the novelty claim, yet the manuscript provides no quantitative validation (e.g., caption accuracy metrics or human study) that CLIP embeddings remain informative under severe underwater color-cast and low-contrast conditions; without this, the ITSS loss may optimize to generic or domain-shifted text rather than task-relevant semantics.

    Authors: We agree that direct quantitative validation of CLIP caption quality specifically on underwater images would strengthen the central novelty claim. The current manuscript relies on end-to-end performance gains and the ITSS loss to demonstrate utility, but we will add a new subsection with caption accuracy metrics (e.g., CLIP similarity scores between generated texts and reference descriptions) computed on a held-out subset of LUIQD-TD, along with qualitative examples of captions under varying degradation levels. This revision will directly address whether the embeddings remain informative. revision: yes

  2. Referee: [§4] §4 (Experiments): the central performance claim against 15 SOTA methods is load-bearing, but the manuscript must include full quantitative tables (PSNR, SSIM, UIQM, etc.) on all five datasets, ablation studies isolating the ITSS loss and CLIP component, and error analysis or failure cases; the abstract alone supplies no such evidence, preventing verification that gains are not due to post-hoc dataset choices or missing baselines.

    Authors: We acknowledge that the experimental section requires fuller documentation to support the claims. In the revised manuscript we will expand Section 4 with complete tables reporting PSNR, SSIM, UIQM and additional metrics across all five datasets (LUIQD-TD plus the four public benchmarks), dedicated ablation tables isolating the ITSS loss and CLIP semantic injection, and a new error-analysis subsection that discusses representative failure cases. These additions will allow independent verification that observed gains arise from the proposed components. revision: yes

Circularity Check

0 steps flagged

No derivation reduces to fitted input by construction; ITSS loss and CLIP guidance remain external

full rationale

The paper introduces a Prior-Free Illumination Estimator grounded in Retinex and a Semantics-Guided Image Restorer that injects CLIP-generated text via a newly defined ITSS loss. Neither component is shown to be fitted to a subset of the target outputs and then re-predicted; the multimodal dataset LUIQD-TD is constructed externally rather than derived from model parameters. Standard Retinex decomposition is invoked without self-referential redefinition of illumination or reflectance terms. Any self-citations are peripheral and do not carry the central claim of first multimodal UIE integration. The reported performance gains are therefore not forced by internal re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Central claim rests on Retinex as a flexible illumination model, CLIP providing useful semantics for underwater scenes, and the new dataset being large enough to train without overfitting; no free parameters are named in the abstract.

axioms (2)
  • domain assumption Retinex model can be used for illumination correction without rigid physical assumptions
    Basis for the Prior-Free Illumination Estimator component
  • domain assumption CLIP text embeddings supply perceptually meaningful guidance for image restoration
    Core premise of the Semantics-Guided Image Restorer
invented entities (2)
  • LUIQD-TD dataset no independent evidence
    purpose: Provide large-scale image-reference-text triplets for multimodal UIE training
    Newly constructed collection of 6,418 triplets; no external validation mentioned
  • ITSS loss no independent evidence
    purpose: Explicitly optimize semantic consistency between images and text descriptions
    Newly designed loss function; details of formulation absent from abstract

pith-pipeline@v0.9.0 · 5589 in / 1425 out tokens · 48675 ms · 2026-05-15T14:52:03.658370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    A perception- aware decomposition and fusion framework for underwater image enhancement,

    Y . Kang, Q. Jiang, C. Li, W. Ren, H. Liu, and P. Wang, “A perception- aware decomposition and fusion framework for underwater image enhancement,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 988–1002, 2022

  2. [2]

    Beyond single reference for training: Underwater image enhancement via comparative learning,

    K. Li, L. Wu, Q. Qi, W. Liu, X. Gao, L. Zhou, and D. Song, “Beyond single reference for training: Underwater image enhancement via comparative learning,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 6, pp. 2561–2576, 2022

  3. [3]

    Underwater image enhance- ment via histogram similarity-oriented color compensation comple- mented by multiple attribute adjustment,

    H. Wang, A. C. Frery, M. Li, and P. Ren, “Underwater image enhance- ment via histogram similarity-oriented color compensation comple- mented by multiple attribute adjustment,”Intelligent Marine Technology and Systems, vol. 1, no. 1, p. 12, 2023

  4. [4]

    Image enhancement in turbid water using multiscale weighted features and attention mechanisms,

    H. Zhang, W. Zhang, H. Yuan, S. Bai, Y . Tian, and Z. Liu, “Image enhancement in turbid water using multiscale weighted features and attention mechanisms,”Intelligent Marine Technology and Systems, vol. 3, no. 1, p. 35, 2025

  5. [5]

    Retinex- former: One-stage retinex-based transformer for low-light image en- hancement,

    Y . Cai, H. Bian, J. Lin, H. Wang, R. Timofte, and Y . Zhang, “Retinex- former: One-stage retinex-based transformer for low-light image en- hancement,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 12 504–12 513. 13

  6. [6]

    Retinexmamba: Retinex- based mamba for low-light image enhancement,

    J. Bai, Y . Yin, Q. He, Y . Li, and X. Zhang, “Retinexmamba: Retinex- based mamba for low-light image enhancement,” inInternational Con- ference on Neural Information Processing. Springer, 2025, pp. 427– 442

  7. [7]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

  8. [8]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595

  9. [9]

    Lightness and retinex theory,

    E. H. Land and J. J. McCann, “Lightness and retinex theory,”Journal of the Optical society of America, vol. 61, no. 1, pp. 1–11, 1971

  10. [10]

    Single image haze removal using dark channel prior,

    K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, 2010

  11. [11]

    Transmission estimation in underwater single images,

    P. Drews, E. Nascimento, F. Moraes, S. Botelho, and M. Campos, “Transmission estimation in underwater single images,” inProceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 825–830

  12. [12]

    Generalization of the dark channel prior for single image restoration,

    Y .-T. Peng, K. Cao, and P. C. Cosman, “Generalization of the dark channel prior for single image restoration,”IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2856–2868, 2018

  13. [13]

    Diving into haze-lines: Color restoration of underwater images,

    D. Berman, T. Treibitz, and S. Avidan, “Diving into haze-lines: Color restoration of underwater images,” inProc. British Machine Vision Conference (BMVC), vol. 1, no. 2, 2017

  14. [14]

    The retinex theory of color vision

    E. H. Land, “The retinex theory of color vision.”Scientific American, p. 108–128, Feb 2010. [Online]. Available: http://dx.doi.org/10.1038/ scientificamerican1277-108

  15. [15]

    Underwater image enhancement with hyper-laplacian reflectance priors,

    P. Zhuang, J. Wu, F. Porikli, and C. Li, “Underwater image enhancement with hyper-laplacian reflectance priors,”IEEE Transactions on Image Processing, vol. 31, pp. 5442–5455, 2022

  16. [16]

    Under- water image enhancement via minimal color loss and locally adaptive contrast enhancement,

    W. Zhang, P. Zhuang, H.-H. Sun, G. Li, S. Kwong, and C. Li, “Under- water image enhancement via minimal color loss and locally adaptive contrast enhancement,”IEEE Transactions on Image Processing, vol. 31, pp. 3997–4010, 2022

  17. [17]

    Hfm: A hybrid fusion method for underwater image enhancement,

    S. An, L. Xu, I. Senior Member, Z. Deng, and H. Zhang, “Hfm: A hybrid fusion method for underwater image enhancement,”Engineering Applications of Artificial Intelligence, vol. 127, p. 107219, 2024

  18. [18]

    An underwater image enhancement benchmark dataset and beyond,

    C. Li, C. Guo, W. Ren, R. Cong, J. Hou, S. Kwong, and D. Tao, “An underwater image enhancement benchmark dataset and beyond,”IEEE Transactions on Image Processing, vol. 29, pp. 4376–4389, 2019

  19. [19]

    Underwater image enhancement via medium transmission-guided multi-color space embedding,

    C. Li, S. Anwar, J. Hou, R. Cong, C. Guo, and W. Ren, “Underwater image enhancement via medium transmission-guided multi-color space embedding,”IEEE Transactions on Image Processing, vol. 30, pp. 4985– 5000, 2021

  20. [20]

    A semi-supervised physics-aware triple-stream underwater image enhancement network,

    S. Xu, H. Qi, W. Wang, C. Huang, J. Wen, J. Dong, and X. Dong, “A semi-supervised physics-aware triple-stream underwater image enhancement network,” 2025. [Online]. Available: https: //arxiv.org/abs/2307.11470

  21. [21]

    Uncertainty inspired underwater image enhancement,

    Z. Fu, W. Wang, Y . Huang, X. Ding, and K.-K. Ma, “Uncertainty inspired underwater image enhancement,” inEuropean conference on computer vision. Springer, 2022, pp. 465–482

  22. [22]

    U-shape transformer for underwater image enhancement,

    L. Peng, C. Zhu, and L. Bian, “U-shape transformer for underwater image enhancement,”IEEE Transactions on Image Processing, vol. 32, pp. 3066–3079, 2023

  23. [23]

    Underwater ranker: Learn which is better and how to be better,

    C. Guo, R. Wu, X. Jin, L. Han, W. Zhang, Z. Chai, and C. Li, “Underwater ranker: Learn which is better and how to be better,” in Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, pp. 702–709

  24. [24]

    Deep color-corrected multi-scale retinex network for underwater image enhancement,

    H. Qi, H. Zhou, J. Dong, and X. Dong, “Deep color-corrected multi-scale retinex network for underwater image enhancement,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024

  25. [25]

    Uwformer: Underwater image enhancement via a semi-supervised multi-scale trans- former,

    W. Chen, Y . Lei, S. Luo, Z. Zhou, M. Li, and C.-M. Pun, “Uwformer: Underwater image enhancement via a semi-supervised multi-scale trans- former,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–8

  26. [26]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  27. [27]

    Iterative prompt learning for unsupervised backlit image enhancement,

    Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, “Iterative prompt learning for unsupervised backlit image enhancement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8094–8103

  28. [28]

    Hazeclip: Towards language guided real-world image dehazing,

    R. Wang, W. Li, X. Liu, C. Li, Z. Zhang, X. Min, and G. Zhai, “Hazeclip: Towards language guided real-world image dehazing,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  29. [29]

    Underwater image enhancement by diffusion model with customized clip-classifier,

    S. Liu, K. Li, Y . Ding, and Q. Qi, “Underwater image enhancement by diffusion model with customized clip-classifier,”arXiv preprint arXiv:2405.16214, 2024

  30. [30]

    Perception-aware underwater image quality assessment: Dataset, perceptual quality scores and assessment network,

    B. Lin, J. Dong, and X. Dong, “Perception-aware underwater image quality assessment: Dataset, perceptual quality scores and assessment network,”IEEE Transactions on Circuits and Systems for Video Tech- nology, 2025

  31. [31]

    Underwater single image color restoration using haze-lines and a new quantitative dataset,

    D. Berman, D. Levy, S. Avidan, and T. Treibitz, “Underwater single image color restoration using haze-lines and a new quantitative dataset,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2822–2837, 2020

  32. [32]

    Zuiderveld,Contrast Limited Adaptive Histogram Equalization

    K. Zuiderveld,Contrast Limited Adaptive Histogram Equalization. USA: Academic Press Professional, Inc., 1994, p. 474–485

  33. [33]

    Shallow- water image enhancement using relative global histogram stretching based on adaptive parameter acquisition,

    D. Huang, Y . Wang, W. Song, J. Sequeira, and S. Mavromatis, “Shallow- water image enhancement using relative global histogram stretching based on adaptive parameter acquisition,” inMultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5- 7, 2018, Proceedings, Part I 24. Springer, 2018, pp. 453–465

  34. [34]

    Underwater image enhance- ment via extended multi-scale retinex,

    S. Zhang, T. Wang, J. Dong, and H. Yu, “Underwater image enhance- ment via extended multi-scale retinex,”Neurocomputing, vol. 245, pp. 1–9, 2017

  35. [35]

    Enhancing underwa- ter images and videos by fusion,

    C. Ancuti, C. O. Ancuti, T. Haber, and P. Bekaert, “Enhancing underwa- ter images and videos by fusion,” in2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 81–88

  36. [36]

    Color balance and fusion for underwater image enhancement,

    C. O. Ancuti, C. Ancuti, C. De Vleeschouwer, and P. Bekaert, “Color balance and fusion for underwater image enhancement,”IEEE Transac- tions on Image Processing, vol. 27, no. 1, pp. 379–393, 2017

  37. [37]

    Sguie-net: Semantic attention guided underwater image enhancement with multi- scale perception,

    Q. Qi, K. Li, H. Zheng, X. Gao, G. Hou, and K. Sun, “Sguie-net: Semantic attention guided underwater image enhancement with multi- scale perception,”IEEE Transactions on Image Processing, vol. 31, pp. 6816–6830, 2022

  38. [38]

    Rave: Residual vector embedding for clip-guided backlit image enhancement,

    T. Gaintseva, M. Benning, and G. Slabaugh, “Rave: Residual vector embedding for clip-guided backlit image enhancement,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 412–428

  39. [39]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433

  40. [40]

    Masked au- toencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  41. [41]

    Film: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  42. [42]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

  43. [43]

    Image enhancement by histogram transformation,

    R. Hummel, “Image enhancement by histogram transformation,”Com- puter Graphics Image Processing, vol. 6, no. 2, pp. 184–195, 1977

  44. [44]

    Uif: An objective quality assessment for underwater image enhancement,

    Y . Zheng, W. Chen, R. Lin, T. Zhao, and P. Le Callet, “Uif: An objective quality assessment for underwater image enhancement,”IEEE Transactions on Image Processing, vol. 31, pp. 5456–5468, 2022

  45. [45]

    An underwater color image quality evaluation metric,

    M. Yang and A. Sowmya, “An underwater color image quality evaluation metric,”IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 6062–6071, 2015

  46. [46]

    Human-visual-system-inspired underwater image quality measures,

    K. Panetta, C. Gao, and S. Agaian, “Human-visual-system-inspired underwater image quality measures,”IEEE Journal of Oceanic Engi- neering, vol. 41, no. 3, pp. 541–551, 2015. Shixuan Xureceived the bachelor’s degree in En- gineering from Lanzhou University of Finance and Economics (LZUFE), Lanzhou, Gansu, China, in

  47. [47]

    His research interests include computer vision, deep learning and image enhancement

    He is currently pursuing the master’s degree in Artificial Intelligence at Ocean University of China. His research interests include computer vision, deep learning and image enhancement. 14 Yabo Liureceived the Ph.D. degree in computer technology from Harbin Institute of Technology, Shenzhen, China, in 2025. From 2021 to 2025, he was a jointly supervised ...