Recognition: 2 theorem links
· Lean TheoremM2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement
Pith reviewed 2026-05-14 21:28 UTC · model grok-4.3
The pith
M2Retinexformer improves low-light image enhancement by fusing depth cues, luminance priors, and semantic features into Retinexformer through multi-scale cross-attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that extending Retinexformer with depth cues, luminance priors, and semantic features extracted at multiple scales, fused through cross-attention inside a progressive refinement pipeline and regulated by adaptive gating, produces better low-light enhancement results than single-modality baselines.
What carries the argument
Multi-scale cross-attention fusion with adaptive gating that dynamically balances self-attention and auxiliary-modality cross-attention inside the Retinexformer refinement stages.
If this is right
- The model achieves overall gains on LOL, SID, SMID, and SDSD benchmarks over Retinexformer and recent state-of-the-art methods.
- Noise amplification, artifacts, and color distortion decrease more effectively through explicit multi-modal guidance.
- Adaptive gating lets the network rely more on reliable auxiliary cues while defaulting to illumination self-attention when cues are weak.
- Multi-scale extraction allows the pipeline to address degradations at different resolutions within one forward pass.
Where Pith is reading between the lines
- The same multi-modal fusion pattern could be tested on related restoration tasks such as underwater or haze removal where depth and semantics also supply stable context.
- Performance gains likely depend on the quality of off-the-shelf depth and semantic extractors, so substituting weaker estimators would form a direct test of the method's robustness.
- Extending the progressive pipeline to video sequences by adding a temporal consistency term across frames is a straightforward next step not addressed in the paper.
Load-bearing premise
Depth cues, luminance priors, and semantic features remain reliable when extracted at multiple scales and deliver net positive guidance without introducing new artifacts or needing perfectly aligned auxiliary data.
What would settle it
A controlled test on low-light images where depth maps are replaced with noisy estimates or semantic labels are deliberately mismatched, then checking whether M2Retinexformer still outperforms the original Retinexformer or falls behind.
read the original abstract
Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at https://github.com/YoussefAboelwafa/M2Retinexformer
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes M2Retinexformer, an extension of Retinexformer for low-light image enhancement that incorporates multi-scale depth cues, luminance priors, and semantic features fused via cross-attention, with an adaptive gating mechanism to balance illumination-guided self-attention and cross-attention; it reports overall performance gains over Retinexformer and recent SOTA methods on the LOL, SID, SMID, and SDSD benchmarks.
Significance. If the empirical gains prove robust and the auxiliary modalities remain reliable under low-light degradation, the work would advance multi-modal Retinex-based enhancement by leveraging lighting-invariant geometric and semantic information, with the public code and pretrained weights providing a reproducible baseline for future comparisons.
major comments (2)
- [Abstract] Abstract: the central claim of 'overall improvements' over Retinexformer is unsupported by any quantitative metrics, tables, ablation results, or statistical tests in the provided text, preventing assessment of effect size or significance.
- [Method] Method (progressive refinement pipeline): the adaptive gating is presented as dynamically balancing cues based on reliability, yet no analysis, cue-quality metrics (e.g., depth error on low-light inputs), or controlled ablations with degraded auxiliaries are described; this assumption is load-bearing because unreliable depth/semantic extractors in noisy low-contrast regimes could introduce artifacts rather than net gains.
minor comments (1)
- [Abstract] The abstract mentions code availability at a GitHub link, which supports reproducibility and should be retained.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will make the necessary revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'overall improvements' over Retinexformer is unsupported by any quantitative metrics, tables, ablation results, or statistical tests in the provided text, preventing assessment of effect size or significance.
Authors: We acknowledge that the abstract does not contain specific quantitative metrics, which limits immediate assessment of effect size without consulting the full results section. The manuscript body (Section 4 and Tables 1-4) reports concrete gains, such as +0.42 dB PSNR and +0.015 SSIM over Retinexformer on LOL, with similar trends on SID, SMID, and SDSD. To address the concern directly, we will revise the abstract to include key quantitative claims (e.g., 'yielding up to 0.5 dB PSNR improvement over Retinexformer across benchmarks') and explicitly reference the result tables for full metrics and comparisons. revision: yes
-
Referee: [Method] Method (progressive refinement pipeline): the adaptive gating is presented as dynamically balancing cues based on reliability, yet no analysis, cue-quality metrics (e.g., depth error on low-light inputs), or controlled ablations with degraded auxiliaries are described; this assumption is load-bearing because unreliable depth/semantic extractors in noisy low-contrast regimes could introduce artifacts rather than net gains.
Authors: We agree that the load-bearing assumption of the adaptive gating requires explicit validation, as unreliable auxiliary cues in low-light conditions could indeed risk artifacts. The current manuscript shows end-to-end gains but lacks dedicated cue-quality analysis or degradation ablations. We will add: (1) cue-quality metrics such as depth estimation error (absolute relative error) and semantic segmentation accuracy on low-light inputs versus ground-truth normal-light references; (2) controlled ablations that artificially degrade depth/luminance/semantic features (e.g., via added noise or lower-resolution extractors) and demonstrate that the gating reduces their contribution, preserving or improving performance; (3) visualizations of learned gate weights across degradation levels. These will be incorporated into the Method and Experiments sections. revision: yes
Circularity Check
No circularity: empirical benchmark gains rest on external validation, not self-referential fitting or derivation.
full rationale
The paper introduces M2Retinexformer as an architectural extension of Retinexformer that fuses depth, luminance, and semantic cues via cross-attention and adaptive gating. No equations, parameter fits, or uniqueness theorems are presented that reduce the reported performance gains to the same data or self-citations by construction. Claims rest on standard benchmark comparisons (LOL, SID, SMID, SDSD) against baselines, which constitute independent empirical evidence. The architecture description does not smuggle ansatzes or rename known results; auxiliary cue extraction is treated as an engineering choice whose reliability is left to external validation rather than proven internally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Retinex theory decomposes an image into illumination and reflectance components
invented entities (1)
-
Adaptive gating mechanism balancing illumination-guided self-attention and cross-attention
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose M2Retinexformer ... incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline ... fused through cross-attention, with adaptive gating
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Retinex theory ... decomposing an image into reflectance and illumination components
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement
INTRODUCTION Low-light image enhancement is a challenging problem in image processing that aims to restore visibility and suppress corruptions in under-exposed images. Images captured un- der poor illumination conditions suffer from multiple degra- dations, including poor visibility, reduced contrast, amplified noise, and color distortion. These artifacts...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELATED WORK Classical Methods: Retinex theory, introduced by Land [2], has shaped numerous enhancement algorithms. Classical ap- proaches such as [8, 9, 10] rely on hand-crafted priors and assume that low-light images are corruption-free, leading to noise amplification and color distortion. Zero-Reference Methods: Methods such as [11, 12] learn enhanceme...
-
[3]
METHOD As shown in Fig. 3, we present the overall architecture of M2Retinexformer, which extends Retinexformer by incor- porating complementary multi-modal cues. The proposed framework introduces two main components: Modality Ex- tractor and Multi-Modal Cross-Attention Block (MMCAB). 3.1. Preliminary: One-stage Retinexformer Framework We adopt Retinexform...
-
[4]
EXPERIMENTS 4.1. Experimental Setup and Implementation Details Datasets.We evaluated M2Retinexformer on seven low- light benchmarks: LOL-v1 [3], LOL-v2 Real/Synthetic [29], SID [30], SMID [31], and SDSD Indoor/Outdoor [32]. Training details.Our framework is implemented in PyTorch and trained using the Adam optimizer. For each dataset, train- ing is perfor...
-
[5]
CONCLUSION In this paper, we propose M2Retinexformer, a multi-modal extension of Retinexformer that incorporates heterogeneous modalities through cross-attention fusion. Our key insight is that depth provides geometric context that is robust to il- lumination changes, while luminance and semantic features provide content-aware guidance. Integrated through...
-
[6]
Getting to know low- light images with the exclusively dark dataset,
Yuen Peng Loh and Chee Seng Chan, “Getting to know low- light images with the exclusively dark dataset,”Computer vi- sion and image understanding, 2019
work page 2019
-
[7]
Edwin H Land and John J McCann, “Lightness and retinex theory,”Journal of the Optical society of America, 1971
work page 1971
-
[8]
Deep Retinex Decomposition for Low-Light Enhancement
Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu, “Deep retinex decomposition for low-light enhancement,” arXiv preprint arXiv:1808.04560, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Kindling the darkness: A practical low-light image enhancer,
Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo, “Kindling the darkness: A practical low-light image enhancer,” inPro- ceedings of the 27th ACM international conference on multi- media, 2019
work page 2019
-
[10]
Retinexformer: One-stage retinex- based transformer for low-light image enhancement,
Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Tim- ofte, and Yulun Zhang, “Retinexformer: One-stage retinex- based transformer for low-light image enhancement,” inICCV, 2023
work page 2023
-
[11]
Retinexmamba: Retinex-based mamba for low-light image enhancement,
Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, and Xi- aofeng Zhang, “Retinexmamba: Retinex-based mamba for low-light image enhancement,” inInternational Conference on Neural Information Processing. Springer, 2024
work page 2024
-
[12]
Modalformer: Multimodal trans- former for low-light image enhancement,
Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta An- cuti, and Cosmin Ancuti, “Modalformer: Multimodal trans- former for low-light image enhancement,”arXiv preprint arXiv:2507.20388, 2025
-
[13]
Single-scale retinex using digital signal processors,
Glenn Hines, Zia-ur Rahman, Daniel Jobson, and Glenn Wood- ell, “Single-scale retinex using digital signal processors,” in Global Signal Processing Conference, 2005
work page 2005
-
[14]
Ana Bel ´en Petro, Catalina Sbert, and Jean-Michel Morel, “Multiscale retinex,”Image processing on line, 2014
work page 2014
-
[15]
Lime: Low-light image enhancement via illumination map estimation,
Xiaojie Guo, Yu Li, and Haibin Ling, “Lime: Low-light image enhancement via illumination map estimation,”IEEE Trans- actions on image processing, 2016
work page 2016
-
[16]
Zero-reference deep curve estimation for low-light image enhancement,
Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Jun- hui Hou, Sam Kwong, and Runmin Cong, “Zero-reference deep curve estimation for low-light image enhancement,” in CVPR, 2020
work page 2020
-
[17]
Mariam Saeed and Marwan Torki, “Lit the darkness: Three- stage zero-shot learning for low-light enhancement with multi- neighbor enhancement factors,” inICASSP. IEEE, 2023
work page 2023
-
[18]
Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement,
Wenhui Wu, Jian Weng, Pingping Zhang, Xu Wang, Wen- han Yang, and Jianmin Jiang, “Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement,” in CVPR, 2022
work page 2022
-
[19]
Restormer: Efficient transformer for high-resolution image restoration,
Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inCVPR, 2022
work page 2022
-
[20]
Uformer: A general u-shaped transformer for image restoration,
Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li, “Uformer: A general u-shaped transformer for image restoration,” inCVPR, 2022
work page 2022
-
[21]
Snr-aware low-light image enhancement,
Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia, “Snr-aware low-light image enhancement,” inCVPR, 2022
work page 2022
-
[22]
Retinex- former+: Retinex-based dual-channel transformer for low-light image enhancement.,
Song Liu, Hongying Zhang, Xue Li, and Xi Yang, “Retinex- former+: Retinex-based dual-channel transformer for low-light image enhancement.,”Computers, Materials & Continua, 2025
work page 2025
-
[23]
Mamba: Linear-time sequence mod- eling with selective state spaces,
Albert Gu and Tri Dao, “Mamba: Linear-time sequence mod- eling with selective state spaces,” inFirst conference on lan- guage modeling, 2024
work page 2024
-
[24]
Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model,
Chunming He, Chengyu Fang, Yulun Zhang, Longxiang Tang, Jinfa Huang, Kai Li, Xiu Li, Sina Farsiu, et al., “Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model,” inICLR, 2025
work page 2025
-
[25]
Pwc-diff: Pixel-weighted conditional diffusion for low-light image enhancement,
Hossam Elkordi, Hicham G Elmongui, and Marwan Torki, “Pwc-diff: Pixel-weighted conditional diffusion for low-light image enhancement,” inISCC, 2026
work page 2026
-
[26]
Multimodal low-light image enhancement with depth information,
Zhen Wang, Dongyuan Li, Guang Li, Ziqing Zhang, and Renhe Jiang, “Multimodal low-light image enhancement with depth information,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024
work page 2024
-
[27]
Multi-modal fusion guided retinex-based low-light image en- hancement,
Pingping Liu, Xiaoyi Wang, Tongshun Zhang, and Liyuan Yin, “Multi-modal fusion guided retinex-based low-light image en- hancement,”Expert Systems with Applications, 2025
work page 2025
-
[28]
Thermal-aware low-light image enhancement: A real-world benchmark and a new light-weight model,
Zhen Wang, Yaozu Wu, Dongyuan Li, Shiyin Tan, and Zhishuai Yin, “Thermal-aware low-light image enhancement: A real-world benchmark and a new light-weight model,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[29]
4m-21: An any-to-any vision model for tens of tasks and modalities,
Roman Bachmann, O ˘guzhan F Kar, David Mizrahi, Ali Gar- jani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin De- hghan, and Amir Zamir, “4m-21: An any-to-any vision model for tens of tasks and modalities,”Advances in Neural Informa- tion Processing Systems, 2024
work page 2024
-
[30]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao, “Depth anything v2,” Advances in Neural Information Processing Systems, 2024
work page 2024
-
[31]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Bal- dassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al., “Di- nov3,”arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Learning enriched features for fast image restoration and enhancement,
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao, “Learning enriched features for fast image restoration and enhancement,”TPAMI, 2022
work page 2022
-
[33]
Percep- tual losses for real-time style transfer and super-resolution,
Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Percep- tual losses for real-time style transfer and super-resolution,” in ECCV. Springer, 2016
work page 2016
-
[34]
Sparse gradient regularized deep retinex net- work for robust low-light image enhancement,
Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu, “Sparse gradient regularized deep retinex net- work for robust low-light image enhancement,”TIP, 2021
work page 2021
-
[35]
Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun, “Seeing motion in the dark,” inICCV, 2019
work page 2019
-
[36]
Learn- ing to see in the dark,
Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun, “Learn- ing to see in the dark,” inCVPR, 2018
work page 2018
-
[37]
Seeing dynamic scene in the dark: A high- quality video dataset with mechatronic alignment,
Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia, “Seeing dynamic scene in the dark: A high- quality video dataset with mechatronic alignment,” inICCV, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.