arxiv: 2605.12556 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

Youssef Aboelwafa , Hicham G. Elmongui , Marwan Torki

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords low-light image enhancementRetinexformermulti-modal fusioncross-attentiondepth cuessemantic featuresluminance priorsimage restoration

0 comments

The pith

M2Retinexformer improves low-light image enhancement by fusing depth cues, luminance priors, and semantic features into Retinexformer through multi-scale cross-attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M2Retinexformer as a multi-modal extension of Retinexformer for low-light image enhancement. It adds depth for geometric context invariant to lighting, luminance priors for brightness distribution, and semantic features for scene understanding. These modalities are extracted at multiple scales and fused via cross-attention, with adaptive gating that balances illumination-guided self-attention and cross-attention according to cue reliability. The approach targets the limitations of single-modality RGB methods in handling noise, artifacts, and color distortion. Evaluations on the LOL, SID, SMID, and SDSD benchmarks show gains over Retinexformer and other recent methods.

Core claim

The central claim is that extending Retinexformer with depth cues, luminance priors, and semantic features extracted at multiple scales, fused through cross-attention inside a progressive refinement pipeline and regulated by adaptive gating, produces better low-light enhancement results than single-modality baselines.

What carries the argument

Multi-scale cross-attention fusion with adaptive gating that dynamically balances self-attention and auxiliary-modality cross-attention inside the Retinexformer refinement stages.

If this is right

The model achieves overall gains on LOL, SID, SMID, and SDSD benchmarks over Retinexformer and recent state-of-the-art methods.
Noise amplification, artifacts, and color distortion decrease more effectively through explicit multi-modal guidance.
Adaptive gating lets the network rely more on reliable auxiliary cues while defaulting to illumination self-attention when cues are weak.
Multi-scale extraction allows the pipeline to address degradations at different resolutions within one forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-modal fusion pattern could be tested on related restoration tasks such as underwater or haze removal where depth and semantics also supply stable context.
Performance gains likely depend on the quality of off-the-shelf depth and semantic extractors, so substituting weaker estimators would form a direct test of the method's robustness.
Extending the progressive pipeline to video sequences by adding a temporal consistency term across frames is a straightforward next step not addressed in the paper.

Load-bearing premise

Depth cues, luminance priors, and semantic features remain reliable when extracted at multiple scales and deliver net positive guidance without introducing new artifacts or needing perfectly aligned auxiliary data.

What would settle it

A controlled test on low-light images where depth maps are replaced with noisy estimates or semantic labels are deliberately mismatched, then checking whether M2Retinexformer still outperforms the original Retinexformer or falls behind.

read the original abstract

Low-light image enhancement is challenging due to complex degradations, including amplified noise, artifacts, and color distortion. While Retinex-based deep learning methods have achieved promising results, they primarily rely on single-modality RGB information. We propose M2Retinexformer (Multi-Modal Retinexformer), a novel framework that extends Retinexformer by incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline. Depth provides geometric context that is invariant to lighting variations, while luminance and semantic features offer explicit guidance on brightness distribution and scene understanding. Modalities are extracted at multiple scales and fused through cross-attention, with adaptive gating dynamically balancing illumination-guided self-attention and cross-attention based on the reliability of auxiliary cues. Evaluations on the LOL, SID, SMID, and SDSD benchmarks demonstrate overall improvements over Retinexformer and recent state-of-the-art methods. Code and pretrained weights are available at https://github.com/YoussefAboelwafa/M2Retinexformer

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M2Retinexformer adds depth, luminance, and semantic cues to Retinexformer with adaptive gating, but the abstract gives no numbers or ablations so the actual gains are hard to judge.

read the letter

The core move here is straightforward: take the existing Retinexformer, pull out depth maps, luminance priors, and semantic features at multiple scales, then fuse them through cross-attention while an adaptive gate decides how much to trust the extra signals versus the illumination-guided self-attention. The paper runs this progressive pipeline on the usual low-light sets (LOL, SID, SMID, SDSD) and says it beats the baseline Retinexformer plus a few recent methods. Code and weights are released, which is useful for anyone who wants to test it directly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes M2Retinexformer, an extension of Retinexformer for low-light image enhancement that incorporates multi-scale depth cues, luminance priors, and semantic features fused via cross-attention, with an adaptive gating mechanism to balance illumination-guided self-attention and cross-attention; it reports overall performance gains over Retinexformer and recent SOTA methods on the LOL, SID, SMID, and SDSD benchmarks.

Significance. If the empirical gains prove robust and the auxiliary modalities remain reliable under low-light degradation, the work would advance multi-modal Retinex-based enhancement by leveraging lighting-invariant geometric and semantic information, with the public code and pretrained weights providing a reproducible baseline for future comparisons.

major comments (2)

[Abstract] Abstract: the central claim of 'overall improvements' over Retinexformer is unsupported by any quantitative metrics, tables, ablation results, or statistical tests in the provided text, preventing assessment of effect size or significance.
[Method] Method (progressive refinement pipeline): the adaptive gating is presented as dynamically balancing cues based on reliability, yet no analysis, cue-quality metrics (e.g., depth error on low-light inputs), or controlled ablations with degraded auxiliaries are described; this assumption is load-bearing because unreliable depth/semantic extractors in noisy low-contrast regimes could introduce artifacts rather than net gains.

minor comments (1)

[Abstract] The abstract mentions code availability at a GitHub link, which supports reproducibility and should be retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will make the necessary revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'overall improvements' over Retinexformer is unsupported by any quantitative metrics, tables, ablation results, or statistical tests in the provided text, preventing assessment of effect size or significance.

Authors: We acknowledge that the abstract does not contain specific quantitative metrics, which limits immediate assessment of effect size without consulting the full results section. The manuscript body (Section 4 and Tables 1-4) reports concrete gains, such as +0.42 dB PSNR and +0.015 SSIM over Retinexformer on LOL, with similar trends on SID, SMID, and SDSD. To address the concern directly, we will revise the abstract to include key quantitative claims (e.g., 'yielding up to 0.5 dB PSNR improvement over Retinexformer across benchmarks') and explicitly reference the result tables for full metrics and comparisons. revision: yes
Referee: [Method] Method (progressive refinement pipeline): the adaptive gating is presented as dynamically balancing cues based on reliability, yet no analysis, cue-quality metrics (e.g., depth error on low-light inputs), or controlled ablations with degraded auxiliaries are described; this assumption is load-bearing because unreliable depth/semantic extractors in noisy low-contrast regimes could introduce artifacts rather than net gains.

Authors: We agree that the load-bearing assumption of the adaptive gating requires explicit validation, as unreliable auxiliary cues in low-light conditions could indeed risk artifacts. The current manuscript shows end-to-end gains but lacks dedicated cue-quality analysis or degradation ablations. We will add: (1) cue-quality metrics such as depth estimation error (absolute relative error) and semantic segmentation accuracy on low-light inputs versus ground-truth normal-light references; (2) controlled ablations that artificially degrade depth/luminance/semantic features (e.g., via added noise or lower-resolution extractors) and demonstrate that the gating reduces their contribution, preserving or improving performance; (3) visualizations of learned gate weights across degradation levels. These will be incorporated into the Method and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on external validation, not self-referential fitting or derivation.

full rationale

The paper introduces M2Retinexformer as an architectural extension of Retinexformer that fuses depth, luminance, and semantic cues via cross-attention and adaptive gating. No equations, parameter fits, or uniqueness theorems are presented that reduce the reported performance gains to the same data or self-citations by construction. Claims rest on standard benchmark comparisons (LOL, SID, SMID, SDSD) against baselines, which constitute independent empirical evidence. The architecture description does not smuggle ansatzes or rename known results; auxiliary cue extraction is treated as an engineering choice whose reliability is left to external validation rather than proven internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard Retinex decomposition assumption plus the unproven premise that auxiliary modalities remain informative under low light; no free parameters or new entities are quantified in the abstract.

axioms (1)

domain assumption Retinex theory decomposes an image into illumination and reflectance components
Inherited from the Retinexformer baseline and invoked as the foundation for the multi-modal extension.

invented entities (1)

Adaptive gating mechanism balancing illumination-guided self-attention and cross-attention no independent evidence
purpose: Dynamically weight auxiliary cues according to their reliability
New component introduced to fuse modalities; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5488 in / 1320 out tokens · 34971 ms · 2026-05-14T21:28:28.673488+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose M2Retinexformer ... incorporating depth cues, luminance priors, and semantic features within a progressive refinement pipeline ... fused through cross-attention, with adaptive gating
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Retinex theory ... decomposing an image into reflectance and illumination components

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement

INTRODUCTION Low-light image enhancement is a challenging problem in image processing that aims to restore visibility and suppress corruptions in under-exposed images. Images captured un- der poor illumination conditions suffer from multiple degra- dations, including poor visibility, reduced contrast, amplified noise, and color distortion. These artifacts...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Classical ap- proaches such as [8, 9, 10] rely on hand-crafted priors and assume that low-light images are corruption-free, leading to noise amplification and color distortion

RELATED WORK Classical Methods: Retinex theory, introduced by Land [2], has shaped numerous enhancement algorithms. Classical ap- proaches such as [8, 9, 10] rely on hand-crafted priors and assume that low-light images are corruption-free, leading to noise amplification and color distortion. Zero-Reference Methods: Methods such as [11, 12] learn enhanceme...

work page
[3]

3, we present the overall architecture of M2Retinexformer, which extends Retinexformer by incor- porating complementary multi-modal cues

METHOD As shown in Fig. 3, we present the overall architecture of M2Retinexformer, which extends Retinexformer by incor- porating complementary multi-modal cues. The proposed framework introduces two main components: Modality Ex- tractor and Multi-Modal Cross-Attention Block (MMCAB). 3.1. Preliminary: One-stage Retinexformer Framework We adopt Retinexform...

work page
[4]

EXPERIMENTS 4.1. Experimental Setup and Implementation Details Datasets.We evaluated M2Retinexformer on seven low- light benchmarks: LOL-v1 [3], LOL-v2 Real/Synthetic [29], SID [30], SMID [31], and SDSD Indoor/Outdoor [32]. Training details.Our framework is implemented in PyTorch and trained using the Adam optimizer. For each dataset, train- ing is perfor...

work page
[5]

Our key insight is that depth provides geometric context that is robust to il- lumination changes, while luminance and semantic features provide content-aware guidance

CONCLUSION In this paper, we propose M2Retinexformer, a multi-modal extension of Retinexformer that incorporates heterogeneous modalities through cross-attention fusion. Our key insight is that depth provides geometric context that is robust to il- lumination changes, while luminance and semantic features provide content-aware guidance. Integrated through...

work page
[6]

Getting to know low- light images with the exclusively dark dataset,

Yuen Peng Loh and Chee Seng Chan, “Getting to know low- light images with the exclusively dark dataset,”Computer vi- sion and image understanding, 2019

work page 2019
[7]

Lightness and retinex theory,

Edwin H Land and John J McCann, “Lightness and retinex theory,”Journal of the Optical society of America, 1971

work page 1971
[8]

Deep Retinex Decomposition for Low-Light Enhancement

Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu, “Deep retinex decomposition for low-light enhancement,” arXiv preprint arXiv:1808.04560, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Kindling the darkness: A practical low-light image enhancer,

Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo, “Kindling the darkness: A practical low-light image enhancer,” inPro- ceedings of the 27th ACM international conference on multi- media, 2019

work page 2019
[10]

Retinexformer: One-stage retinex- based transformer for low-light image enhancement,

Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Tim- ofte, and Yulun Zhang, “Retinexformer: One-stage retinex- based transformer for low-light image enhancement,” inICCV, 2023

work page 2023
[11]

Retinexmamba: Retinex-based mamba for low-light image enhancement,

Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, and Xi- aofeng Zhang, “Retinexmamba: Retinex-based mamba for low-light image enhancement,” inInternational Conference on Neural Information Processing. Springer, 2024

work page 2024
[12]

Modalformer: Multimodal trans- former for low-light image enhancement,

Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta An- cuti, and Cosmin Ancuti, “Modalformer: Multimodal trans- former for low-light image enhancement,”arXiv preprint arXiv:2507.20388, 2025

work page arXiv 2025
[13]

Single-scale retinex using digital signal processors,

Glenn Hines, Zia-ur Rahman, Daniel Jobson, and Glenn Wood- ell, “Single-scale retinex using digital signal processors,” in Global Signal Processing Conference, 2005

work page 2005
[14]

Multiscale retinex,

Ana Bel ´en Petro, Catalina Sbert, and Jean-Michel Morel, “Multiscale retinex,”Image processing on line, 2014

work page 2014
[15]

Lime: Low-light image enhancement via illumination map estimation,

Xiaojie Guo, Yu Li, and Haibin Ling, “Lime: Low-light image enhancement via illumination map estimation,”IEEE Trans- actions on image processing, 2016

work page 2016
[16]

Zero-reference deep curve estimation for low-light image enhancement,

Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Jun- hui Hou, Sam Kwong, and Runmin Cong, “Zero-reference deep curve estimation for low-light image enhancement,” in CVPR, 2020

work page 2020
[17]

Lit the darkness: Three- stage zero-shot learning for low-light enhancement with multi- neighbor enhancement factors,

Mariam Saeed and Marwan Torki, “Lit the darkness: Three- stage zero-shot learning for low-light enhancement with multi- neighbor enhancement factors,” inICASSP. IEEE, 2023

work page 2023
[18]

Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement,

Wenhui Wu, Jian Weng, Pingping Zhang, Xu Wang, Wen- han Yang, and Jianmin Jiang, “Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement,” in CVPR, 2022

work page 2022
[19]

Restormer: Efficient transformer for high-resolution image restoration,

Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inCVPR, 2022

work page 2022
[20]

Uformer: A general u-shaped transformer for image restoration,

Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li, “Uformer: A general u-shaped transformer for image restoration,” inCVPR, 2022

work page 2022
[21]

Snr-aware low-light image enhancement,

Xiaogang Xu, Ruixing Wang, Chi-Wing Fu, and Jiaya Jia, “Snr-aware low-light image enhancement,” inCVPR, 2022

work page 2022
[22]

Retinex- former+: Retinex-based dual-channel transformer for low-light image enhancement.,

Song Liu, Hongying Zhang, Xue Li, and Xi Yang, “Retinex- former+: Retinex-based dual-channel transformer for low-light image enhancement.,”Computers, Materials & Continua, 2025

work page 2025
[23]

Mamba: Linear-time sequence mod- eling with selective state spaces,

Albert Gu and Tri Dao, “Mamba: Linear-time sequence mod- eling with selective state spaces,” inFirst conference on lan- guage modeling, 2024

work page 2024
[24]

Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model,

Chunming He, Chengyu Fang, Yulun Zhang, Longxiang Tang, Jinfa Huang, Kai Li, Xiu Li, Sina Farsiu, et al., “Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model,” inICLR, 2025

work page 2025
[25]

Pwc-diff: Pixel-weighted conditional diffusion for low-light image enhancement,

Hossam Elkordi, Hicham G Elmongui, and Marwan Torki, “Pwc-diff: Pixel-weighted conditional diffusion for low-light image enhancement,” inISCC, 2026

work page 2026
[26]

Multimodal low-light image enhancement with depth information,

Zhen Wang, Dongyuan Li, Guang Li, Ziqing Zhang, and Renhe Jiang, “Multimodal low-light image enhancement with depth information,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024
[27]

Multi-modal fusion guided retinex-based low-light image en- hancement,

Pingping Liu, Xiaoyi Wang, Tongshun Zhang, and Liyuan Yin, “Multi-modal fusion guided retinex-based low-light image en- hancement,”Expert Systems with Applications, 2025

work page 2025
[28]

Thermal-aware low-light image enhancement: A real-world benchmark and a new light-weight model,

Zhen Wang, Yaozu Wu, Dongyuan Li, Shiyin Tan, and Zhishuai Yin, “Thermal-aware low-light image enhancement: A real-world benchmark and a new light-weight model,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[29]

4m-21: An any-to-any vision model for tens of tasks and modalities,

Roman Bachmann, O ˘guzhan F Kar, David Mizrahi, Ali Gar- jani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin De- hghan, and Amir Zamir, “4m-21: An any-to-any vision model for tens of tasks and modalities,”Advances in Neural Informa- tion Processing Systems, 2024

work page 2024
[30]

Depth anything v2,

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao, “Depth anything v2,” Advances in Neural Information Processing Systems, 2024

work page 2024
[31]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Bal- dassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al., “Di- nov3,”arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Learning enriched features for fast image restoration and enhancement,

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao, “Learning enriched features for fast image restoration and enhancement,”TPAMI, 2022

work page 2022
[33]

Percep- tual losses for real-time style transfer and super-resolution,

Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Percep- tual losses for real-time style transfer and super-resolution,” in ECCV. Springer, 2016

work page 2016
[34]

Sparse gradient regularized deep retinex net- work for robust low-light image enhancement,

Wenhan Yang, Wenjing Wang, Haofeng Huang, Shiqi Wang, and Jiaying Liu, “Sparse gradient regularized deep retinex net- work for robust low-light image enhancement,”TIP, 2021

work page 2021
[35]

Seeing motion in the dark,

Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun, “Seeing motion in the dark,” inICCV, 2019

work page 2019
[36]

Learn- ing to see in the dark,

Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun, “Learn- ing to see in the dark,” inCVPR, 2018

work page 2018
[37]

Seeing dynamic scene in the dark: A high- quality video dataset with mechatronic alignment,

Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia, “Seeing dynamic scene in the dark: A high- quality video dataset with mechatronic alignment,” inICCV, 2021

work page 2021