pith. sign in

arxiv: 2605.23531 · v1 · pith:RYI3V5QBnew · submitted 2026-05-22 · 💻 cs.CV

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

Pith reviewed 2026-05-25 05:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords low-light image enhancementpixel-space processingsemantic promptingDINO featuresimage denoisingdetail recoverycomputer vision
0
0 comments X

The pith

PixIE enhances low-light images by injecting DINOv3 features into per-pixel modulation blocks after cross-scale denoising.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PixIE, a feed-forward pixel-space method for low-light image enhancement that first applies cross-scale denoising to suppress noise and keep structure, then refines output with DINO-Prompted Pixel Blocks. These blocks use patch-conditioned per-pixel modulation driven by intermediate features from DINOv3, supported by Spatial-Channel Compaction for efficient attention and Multi-Receptive-Field Pixel Embedding for neighborhood context. The approach targets noise, contrast loss, and semantic ambiguity together. If the method works as described, it would produce images with higher reconstruction fidelity and perceptual quality than prior techniques on standard benchmarks.

Core claim

PixIE performs cross-scale denoising followed by refinement in DINO-Prompted Pixel Blocks that inject intermediate DINOv3 features via patch-conditioned, spatially continuous per-pixel modulation; Spatial-Channel Compaction folds features into a compact grid to bound attention cost across scales, while Multi-Receptive-Field Pixel Embedding supplies neighborhood-aware representations before prompting, yielding average PSNR gains of 1.9-15.0 percent and LPIPS reductions of 8.5-44.4 percent on LLIE benchmarks along with sharper details and stable textures.

What carries the argument

DINO-Prompted Pixel Blocks (DPPB) that inject intermediate DINOv3 features via patch-conditioned per-pixel modulation to refine details after initial denoising.

If this is right

  • Cross-scale denoising suppresses noise while preserving structure before semantic refinement.
  • Spatial-Channel Compaction enables pixel-attention computation with bounded cost across multiple scales.
  • Multi-Receptive-Field Pixel Embedding increases robustness to signal-dependent noise compared with point-wise embeddings.
  • The overall pipeline recovers sharper details and more stable textures than recent state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting structure could be tested on related restoration tasks such as dehazing or low-light video to check if foundation-model features transfer.
  • Efficiency from compaction might allow the framework to run on mobile hardware if the modulation blocks are further quantized.
  • Replacing DINOv3 with a different foundation model could reveal whether the gains depend on specific semantic properties of that model.

Load-bearing premise

The assumption that DINOv3 features can be injected via patch-conditioned per-pixel modulation to improve detail recovery without introducing semantic errors or artifacts in noisy low-light inputs.

What would settle it

A set of low-light test images where the enhanced outputs exhibit semantic artifacts, such as invented textures or misidentified object boundaries, traceable to mismatched DINOv3 feature injection.

Figures

Figures reproduced from arXiv: 2605.23531 by David Bull, Guoxi Huang, Nantheera Anantrasirichai, Ruirui Lin.

Figure 1
Figure 1. Figure 1: DINOv3 ViT-S/16 features under low-light degradation. Left: Mean token￾wise cosine similarity across the 12 transformer layers for three comparisons: Clean vs. Low-Match (noise), Low-Match vs. Low-Raw (illumination), and Clean vs. Low-Raw (noise and illumination). Similarity increases with depth, indicating progressively more stable representations; dotted vertical lines mark the layers {2, 5, 8, 11} used … view at source ↗
Figure 2
Figure 2. Figure 2: Visual comparison of RetinexFormer [2], CIDNet [49], and PixIE (Ours) on an example LoLv2-Real test image. The low-light input is histogram-stretched to match the ground truth for better noise visualization. PixIE demonstrates superior noise sup￾pression and detail restoration. hoods. Relying solely on single-pixel statistics makes it difficult to distinguish true signals from noise [7, 8, 46], motivating … view at source ↗
Figure 3
Figure 3. Figure 3: The overall pipeline of our proposed PixIE. Given a low-light input, the cross￾scale denoising stream suppresses noise fine-to-coarse, followed by Multi-Receptive￾Field Pixel Embedding (MRPE), and DINO-Prompted Pixel Blocks (DPPBs) inject DINO semantic guidance at each scale, and multi-scale fusion aggregates refined fea￾tures to predict a residual correction Iˆ = I + R. Transformer blocks with fine-to-coa… view at source ↗
Figure 4
Figure 4. Figure 4: Patch-conditioned modulation strategy comparison: (A) Global broadcast ap￾plies one shared vector to all pixels. (B) Patch-wise constant uses one token per patch, yielding piecewise-constant modulation and grid seams at patch boundaries. (C) Per￾pixel MLP predicts pixel-wise parameters within each patch, but independent per-patch prediction may introduce boundary discontinuities. (D) Ours upsamples the tok… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of PixIE with recent state-of-the-art methods on LoLv1 (top row), LoLv2-Real (second row), and LSRW (bottom row) test sets, respectively. 0.902. These results demonstrate that our pixel-space enhancement effectively balances noise suppression with structural fidelity, even in datasets like LSRW and LOLv2 that are captured under real, heavy low-light noise. Furthermore, PixIE achieves… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of PixIE with recent state-of-the-art methods on five unpaired datasets to show generalization. where they are easily suppressed by aggressive denoising or global illumination correction. In the top row (LOLv1), most methods fail to recover subtle clothing textures, whereas PixIE better preserves these high-frequency patterns. PixIE restores the fine details of the ping-pong table wh… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative ablation results of PixIE on the LoLv2-Real dataset. ‘full mod’ denotes spatially continuous modulation in full resolution. We compare (i) w/o MDTA and w/o full mod, (ii) w/o MDTA and w full mod, and (iii) w MDTA and w full mod (full PixIE). As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Low-light images exhibit severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically-prompted by a vision foundation model. PixIE first performs a cross-scale denoising to suppress noise and preserve structure, then refines details with DINO-Prompted Pixel Blocks (DPPB) that inject intermediate DINOv3 features via patch-conditioned, spatially continuous per-pixel modulation. We introduce a Spatial-Channel Compaction (SCC), which folds features into a compact spatial grid and compresses in the channel dimension, so pixel-attention is computed efficiently with bounded cost across scales. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves the average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further demonstrate that PixIE recovers sharper details and more stable textures, resulting in improved reconstruction fidelity and perceptual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PixIE, a feed-forward pixel-space low-light image enhancement (LLIE) framework that first applies cross-scale denoising and then refines details using DINO-Prompted Pixel Blocks (DPPB) to inject intermediate DINOv3 features via patch-conditioned per-pixel modulation. It introduces Spatial-Channel Compaction (SCC) for efficient pixel-attention and Multi-Receptive-Field Pixel Embedding (MRPE) for neighborhood-aware representations. Experiments claim average PSNR gains of 1.9-15.0% and LPIPS reductions of 8.5-44.4% over recent SOTA methods on LLIE benchmarks, with qualitative improvements in detail and texture stability.

Significance. If the reported gains hold under rigorous validation and the DPPB modulation proves robust to signal-dependent noise without semantic artifacts, the work could meaningfully advance LLIE by showing how vision foundation model features can be efficiently integrated into pixel-space processing. The SCC mechanism for bounded-cost attention across scales is a potentially useful efficiency contribution if its implementation details are fully specified.

major comments (2)
  1. [DPPB and experiments sections] The central empirical claim of consistent PSNR/LPIPS gains rests on the assumption that DINOv3 features (trained on standard lighting) can be injected via DPPB without introducing semantic mismatches or artifacts in noisy low-light inputs; the manuscript provides no quantitative ablation isolating DPPB's contribution or failure-case analysis on signal-dependent noise, which is load-bearing for attributing improvements to the prompting mechanism rather than the denoising or embedding stages.
  2. [Experiments and results] The abstract and method description report average metric improvements but the manuscript lacks per-dataset tables with standard deviations, dataset splits, or statistical tests; without these, it is impossible to determine whether the 1.9-15.0% PSNR range reflects robust gains or is driven by particular benchmarks or post-hoc tuning.
minor comments (2)
  1. [Method] Notation for patch-conditioned modulation and the exact form of per-pixel modulation in DPPB should be formalized with equations to allow reproducibility.
  2. [Method] The paper should clarify the cross-scale denoising architecture (e.g., number of scales, loss terms) to distinguish its contribution from the novel DPPB component.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the empirical support for PixIE.

read point-by-point responses
  1. Referee: [DPPB and experiments sections] The central empirical claim of consistent PSNR/LPIPS gains rests on the assumption that DINOv3 features (trained on standard lighting) can be injected via DPPB without introducing semantic mismatches or artifacts in noisy low-light inputs; the manuscript provides no quantitative ablation isolating DPPB's contribution or failure-case analysis on signal-dependent noise, which is load-bearing for attributing improvements to the prompting mechanism rather than the denoising or embedding stages.

    Authors: We agree that a dedicated quantitative ablation isolating DPPB is necessary to attribute gains specifically to the prompting mechanism. The manuscript presents the full framework results but does not isolate DPPB from the cross-scale denoising and MRPE stages. In the revised version we will add an ablation study that removes or substitutes the DPPB module and reports the resulting metric changes. We will also include a failure-case analysis on low-light images exhibiting strong signal-dependent noise to examine potential semantic mismatches or artifacts introduced by DINOv3 features. revision: yes

  2. Referee: [Experiments and results] The abstract and method description report average metric improvements but the manuscript lacks per-dataset tables with standard deviations, dataset splits, or statistical tests; without these, it is impossible to determine whether the 1.9-15.0% PSNR range reflects robust gains or is driven by particular benchmarks or post-hoc tuning.

    Authors: We concur that per-dataset breakdowns with variability measures would improve transparency. The reported ranges are averages across the evaluated LLIE benchmarks, but the manuscript does not tabulate individual dataset results or standard deviations. In revision we will add per-dataset tables that include PSNR and LPIPS for each benchmark, along with standard deviations computed over multiple runs where available, and explicit details on the train/test splits employed. We will also explore the inclusion of statistical significance tests to support the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The paper describes a proposed architecture (cross-scale denoising + DPPB with DINOv3 injection via SCC and MRPE) and validates it via standard benchmark metrics (PSNR, LPIPS) on LLIE datasets. No equations or claims reduce by construction to fitted parameters or self-referential definitions; performance numbers are external measurements, not tautological. No load-bearing self-citations or uniqueness theorems are invoked in the provided text. This is a standard empirical ML paper whose central claims rest on reproducible experiments rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, free parameters, axioms, or invented scientific entities are described.

pith-pipeline@v0.9.0 · 5754 in / 1155 out tokens · 26789 ms · 2026-05-25T05:08:56.187105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

  1. [1]

    In: International Conference on Neural Informa- tion Processing (2024),https://api.semanticscholar.org/CorpusID:269604860

    Bai, J., Yin, Y., He, Q., Li, Y., Zhang, X.: Retinexmamba: Retinex-based mamba for low-light image enhancement. In: International Conference on Neural Informa- tion Processing (2024),https://api.semanticscholar.org/CorpusID:269604860

  2. [2]

    In: IEEE/CVF ICCV

    Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One- stage retinex-based transformer for low-light image enhancement. In: IEEE/CVF ICCV. pp. 12504–12513 (October 2023)

  3. [3]

    In: ECCV

    Chang, M., Li, Q., Feng, H., Xu, Z.: Spatial-adaptive network for single image de- noising. In: ECCV. p. 171–187. Springer-Verlag, Berlin, Heidelberg (2020).https: //doi.org/10.1007/978-3-030-58577-8_11,https://doi.org/10.1007/978-3- 030-58577-8_11

  4. [4]

    arXiv preprint arXiv:2504.07963 (2025)

    Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025)

  5. [5]

    arXiv preprint arXiv:2511.18822 (2025)

    Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)

  6. [6]

    In: ICML

    Crowson, K., Baumann, S.A., Birch, A., Abraham, T.M., Kaplan, D.Z., Shippole, E.: Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In: ICML. JMLR.org (2024)

  7. [7]

    IEEE TIP17(10), 1737–1754 (2008).https://doi.org/10.1109/TIP.2008.2001399

    Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE TIP17(10), 1737–1754 (2008).https://doi.org/10.1109/TIP.2008.2001399

  8. [8]

    In: NeurIPS

    Gou, Y., Hu, P., Lv, J., Zhou, J.T., Peng, X.: Multi-scale adaptive network for single image denoising. In: NeurIPS. Curran Associates Inc. (2022)

  9. [9]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    Guo, C.G., Li, C., Guo, J., Loy, C.C., Hou, J., Kwong, S., Cong, R.: Zero-reference deep curve estimation for low-light image enhancement. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 1780– 1789 (June 2020)

  10. [10]

    IEEE Transactions on image processing26(2), 982–993 (2016)

    Guo, X., Li, Y., Ling, H.: Lime: Low-light image enhancement via illumination map estimation. IEEE Transactions on image processing26(2), 982–993 (2016)

  11. [11]

    Journal of Visual Communication and Image Representation90, 103712 (2023)

    Hai, J., Xuan, Z., Yang, R., Hao, Y., Zou, F., Lin, F., Han, S.: R2rnet: Low- light image enhancement via real-low to real-normal network. Journal of Visual Communication and Image Representation90, 103712 (2023)

  12. [12]

    In: ICML

    Hoogeboom, E., Heek, J., Salimans, T.: Simple diffusion: end-to-end diffusion for high resolution images. In: ICML. ICML’23, JMLR.org (2023)

  13. [13]

    Advances in Neural Information Processing Systems36, 79734–79747 (2023)

    Hou, J., Zhu, Z., Hou, J., Liu, H., Zeng, H., Yuan, H.: Global structure-aware diffusion process for low-light image enhancement. Advances in Neural Information Processing Systems36, 79734–79747 (2023)

  14. [14]

    In: Machine Learning from Challenging Data 2025

    Huang, G., Lin, R., Li, Y., Bull, D., Anantrasirichai, N.: Bvi-mamba: video en- hancement using a visual state-space model for low-light and underwater environ- ments. In: Machine Learning from Challenging Data 2025. vol. 13460, pp. 74–81. SPIE (2025)

  15. [15]

    Proceedings of the AAAI Conference on Artificial Intelligence (2026)

    Huang, G., Yang, Q., Qi, Z., Lin, R., Bull, D., Anantrasirichai, N.: Bayesian neu- ral networks for one-to-many mapping in image enhancement. Proceedings of the AAAI Conference on Artificial Intelligence (2026)

  16. [16]

    IEEE/CVF TCE53(4), 1752–1758 (2007)

    Ibrahim, H., Pik Kong, N.S.: Brightness Preserving Dynamic Histogram Equaliza- tion for Image Contrast Enhancement. IEEE/CVF TCE53(4), 1752–1758 (2007)

  17. [17]

    ACM TOG42(6), 1–14 (2023) 16 Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

    Jiang, H., Luo, A., Fan, H., Han, S., Liu, S.: Low-light image enhancement with wavelet-based diffusion models. ACM TOG42(6), 1–14 (2023) 16 Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

  18. [18]

    In: ECCV (2024)

    Jiang, H., Luo, A., Liu, X., Han, S., Liu, S.: Lightendiffusion: Unsupervised low- light image enhancement with latent-retinex diffusion models. In: ECCV (2024)

  19. [19]

    IEEE TIP30, 2340–2349 (2021)

    Jiang, Y., Gong, X., Liu, D., Cheng, Y., Fang, C., Shen, X., Yang, J., Zhou, P., Wang, Z.: Enlightengan: Deep light enhancement without paired supervision. IEEE TIP30, 2340–2349 (2021)

  20. [20]

    arXiv preprint arXiv:2306.15870 (2023)

    Jin, Z., Chen, S., Chen, Y., Xu, Z., Feng, H.: Let segment anything help image dehaze. arXiv preprint arXiv:2306.15870 (2023)

  21. [21]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980(2014),https://api.semanticscholar.org/CorpusID:6628106

  22. [22]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014), http://arxiv.org/abs/1312.6114

  23. [23]

    Segment Anything

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)

  24. [24]

    Scientific American237 6, 108–28 (1977)

    Land, E.H.: The retinex theory of color vision. Scientific American237 6, 108–28 (1977)

  25. [25]

    IEEE transactions on image processing22(12), 5372–5384 (2013)

    Lee, C., Lee, C., Kim, C.S.: Contrast enhancement based on layered difference representation of 2d histograms. IEEE transactions on image processing22(12), 5372–5384 (2013)

  26. [26]

    Li, S., Liu, M., Zhang, Y., Chen, S., Li, H., Dou, Z., Chen, H.: Sam-deblur: Let segmentanythingboostimagedeblurring.In:ICASSP.pp.2445–2449.IEEE(2024)

  27. [27]

    In: IEEE ICASSP

    Lin, J., Anantrasirichai, N., Bull, D.: Multi-scale denoising in the feature space for low-light instance segmentation. In: IEEE ICASSP. pp. 1–5 (2025)

  28. [28]

    arXiv preprint arXiv:2312.01677 (2023)

    Lin, X., Yue, J., Chan, K.C., Qi, L., Ren, C., Pan, J., Yang, M.H.: Multi-task image restoration guided by robust dino features. arXiv preprint arXiv:2312.01677 (2023)

  29. [29]

    Pattern Recognition61, 650–662 (2017)

    Lore,K.G.,Akintayo,A.,Sarkar,S.:Llnet:Adeepautoencoderapproachtonatural low-light image enhancement. Pattern Recognition61, 650–662 (2017)

  30. [30]

    In: ICLR (2023),https://api

    Luo, Z., Gustafsson, F.K., Zhao, Z., Sjolund, J., Schon, T.B.: Controlling vision- language models for multi-task image restoration. In: ICLR (2023),https://api. semanticscholar.org/CorpusID:263605463

  31. [31]

    In: BMVC (2018)

    Lv, F., Lu, F., Wu, J., Lim, C.: Mbllen: Low-light image/video enhancement using cnns. In: BMVC (2018)

  32. [32]

    IEEE Transactions on Image Processing24(11), 3345–3356 (2015)

    Ma, K., Zeng, K., Wang, Z.: Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing24(11), 3345–3356 (2015)

  33. [33]

    DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    Ma, Z., Wei, L., Wang, S., Zhang, S., Tian, Q.: Deco: Frequency-decoupled pixel diffusion for end-to-end image generation. arXiv preprint arXiv:2511.19365 (2025)

  34. [34]

    completely blind

    Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters20(3), 209–212 (2013).https: //doi.org/10.1109/LSP.2012.2227726

  35. [35]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

  36. [36]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: IEEE/CVF ICCV. pp. 4172–4182 (2023).https://doi.org/10.1109/ICCV51070.2023.00387

  37. [37]

    In: ECCV

    Qi, C., Tu, Z., Ye, K., Delbracio, M., Milanfar, P., Chen, Q., Talebi, H.: Spire: Se- mantic prompt-driven image restoration. In: ECCV. pp. 446–464. Springer (2024) Abbreviated paper title 17

  38. [38]

    In: ICML (2021),https://api

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021),https://api. semanticscholar.org/CorpusID:231591445

  39. [39]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714

  40. [40]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. CVPR pp. 10674–10685 (2021), https://api.semanticscholar.org/CorpusID:245335280

  41. [41]

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

  42. [42]

    Multimedia Tools and Applications77, 9211–9231 (2018)

    Vonikakis, V., Kouskouridas, R., Gasteratos, A.: On the evaluation of illumina- tion compensation algorithms. Multimedia Tools and Applications77, 9211–9231 (2018)

  43. [43]

    In: AAAI (2023)

    Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: AAAI (2023)

  44. [44]

    IEEE transactions on image processing 22(9), 3538–3548 (2013)

    Wang, S., Zheng, J., Hu, H.M., Li, B.: Naturalness preserved enhancement algo- rithm for non-uniform illumination images. IEEE transactions on image processing 22(9), 3538–3548 (2013)

  45. [45]

    In: BMVC (2018)

    Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. In: BMVC (2018)

  46. [46]

    IEEE TPAMI44(11), 8520–8537 (2021)

    Wei, K., Fu, Y., Zheng, Y., Yang, J.: Physics-based noise modeling for extreme low-light photography. IEEE TPAMI44(11), 8520–8537 (2021)

  47. [47]

    In: CVPR (2022)

    Xu, X., Wang, R., Fu, C.W., Jia, J.: Snr-aware low-light image enhancement. In: CVPR (2022)

  48. [48]

    ACM TOMM21(11) (Nov 2025).https://doi.org/10

    Xue, M., He, J., Wang, W., Zhou, M.: Low-light image enhancement via clip-fourier guided wavelet diffusion. ACM TOMM21(11) (Nov 2025).https://doi.org/10. 1145/3764933,https://doi.org/10.1145/3764933

  49. [49]

    Yan, Q., Feng, Y., Zhang, C., Pang, G., Shi, K., Wu, P., Dong, W., Sun, J., Zhang, Y.:Hvi:Anewcolorspaceforlow-lightimageenhancement.In:IEEE/CVFCVPR. pp. 5678–5687 (2025).https://doi.org/10.1109/CVPR52734.2025.00533

  50. [50]

    IEEE Transactions on Image Processing30, 2072–2086 (2021)

    Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing30, 2072–2086 (2021)

  51. [51]

    Yu, Y., Xiong, W., Nie, W., Sheng, Y., Liu, S., Luo, J.: Pixeldit: Pixel diffusion transformers for image generation (2025),https://arxiv.org/abs/2511.20645

  52. [52]

    In: IEEE/CVF CVPR (2023)

    Yuhui, W., Chen, P., Guoqing, W., Yang, Y., Jiwei, W., Chongyi, L., Shen, H.T.: Learning semantic-aware knowledge guidance for low-light image enhancement. In: IEEE/CVF CVPR (2023)

  53. [53]

    In: CVPR (2022)

    Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: CVPR (2022)

  54. [54]

    IEEE/CVF CVPR pp

    Zhang, Q., Liu, X., Li, W., Chen, H., Liu, J., Hu, J., Xiong, Z., Yuan, C., Wang, Y.: Distilling semantic priors from sam to efficient image restoration models. IEEE/CVF CVPR pp. 25409–25419 (2024),https://api.semanticscholar.org/ CorpusID:268681086 18 Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

  55. [55]

    In: IEEE/CVF CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF CVPR (2018)

  56. [56]

    Diffusion Transformers with Representation Autoencoders

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025)

  57. [57]

    In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision

    Zheng, S., Gupta, G.: Semantic-guided zero-shot learning for low-light image/video enhancement. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision. pp. 581–590 (2022)

  58. [58]

    In: ACM MM (2024),https: //openreview.net/forum?id=oQahsz6vWe

    Zou, W., Gao, H., Yang, W., Liu, T.: Wave-mamba: Wavelet state space model for ultra-high-definition low-light image enhancement. In: ACM MM (2024),https: //openreview.net/forum?id=oQahsz6vWe