arxiv: 2604.02787 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers

Shreshth Saini , Hakan Gedik , Neil Birkbeck , Yilin Wang , Balu Adsumilli , Alan C. Bovik

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SDR-to-HDR conversiondiffusion transformersinverse tone mappingphysically-guided adaptationperceptual modulationHDR reconstructionimage tone expansion

0 comments

The pith

LumaFlux adapts a pretrained diffusion transformer with physical luminance injection and perceptual modulation to convert 8-bit SDR images into accurate 10-bit HDR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LumaFlux as a diffusion transformer that lifts SDR content to HDR by adapting a large pretrained model rather than training from scratch. It adds three targeted components: a Physically-Guided Adaptation module that feeds luminance, spatial, and frequency information into attention layers through low-rank residuals; a Perceptual Cross-Modulation layer that conditions chroma and texture on vision-encoder features; and an HDR Residual Coupler that blends signals under timestep-adaptive schedules. A final Rational-Quadratic Spline decoder produces smooth tone curves for highlight expansion. The authors also release a large curated SDR-HDR training set and an expert-graded evaluation benchmark. If the approach holds, existing 8-bit libraries could be upgraded to match modern HDR displays without manual tone-mapping adjustments.

Core claim

LumaFlux is a first physically and perceptually guided diffusion transformer for SDR-to-HDR reconstruction obtained by adapting a large pretrained DiT. The model injects luminance, spatial descriptors, and frequency cues into attention via low-rank residuals in the Physically-Guided Adaptation module, stabilizes chroma and texture through FiLM-conditioned Perceptual Cross-Modulation, and fuses signals with a timestep- and layer-adaptive HDR Residual Coupler. A lightweight Rational-Quadratic Spline decoder then reconstructs interpretable tone fields that enhance the VAE output. Supported by a newly curated large-scale SDR-HDR corpus and an expert-graded benchmark, the method reports superior

What carries the argument

Physically-Guided Adaptation (PGA) module that injects luminance and frequency cues into DiT attention via low-rank residuals, combined with Perceptual Cross-Modulation (PCM) via FiLM conditioning and an HDR Residual Coupler under adaptive modulation, plus a Rational-Quadratic Spline decoder for tone fields.

If this is right

Produces higher luminance reconstruction accuracy than prior inverse tone-mapping methods across the established benchmarks.
Maintains better perceptual color fidelity with only minimal added parameters.
Enables more stable highlight and exposure expansion through the spline-based tone decoder.
Benefits from the new large-scale SDR-HDR corpus for training robustness on varied content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular guidance pattern could be tested on video sequences if frame-to-frame consistency losses are added.
Similar low-rank cue injection might improve other diffusion-based enhancement tasks such as denoising or low-light recovery.
Deployment on consumer devices would still require separate validation on mobile camera pipelines not covered in the expert benchmark.

Load-bearing premise

The proposed Physically-Guided Adaptation, Perceptual Cross-Modulation, and HDR Residual Coupler modules together with the new training corpus will generalize to real-world SDR degradations, stylistic variations, and camera pipelines beyond the curated expert-graded benchmark.

What would settle it

Test LumaFlux on SDR images captured by camera models and processing pipelines absent from the new benchmark and measure whether luminance reconstruction error and color fidelity drop below the reported baseline margins.

read the original abstract

The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LumaFlux adds PGA, PCM, and a residual coupler to adapt DiTs for SDR-to-HDR plus a new corpus, but the abstract gives no metrics so the superiority claim stays unproven.

read the letter

The main takeaway is that this paper adapts a pretrained diffusion transformer for inverse tone mapping by adding three specific modules—Physically-Guided Adaptation that feeds luminance and frequency info through low-rank residuals, Perceptual Cross-Modulation that applies FiLM conditioning from a vision encoder to keep chroma stable, and an HDR Residual Coupler that blends signals with timestep-aware modulation—plus a Rational-Quadratic Spline decoder and a newly curated large-scale SDR-HDR dataset with an expert-graded benchmark. That combination is the concrete new piece; it tries to steer the diffusion process with physical and perceptual priors instead of generic fine-tuning or fixed operators. The dataset effort alone is useful for the field because prior ITM work has been limited by small or synthetic pairs. The modules look like a reasonable way to inject domain knowledge without blowing up parameter count. The soft spot is exactly what the abstract leaves out: no quantitative results, no ablation tables, no error bars, and no clear description of how baselines were run or what data was held out. Without those, it is impossible to tell whether the claimed gains in luminance and color fidelity come from the new components or from better data matching. The generalization worry also lands—the curated expert-graded SDRs may not capture typical compression artifacts, sensor variations, or stylistic tone curves, so any advantage could shrink on real-world inputs. This is aimed at researchers working on computational photography and HDR pipelines who want to see how diffusion models can be specialized with low-rank and conditioning tricks. If the full paper supplies the missing experiments and releases the corpus, it is worth a serious referee because the architectural choices are specific enough to test and build on even if the final gains are modest.

Referee Report

3 major / 2 minor

Summary. The paper introduces LumaFlux, a physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction. It adapts a pretrained DiT with three new modules—Physically-Guided Adaptation (PGA) using low-rank residuals for luminance/spatial/frequency cues, Perceptual Cross-Modulation (PCM) via FiLM conditioning for chroma/texture stability, and an HDR Residual Coupler with timestep/layer-adaptive modulation—plus a Rational-Quadratic Spline decoder. The work also curates a large-scale SDR-HDR training corpus and establishes a new expert-graded evaluation benchmark, claiming superior luminance reconstruction and perceptual color fidelity over state-of-the-art baselines with minimal added parameters.

Significance. If the performance claims hold under rigorous evaluation, the work would represent a meaningful advance in inverse tone-mapping by demonstrating that targeted physical/perceptual priors can be injected into large diffusion transformers to improve generalization beyond fixed operators. The introduction of a new large-scale corpus and reproducible benchmark is a concrete contribution that could support future research, even if the architectural innovations require further validation.

major comments (3)

[Abstract and §4] Abstract and §4 (Evaluation): The central claim that LumaFlux 'outperforms state-of-the-art baselines' in luminance reconstruction and perceptual color fidelity is unsupported by any quantitative metrics, tables, ablation studies, error bars, baseline implementation details, or data exclusion criteria. Without these, the headline result cannot be assessed and the generalization argument remains untestable.
[§3.2–3.4] §3.2–3.4 (PGA, PCM, HDR Residual Coupler): The description of how low-rank residuals, FiLM conditioning, and timestep-adaptive coupling inject physical/perceptual priors is high-level; no equations or pseudocode show the exact modulation schedule or how these modules interact with the DiT attention layers. This makes it impossible to verify whether the claimed parameter efficiency and stability arise from the proposed mechanisms or from dataset matching.
[§5] §5 (Generalization discussion): The paper asserts robustness to real-world degradations, stylistic variations, and camera pipelines, yet provides no out-of-distribution test set (e.g., consumer camera SDR with compression artifacts or non-expert tone curves). The skeptic concern that performance may be driven by corpus match rather than the architectural priors is therefore unaddressed.

minor comments (2)

[Abstract and §3] The abstract and method sections use several new acronyms (PGA, PCM) without an initial glossary; a short table of module names and their roles would improve readability.
[§4 and figures] Figure captions and benchmark description should explicitly state the number of images, resolution, and exact expert-grading protocol to enable reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the empirical validation, technical clarity, and generalization claims. We address each major comment below and commit to revisions that improve verifiability without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The central claim that LumaFlux 'outperforms state-of-the-art baselines' in luminance reconstruction and perceptual color fidelity is unsupported by any quantitative metrics, tables, ablation studies, error bars, baseline implementation details, or data exclusion criteria. Without these, the headline result cannot be assessed and the generalization argument remains untestable.

Authors: We agree that explicit quantitative support must be prominent. The full manuscript contains comparative results in §4, but to directly address this concern we will add comprehensive tables reporting PSNR, SSIM, LPIPS, and HDR-specific metrics (e.g., luminance error, color fidelity) with error bars from multiple seeds, full ablation studies on each module, baseline re-implementation details (including training protocols and data exclusion criteria), and statistical significance tests. These additions will make the performance claims fully verifiable. revision: yes
Referee: [§3.2–3.4] §3.2–3.4 (PGA, PCM, HDR Residual Coupler): The description of how low-rank residuals, FiLM conditioning, and timestep-adaptive coupling inject physical/perceptual priors is high-level; no equations or pseudocode show the exact modulation schedule or how these modules interact with the DiT attention layers. This makes it impossible to verify whether the claimed parameter efficiency and stability arise from the proposed mechanisms or from dataset matching.

Authors: We will revise §3.2–3.4 to include the precise mathematical definitions: the low-rank residual formulation for PGA (luminance/spatial/frequency injection), the FiLM conditioning equations for PCM, and the timestep/layer-adaptive modulation schedule for the HDR Residual Coupler. We will also add pseudocode illustrating the integration with DiT attention layers, parameter counts, and forward-pass interactions to demonstrate that efficiency and stability derive from the proposed mechanisms. revision: yes
Referee: [§5] §5 (Generalization discussion): The paper asserts robustness to real-world degradations, stylistic variations, and camera pipelines, yet provides no out-of-distribution test set (e.g., consumer camera SDR with compression artifacts or non-expert tone curves). The skeptic concern that performance may be driven by corpus match rather than the architectural priors is therefore unaddressed.

Authors: The curated benchmark already incorporates expert-graded SDR versions spanning stylistic and pipeline variations. To further mitigate the corpus-match concern, we will expand §5 with explicit out-of-distribution experiments on consumer-camera SDR (including compression artifacts and non-expert tone curves) using held-out data, reporting both quantitative metrics and qualitative examples to isolate the contribution of the architectural priors. revision: partial

Circularity Check

0 steps flagged

No circularity: novel modules and curated corpus provide independent content

full rationale

The paper's central claims rest on the introduction of three new architectural modules (Physically-Guided Adaptation, Perceptual Cross-Modulation, HDR Residual Coupler) plus a newly curated large-scale SDR-HDR training corpus and expert-graded evaluation benchmark. These elements are presented as independent contributions rather than reductions of prior results. No equations, parameter-fitting steps, or self-citations are shown that would make any prediction equivalent to its inputs by construction. The derivation chain therefore remains self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities. The approach relies on a pretrained DiT backbone whose adaptation parameters are learned from the new corpus; any low-rank dimensions or modulation schedules are implicit training choices.

pith-pipeline@v0.9.0 · 5619 in / 1227 out tokens · 49104 ms · 2026-05-13T19:48:19.228565+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Physically-Guided Adaptation (PGA) module that injects luminance... through low-rank residuals; Perceptual Cross-Modulation (PCM) layer... FiLM conditioning; HDR Residual Coupler... Rational-Quadratic Spline decoder

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generating HDR Video from SDR Video
cs.CV 2026-05 unverdicted novelty 7.0

A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

[1]

Xiangyu Chen, Zhengwen Zhang, Jimmy S. J. Ren, Lynhoo Tian, Y. Qiao, and Chao Dong. A new journey from sdrtv to hdrtv.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4480–4489,

work page 2021
[2]

ffmpeg tool

FFmpeg Developers. ffmpeg tool. [Software], 2016.http://ffmpeg.org/. Version be1d324. Cheng Guo, Leidong Fan, Ziyu Xue, and Xiuhua Jiang. Learning a practical sdr-to-hdrtv up-conversion using new dataset and degradation models. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22231–22241,

work page 2016
[3]

Assran, Q

doi: 10.1109/CVPR52729.2023.02129. Gang He, Kepeng Xu, Li Xu, Chang Wu, Ming Sun, Xing Wen, and Yu-Wing Tai. Sdrtv-to-hdrtv via hierarchical dynamic context feature mapping. InProceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 2890–2898, New York, NY, USA,

work page doi:10.1109/cvpr52729.2023.02129 2023
[4]

ISBN 9781450392037

Association for Computing Machinery. ISBN 9781450392037. doi: 10.1145/3503161.3548043.https://doi.org/10.1145/3503161.3548043. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851,

work page doi:10.1145/3503161.3548043.https://doi.org/10.1145/3503161.3548043
[5]

ISBN 9798400701085

Association for Computing Machinery. ISBN 9798400701085. doi: 10.1145/3581783.3612199. https://doi.org/10.1145/3581783.3612199. International Telecommunication Union. Bt.2020-2: Parameter values for ultra-high definition television systems for production and international programme exchange. Technical Report BT.2020-2, International Telecommunication Unio...

work page doi:10.1145/3581783.3612199 2020
[6]

Soo Ye Kim, Jihyong Oh, and Munchurl Kim

https://www.itu.int/. Soo Ye Kim, Jihyong Oh, and Munchurl Kim. Jsi-gan: Gan-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for uhd hdr video. InAAAI Conference on Artificial Intelligence, 2019a. https://api.semanticscholar.org/CorpusID:202542696. Soo Ye Kim, Jihyong Oh, and Munchurl Kim. Deep sr-itm: Joint lea...

work page 2019
[7]

2304.13625 , archivePrefix=

Rafal K Mantiuk, Dounia Hammou, and Param Hanji. Hdr-vdp-3: A multi-metric for predicting image differences, quality and contrast distortions in high dynamic range and regular content.arXiv preprint arXiv:2304.13625,

work page arXiv
[8]

Tiktok by the numbers

Omnicore. Tiktok by the numbers. [Online], 2024.https://www.omnicoreagency.com/tiktok-statistics/. Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32,

work page 2024
[9]

Chug: Crowdsourced user-generated hdr video quality dataset

Shreshth Saini, Alan C Bovik, Neil Birkbeck, Yilin Wang, and Balu Adsumilli. Chug: Crowdsourced user-generated hdr video quality dataset. In2025 IEEE International Conference on Image Processing (ICIP), pages 2504–2509. IEEE, 2025a. 14 Shreshth Saini, Shashank Gupta, and Alan C Bovik. Rectified-cfg++ for flow based models.arXiv preprint arXiv:2510.07631, ...

work page arXiv 2084
[10]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

doi: 10.1109/TIP.2003.819861. Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986,

work page doi:10.1109/tip.2003.819861 2003
[11]

Dynamic diffusion transformer.arXiv preprint arXiv:2410.03456,

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, and Yang You. Dynamic diffusion transformer.arXiv preprint arXiv:2410.03456,

work page arXiv
[12]

Flashvsr: Towards real-time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747, 2025

Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, and Tianfan Xue. Flashvsr: Towards real-time diffusion-based streaming video super-resolution.arXiv preprint arXiv:2510.12747,

work page arXiv
[13]

Inverse tone mapping must invert the OETF, restore the BT.709→BT.2020 gamut, and expand tone while preserving chromatic relationships and avoiding banding

15 Appendix A Background The SDR formation model xsdr = Γ709 OETF M2020→709 xhdr Lmax +ϵ compresses both dynamic range and color gamut. Inverse tone mapping must invert the OETF, restore the BT.709→BT.2020 gamut, and expand tone while preserving chromatic relationships and avoiding banding. PQ luminance is linearized asL= (ΓPQ EOTF)−1(y)·L max. Perceptual...

work page 2020
[14]

introduce hierarchical feature modulation and dynamic context transformation. However, these convolutional networks are limited by local receptive fields and struggle to generalize to unseen degradations, especially compression noise and mixed tone curves commonly found in user-generated content (UGC). They also operate in fixed color spaces, often Rec.70...

work page 2020
[15]

exploit self-attention for long-range correlation, but they remain fully supervised and use exposure fusion instead of direct HDR prediction. In addition, most prior datasets (e.g., HDRTV1K (Chen et al., 2021), HDRTV4K (Guo et al., 2023)) are either synthetic or tone-mapped from HDR masters using fixed operators, leading to narrow domain diversity. Diffus...

work page 2021
[16]

However, these datasets lack the diversity and real-capture noise found in consumer SDR videos

and HDRTV4K (Guo et al., 2023), each providing paired SDR–HDR frames derived via controlled tone mapping. However, these datasets lack the diversity and real-capture noise found in consumer SDR videos. The LIVE-TMHDR dataset (Venkataramanan and Bovik, 2024), originally developed for HDR tone-mapping quality assessment, contains 40 HDR source videos and 10...

work page 2023
[17]

Expert Graded SDR

and CHUG (Saini et al., 2025a) to form the first large-scale SDR–HDR corpus, unified under PQ-encoded BT.2020 representation. We also propose a new evaluation benchmark, LumaEval, establishing a high-fidelity reference for fair, perceptual HDR reconstruction assessment. C Diffusion Priors C.1 Relative-structure bias in diffusion transformers Let x∈R H×W×3...

work page 2020