MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction
Pith reviewed 2026-06-26 00:14 UTC · model grok-4.3
The pith
MambaRaw reconstructs 4K raw images from JPEG previews using selective state space models that avoid quadratic attention costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MambaRaw claims that Spatial-Energy Coupled Context Modeling, formed by TileMambaBlock selective scanning on dense tiles plus identity-initialized EAR refinement, estimates entropy parameters for raw signals more efficiently than attention while preserving or improving reconstruction accuracy, thereby setting a new state of the art for JPEG-guided raw image recovery at 4K resolution.
What carries the argument
Spatial-Energy Coupled Context Modeling mechanism that combines TileMambaBlock (Mamba-style selective scanning restricted to information-dense tiles) and Energy-Aware Refinement (identity-initialized residual module) to produce entropy parameters without quadratic attention scaling.
If this is right
- PSNR increases by 1.2-1.4 dB over strong metadata baselines at low bitrates.
- End-to-end coding latency drops by approximately 9 percent.
- The approach remains computationally feasible at 4K resolution where attention scales poorly.
- New state-of-the-art results hold across the three tested camera datasets.
Where Pith is reading between the lines
- The selective-scanning design could extend to other high-resolution entropy-modeling tasks such as video compression.
- Further gains might appear if the tile-selection heuristic inside TileMambaBlock is learned rather than fixed.
- Mobile or embedded pipelines could adopt the method once the three-brand generalization is confirmed on additional sensors.
Load-bearing premise
The TileMambaBlock and EAR modules can accurately model raw-signal entropy across diverse cameras without the quadratic cost of attention.
What would settle it
Running the method on a fourth unseen camera brand at 4K resolution and measuring whether the reported PSNR gain and latency reduction both disappear.
Figures
read the original abstract
In-camera JPEG previews are ubiquitous in raw image formats and provide an sRGB reference at negligible storage cost. Although existing metadata-based reconstruction frameworks can exploit this side information when recovering raw images, their context models often become computationally expensive especially at high resolution, eg, 4K raw image, given that attention mechanisms scale quadratically with feature maps, hindering its practical application. To address these limitations, we propose MambaRaw, a JPEG-conditioned metadata-based raw image reconstruction framework that uses State Space Models (SSMs) to estimate entropy parameters efficiently. Our key contribution comprises a Spatial-Energy Coupled Context Modeling mechanism with two lightweight modules: (1) TileMambaBlock, which performs Mamba-style selective scanning only on information-dense tiles to improve the efficiency; and (2) Energy-Aware Refinement (EAR), an identity-initialized residual module that enhance feature representation to match the long-tail energy distribution of raw signals. Extensive experiments on three camera datasets (Sony, Olympus, Samsung) show consistent improvements over strong metadata-based baselines and set a new state of the art for JPEG-guided raw reconstruction with great efficiency. Notably, at low metadata bitrates, MambaRaw increases PSNR by 1.2--1.4 dB and reduces end-to-end coding latency by about 9%. Code is released at https://github.com/Peizeli1/MambaRaw.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MambaRaw, a JPEG-conditioned framework for 4K raw image reconstruction that replaces attention-based entropy modeling with selective state space models. The core contribution is a Spatial-Energy Coupled Context Modeling mechanism comprising the TileMambaBlock (selective scanning restricted to information-dense tiles) and the identity-initialized Energy-Aware Refinement (EAR) module. Experiments on Sony, Olympus, and Samsung datasets report consistent PSNR gains of 1.2–1.4 dB at low metadata bitrates together with an approximately 9 % reduction in end-to-end latency, establishing a new state of the art while releasing code.
Significance. If the reported efficiency and accuracy advantages prove robust, the work supplies a practical linear-complexity alternative to quadratic attention for high-resolution metadata-guided reconstruction, with direct relevance to in-camera pipelines. The public code release supports reproducibility and is a clear strength.
major comments (1)
- [Experiments on three camera datasets] Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.
minor comments (1)
- [Abstract] Abstract: the latency reduction is given as “about 9 %”; reporting the exact measured value together with any standard deviation or number of runs would improve precision.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on generalization. We address the single major comment below.
read point-by-point responses
-
Referee: Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.
Authors: We agree that broader sensor diversity would strengthen generalization claims. The Sony, Olympus, and Samsung datasets are the standard benchmarks in prior raw reconstruction literature and cover multiple manufacturers, even though they share the Bayer CFA. Our experiments follow this established protocol to enable direct comparison. Cross-sensor transfer tests were not included because raw reconstruction is typically sensor-specific due to differing noise profiles and ISPs. In the revision we will add a limitations paragraph discussing this scope and noting it as future work. The consistent gains across the three datasets still support the effectiveness of the proposed modules for the evaluated setting. revision: partial
- Additional experiments on further camera brands or cross-sensor transfer, which would require new raw datasets outside the current work's scope.
Circularity Check
No circularity: empirical architecture evaluation on external datasets
full rationale
The paper introduces MambaRaw, a new SSM-based architecture with TileMambaBlock and identity-initialized EAR modules for JPEG-guided raw reconstruction. All reported gains (PSNR, latency) are obtained by training and testing on three external camera datasets (Sony, Olympus, Samsung). No equations, derivations, or predictions are presented that reduce to fitted inputs by construction, nor are there load-bearing self-citations or uniqueness theorems. The work is self-contained against external benchmarks, consistent with the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1611.01704 (2016) 3
Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 (2016) 3
-
[2]
Variational image compression with a scale hyperprior
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 (2018) 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
In: CVPR 2011
Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011) 10
2011
-
[4]
arXiv preprint arXiv:2411.11717 (2024) 4
Chen, H., Han, W., Zheng, H., Shen, J.: Rawmamba: Unified srgb-to-raw de- rendering with state space model. arXiv preprint arXiv:2411.11717 (2024) 4
-
[5]
Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration
Chen, Y., Qin, H., Zhang, Z., Magno, M., Benini, L., Li, Y.: Q-mambair: Accurate quantized mamba for efficient image restoration. arXiv preprint arXiv:2503.21970 (2025) 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2508.02192 (2025) 4
Chen, Y., Lyu, Z., He, B., Hu, H., Wang, Q., Tian, Y., Song, L., Zhang, W., Lu, G.: Cmic: Content-adaptive mamba for learned image compression. arXiv preprint arXiv:2508.02192 (2025) 4
-
[7]
Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10
Cheng, D., Prasad, D.K., Brown, M.S.: Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10
2014
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939– 7948 (2020) 3 16 P. Liet al
2020
-
[9]
Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Gao, G., You, P., Pan, R., Han, S., Zhang, Y., Dai, Y., Lee, H.: Neural image com- pression via attentional multi-scale back projection and frequency decomposition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14677–14686 (2021) 3
2021
-
[11]
In: First conference on language modeling (2024) 4
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024) 4
2024
-
[12]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021) 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guo, H.,Guo,Y.,Zha,Y.,Zhang,Y.,Li,W.,Dai,T., Xia,S.T.,Li,Y.:Mambairv2: Attentive state space restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28124–28133 (2025) 4
2025
-
[14]
IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3
Guo, Z., Zhang, Z., Feng, R., Chen, Z.: Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3
2021
-
[15]
In: Proceedings of the Computer Vision and Pattern Recognition Con- ference
Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025) 4
2025
-
[16]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive cod- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5718–5727 (2022) 3
2022
-
[17]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14771–14780 (2021) 3
2021
-
[18]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018) 8
2018
-
[19]
In: European conference on computer vision
Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. In: European conference on computer vision. pp. 12–22. Springer (2024) 4
2024
-
[20]
In: European conference on computer vision
Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., Qiao, Y.: Videomamba: State space model for efficient video understanding. In: European conference on computer vision. pp. 237–255. Springer (2024) 15
2024
-
[21]
IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3
Li, M., Ma, K., You, J., Zhang, D., Zuo, W.: Efficient and effective context-based convolutional entropy modeling for image compression. IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3
2020
-
[22]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer- cnn architectures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14388–14397 (2023) 3
2023
-
[23]
Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20
2024
-
[24]
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021) 14
2021
-
[25]
arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17
Ma, C., Wang, Z., Liao, R., Ye, Y.: A cross channel context model for latents in deep image compression. arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17
-
[26]
U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Advances in neural information processing systems 31(2018) 2, 3
Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems 31(2018) 2, 3
2018
-
[28]
In: 2020 IEEE International Conference on Image Processing (ICIP)
Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 3339–3343. IEEE (2020) 3
2020
-
[29]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Nam, S., Punnappurath, A., Brubaker, M.A., Brown, M.S.: Learning srgb-to-raw- rgb de-rendering with content-aware metadata. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17704–17713 (2022) 4, 10, 11, 12, 14, 21, 24
2022
-
[30]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Patel, Y., Appalaraju, S., Manmatha, R.: Saliency driven perceptual image com- pression. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 227–236 (2021) 3
2021
-
[31]
In: Proceedings of the IEEE/CVF winter conference on applications of computer vision
Punnappurath, A., Brown, M.S.: Spatially aware metadata for raw reconstruction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 218–226 (2021) 2, 4, 10, 12
2021
-
[32]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qin, S., Lu, Y., Zhou, Y., Li, J., Ren, Y., Xue, Y., Xia, S.T., Chen, B.: Freqsic: Frequency-aware stereo image compression with bi-directional checkerboard con- text model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19393–19402 (2026) 3
2026
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Qin, S., Wang, J., Zhou, Y., Chen, B., Luo, T., An, B., Dai, T., Xia, S.T., Wang, Y.: Cassic: Towards content-adaptive state-space models for learned image com- pression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15727–15736 (2025) 4
2025
-
[34]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Qin, S., Zhang, X., Liu, Z., Wang, J., Chen, B., Li, J., Ren, Y., Xia, S.T., Zhang, J.: Mambasic: Mamba-based stereo image compression with bi-directional multi- reference entropy model. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5306–5315 (2026) 4
2026
-
[35]
IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4
Shi, Y., Xia, B., Jin, X., Wang, X., Zhao, T., Xia, X., Xiao, X., Yang, W.: Vmam- bair: Visual state space model for image restoration. IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4
2025
-
[36]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3
Tian, Y., Ling, X., Geng, C., Hu, Q., Lu, G., Zha, G.: Smc++: Masked learning of unsupervised video semantic compression. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3
2025
-
[37]
In: Proceedings of the IEEE/CVF inter- national conference on computer vision
Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., Gao, Z.: Self-conditioned probabilistic learning of video rescaling. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 4490–4499 (2021) 3
2021
-
[38]
IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3
Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3
2024
-
[39]
In: European Conference on Com- puter Vision
Tian, Y., Lu, G., Zhai, G.: Free-vsc: Free semantics from visual foundation models for unsupervised video semantic compression. In: European Conference on Com- puter Vision. pp. 163–183. Springer (2024) 3
2024
-
[40]
In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision
Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 13610–13622 (2023) 3
2023
-
[41]
Communications of the ACM34(4), 30–44 (1991) 2 18 P
Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991) 2 18 P. Liet al
1991
-
[42]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Raw image reconstruction with learned compact metadata. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18206–18215 (2023) 2, 4, 6, 10, 12
2023
-
[43]
International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24
Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Beyond learned metadata-based raw image reconstruction. International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24
2024
-
[44]
Warenkorb, L.R.: Information technology-high efficiency coding and media delivery in heterogeneous environments-part 3: 3d audio (2015) 2
2015
-
[45]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wu, C., Wang, L., Zheng, Z., Cui, Y., Yang, Z., Chen, X., Zhang, Y., Jiang, W., Xia, J.: Scan clusters, not pixels: A cluster-centric paradigm for efficient ultra- high-definition image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15528–15537 (2026) 3
2026
-
[46]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xing, Y., Qian, Z., Chen, Q.: Invertible image signal processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6287– 6296 (2021) 4, 12
2021
-
[47]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Zeng, F., Tang, H., Shao, Y., Chen, S., Shao, L., Wang, Y.: Mambaic: State space models for high-performance learned image compression. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18041–18050 (2025) 4, 20
2025
-
[48]
Zhang, J., Nguyen, A.T., Han, X., Trinh, V.Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applica- tionsongiga-pixelwholeslideimageclassification.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025) 4
2025
-
[49]
Zhou,Y.,Zhou,P.,Ng,T.K.:Efficientcascadedmultiscaleadaptivenetworkforim- agerestoration.In:EuropeanConferenceonComputerVision.pp.92–110.Springer (2024) 3
2024
-
[50]
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
In: Interna- tional conference on learning representations (2022) 3
Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: Interna- tional conference on learning representations (2022) 3
2022
-
[52]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zou, R., Song, C., Zhang, Z.: The devil is in the details: Window-based attention for image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17492–17501 (2022) 3 MambaRaw 19 A More Details A.1 Network Architecture Our MambaRaw framework directly adopts the two-level JPEG-conditioned learned-context ba...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.