MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction

Fanhu Zeng; Haotian Zhang; Peize Li; Tongda Xu; Xingguo Xu; Xingtong Ge; Xinjie Zhang; Yan Wang

arxiv: 2606.24479 · v1 · pith:XW4OF6VZnew · submitted 2026-06-23 · 💻 cs.CV

MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction

Peize Li , Fanhu Zeng , Tongda Xu , Xingguo Xu , Xinjie Zhang , Xingtong Ge , Haotian Zhang , Yan Wang This is my paper

Pith reviewed 2026-06-26 00:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords raw image reconstructionstate space modelsMambaJPEG-guided reconstructionentropy parameter estimation4K imagingimage compression

0 comments

The pith

MambaRaw reconstructs 4K raw images from JPEG previews using selective state space models that avoid quadratic attention costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MambaRaw as a JPEG-conditioned framework for recovering raw images that replaces attention-based entropy modeling with state space models. It adds a Spatial-Energy Coupled Context Modeling step built from TileMambaBlock, which scans only information-dense tiles, and an identity-initialized Energy-Aware Refinement module to match the long-tail energy distribution of raw signals. Experiments on Sony, Olympus, and Samsung camera data show consistent PSNR gains of 1.2-1.4 dB at low metadata bitrates together with roughly 9 percent lower end-to-end latency. A reader would care because the method makes high-resolution raw reconstruction practical at the scale where attention becomes prohibitive.

Core claim

MambaRaw claims that Spatial-Energy Coupled Context Modeling, formed by TileMambaBlock selective scanning on dense tiles plus identity-initialized EAR refinement, estimates entropy parameters for raw signals more efficiently than attention while preserving or improving reconstruction accuracy, thereby setting a new state of the art for JPEG-guided raw image recovery at 4K resolution.

What carries the argument

Spatial-Energy Coupled Context Modeling mechanism that combines TileMambaBlock (Mamba-style selective scanning restricted to information-dense tiles) and Energy-Aware Refinement (identity-initialized residual module) to produce entropy parameters without quadratic attention scaling.

If this is right

PSNR increases by 1.2-1.4 dB over strong metadata baselines at low bitrates.
End-to-end coding latency drops by approximately 9 percent.
The approach remains computationally feasible at 4K resolution where attention scales poorly.
New state-of-the-art results hold across the three tested camera datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective-scanning design could extend to other high-resolution entropy-modeling tasks such as video compression.
Further gains might appear if the tile-selection heuristic inside TileMambaBlock is learned rather than fixed.
Mobile or embedded pipelines could adopt the method once the three-brand generalization is confirmed on additional sensors.

Load-bearing premise

The TileMambaBlock and EAR modules can accurately model raw-signal entropy across diverse cameras without the quadratic cost of attention.

What would settle it

Running the method on a fourth unseen camera brand at 4K resolution and measuring whether the reported PSNR gain and latency reduction both disappear.

Figures

Figures reproduced from arXiv: 2606.24479 by Fanhu Zeng, Haotian Zhang, Peize Li, Tongda Xu, Xingguo Xu, Xingtong Ge, Xinjie Zhang, Yan Wang.

**Figure 1.** Figure 1: Motivation and Comparison. (a) Convolution-based methods have limited receptive fields, which restricts long-range spatial modeling. (b) MambaRaw uses a spatial–energy coupled context model. It applies TileMambaBlock for selective scanning on information-dense tiles and uses EAR for energy-guided refinement. Many raw formats also store an aligned in-camera JPEG preview. Recent metadata-based reconstructi… view at source ↗

**Figure 2.** Figure 2: Energy analysis and tile selection. (a): Long-tail energy distribution motivating EAR. (b): Spatial L2 energy map showing selected high-energy tiles (cyan) at ρ = 0.5. (c): Impact of keep ratio ρ; ρ = 0.5 offers the optimal accuracy-speed trade-off. Here, F˜ is the JPEG-conditioned feature, Fin is the context input before SSM processing, Fc is the TileMambaBlock output, and F ′ is the final EAR output. 3.4… view at source ↗

**Figure 3.** Figure 3: The Overall Framework of MambaRaw. We adopt a two-level VAE architecture conditioned on the available JPEG preview. The core innovation lies in the Level-1 Context Model, where we replace standard separate spatial/channel contexts with a coupled design: TileMambaBlock for efficient long-range spatial modeling on selected information-dense tiles, and EAR for lightweight energy-guided refinement. \mathbf {t}… view at source ↗

**Figure 4.** Figure 4: RD curves over the NUS dataset (Samsung NX2000, Olympus E-PL6, Sony SLT-A57) following the setting of [29]. The left and right columns report PSNR and SSIM, respectively. For variable-rate models, a single model is trained for each curve and different operating points are obtained by changing the rate–distortion hyperparameter of the trained model. margins. Specifically, on the Samsung subset, MambaRaw ac… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on Sony SLT-A57. Error maps show the per-pixel maximum absolute error over the three channels (after gamma correction for visibility); darker indicates smaller error. widely applicable default, achieving a sweet spot that maintains state-of-the-art results with significant speedup. Impact of Foundational Models. To verify the effectiveness of the proposed SSM-based design, we replac… view at source ↗

read the original abstract

In-camera JPEG previews are ubiquitous in raw image formats and provide an sRGB reference at negligible storage cost. Although existing metadata-based reconstruction frameworks can exploit this side information when recovering raw images, their context models often become computationally expensive especially at high resolution, eg, 4K raw image, given that attention mechanisms scale quadratically with feature maps, hindering its practical application. To address these limitations, we propose MambaRaw, a JPEG-conditioned metadata-based raw image reconstruction framework that uses State Space Models (SSMs) to estimate entropy parameters efficiently. Our key contribution comprises a Spatial-Energy Coupled Context Modeling mechanism with two lightweight modules: (1) TileMambaBlock, which performs Mamba-style selective scanning only on information-dense tiles to improve the efficiency; and (2) Energy-Aware Refinement (EAR), an identity-initialized residual module that enhance feature representation to match the long-tail energy distribution of raw signals. Extensive experiments on three camera datasets (Sony, Olympus, Samsung) show consistent improvements over strong metadata-based baselines and set a new state of the art for JPEG-guided raw reconstruction with great efficiency. Notably, at low metadata bitrates, MambaRaw increases PSNR by 1.2--1.4 dB and reduces end-to-end coding latency by about 9%. Code is released at https://github.com/Peizeli1/MambaRaw.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MambaRaw swaps attention for selective Mamba scanning plus two lightweight modules and reports 1.2-1.4 dB PSNR gains with lower latency on three camera datasets.

read the letter

MambaRaw adapts state space models to JPEG-conditioned raw reconstruction by scanning only dense tiles with TileMambaBlock and adding an identity-initialized Energy-Aware Refinement step. The result is a context model that avoids quadratic attention cost at 4K while estimating entropy parameters for the raw signal.

The paper does the straightforward thing well: it identifies the compute wall in prior metadata-based methods and targets it with SSMs that scale better. The two new modules look like reasonable engineering choices for handling raw data's energy distribution without extra overhead. Gains hold across the three tested datasets, code is released, and the latency cut of about 9% is the kind of number that matters for device workflows.

The main limitation is narrow testing. All numbers come from Sony, Olympus, and Samsung sensors, which share similar CFA and noise traits. If the selective scanning and refinement overfit to those statistics, the efficiency edge and SOTA status will not automatically carry to other brands. The abstract also gives headline improvements without visible error bars or exhaustive ablations, so the exact contribution of each module is harder to pin down.

This is for researchers building efficient on-device or metadata pipelines in computational photography. A reader already working on alternatives to transformers in vision would get concrete value from the modules and the released implementation. The work is grounded enough and the task practical enough that it deserves a serious referee rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript presents MambaRaw, a JPEG-conditioned framework for 4K raw image reconstruction that replaces attention-based entropy modeling with selective state space models. The core contribution is a Spatial-Energy Coupled Context Modeling mechanism comprising the TileMambaBlock (selective scanning restricted to information-dense tiles) and the identity-initialized Energy-Aware Refinement (EAR) module. Experiments on Sony, Olympus, and Samsung datasets report consistent PSNR gains of 1.2–1.4 dB at low metadata bitrates together with an approximately 9 % reduction in end-to-end latency, establishing a new state of the art while releasing code.

Significance. If the reported efficiency and accuracy advantages prove robust, the work supplies a practical linear-complexity alternative to quadratic attention for high-resolution metadata-guided reconstruction, with direct relevance to in-camera pipelines. The public code release supports reproducibility and is a clear strength.

major comments (1)

[Experiments on three camera datasets] Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.

minor comments (1)

[Abstract] Abstract: the latency reduction is given as “about 9 %”; reporting the exact measured value together with any standard deviation or number of runs would improve precision.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on generalization. We address the single major comment below.

read point-by-point responses

Referee: Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.

Authors: We agree that broader sensor diversity would strengthen generalization claims. The Sony, Olympus, and Samsung datasets are the standard benchmarks in prior raw reconstruction literature and cover multiple manufacturers, even though they share the Bayer CFA. Our experiments follow this established protocol to enable direct comparison. Cross-sensor transfer tests were not included because raw reconstruction is typically sensor-specific due to differing noise profiles and ISPs. In the revision we will add a limitations paragraph discussing this scope and noting it as future work. The consistent gains across the three datasets still support the effectiveness of the proposed modules for the evaluated setting. revision: partial

standing simulated objections not resolved

Additional experiments on further camera brands or cross-sensor transfer, which would require new raw datasets outside the current work's scope.

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluation on external datasets

full rationale

The paper introduces MambaRaw, a new SSM-based architecture with TileMambaBlock and identity-initialized EAR modules for JPEG-guided raw reconstruction. All reported gains (PSNR, latency) are obtained by training and testing on three external camera datasets (Sony, Olympus, Samsung). No equations, derivations, or predictions are presented that reduce to fitted inputs by construction, nor are there load-bearing self-citations or uniqueness theorems. The work is self-contained against external benchmarks, consistent with the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, training details, or parameter counts provided to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5803 in / 1123 out tokens · 19383 ms · 2026-06-26T00:14:04.645636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 10 canonical work pages · 6 internal anchors

[1]

arXiv preprint arXiv:1611.01704 (2016) 3

Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 (2016) 3

work page arXiv 2016
[2]

Variational image compression with a scale hyperprior

Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 (2018) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

In: CVPR 2011

Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011) 10

2011
[4]

arXiv preprint arXiv:2411.11717 (2024) 4

Chen, H., Han, W., Zheng, H., Shen, J.: Rawmamba: Unified srgb-to-raw de- rendering with state space model. arXiv preprint arXiv:2411.11717 (2024) 4

work page arXiv 2024
[5]

Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

Chen, Y., Qin, H., Zhang, Z., Magno, M., Benini, L., Li, Y.: Q-mambair: Accurate quantized mamba for efficient image restoration. arXiv preprint arXiv:2503.21970 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2508.02192 (2025) 4

Chen, Y., Lyu, Z., He, B., Hu, H., Wang, Q., Tian, Y., Song, L., Zhang, W., Lu, G.: Cmic: Content-adaptive mamba for learned image compression. arXiv preprint arXiv:2508.02192 (2025) 4

work page arXiv 2025
[7]

Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

Cheng, D., Prasad, D.K., Brown, M.S.: Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

2014
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939– 7948 (2020) 3 16 P. Liet al

2020
[9]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gao, G., You, P., Pan, R., Han, S., Zhang, Y., Dai, Y., Lee, H.: Neural image com- pression via attentional multi-scale back projection and frequency decomposition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14677–14686 (2021) 3

2021
[11]

In: First conference on language modeling (2024) 4

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024) 4

2024
[12]

Efficiently Modeling Long Sequences with Structured State Spaces

Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021) 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guo, H.,Guo,Y.,Zha,Y.,Zhang,Y.,Li,W.,Dai,T., Xia,S.T.,Li,Y.:Mambairv2: Attentive state space restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28124–28133 (2025) 4

2025
[14]

IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

Guo, Z., Zhang, Z., Feng, R., Chen, Z.: Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

2021
[15]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025) 4

2025
[16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive cod- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5718–5727 (2022) 3

2022
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14771–14780 (2021) 3

2021
[18]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018) 8

2018
[19]

In: European conference on computer vision

Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. In: European conference on computer vision. pp. 12–22. Springer (2024) 4

2024
[20]

In: European conference on computer vision

Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., Qiao, Y.: Videomamba: State space model for efficient video understanding. In: European conference on computer vision. pp. 237–255. Springer (2024) 15

2024
[21]

IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

Li, M., Ma, K., You, J., Zhang, D., Zuo, W.: Efficient and effective context-based convolutional entropy modeling for image compression. IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

2020
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer- cnn architectures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14388–14397 (2023) 3

2023
[23]

Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

2024
[24]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021) 14

2021
[25]

arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

Ma, C., Wang, Z., Liao, R., Ye, Y.: A cross channel context model for latents in deep image compression. arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

work page arXiv 2021
[26]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Advances in neural information processing systems 31(2018) 2, 3

Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems 31(2018) 2, 3

2018
[28]

In: 2020 IEEE International Conference on Image Processing (ICIP)

Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 3339–3343. IEEE (2020) 3

2020
[29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nam, S., Punnappurath, A., Brubaker, M.A., Brown, M.S.: Learning srgb-to-raw- rgb de-rendering with content-aware metadata. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17704–17713 (2022) 4, 10, 11, 12, 14, 21, 24

2022
[30]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Patel, Y., Appalaraju, S., Manmatha, R.: Saliency driven perceptual image com- pression. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 227–236 (2021) 3

2021
[31]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Punnappurath, A., Brown, M.S.: Spatially aware metadata for raw reconstruction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 218–226 (2021) 2, 4, 10, 12

2021
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, S., Lu, Y., Zhou, Y., Li, J., Ren, Y., Xue, Y., Xia, S.T., Chen, B.: Freqsic: Frequency-aware stereo image compression with bi-directional checkerboard con- text model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19393–19402 (2026) 3

2026
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Qin, S., Wang, J., Zhou, Y., Chen, B., Luo, T., An, B., Dai, T., Xia, S.T., Wang, Y.: Cassic: Towards content-adaptive state-space models for learned image com- pression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15727–15736 (2025) 4

2025
[34]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Qin, S., Zhang, X., Liu, Z., Wang, J., Chen, B., Li, J., Ren, Y., Xia, S.T., Zhang, J.: Mambasic: Mamba-based stereo image compression with bi-directional multi- reference entropy model. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5306–5315 (2026) 4

2026
[35]

IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

Shi, Y., Xia, B., Jin, X., Wang, X., Zhao, T., Xia, X., Xiao, X., Yang, W.: Vmam- bair: Visual state space model for image restoration. IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

2025
[36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

Tian, Y., Ling, X., Geng, C., Hu, Q., Lu, G., Zha, G.: Smc++: Masked learning of unsupervised video semantic compression. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

2025
[37]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., Gao, Z.: Self-conditioned probabilistic learning of video rescaling. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 4490–4499 (2021) 3

2021
[38]

IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

2024
[39]

In: European Conference on Com- puter Vision

Tian, Y., Lu, G., Zhai, G.: Free-vsc: Free semantics from visual foundation models for unsupervised video semantic compression. In: European Conference on Com- puter Vision. pp. 163–183. Springer (2024) 3

2024
[40]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 13610–13622 (2023) 3

2023
[41]

Communications of the ACM34(4), 30–44 (1991) 2 18 P

Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991) 2 18 P. Liet al

1991
[42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Raw image reconstruction with learned compact metadata. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18206–18215 (2023) 2, 4, 6, 10, 12

2023
[43]

International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Beyond learned metadata-based raw image reconstruction. International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

2024
[44]

Warenkorb, L.R.: Information technology-high efficiency coding and media delivery in heterogeneous environments-part 3: 3d audio (2015) 2

2015
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, C., Wang, L., Zheng, Z., Cui, Y., Yang, Z., Chen, X., Zhang, Y., Jiang, W., Xia, J.: Scan clusters, not pixels: A cluster-centric paradigm for efficient ultra- high-definition image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15528–15537 (2026) 3

2026
[46]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xing, Y., Qian, Z., Chen, Q.: Invertible image signal processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6287– 6296 (2021) 4, 12

2021
[47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zeng, F., Tang, H., Shao, Y., Chen, S., Shao, L., Wang, Y.: Mambaic: State space models for high-performance learned image compression. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18041–18050 (2025) 4, 20

2025
[48]

Zhang, J., Nguyen, A.T., Han, X., Trinh, V.Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applica- tionsongiga-pixelwholeslideimageclassification.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025) 4

2025
[49]

Zhou,Y.,Zhou,P.,Ng,T.K.:Efficientcascadedmultiscaleadaptivenetworkforim- agerestoration.In:EuropeanConferenceonComputerVision.pp.92–110.Springer (2024) 3

2024
[50]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

In: Interna- tional conference on learning representations (2022) 3

Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: Interna- tional conference on learning representations (2022) 3

2022
[52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zou, R., Song, C., Zhang, Z.: The devil is in the details: Window-based attention for image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17492–17501 (2022) 3 MambaRaw 19 A More Details A.1 Network Architecture Our MambaRaw framework directly adopts the two-level JPEG-conditioned learned-context ba...

2022

[1] [1]

arXiv preprint arXiv:1611.01704 (2016) 3

Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 (2016) 3

work page arXiv 2016

[2] [2]

Variational image compression with a scale hyperprior

Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 (2018) 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

In: CVPR 2011

Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011) 10

2011

[4] [4]

arXiv preprint arXiv:2411.11717 (2024) 4

Chen, H., Han, W., Zheng, H., Shen, J.: Rawmamba: Unified srgb-to-raw de- rendering with state space model. arXiv preprint arXiv:2411.11717 (2024) 4

work page arXiv 2024

[5] [5]

Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

Chen, Y., Qin, H., Zhang, Z., Magno, M., Benini, L., Li, Y.: Q-mambair: Accurate quantized mamba for efficient image restoration. arXiv preprint arXiv:2503.21970 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

arXiv preprint arXiv:2508.02192 (2025) 4

Chen, Y., Lyu, Z., He, B., Hu, H., Wang, Q., Tian, Y., Song, L., Zhang, W., Lu, G.: Cmic: Content-adaptive mamba for learned image compression. arXiv preprint arXiv:2508.02192 (2025) 4

work page arXiv 2025

[7] [7]

Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

Cheng, D., Prasad, D.K., Brown, M.S.: Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

2014

[8] [8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939– 7948 (2020) 3 16 P. Liet al

2020

[9] [9]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gao, G., You, P., Pan, R., Han, S., Zhang, Y., Dai, Y., Lee, H.: Neural image com- pression via attentional multi-scale back projection and frequency decomposition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14677–14686 (2021) 3

2021

[11] [11]

In: First conference on language modeling (2024) 4

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024) 4

2024

[12] [12]

Efficiently Modeling Long Sequences with Structured State Spaces

Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021) 4

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guo, H.,Guo,Y.,Zha,Y.,Zhang,Y.,Li,W.,Dai,T., Xia,S.T.,Li,Y.:Mambairv2: Attentive state space restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28124–28133 (2025) 4

2025

[14] [14]

IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

Guo, Z., Zhang, Z., Feng, R., Chen, Z.: Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

2021

[15] [15]

In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025) 4

2025

[16] [16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive cod- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5718–5727 (2022) 3

2022

[17] [17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14771–14780 (2021) 3

2021

[18] [18]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018) 8

2018

[19] [19]

In: European conference on computer vision

Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. In: European conference on computer vision. pp. 12–22. Springer (2024) 4

2024

[20] [20]

In: European conference on computer vision

Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., Qiao, Y.: Videomamba: State space model for efficient video understanding. In: European conference on computer vision. pp. 237–255. Springer (2024) 15

2024

[21] [21]

IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

Li, M., Ma, K., You, J., Zhang, D., Zuo, W.: Efficient and effective context-based convolutional entropy modeling for image compression. IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

2020

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer- cnn architectures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14388–14397 (2023) 3

2023

[23] [23]

Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

2024

[24] [24]

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021) 14

2021

[25] [25]

arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

Ma, C., Wang, Z., Liao, R., Ye, Y.: A cross channel context model for latents in deep image compression. arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

work page arXiv 2021

[26] [26]

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Advances in neural information processing systems 31(2018) 2, 3

Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems 31(2018) 2, 3

2018

[28] [28]

In: 2020 IEEE International Conference on Image Processing (ICIP)

Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 3339–3343. IEEE (2020) 3

2020

[29] [29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Nam, S., Punnappurath, A., Brubaker, M.A., Brown, M.S.: Learning srgb-to-raw- rgb de-rendering with content-aware metadata. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17704–17713 (2022) 4, 10, 11, 12, 14, 21, 24

2022

[30] [30]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Patel, Y., Appalaraju, S., Manmatha, R.: Saliency driven perceptual image com- pression. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 227–236 (2021) 3

2021

[31] [31]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Punnappurath, A., Brown, M.S.: Spatially aware metadata for raw reconstruction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 218–226 (2021) 2, 4, 10, 12

2021

[32] [32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qin, S., Lu, Y., Zhou, Y., Li, J., Ren, Y., Xue, Y., Xia, S.T., Chen, B.: Freqsic: Frequency-aware stereo image compression with bi-directional checkerboard con- text model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19393–19402 (2026) 3

2026

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Qin, S., Wang, J., Zhou, Y., Chen, B., Luo, T., An, B., Dai, T., Xia, S.T., Wang, Y.: Cassic: Towards content-adaptive state-space models for learned image com- pression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15727–15736 (2025) 4

2025

[34] [34]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Qin, S., Zhang, X., Liu, Z., Wang, J., Chen, B., Li, J., Ren, Y., Xia, S.T., Zhang, J.: Mambasic: Mamba-based stereo image compression with bi-directional multi- reference entropy model. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5306–5315 (2026) 4

2026

[35] [35]

IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

Shi, Y., Xia, B., Jin, X., Wang, X., Zhao, T., Xia, X., Xiao, X., Yang, W.: Vmam- bair: Visual state space model for image restoration. IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

2025

[36] [36]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

Tian, Y., Ling, X., Geng, C., Hu, Q., Lu, G., Zha, G.: Smc++: Masked learning of unsupervised video semantic compression. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

2025

[37] [37]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., Gao, Z.: Self-conditioned probabilistic learning of video rescaling. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 4490–4499 (2021) 3

2021

[38] [38]

IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

2024

[39] [39]

In: European Conference on Com- puter Vision

Tian, Y., Lu, G., Zhai, G.: Free-vsc: Free semantics from visual foundation models for unsupervised video semantic compression. In: European Conference on Com- puter Vision. pp. 163–183. Springer (2024) 3

2024

[40] [40]

In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 13610–13622 (2023) 3

2023

[41] [41]

Communications of the ACM34(4), 30–44 (1991) 2 18 P

Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991) 2 18 P. Liet al

1991

[42] [42]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Raw image reconstruction with learned compact metadata. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18206–18215 (2023) 2, 4, 6, 10, 12

2023

[43] [43]

International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Beyond learned metadata-based raw image reconstruction. International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

2024

[44] [44]

Warenkorb, L.R.: Information technology-high efficiency coding and media delivery in heterogeneous environments-part 3: 3d audio (2015) 2

2015

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, C., Wang, L., Zheng, Z., Cui, Y., Yang, Z., Chen, X., Zhang, Y., Jiang, W., Xia, J.: Scan clusters, not pixels: A cluster-centric paradigm for efficient ultra- high-definition image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15528–15537 (2026) 3

2026

[46] [46]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xing, Y., Qian, Z., Chen, Q.: Invertible image signal processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6287– 6296 (2021) 4, 12

2021

[47] [47]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zeng, F., Tang, H., Shao, Y., Chen, S., Shao, L., Wang, Y.: Mambaic: State space models for high-performance learned image compression. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18041–18050 (2025) 4, 20

2025

[48] [48]

Zhang, J., Nguyen, A.T., Han, X., Trinh, V.Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applica- tionsongiga-pixelwholeslideimageclassification.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025) 4

2025

[49] [49]

Zhou,Y.,Zhou,P.,Ng,T.K.:Efficientcascadedmultiscaleadaptivenetworkforim- agerestoration.In:EuropeanConferenceonComputerVision.pp.92–110.Springer (2024) 3

2024

[50] [50]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

In: Interna- tional conference on learning representations (2022) 3

Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: Interna- tional conference on learning representations (2022) 3

2022

[52] [52]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zou, R., Song, C., Zhang, Z.: The devil is in the details: Window-based attention for image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17492–17501 (2022) 3 MambaRaw 19 A More Details A.1 Network Architecture Our MambaRaw framework directly adopts the two-level JPEG-conditioned learned-context ba...

2022