pith. sign in

arxiv: 2606.24479 · v1 · pith:XW4OF6VZnew · submitted 2026-06-23 · 💻 cs.CV

MambaRaw: Selective State Space Modeling for Efficient 4K Raw Image Reconstruction

Pith reviewed 2026-06-26 00:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords raw image reconstructionstate space modelsMambaJPEG-guided reconstructionentropy parameter estimation4K imagingimage compression
0
0 comments X

The pith

MambaRaw reconstructs 4K raw images from JPEG previews using selective state space models that avoid quadratic attention costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MambaRaw as a JPEG-conditioned framework for recovering raw images that replaces attention-based entropy modeling with state space models. It adds a Spatial-Energy Coupled Context Modeling step built from TileMambaBlock, which scans only information-dense tiles, and an identity-initialized Energy-Aware Refinement module to match the long-tail energy distribution of raw signals. Experiments on Sony, Olympus, and Samsung camera data show consistent PSNR gains of 1.2-1.4 dB at low metadata bitrates together with roughly 9 percent lower end-to-end latency. A reader would care because the method makes high-resolution raw reconstruction practical at the scale where attention becomes prohibitive.

Core claim

MambaRaw claims that Spatial-Energy Coupled Context Modeling, formed by TileMambaBlock selective scanning on dense tiles plus identity-initialized EAR refinement, estimates entropy parameters for raw signals more efficiently than attention while preserving or improving reconstruction accuracy, thereby setting a new state of the art for JPEG-guided raw image recovery at 4K resolution.

What carries the argument

Spatial-Energy Coupled Context Modeling mechanism that combines TileMambaBlock (Mamba-style selective scanning restricted to information-dense tiles) and Energy-Aware Refinement (identity-initialized residual module) to produce entropy parameters without quadratic attention scaling.

If this is right

  • PSNR increases by 1.2-1.4 dB over strong metadata baselines at low bitrates.
  • End-to-end coding latency drops by approximately 9 percent.
  • The approach remains computationally feasible at 4K resolution where attention scales poorly.
  • New state-of-the-art results hold across the three tested camera datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective-scanning design could extend to other high-resolution entropy-modeling tasks such as video compression.
  • Further gains might appear if the tile-selection heuristic inside TileMambaBlock is learned rather than fixed.
  • Mobile or embedded pipelines could adopt the method once the three-brand generalization is confirmed on additional sensors.

Load-bearing premise

The TileMambaBlock and EAR modules can accurately model raw-signal entropy across diverse cameras without the quadratic cost of attention.

What would settle it

Running the method on a fourth unseen camera brand at 4K resolution and measuring whether the reported PSNR gain and latency reduction both disappear.

Figures

Figures reproduced from arXiv: 2606.24479 by Fanhu Zeng, Haotian Zhang, Peize Li, Tongda Xu, Xingguo Xu, Xingtong Ge, Xinjie Zhang, Yan Wang.

Figure 1
Figure 1. Figure 1: Motivation and Comparison. (a) Convolution-based methods have limited receptive fields, which restricts long-range spatial modeling. (b) MambaRaw uses a spatial–energy coupled context model. It applies TileMambaBlock for selective scan￾ning on information-dense tiles and uses EAR for energy-guided refinement. Many raw formats also store an aligned in-camera JPEG preview. Re￾cent metadata-based reconstructi… view at source ↗
Figure 2
Figure 2. Figure 2: Energy analysis and tile selection. (a): Long-tail energy distribution motivating EAR. (b): Spatial L2 energy map showing selected high-energy tiles (cyan) at ρ = 0.5. (c): Impact of keep ratio ρ; ρ = 0.5 offers the optimal accuracy-speed trade-off. Here, F˜ is the JPEG-conditioned feature, Fin is the context input before SSM processing, Fc is the TileMambaBlock output, and F ′ is the final EAR output. 3.4… view at source ↗
Figure 3
Figure 3. Figure 3: The Overall Framework of MambaRaw. We adopt a two-level VAE architecture conditioned on the available JPEG preview. The core innovation lies in the Level-1 Context Model, where we replace standard separate spatial/channel contexts with a coupled design: TileMambaBlock for efficient long-range spatial modeling on selected information-dense tiles, and EAR for lightweight energy-guided refinement. \mathbf {t}… view at source ↗
Figure 4
Figure 4. Figure 4: RD curves over the NUS dataset (Samsung NX2000, Olympus E-PL6, Sony SLT-A57) following the setting of [29]. The left and right columns report PSNR and SSIM, respectively. For variable-rate models, a single model is trained for each curve and different operating points are obtained by changing the rate–distortion hyper￾parameter of the trained model. margins. Specifically, on the Samsung subset, MambaRaw ac… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on Sony SLT-A57. Error maps show the per-pixel max￾imum absolute error over the three channels (after gamma correction for visibility); darker indicates smaller error. widely applicable default, achieving a sweet spot that maintains state-of-the-art results with significant speedup. Impact of Foundational Models. To verify the effectiveness of the proposed SSM-based design, we replac… view at source ↗
read the original abstract

In-camera JPEG previews are ubiquitous in raw image formats and provide an sRGB reference at negligible storage cost. Although existing metadata-based reconstruction frameworks can exploit this side information when recovering raw images, their context models often become computationally expensive especially at high resolution, eg, 4K raw image, given that attention mechanisms scale quadratically with feature maps, hindering its practical application. To address these limitations, we propose MambaRaw, a JPEG-conditioned metadata-based raw image reconstruction framework that uses State Space Models (SSMs) to estimate entropy parameters efficiently. Our key contribution comprises a Spatial-Energy Coupled Context Modeling mechanism with two lightweight modules: (1) TileMambaBlock, which performs Mamba-style selective scanning only on information-dense tiles to improve the efficiency; and (2) Energy-Aware Refinement (EAR), an identity-initialized residual module that enhance feature representation to match the long-tail energy distribution of raw signals. Extensive experiments on three camera datasets (Sony, Olympus, Samsung) show consistent improvements over strong metadata-based baselines and set a new state of the art for JPEG-guided raw reconstruction with great efficiency. Notably, at low metadata bitrates, MambaRaw increases PSNR by 1.2--1.4 dB and reduces end-to-end coding latency by about 9%. Code is released at https://github.com/Peizeli1/MambaRaw.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents MambaRaw, a JPEG-conditioned framework for 4K raw image reconstruction that replaces attention-based entropy modeling with selective state space models. The core contribution is a Spatial-Energy Coupled Context Modeling mechanism comprising the TileMambaBlock (selective scanning restricted to information-dense tiles) and the identity-initialized Energy-Aware Refinement (EAR) module. Experiments on Sony, Olympus, and Samsung datasets report consistent PSNR gains of 1.2–1.4 dB at low metadata bitrates together with an approximately 9 % reduction in end-to-end latency, establishing a new state of the art while releasing code.

Significance. If the reported efficiency and accuracy advantages prove robust, the work supplies a practical linear-complexity alternative to quadratic attention for high-resolution metadata-guided reconstruction, with direct relevance to in-camera pipelines. The public code release supports reproducibility and is a clear strength.

major comments (1)
  1. [Experiments on three camera datasets] Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.
minor comments (1)
  1. [Abstract] Abstract: the latency reduction is given as “about 9 %”; reporting the exact measured value together with any standard deviation or number of runs would improve precision.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on generalization. We address the single major comment below.

read point-by-point responses
  1. Referee: Experiments (three camera datasets): all quantitative results and the SOTA claim rest exclusively on Sony, Olympus, and Samsung sensors that share similar CFA patterns and noise statistics. The central assertion that TileMambaBlock plus EAR accurately estimates entropy parameters for raw signals across diverse cameras therefore lacks supporting evidence from additional brands or cross-sensor transfer tests; this directly affects the generalization and practical applicability of the 1.2–1.4 dB / 9 % gains.

    Authors: We agree that broader sensor diversity would strengthen generalization claims. The Sony, Olympus, and Samsung datasets are the standard benchmarks in prior raw reconstruction literature and cover multiple manufacturers, even though they share the Bayer CFA. Our experiments follow this established protocol to enable direct comparison. Cross-sensor transfer tests were not included because raw reconstruction is typically sensor-specific due to differing noise profiles and ISPs. In the revision we will add a limitations paragraph discussing this scope and noting it as future work. The consistent gains across the three datasets still support the effectiveness of the proposed modules for the evaluated setting. revision: partial

standing simulated objections not resolved
  • Additional experiments on further camera brands or cross-sensor transfer, which would require new raw datasets outside the current work's scope.

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluation on external datasets

full rationale

The paper introduces MambaRaw, a new SSM-based architecture with TileMambaBlock and identity-initialized EAR modules for JPEG-guided raw reconstruction. All reported gains (PSNR, latency) are obtained by training and testing on three external camera datasets (Sony, Olympus, Samsung). No equations, derivations, or predictions are presented that reduce to fitted inputs by construction, nor are there load-bearing self-citations or uniqueness theorems. The work is self-contained against external benchmarks, consistent with the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no equations, training details, or parameter counts provided to identify free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5803 in / 1123 out tokens · 19383 ms · 2026-06-26T00:14:04.645636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    arXiv preprint arXiv:1611.01704 (2016) 3

    Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. arXiv preprint arXiv:1611.01704 (2016) 3

  2. [2]

    Variational image compression with a scale hyperprior

    Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436 (2018) 2, 3

  3. [3]

    In: CVPR 2011

    Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011) 10

  4. [4]

    arXiv preprint arXiv:2411.11717 (2024) 4

    Chen, H., Han, W., Zheng, H., Shen, J.: Rawmamba: Unified srgb-to-raw de- rendering with state space model. arXiv preprint arXiv:2411.11717 (2024) 4

  5. [5]

    Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration

    Chen, Y., Qin, H., Zhang, Z., Magno, M., Benini, L., Li, Y.: Q-mambair: Accurate quantized mamba for efficient image restoration. arXiv preprint arXiv:2503.21970 (2025) 4

  6. [6]

    arXiv preprint arXiv:2508.02192 (2025) 4

    Chen, Y., Lyu, Z., He, B., Hu, H., Wang, Q., Tian, Y., Song, L., Zhang, W., Lu, G.: Cmic: Content-adaptive mamba for learned image compression. arXiv preprint arXiv:2508.02192 (2025) 4

  7. [7]

    Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

    Cheng, D., Prasad, D.K., Brown, M.S.: Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. Journal of the Optical Society of America A31(5), 1049–1058 (2014) 10

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7939– 7948 (2020) 3 16 P. Liet al

  9. [9]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 4

  10. [10]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Gao, G., You, P., Pan, R., Han, S., Zhang, Y., Dai, Y., Lee, H.: Neural image com- pression via attentional multi-scale back projection and frequency decomposition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14677–14686 (2021) 3

  11. [11]

    In: First conference on language modeling (2024) 4

    Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024) 4

  12. [12]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021) 4

  13. [13]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, H.,Guo,Y.,Zha,Y.,Zhang,Y.,Li,W.,Dai,T., Xia,S.T.,Li,Y.:Mambairv2: Attentive state space restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28124–28133 (2025) 4

  14. [14]

    IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

    Guo, Z., Zhang, Z., Feng, R., Chen, Z.: Causal contextual prediction for learned image compression. IEEE Transactions on Circuits and Systems for Video Tech- nology32(4), 2329–2341 (2021) 3

  15. [15]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025) 4

  16. [16]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive cod- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5718–5727 (2022) 3

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14771–14780 (2021) 3

  18. [18]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018) 8

  19. [19]

    In: European conference on computer vision

    Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. In: European conference on computer vision. pp. 12–22. Springer (2024) 4

  20. [20]

    In: European conference on computer vision

    Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., Qiao, Y.: Videomamba: State space model for efficient video understanding. In: European conference on computer vision. pp. 237–255. Springer (2024) 15

  21. [21]

    IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

    Li, M., Ma, K., You, J., Zhang, D., Zuo, W.: Efficient and effective context-based convolutional entropy modeling for image compression. IEEE Transactions on Im- age Processing29, 5900–5911 (2020) 3

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, J., Sun, H., Katto, J.: Learned image compression with mixed transformer- cnn architectures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14388–14397 (2023) 3

  23. [23]

    Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y.: Vmamba: Visual state space model. Advances in neural information processing systems37, 103031–103063 (2024) 4, 5, 20

  24. [24]

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer:Hierarchicalvisiontransformerusingshiftedwindows.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 10012–10022 (2021) 14

  25. [25]

    arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

    Ma, C., Wang, Z., Liao, R., Ye, Y.: A cross channel context model for latents in deep image compression. arXiv preprint arXiv:2103.02884 (2021) 3 MambaRaw 17

  26. [26]

    U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

    Ma, J., Li, F., Wang, B.: U-mamba: Enhancing long-range dependency for biomed- ical image segmentation. arXiv preprint arXiv:2401.04722 (2024) 4

  27. [27]

    Advances in neural information processing systems 31(2018) 2, 3

    Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems 31(2018) 2, 3

  28. [28]

    In: 2020 IEEE International Conference on Image Processing (ICIP)

    Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 3339–3343. IEEE (2020) 3

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Nam, S., Punnappurath, A., Brubaker, M.A., Brown, M.S.: Learning srgb-to-raw- rgb de-rendering with content-aware metadata. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17704–17713 (2022) 4, 10, 11, 12, 14, 21, 24

  30. [30]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Patel, Y., Appalaraju, S., Manmatha, R.: Saliency driven perceptual image com- pression. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 227–236 (2021) 3

  31. [31]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Punnappurath, A., Brown, M.S.: Spatially aware metadata for raw reconstruction. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 218–226 (2021) 2, 4, 10, 12

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qin, S., Lu, Y., Zhou, Y., Li, J., Ren, Y., Xue, Y., Xia, S.T., Chen, B.: Freqsic: Frequency-aware stereo image compression with bi-directional checkerboard con- text model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19393–19402 (2026) 3

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Qin, S., Wang, J., Zhou, Y., Chen, B., Luo, T., An, B., Dai, T., Xia, S.T., Wang, Y.: Cassic: Towards content-adaptive state-space models for learned image com- pression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15727–15736 (2025) 4

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Qin, S., Zhang, X., Liu, Z., Wang, J., Chen, B., Li, J., Ren, Y., Xia, S.T., Zhang, J.: Mambasic: Mamba-based stereo image compression with bi-directional multi- reference entropy model. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5306–5315 (2026) 4

  35. [35]

    IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

    Shi, Y., Xia, B., Jin, X., Wang, X., Zhao, T., Xia, X., Xiao, X., Yang, W.: Vmam- bair: Visual state space model for image restoration. IEEE Transactions on Circuits and Systems for Video Technology35(6), 5560–5574 (2025) 4

  36. [36]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

    Tian, Y., Ling, X., Geng, C., Hu, Q., Lu, G., Zha, G.: Smc++: Masked learning of unsupervised video semantic compression. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3

  37. [37]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Tian, Y., Lu, G., Min, X., Che, Z., Zhai, G., Guo, G., Gao, Z.: Self-conditioned probabilistic learning of video rescaling. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 4490–4499 (2021) 3

  38. [38]

    IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

    Tian, Y., Lu, G., Yan, Y., Zhai, G., Chen, L., Gao, Z.: A coding framework and benchmark towards low-bitrate video understanding. IEEE Transactions on Pat- tern Analysis and Machine Intelligence46(8), 5852–5872 (2024) 3

  39. [39]

    In: European Conference on Com- puter Vision

    Tian, Y., Lu, G., Zhai, G.: Free-vsc: Free semantics from visual foundation models for unsupervised video semantic compression. In: European Conference on Com- puter Vision. pp. 163–183. Springer (2024) 3

  40. [40]

    In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision

    Tian, Y., Lu, G., Zhai, G., Gao, Z.: Non-semantics suppressed mask learning for unsupervised video semantic compression. In: Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision. pp. 13610–13622 (2023) 3

  41. [41]

    Communications of the ACM34(4), 30–44 (1991) 2 18 P

    Wallace, G.K.: The jpeg still picture compression standard. Communications of the ACM34(4), 30–44 (1991) 2 18 P. Liet al

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Raw image reconstruction with learned compact metadata. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18206–18215 (2023) 2, 4, 6, 10, 12

  43. [43]

    International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

    Wang, Y., Yu, Y., Yang, W., Guo, L., Chau, L.P., Kot, A.C., Wen, B.: Beyond learned metadata-based raw image reconstruction. International Journal of Com- puter Vision132(12), 5514–5533 (2024) 4, 6, 10, 12, 13, 14, 19, 20, 21, 24

  44. [44]

    Warenkorb, L.R.: Information technology-high efficiency coding and media delivery in heterogeneous environments-part 3: 3d audio (2015) 2

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, C., Wang, L., Zheng, Z., Cui, Y., Yang, Z., Chen, X., Zhang, Y., Jiang, W., Xia, J.: Scan clusters, not pixels: A cluster-centric paradigm for efficient ultra- high-definition image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15528–15537 (2026) 3

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xing, Y., Qian, Z., Chen, Q.: Invertible image signal processing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6287– 6296 (2021) 4, 12

  47. [47]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zeng, F., Tang, H., Shao, Y., Chen, S., Shao, L., Wang, Y.: Mambaic: State space models for high-performance learned image compression. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18041–18050 (2025) 4, 20

  48. [48]

    Zhang, J., Nguyen, A.T., Han, X., Trinh, V.Q.H., Qin, H., Samaras, D., Hosseini, M.S.: 2dmamba: Efficient state space model for image representation with applica- tionsongiga-pixelwholeslideimageclassification.In:ProceedingsoftheComputer Vision and Pattern Recognition Conference. pp. 3583–3592 (2025) 4

  49. [49]

    Zhou,Y.,Zhou,P.,Ng,T.K.:Efficientcascadedmultiscaleadaptivenetworkforim- agerestoration.In:EuropeanConferenceonComputerVision.pp.92–110.Springer (2024) 3

  50. [50]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024) 4

  51. [51]

    In: Interna- tional conference on learning representations (2022) 3

    Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: Interna- tional conference on learning representations (2022) 3

  52. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zou, R., Song, C., Zhang, Z.: The devil is in the details: Window-based attention for image compression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17492–17501 (2022) 3 MambaRaw 19 A More Details A.1 Network Architecture Our MambaRaw framework directly adopts the two-level JPEG-conditioned learned-context ba...