pith. sign in

arxiv: 2606.10939 · v1 · pith:AUW2S4RMnew · submitted 2026-06-09 · 💻 cs.CV

PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis

Pith reviewed 2026-06-27 13:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords image steganalysislightweight networkresidual transformerPENetALASKA2efficiency optimizationhigh-pass filtersMobileNetV2
0
0 comments X

The pith

PENet+ cuts up to 45.5% parameters and 97% FLOPs from a residual transformer for image steganalysis while holding accuracy steady on ALASKA2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PENet+ as a lighter version of the earlier PENet residual transformer model used to detect hidden data inside digital images. It keeps the original self-attention structure intact but narrows the channels feeding the classifier, refines the initial high-pass filter bank with a top-K SRM-Gabor selection, and swaps the backbone for a MobileNetV2-style inverted residual network. These steps produce large drops in parameter count and floating-point operations on a fixed ALASKA2 JPEG protocol at 512 by 512 resolution. The authors argue that the efficiency gains come with negligible loss in detection performance on their 19,000-image evaluation split.

Core claim

PENet+ matches or exceeds the detection accuracy of the re-evaluated PENet baseline on a disjoint ALASKA2 JPEG QF90 test set of 19,000 cover images by applying three targeted changes: a progressive narrowing of channels between spatial pyramid pooling and the first fully connected layer, an activation-aware early aggregation that selects a balanced set of 31 SRM-Gabor high-pass filters, and replacement of the backbone with MobileNetV2-style inverted residual blocks. The balanced K=31 configuration yields up to 45.5 percent fewer parameters and roughly 97 percent fewer FLOPs at 512 by 512 resolution.

What carries the argument

Classifier-streamlining stage that progressively narrows SPP-to-FC1 input channels, paired with activation-aware top-K SRM-Gabor HPF selection and a MobileNetV2-style inverted residual backbone.

If this is right

  • Resource-limited devices become practical hosts for residual-transformer steganalysis without redesigning the attention layers.
  • The same narrowing and top-K refinement steps can be applied to other residual transformer steganalysis models to reduce their compute footprint.
  • Preserving negative filter responses via PReLU retains weak stego signals that ReLU would discard.
  • A balanced 16-Gabor plus 15-SRM filter set at K=31 performs at least as well as larger or unbalanced sets at lower cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to related forensic tasks such as detecting manipulated images where memory and latency constraints are similar.
  • Hardware measurements on actual edge devices would be needed to confirm whether the reported FLOP reductions translate into proportional speed or power savings.
  • The top-K selection rule might be tested on other embedding algorithms beyond the ALASKA2 protocol to check its generality.

Load-bearing premise

The chosen channel-narrowing schedule, K=31 filter selection, and backbone swap preserve the original model's detection accuracy on the ALASKA2 split without any post-hoc loss.

What would settle it

A statistically significant drop in detection accuracy below the PENet baseline when the K=31 balanced configuration is evaluated on the separate 19,000-cover ALASKA2 JPEG QF90 set.

Figures

Figures reproduced from arXiv: 2606.10939 by Dongsu Kim, Haneol Jang, Jincheol AN, Youngjoon Yoo.

Figure 1
Figure 1. Figure 1: FIGURE 1 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: For an IR block with spatial size (H, W), expansion r, and output channels Cout, the FLOPs are FLOPsexp 1×1 = HW Cin (rCin), (4) FLOPsDW = HW · 9 (rCin), (5) FLOPsproj 1×1 = HW (rCin) Cout. (6) Replacing a full 3×3 convolution with 9HW C2 FLOPs by a depthwise 3×3 convolution with 9HW C FLOPs followed by two 1×1 pointwise projections with 2HW C2 FLOPs reduces the spatial cost by roughly a factor of C at hig… view at source ↗
Figure 4
Figure 4. Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIGURE 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIGURE 7 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Image steganalysis, the detection of hidden information embedded in digital images, is a core component of modern cybersecurity and digital forensics. Recent residual Transformer architectures, such as the Pixel-Difference-Convolution and Enhanced-Transformer-Network (PENet) [1], achieve strong detection accuracy, but their computational and memory demands hinder deployment in resource-constrained settings. We present PENet+, a lightweight steganalysis framework that preserves PENet's discriminative structure while substantially improving efficiency. Rather than redesigning or compressing the attention blocks, we retain PENet's self-attention topology for reproducibility and add a classifier-streamlining stage that progressively narrows the SPP-to-FC1 input channels (SPP: spatial pyramid pooling; FC1: first fully connected layer), yielding large reductions in parameters and FLOPs with negligible accuracy loss. We further refine the high-pass-filter (HPF) stem with an activation-aware mechanism that aggregates HPF responses early and selects a balanced SRM-Gabor top-K subset, and we replace PENet's backbone with a MobileNetV2-style inverted residual network. A balanced configuration with K=31 filters (16 Gabor + 15 SRM) matches or surpasses heavier settings at lower compute. Finally, we motivate PReLU from a steganalysis standpoint, arguing that preserving negative responses helps capture weak stego cues that ReLU suppresses. On a disjoint ALASKA2 JPEG QF90 protocol at 512x512 resolution (5,000 cover images for training, validation, and internal testing; a separate 19,000-cover evaluation set), PENet+ achieves up to 45.5% fewer parameters and about 97% fewer FLOPs than the re-evaluated PENet baseline, offering a computationally efficient direction for resource-constrained steganalysis. Device-level latency and power measurements remain future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents PENet+, a lightweight variant of the PENet residual Transformer for image steganalysis. It retains the original self-attention topology while adding a classifier-streamlining stage that narrows SPP-to-FC1 channels, refines the HPF stem via an activation-aware SRM-Gabor top-K selection (balanced K=31), substitutes a MobileNetV2-style inverted residual backbone, and motivates PReLU to preserve negative responses for weak stego cues. On a disjoint ALASKA2 JPEG QF90 512x512 protocol (5k training/validation/internal test covers; separate 19k evaluation covers), it claims up to 45.5% fewer parameters and ~97% fewer FLOPs than the re-evaluated PENet baseline with negligible accuracy loss.

Significance. If the accuracy-preservation claim holds, the work supplies a concrete efficiency direction for resource-constrained steganalysis, an area where existing high-accuracy detectors are often impractical. Retaining the original self-attention topology for reproducibility is a positive design choice that aids verification.

major comments (1)
  1. [Abstract] Abstract: The central claim that the three modifications (classifier-channel narrowing, K=31 SRM-Gabor top-K, MobileNetV2-style backbone) incur only 'negligible accuracy loss' and that the K=31 configuration 'matches or surpasses heavier settings' is unsupported by any reported detection error rates, AUC values, or ablation tables on the 19,000-cover evaluation set. Without these numbers the efficiency gains (45.5% params, 97% FLOPs) cannot be assessed against possible degradation in discriminative power.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that the claims regarding accuracy preservation require explicit supporting metrics on the evaluation set to allow proper assessment of the efficiency gains.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the three modifications (classifier-channel narrowing, K=31 SRM-Gabor top-K, MobileNetV2-style backbone) incur only 'negligible accuracy loss' and that the K=31 configuration 'matches or surpasses heavier settings' is unsupported by any reported detection error rates, AUC values, or ablation tables on the 19,000-cover evaluation set. Without these numbers the efficiency gains (45.5% params, 97% FLOPs) cannot be assessed against possible degradation in discriminative power.

    Authors: We agree that the abstract as written does not include the specific detection error rates or AUC values on the 19,000-cover evaluation set, making the 'negligible accuracy loss' and 'matches or surpasses' claims difficult to evaluate directly from the abstract alone. In the revised version we will add these metrics (e.g., the error rates and AUC for the K=31 configuration versus the baseline and heavier K settings) to the abstract so that the efficiency numbers can be assessed against any accuracy impact. The experimental section already reports results on the disjoint evaluation protocol; the revision will simply make the abstract self-contained with the key numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; efficiency and accuracy claims are direct empirical measurements on external ALASKA2 benchmark

full rationale

The paper describes three architectural modifications (classifier-channel narrowing, K=31 SRM-Gabor top-K selection, MobileNetV2-style backbone) to the cited PENet baseline and reports measured parameter/FLOP reductions plus an assertion of negligible accuracy loss on a disjoint 19k-image ALASKA2 evaluation set. No equations, fitted parameters, or self-citation chains are shown that would make the reported numbers equivalent to the inputs by construction. The central claims rest on external benchmark evaluation rather than any self-definitional or fitted-input reduction. Absence of explicit ablation numbers is an evidence gap, not circularity. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard deep-learning assumptions plus two domain-specific modeling choices whose justification is given only at the level of the abstract.

free parameters (2)
  • K = 31
    Top-K filter count set to 31 (16 Gabor + 15 SRM); chosen to balance accuracy and compute.
  • classifier channel narrowing schedule
    Progressive reduction of SPP-to-FC1 channels; exact ratios not stated in abstract but treated as tunable design parameter.
axioms (1)
  • domain assumption Preserving negative filter responses via PReLU captures weak stego cues that ReLU would suppress.
    Stated as a steganalysis-specific motivation without supporting derivation or ablation in the provided abstract.

pith-pipeline@v0.9.1-grok · 5871 in / 1387 out tokens · 23811 ms · 2026-06-27T13:27:36.221666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 2 linked inside Pith

  1. [1]

    Color image steganalysis based on pixel difference convolution and enhanced transformer with selective pooling,

    S. Yin, X. Li, Y . Zhang, S. Liu, and L. Wang, “Color image steganalysis based on pixel difference convolution and enhanced transformer with selective pooling,”IEEE Transactions on Information Forensics and Security, vol. 18, pp. 5129–5143, 2023

  2. [2]

    Deep learning for steganalysis via convolutional neural networks,

    Y . Qian, J. Dong, W. Wang, T. Tanet al., “Deep learning for steganalysis via convolutional neural networks,” inIS&T Electronic Imaging: Media Watermarking, Security, and Forensics, 2015

  3. [3]

    Deep learning hierarchical representations for image steganalysis,

    J. Ye, J. Ni, and Y . Yi, “Deep learning hierarchical representations for image steganalysis,” inProceedings of the ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec). ACM, 2017, pp. 67–73

  4. [4]

    Structural design of convolutional neural networks for steganalysis,

    G. Xu, H.-Z. Wu, and Y .-Q. Shi, “Structural design of convolutional neural networks for steganalysis,” in2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 4044–4048

  5. [5]

    Yedroudj-net: An efficient cnn for spatial steganalysis,

    M. Yedroudj, F. Comby, and M. Chaumont, “Yedroudj-net: An efficient cnn for spatial steganalysis,”IEEE Access, vol. 6, pp. 18 065–18 077, 2018

  6. [6]

    Deep residual network for steganalysis of digital images,

    M. Boroumand, M. Chen, and J. Fridrich, “Deep residual network for steganalysis of digital images,”IEEE Transactions on Information Forensics and Security, vol. 14, no. 5, pp. 1181–1193, 2019

  7. [7]

    Steganalyzing images of arbitrary size with cnns,

    C.-F. Tsang and J. Fridrich, “Steganalyzing images of arbitrary size with cnns,” inElectronic Imaging, vol. 30, no. 7, 2018, pp. 121–1–121–8

  8. [8]

    A siamese cnn for image steganalysis,

    W. You, J. Ni, Y . Zhang, and Z. Qian, “A siamese cnn for image steganalysis,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 291– 306, 2021

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021, arXiv:2010.11929

  10. [10]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

  11. [11]

    Rich models for steganalysis of digital images,

    J. Fridrich and T. Kodovský, “Rich models for steganalysis of digital images,”IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012

  12. [12]

    Mo- bilenetv2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- bilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520

  13. [13]

    Shufflenet: An extremely efficient convolutional neural network for mobile devices,

    X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856

  14. [14]

    Efficientnet: Rethinking model scaling for convolu- tional neural networks,

    M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolu- tional neural networks,” inProceedings of the International Conference on Machine Learning (ICML), 2019, pp. 6105–6114

  15. [15]

    Towards faster training of global covariance pooling networks by iterative matrix square root normalization,

    P. Li, J. Xie, Q. Wang, and Z. Gao, “Towards faster training of global covariance pooling networks by iterative matrix square root normalization,” inProc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 947–955

  16. [16]

    Fast and effective global covari- ance pooling network for image steganalysis,

    X. Deng, B. Chen, W. Luo, and D. Luo, “Fast and effective global covari- ance pooling network for image steganalysis,” inProc. ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2019, pp. 230–234

  17. [17]

    Depth-wise separable convolutions and multi-level pooling for an efficient spatial cnn-based steganalysis,

    R. Zhang, F. Zhu, J. Liu, and G. Liu, “Depth-wise separable convolutions and multi-level pooling for an efficient spatial cnn-based steganalysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1138–1150, 2020

  18. [18]

    How to pretrain for steganalysis,

    J. Butora, Y . Yousfi, and J. Fridrich, “How to pretrain for steganalysis,” inProc. ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2021, pp. 143–148

  19. [19]

    Imagenet pretrained cnns for jpeg steganalysis,

    Y . Yousfi, J. Butora, E. Khvedchenya, and J. Fridrich, “Imagenet pretrained cnns for jpeg steganalysis,” inProc. IEEE Int. Workshop on Information Forensics and Security (WIFS), 2020, pp. 1–6

  20. [20]

    Improving efficientnet for jpeg ste- ganalysis,

    Y . Yousfi, J. Butora, and J. Fridrich, “Improving efficientnet for jpeg ste- ganalysis,” inProc. ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2021, pp. 149–157

  21. [21]

    Mobilenets: Efficient convolutional neural networks for mobile vision applications,

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

  22. [22]

    Rectifier nonlinearities improve neural network acoustic models,

    A. L. Maas, A. Y . Hannun, and A. Y . Ng, “Rectifier nonlinearities improve neural network acoustic models,” inProc. ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

  23. [23]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,

    S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,”Neural Networks, vol. 107, pp. 3–11, 2018

  24. [24]

    Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification,

    K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034

  25. [25]

    Spatial pyramid pooling in deep convolutional networks for visual recognition,

    ——, “Spatial pyramid pooling in deep convolutional networks for visual recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015

  26. [26]

    Minimizing the embedding impact in steganography,

    T. Filler, J. Judas, and J. Fridrich, “Minimizing the embedding impact in steganography,” inMedia Watermarking, Security, and Forensics, ser. Proc. SPIE, vol. 6505, 2007, p. 650502

  27. [27]

    Universal distortion function for steganography in an arbitrary domain,

    V . Holub, J. Fridrich, and T. Denemark, “Universal distortion function for steganography in an arbitrary domain,”IEEE Transactions on Information Forensics and Security, vol. 10, no. 12, pp. 2408–2424, 2015

  28. [28]

    Using statistical image model for jpeg steganography: Uniform embedding revisited,

    L. Guo, J. Ni, W. Su, C. Tang, and Y . Q. Shi, “Using statistical image model for jpeg steganography: Uniform embedding revisited,”IEEE Transactions on Information Forensics and Security, vol. 10, no. 12, pp. 2669–2680, 2015

  29. [29]

    A new cost function for spatial image steganography,

    B. Li, M. Wang, J. Huang, and X. Li, “A new cost function for spatial image steganography,” inIEEE International Conference on Image Processing (ICIP), 2014, pp. 4206–4210

  30. [30]

    The alaska steganalysis challenge: A first step towards steganalysis in the wild,

    R. Cogranne, P. Bas, J. Fridrich, and et al., “The alaska steganalysis challenge: A first step towards steganalysis in the wild,” inProceedings 14 VOLUME 4, 2016 Anet al.: PENet+: A Lightweight Residual Transformer Framework for Efficient Image Steganalysis of the ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2019, pp. 125–137

  31. [31]

    Breaking alaska: Color separation for steganalysis in jpeg domain,

    Y . Yousfi, J. Butora, J. Fridrich, and Q. Giboulot, “Breaking alaska: Color separation for steganalysis in jpeg domain,” inProc. ACM Workshop on Information Hiding and Multimedia Security (IH&MMSec), 2019, pp. 138– 149. JINCHEOL ANreceived the B.S. degree in me- chanical engineering from Chung-Ang University, Seoul, South Korea, in 2025, where he is curr...