arxiv: 2604.26461 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

PKS⁴:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding

Lingjie Zeng , Hailun Zhang , Xiwen Wang , Qijun Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords video understandingstate space modelskinematic priorsaction recognitionefficient temporal modelingparameter-efficient adaptationlinear complexity scanning

0 comments

The pith

PKS^4 inserts one plug-and-play module that extracts kinematic priors from frame differences to drive parallel selective state space scanners for video temporal modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the core tension in video understanding: dense attention costs grow quadratically with sequence length, while standard state space models lose 2D spatial structure and demand heavy masked pre-training to regain it. It keeps a conventional 2D vision backbone for spatial features and adds only a single PKS^4 module that first builds kinematic priors from inter-frame differences and correlations. These priors then steer linear-complexity state space models that scan in parallel along time for each spatial location, adaptively changing update rates and read-write behavior at every step. Experiments claim this reaches state-of-the-art accuracy on action recognition benchmarks while converging in just 20 epochs and using roughly ten times less training compute than pure video state space models.

Core claim

The authors claim that kinematic priors derived from local displacements and motion boundaries can guide a set of parallel, temporally selective state space scanners attached to a fixed 2D backbone, thereby supplying temporal dynamics at linear cost without breaking spatial layout or requiring multi-layer adapters and extensive pre-training.

What carries the argument

The Parallel Kinematic Selective State Space Scanner (PKS^4): a single plug-and-play module whose Kinematic Prior Encoder produces motion cues that modulate the speed and read-write gates of linear state space models running in parallel across spatial positions along the time axis.

If this is right

State-of-the-art accuracy on both spatial-heavy and temporal-heavy action recognition benchmarks.
Convergence after only 20 training epochs.
Roughly 10 times lower training compute than pure video state space models.
Linear computational complexity that avoids quadratic attention costs and the activation memory of deep adapter stacks.
Retention of a standard 2D backbone's spatial semantics without disruption from global temporal scanning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kinematic-prior extraction step could be attached to other backbone families or modalities to add temporal or sequential awareness at low cost.
Because the module is single and plug-and-play, it might allow existing image models to be upgraded for video tasks with minimal architectural change.
If the priors prove robust, the method could extend to longer untrimmed videos where memory limits currently block attention-based approaches.
Applying the same parallel-per-location scanning idea to prediction or generation tasks would test whether the kinematic modulation generalizes beyond classification.

Load-bearing premise

Kinematic priors computed from inter-frame differences and correlations alone can sufficiently steer the state space models to capture all necessary temporal dynamics while preserving spatial relationships.

What would settle it

Train the model on a standard action recognition benchmark such as Kinetics-400 or Something-Something and measure whether it matches or exceeds reported state-of-the-art accuracy after only 20 epochs while using at least an order of magnitude less training compute than comparable video state space model baselines.

Figures

Figures reproduced from arXiv: 2604.26461 by Hailun Zhang, Lingjie Zeng, Qijun Zhao, Xiwen Wang.

**Figure 1.** Figure 1: (a) Global attention suffers from quadratic computational complexity, leading to a massive computational bottleneck. (b) Deep adapters require storing intermediate activations across the entire backbone for back-propagation, severely suffering from an activation memory (VRAM) bottleneck. (c) State Space Models (SSMs) flatten 3D video tokens into a 1D sequence, destroying the innate 2D spatial relationships… view at source ↗

**Figure 2.** Figure 2: (a) Comparison of training memory usage with view at source ↗

**Figure 3.** Figure 3: Overview of PKS4 . Given intermediate token features from a ViT layer, we first split CLS and patch tokens. The Kinematic Prior Encoder process patch tokens through a temporal sliding window. The correlation and variation operators are sequentially applied to explicitly capture rich kinematic priors. To preserve innate spatial structures, we deploy parallel scanning along the temporal dimension for each sp… view at source ↗

read the original abstract

Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PKS^4 adds one kinematic-prior-driven parallel SSM module to a 2D backbone for linear-time video temporal modeling, but the SOTA and 10x efficiency claims sit on unshown experiments.

read the letter

The paper's main contribution is a single plug-and-play PKS^4 module that extracts local kinematic priors from inter-frame differences and uses them to modulate parallel selective state space scanners, one per spatial location, on top of an unmodified 2D vision backbone. This keeps spatial layout intact while adding temporal modeling at linear cost and without deep adapters or masked pre-training. The parallel scanner design is a clean way to avoid the spatial flattening that hurts most video SSMs, and the kinematic encoder gives a lightweight motion signal that could reduce reliance on heavy temporal pre-training. Those choices address real pain points in scaling video models. The architecture description is coherent and the motivation from attention costs and adapter memory is straightforward. What is new is the specific synthesis of the prior encoder with per-location parallel scanning rather than global or multi-layer approaches. The soft spots are the missing experimental support. The abstract states SOTA results on spatial-heavy and temporal-heavy benchmarks plus convergence in 20 epochs with roughly 10x lower training compute than pure video SSMs, yet no tables, baselines, datasets, ablations, or error bars appear in the provided sections. Without those numbers it is impossible to judge whether the single-module design actually delivers or whether the gains depend on the backbone or unstated tricks. The stress-test point lands: local inter-frame priors may not supply enough long-range temporal structure on their own, and cross-location coherence would have to emerge entirely from the frozen 2D features. If the full paper contains solid, reproducible results with proper controls, that changes the picture; on current evidence the central claims remain unverified. This work is for researchers building efficient video pipelines who want plug-in temporal modules rather than full 3D retraining. A reader interested in SSM adaptations or PEFT for video could extract useful design ideas even if the numbers need checking. I would send it to peer review so the experiments can be examined in detail rather than desk-rejecting a novel architecture on abstract claims alone.

Referee Report

1 major / 2 minor

Summary. The paper proposes Parallel Kinematic Selective State Space Scanners (PKS^4), a single plug-and-play module inserted into a standard 2D vision backbone for video understanding. A Kinematic Prior Encoder extracts local displacements and motion boundaries from inter-frame correlations and differences; these priors modulate parallel temporal SSM scanners operating independently at each spatial location. The design avoids quadratic temporal attention, multi-layer adapters, and global scanning while claiming linear complexity. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks are reported to achieve state-of-the-art performance, with convergence in 20 epochs and approximately 10× lower training compute than pure video SSMs.

Significance. If the performance and efficiency claims hold under rigorous validation, the work would offer a promising route to scalable video modeling by preserving 2D spatial structure through parallel per-location scanning while injecting kinematic priors to accelerate convergence. The reported 10× training-compute reduction and avoidance of deep adapter memory overhead would be practically significant for long-sequence video tasks.

major comments (1)

[Method description (paragraph 2)] The architecture description states that a single PKS^4 module suffices, with parallel scanners driven solely by local inter-frame kinematic priors. Because each scanner processes its spatial site independently along time, cross-location motion coherence and long-range temporal structure must emerge only from the backbone features and prior modulation. This premise is load-bearing for the SOTA claim on temporal-heavy benchmarks and the 20-epoch convergence; without ablations on sequence length, number of inserted modules, or long-horizon action subsets, the sufficiency of the single-module design remains unverified.

minor comments (2)

[Abstract] The abstract asserts SOTA results and a 10× compute reduction but supplies no numerical values, dataset names, baseline comparisons, or table references; a concise summary of key metrics should be added.
[Title and Abstract] The acronym PKS^4 is used in the title and abstract before its expansion; the full name should appear on first use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to include the suggested ablations, thereby strengthening the validation of the single-module design.

read point-by-point responses

Referee: The architecture description states that a single PKS^4 module suffices, with parallel scanners driven solely by local inter-frame kinematic priors. Because each scanner processes its spatial site independently along time, cross-location motion coherence and long-range temporal structure must emerge only from the backbone features and prior modulation. This premise is load-bearing for the SOTA claim on temporal-heavy benchmarks and the 20-epoch convergence; without ablations on sequence length, number of inserted modules, or long-horizon action subsets, the sufficiency of the single-module design remains unverified.

Authors: We agree that explicit ablations would further substantiate the sufficiency of the single PKS^4 module. In the revised manuscript we will add: (i) experiments inserting 1 vs. 2–3 PKS^4 modules, (ii) results across varying sequence lengths, and (iii) a breakdown on long-horizon action subsets within the temporal-heavy benchmarks. Our existing results on temporal-heavy datasets already indicate that local kinematic priors suffice to drive effective per-location temporal scanning; cross-location coherence and longer-range structure are supplied by the frozen 2D backbone features, enabling SOTA performance and 20-epoch convergence without multi-layer adapters or global scanning. revision: yes

Circularity Check

0 steps flagged

No circularity; architectural proposal and empirical results are independent of self-referential definitions or fitted inputs

full rationale

The paper introduces PKS^4 as a plug-and-play module that extracts kinematic priors from inter-frame correlations/differences and deploys parallel per-location temporal SSM scanners on a retained 2D backbone. All performance claims (SOTA on spatial- and temporal-heavy benchmarks, 20-epoch convergence, ~10x lower training compute) are presented as outcomes of experiments rather than any derivation, equation, or prediction that reduces by construction to the method's own inputs or fitted parameters. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises, and the design choices are explicitly motivated as alternatives to attention, multi-layer adapters, and masked pre-training. The derivation chain is therefore self-contained as an empirical architectural contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on two newly introduced components whose effectiveness is asserted but not independently evidenced in the provided abstract. No explicit free parameters are named. The approach assumes standard properties of state space models and kinematic feature extraction.

axioms (1)

domain assumption State space models can track temporal sequences with linear complexity while being guided by external priors
Invoked when the kinematic priors are said to adaptively modulate SSM update speeds and read-write strategies.

invented entities (2)

Kinematic Prior Encoder no independent evidence
purpose: Extracts local displacements and motion boundaries from inter-frame correlations and differences to drive the SSMs
New module introduced to supply motion information; no independent evidence or falsifiable prediction supplied in abstract.
Parallel Kinematic Selective State Space Scanners (PKS^4) no independent evidence
purpose: Perform linear-complexity temporal scanning at each spatial location while preserving 2D structure
Core novel module; effectiveness asserted via experimental claims but not independently verified here.

pith-pipeline@v0.9.0 · 5596 in / 1476 out tokens · 89033 ms · 2026-05-07T13:45:01.645857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. 2021. Vivit: A video vision transformer. InProceedings of the IEEE/CVF international conference on computer vision. 6836–6846

2021
[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza- tion.Advances in Neural Information Processing Systems(2016)

2016
[3]

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. InIcml, Vol. 2. 4

2021
[4]

Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Jiahao Wang, Zhe Chen, Zhiqi Li, Tong Lu, and Limin Wang. 2026. Video mamba suite: State space model as a versatile alternative for video understanding.International Journal of Computer Vision134, 1 (2026), 20

2026
[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. [n. d.]. An Image is Worth 16x16 Words: Trans- formers for Image Recognition at Scale. InInternational Conference on Learning Representations
[6]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 6824– 6835

2021
[7]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision. 5842–5850

2017
[8]

Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling

2024
[9]

Albert Gu, Karan Goel, and Christopher Re. [n. d.]. Efficiently Modeling Long Sequences with Structured State Spaces. InInternational Conference on Learning Representations
[10]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset.arXiv:1705.06950(2017)

work page internal anchor Pith review arXiv 2017
[11]

Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. Videomamba: State space model for efficient video understanding. In European conference on computer vision. Springer, 237–255

2024
[12]

Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. [n. d.]. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. InInternational Conference on Learning Representations
[13]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision. 7083–7093

2019
[14]

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard De Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2022. Frozen clip models are efficient video learners. InEuropean Conference on Computer Vision. Springer, 388–404

2022
[15]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211

2022
[16]

Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, and Hongsheng Li. 2022. St- adapter: Parameter-efficient image-to-video transfer learning.Advances in Neural Information Processing Systems35 (2022), 26462–26477

2022
[17]

Jungin Park, Jiyoung Lee, and Kwanghoon Sohn. 2023. Dual-path adaptation from image to video transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2203–2213

2023
[18]

Mandela Patrick, Dylan Campbell, Yuki Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Joao F Henriques. 2021. Keeping your eye on the ball: Trajectory attention in video transformers.Advances in neural information processing systems34 (2021), 12493–12506

2021
[19]

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, and Nong Sang. 2023. Disentangling spatial and temporal learning for efficient image-to-video transfer learning. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. 13934–13944

2023
[20]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

2021
[21]

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, and Yong Liu. 2024. A multimodal, multi-task adapting framework for video action recognition. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5517–5525

2024
[22]

Wenhao Wu, Zhun Sun, and Wanli Ouyang. 2023. Revisiting classifier: Transfer- ring vision-language models for video recognition. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 2847–2855

2023
[23]

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3333–3343

2022
[24]

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, and Mu Li. [n. d.]. AIM: Adapting Image Models for Efficient Video Action Recognition. In The Eleventh International Conference on Learning Representations
[25]

Yijun Yang, Zhaohu Xing, Lequan Yu, Chunwang Huang, Huazhu Fu, and Lei Zhu. 2024. Vivim: A video vision mamba for medical video segmentation.arXiv preprint arXiv:2401.14168(2024)

work page arXiv 2024
[26]

Huanjin Yao, Wenhao Wu, and Zhiheng Li. 2023. Side4video: Spatial-temporal side network for memory-efficient image-to-video transfer learning.arXiv preprint arXiv:2311.15769(2023)

work page arXiv 2023
[27]

Hao Zhang, Yanbin Hao, and Chong-Wah Ngo. 2021. Token shift transformer for video classification. InProceedings of the 29th acm international conference on multimedia. 917–925

2021
[28]

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. InInternational Conference on Machine Learning. PMLR, 62429–62442

2024