Recognition: 2 theorem links
· Lean TheoremABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3
The pith
A Mamba-based multimodal model processes video sequences with linear complexity to caption them competitively while running about three times faster than Transformer equivalents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ABMamba builds a fully open multimodal large language model on deep state space models as the language backbone and adds an Aligned Hierarchical Bidirectional Scan module that processes video inputs across multiple temporal resolutions, achieving linear computational complexity while delivering competitive captioning performance on VATEX and MSR-VTT together with roughly three times higher throughput than typical Transformer-based MLLMs.
What carries the argument
The Aligned Hierarchical Bidirectional Scan module, which aligns and scans video features bidirectionally at several temporal resolutions before feeding them into the Mamba backbone.
If this is right
- Video captioning becomes feasible for longer sequences because computational cost grows linearly rather than quadratically with length.
- Fully open MLLMs gain practicality for video tasks by reaching competitive accuracy at three times the throughput of attention-based models.
- The state space model backbone can serve as a drop-in replacement for attention in multimodal video settings without custom retraining for each task.
- Multiple-resolution scanning preserves temporal structure across scales, supporting caption quality that matches existing MLLMs on standard benchmarks.
Where Pith is reading between the lines
- The same linear scan approach could be tested on related tasks such as video question answering or temporal action localization to measure whether efficiency gains transfer.
- For deployment on edge devices, the reduced memory footprint from linear complexity might allow real-time captioning of streaming video where Transformer models currently cannot run.
- If the multi-resolution alignment proves robust, it could be adapted to other sequential modalities like audio or sensor data streams that share long-range dependency challenges.
Load-bearing premise
The Aligned Hierarchical Bidirectional Scan module captures the key temporal dependencies in videos at multiple resolutions without losing information or needing extra task-specific adjustments.
What would settle it
Running ABMamba on longer video sequences from VATEX and measuring both captioning accuracy and actual wall-clock throughput to check whether performance falls below Transformer baselines or the reported speed gain disappears.
Figures
read the original abstract
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ABMamba, a fully open multimodal large language model for video captioning that replaces quadratic Transformer attention with Deep State Space Models (Mamba) as the language backbone and adds a novel Aligned Hierarchical Bidirectional Scan module to handle multi-resolution temporal dependencies in video sequences. It claims linear computational complexity together with competitive performance on standard benchmarks such as VATEX and MSR-VTT and approximately three times higher throughput than typical MLLMs.
Significance. If the empirical claims are substantiated with detailed metrics and ablations, the work would provide a concrete demonstration that state-space models can serve as a scalable backbone for open video MLLMs, addressing a key limitation of attention-based architectures on long sequences.
major comments (2)
- Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.
- §3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.
minor comments (2)
- The abstract and introduction would benefit from explicit citation of the exact Mamba variant (e.g., Mamba-2 or original) and the precise video-captioning metrics used (CIDEr, BLEU-4, etc.).
- Notation for the state-space parameters and the bidirectional scan directions could be formalized with a short equation block to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the potential impact of ABMamba. We address the two major comments point by point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.
Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we will update the abstract to reference concrete metrics (e.g., CIDEr scores on VATEX and MSR-VTT), name the primary baselines (such as Video-LLaMA and other MLLMs), point to the relevant tables for the throughput comparison, and note that the reported gains are consistent across runs. These additions will make the central claims directly verifiable while preserving the abstract's brevity. revision: yes
-
Referee: §3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.
Authors: We acknowledge that the current exposition in §3 remains somewhat high-level. In the revision we will expand this section with a precise algorithmic description and mathematical formulation of the alignment procedure, including how the hierarchical bidirectional scans are synchronized across resolutions and how state representations are merged without discarding information. We will also add a dedicated paragraph and supporting ablation demonstrating that the module operates without task-specific hyper-parameter tuning, thereby confirming both generality and information preservation. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents ABMamba as an architectural proposal extending state-space models with a new Aligned Hierarchical Bidirectional Scan module for video sequences. All performance claims (competitive accuracy on VATEX/MSR-VTT plus 3x throughput) are framed as empirical outcomes measured on external public benchmarks. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text that reduce any result to its own inputs by construction. The design choices are described at the level of module composition rather than self-referential definitions or uniqueness theorems imported from prior self-work. The central claims therefore remain independent of the paper's own fitted values or internal renamings.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Aligned Hierarchical Bidirectional Scan module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Banerjee, S., Lavie, A.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: ACL. pp. 65–72 (2005)
2005
-
[3]
In: ICLR (2024)
Baron, E., Zimerman, I., Wolf, L.: A 2-Dimensional State Space Layer for Spatial Inductive Bias. In: ICLR (2024)
2024
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., et al.:π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [5]
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., et al.: RT-1: Robotics Transformer for Real- World Control at Scale. arXiv preprint arXiv:2212.06817 (2022)
work page internal anchor Pith review arXiv 2022
-
[7]
Carreira, J., Noland, E., Banki-Horvath, A., et al.: A Short Note about Kinetics-
-
[8]
arXiv preprint arXiV:1808.01340 (2018)
work page Pith review arXiv 2018
-
[9]
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: ACL. pp. 190–200 (2011)
2011
-
[10]
In: CVPR
Chen, Z., Wu, J., Wang, W., et al.: InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In: CVPR. pp. 24185–24198 (2024)
2024
-
[11]
Chu, X., Qiao, L., Zhang, X., et al.: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766 (2024)
-
[12]
In: WACVW
Cui, C., Ma, Y., Cao, X., et al.: Drive as You Speak: Enabling Human-Like In- teraction with Large Language Models in Autonomous Vehicles. In: WACVW. pp. 902–909 (2024)
2024
-
[13]
In: ICML (2024)
Dao, T., Gu, A.: Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality. In: ICML (2024)
2024
-
[14]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Deitke, M., Clark, C., Lee, S., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. arXiv preprint arXiv:2409.17146 (2024)
work page internal anchor Pith review arXiv 2024
-
[15]
Yashima et al
Goko, M., Kambara, M., Saito, D., et al.: Task Success Prediction for Open- VocabularyManipulationBasedonMulti-LevelAlignedRepresentations.In:CoRL (2024) 14 D. Yashima et al
2024
-
[16]
In: CoLM (2024)
Gu, A., Dao, T.: Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In: CoLM (2024)
2024
-
[17]
In: ICLR (2022)
Gu, A., Goel, K., Ré, C.: Efficiently Modeling Long Sequences with Structured State Spaces. In: ICLR (2022)
2022
-
[18]
In: NeurIPS (2021)
Gu, A., Johnson, I., Goel, K., et al.: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. In: NeurIPS (2021)
2021
-
[19]
In: NeurIPS
He, H., Bai, Y., et al.: MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection. In: NeurIPS. pp. 71162–71187 (2024)
2024
-
[20]
In: ECCV (2024)
Hu, V., Baumann, S.A., Gui, M., et al.: ZigMa: A DiT-style Zigzag Mamba Diffu- sion Model. In: ECCV (2024)
2024
-
[21]
Kalman,R.:ANewApproachtoLinearFilteringandPredictionProblems.Journal of Basic Engineering82(1), 35–45 (1960)
1960
-
[22]
In: ICML (2024)
Karamcheti, S., Nair, S., Balakrishna, A., et al.: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models. In: ICML (2024)
2024
-
[23]
In: ICML (2020)
Katharopoulos, A., Vyas, A., Pappas, N., et al.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In: ICML (2020)
2020
-
[24]
In: CoRL
Kim, J., Pertsch, K., Karamcheti, S., et al.: OpenVLA: An Open-Source Vision- Language-Action Model. In: CoRL. pp. 2679–2713 (2024)
2024
-
[25]
In: ICCV
Krishna, R., Hata, K., Ren, F., et al.: Dense-Captioning Events in Videos. In: ICCV. pp. 706–715 (2017)
2017
-
[26]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., et al.: LLaVA-OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Li, C., Gan, Z., et al.: Multimodal Foundation Models: From Specialists to General- Purpose Assistants. Found. Trends. Comput. Graph. Vis.16(1-2), 1–214 (2024)
2024
-
[28]
In: ECCV (2024)
Li, K., Li, X., Wang, Y., et al.: VideoMamba: State Space Model for Efficient Video Understanding. In: ECCV (2024)
2024
-
[29]
In: CVPR
Li,Y.,Song,Y.,Cao,L.,etal.:TGIF:ANewDatasetandBenchmarkonAnimated GIF Description. In: CVPR. pp. 4641–4650 (2016)
2016
-
[30]
In: CAICE
Liang,Z.,Xu,Y.,Hong,Y.,etal.:ASurveyofMultimodelLargeLanguageModels. In: CAICE. pp. 405–409 (2024)
2024
-
[31]
In: ICLR (2025)
Lieber,O.,Lenz,B.,Bata,H.,etal.:Jamba:HybridTransformer-MambaLanguage Models. In: ICLR (2025)
2025
-
[32]
In: EMNLP
Lin, B., Ye, Y., Zhu, B., et al.: Video-LLaVA: Learning United Visual Represen- tation by Alignment Before Projection. In: EMNLP. pp. 5971–5984 (2024)
2024
-
[33]
Lin, C.: ROUGE: A Package For Automatic Evaluation Of Summaries. In: ACL. pp. 74–81 (2004)
2004
-
[34]
In: CVPR
Liu, H., Li, C., Li, Y., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)
2024
-
[35]
In: NeurIPS
Liu, H., Li, C., Wu, Q., et al.: Visual Instruction Tuning. In: NeurIPS. pp. 34892– 34916 (2023)
2023
-
[36]
Liu, X., Shu, Y., Liu, Z., et al.: Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding. arXiv preprint arXiv:2503.18478 (2025)
-
[37]
arXiv preprint arXiv:2405.04404 (2024)
Liu, X., Zhang, C., Zhang, L.: Vision Mamba: A Comprehensive Survey and Tax- onomy. arXiv preprint arXiv:2405.04404 (2024)
-
[38]
In: NeurIPS
Liu, Y., Tian, Y., Zhao, Y., et al.: VMamba: Visual State Space Model. In: NeurIPS. pp. 103031–103063 (2024)
2024
-
[39]
Maaz, M., Rasheed, H., et al.: Video-ChatGPT: Towards Detailed Video Under- standing via Large Vision and Language Models. In: ACL. pp. 12585–12602 (2024)
2024
-
[40]
In: ICCV
Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: ICCV. pp. 2630–2640 (2019) ABMAMBA 15
2019
-
[41]
In: NeurIPS
Nguyen, E., Goel, K., Gu, A., et al.: S4ND: modeling images and videos as multi- dimensional signals using state spaces. In: NeurIPS. p. 2846–2861 (2022)
2022
-
[42]
Nguyen, T., Bin, Y., Xiao, J., et al.: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. In: ACL. pp. 3636–3657 (2024)
2024
-
[43]
TMLR (2024)
Oquab, M., Darcet, T., Moutakanni, T., et al.: DINOv2: Learning Robust Visual Features without Supervision. TMLR (2024)
2024
-
[44]
Papineni, K., Roukos, S., Ward, T., et al.: BLEU: a Method for Automatic Eval- uation of Machine Translation. In: ACL. pp. 311–318 (2002)
2002
-
[45]
arXiv preprint arXiv:2403.15360 , year=
Patro, B., Agneeswaran, V.: SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv preprint arXiv:2403.15360 (2024)
-
[46]
Patro, N., Agneeswaran, S.: Mamba-360: Survey of State Space Models as Trans- former Alternative for Long Sequence Modelling: Methods, Applications, and Chal- lenges. arXiv preprint arXiv:2404.16112 (2024)
-
[47]
Multimedia Tools Appl.78(10), 14007–14027 (2019)
Pini, S., Cornia, M., Bolelli, F., et al.: M-VAD Names: a Dataset for Video Cap- tioning with Naming. Multimedia Tools Appl.78(10), 14007–14027 (2019)
2019
-
[48]
arXiv preprint arXiv:2403.13600 (2024)
Qiao, Y., Yu, Z., Guo, L., et al.: Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600 (2024)
-
[49]
In: ICML
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning Transferable Visual Models from Natural Language Supervision. In: ICML. pp. 8748–8763 (2021)
2021
-
[50]
arXiv preprint arXiv:2410.03105 (2024)
Rahman, M.M., Tutul, A.A., Nath, A., et al.: Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv preprint arXiv:2410.03105 (2024)
-
[51]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., et al.: Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
In: GCPR
Rohrbach, A., Rohrbach, M., Qiu, W., et al.: Coherent multi-sentence video de- scription with variable level of detail. In: GCPR. pp. 184–195 (2014)
2014
-
[53]
In: CVPR
Sarto,S.,Barraco,M.,Cornia,M.,etal.:Positive-AugmentedContrastiveLearning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)
2023
-
[54]
In: ICLR (2024)
Shiyu,W.,Haixu,W.,Xiaoming,S.,Tengge,H.,Huakun,L.,Lintao,M.,James,Z., Jun, Z.: TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In: ICLR (2024)
2024
-
[55]
In: ICLR (2023)
Smith, J., Warrington, A., Linderman, S.: Simplified State Space Layers for Se- quence Modeling. In: ICLR (2023)
2023
-
[56]
Retentive Network: A Successor to Transformer for Large Language Models
Sun,Y.,Dong,L.,Huang,S.,etal.:RetentiveNetwork:ASuccessortoTransformer for Large Language Models. arXiv preprint arXiv:2307.08621 (2023)
work page internal anchor Pith review arXiv 2023
-
[57]
Video understanding with large language models: A survey.arXiv preprint arXiv:2312.17432, 2023
Tang, Y., Bi, J., Xu, S., et al.: Video Understanding with Large Language Models: A Survey. arXiv preprint arXiv:2312.17432 (2023)
-
[58]
In: CoRL (2024)
Tian,X.,Gu,J.,Li,B.,etal.:DriveVLM:TheConvergenceofAutonomousDriving and Large Vision-Language Models. In: CoRL (2024)
2024
-
[59]
In: AAAI
Tong, C., He, S., Shao, Z., et al.: G-VEval: A versatile metric for evaluating image and video captions using GPT-4o. In: AAAI. pp. 7419–7427 (2025)
2025
-
[60]
NeurIPS pp
Tong, S., Brown, E., Wu, P., et al.: Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS pp. 87310–87356 (2024)
2024
-
[61]
In: CVPR
Vedantam, R., Zitnick, L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR. pp. 4566–4575 (2015)
2015
-
[62]
arXiv preprint arXiv:2404.09516 (2024)
Wang, X., Wang, S., et al.: State Space Model for New-Generation Network Alter- native to Transformers: A Survey. arXiv preprint arXiv:2404.09516 (2024)
-
[63]
In: ICCV
Wang, X., Wu, J., Chen, J., et al.: VaTeX: A Large-Scale, High-Quality Multilin- gual Dataset for Video-and-Language Research. In: ICCV. pp. 4580–4590 (2019) 16 D. Yashima et al
2019
-
[64]
In: ICLR (2023)
Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: Temporal 2D- Variation Modeling for General Time Series Analysis. In: ICLR (2023)
2023
-
[65]
In: ICLR (2025)
Xing, Y., Lan, X., Wang, R., et al.: EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment. In: ICLR (2025)
2025
-
[66]
In: CVPR
Xu, J., Mei, T., Yao, T., et al.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR. pp. 5288–5296 (2016)
2016
-
[67]
arXiv preprint arXiv:2404.18861 (2024)
Xu, R., Yang, S., Wang, Y., et al.: Visual Mamba: A Survey and New Outlooks. arXiv preprint arXiv:2404.18861 (2024)
-
[68]
Xu, Z., Zhang, Y., Xie, E., et al.: DriveGPT4: Interpretable End-to-End Au- tonomousDrivingViaLargeLanguageModel.IEEERA-L9(10),8186–8193(2024)
2024
-
[69]
In: ICLR (2024)
Yue, X., Song, Y., Asai, A., et al.: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. In: ICLR (2024)
2024
-
[70]
In: ICCV
Zhai, X., Mustafa, B., Kolesnikov, A., et al.: Sigmoid Loss for Language Image Pre-Training. In: ICCV. pp. 11975–11986 (2023)
2023
-
[71]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Zhang, B., Li, K., et al.: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. arXiv preprint arXiv:2501.13106 (2025)
work page internal anchor Pith review arXiv 2025
-
[72]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y., Wu, J., Li, W., et al.: Video Instruction Tuning With Synthetic Data. arXiv preprint arXiv:2410.02713 (2024)
work page internal anchor Pith review arXiv 2024
-
[73]
In: ICCAS
Zhang, Z., Chong, K.: Comparison between First-Order Hold with Zero-Order Hold in Discretization of Input-Delay Nonlinear Systems. In: ICCAS. pp. 2892– 2896 (2007)
2007
-
[74]
In: AAAI
Zhao, H., Zhang, M., Zhao, W., et al.: Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. In: AAAI. pp. 10421–10429 (2025)
2025
-
[75]
In: AAAI
Zhou, L., Xu, C., Corso, J.: Towards automatic learning of procedures from web instructional videos. In: AAAI. p. 7590–7598 (2018)
2018
-
[76]
In: ICML (2024)
Zhu, L., Liao, B., Zhang, Q., et al.: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In: ICML (2024)
2024
-
[77]
Provide a single- sentence caption that matches the style of the preceding videos
Zou, B., Guo, Z., Hu, X., et al.: RhythmMamba: Fast, Lightweight, and Accurate Remote Physiological Measurement. In: AAAI. pp. 11077–11085 (2025) ABMAMBA 17 7 Deep State Space Models Recent advances in Deep SSMs [16,15,12] have demonstrated their remarkable advantages over predominant architectures, including Transformers, across vari- ous sequence modeli...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.