pith. machine review for the scientific record. sign in

arxiv: 2604.08050 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords video captioningmultimodal large language modelsstate space modelsMambaefficient video processingtemporal dependencieshierarchical scan
0
0 comments X

The pith

A Mamba-based multimodal model processes video sequences with linear complexity to caption them competitively while running about three times faster than Transformer equivalents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the quadratic scaling problem that makes current multimodal large language models impractical for long video sequences in captioning tasks. It replaces attention mechanisms with state space models and introduces a module that scans video features bidirectionally at multiple temporal resolutions to preserve dependencies. If the approach holds, it would let open models handle extended videos on ordinary hardware without the usual compute explosion, directly improving throughput on benchmarks like VATEX and MSR-VTT.

Core claim

ABMamba builds a fully open multimodal large language model on deep state space models as the language backbone and adds an Aligned Hierarchical Bidirectional Scan module that processes video inputs across multiple temporal resolutions, achieving linear computational complexity while delivering competitive captioning performance on VATEX and MSR-VTT together with roughly three times higher throughput than typical Transformer-based MLLMs.

What carries the argument

The Aligned Hierarchical Bidirectional Scan module, which aligns and scans video features bidirectionally at several temporal resolutions before feeding them into the Mamba backbone.

If this is right

  • Video captioning becomes feasible for longer sequences because computational cost grows linearly rather than quadratically with length.
  • Fully open MLLMs gain practicality for video tasks by reaching competitive accuracy at three times the throughput of attention-based models.
  • The state space model backbone can serve as a drop-in replacement for attention in multimodal video settings without custom retraining for each task.
  • Multiple-resolution scanning preserves temporal structure across scales, supporting caption quality that matches existing MLLMs on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear scan approach could be tested on related tasks such as video question answering or temporal action localization to measure whether efficiency gains transfer.
  • For deployment on edge devices, the reduced memory footprint from linear complexity might allow real-time captioning of streaming video where Transformer models currently cannot run.
  • If the multi-resolution alignment proves robust, it could be adapted to other sequential modalities like audio or sensor data streams that share long-range dependency challenges.

Load-bearing premise

The Aligned Hierarchical Bidirectional Scan module captures the key temporal dependencies in videos at multiple resolutions without losing information or needing extra task-specific adjustments.

What would settle it

Running ABMamba on longer video sequences from VATEX and measuring both captioning accuracy and actual wall-clock throughput to check whether performance falls below Transformer baselines or the reported speed gain disappears.

Figures

Figures reproduced from arXiv: 2604.08050 by Daichi Yashima, Komei Sugiura, Seitaro Otsuki, Shuhei Kurita, Shuntaro Suzuki, Yusuke Oda.

Figure 1
Figure 1. Figure 1: A typical use case of ABMamba. ABMamba generates relevant and de [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of ABMamba. Given a video with a language prompt, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of Aligned Hierarchical Bidirectional Scan (AHBS) mod [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of ABMamba and baseline methods on the VATEX [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure case from ABMamba and baselines on the VATEX benchmark. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ABMamba, a fully open multimodal large language model for video captioning that replaces quadratic Transformer attention with Deep State Space Models (Mamba) as the language backbone and adds a novel Aligned Hierarchical Bidirectional Scan module to handle multi-resolution temporal dependencies in video sequences. It claims linear computational complexity together with competitive performance on standard benchmarks such as VATEX and MSR-VTT and approximately three times higher throughput than typical MLLMs.

Significance. If the empirical claims are substantiated with detailed metrics and ablations, the work would provide a concrete demonstration that state-space models can serve as a scalable backbone for open video MLLMs, addressing a key limitation of attention-based architectures on long sequences.

major comments (2)
  1. Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.
  2. §3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.
minor comments (2)
  1. The abstract and introduction would benefit from explicit citation of the exact Mamba variant (e.g., Mamba-2 or original) and the precise video-captioning metrics used (CIDEr, BLEU-4, etc.).
  2. Notation for the state-space parameters and the bidirectional scan directions could be formalized with a short equation block to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential impact of ABMamba. We address the two major comments point by point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.

    Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we will update the abstract to reference concrete metrics (e.g., CIDEr scores on VATEX and MSR-VTT), name the primary baselines (such as Video-LLaMA and other MLLMs), point to the relevant tables for the throughput comparison, and note that the reported gains are consistent across runs. These additions will make the central claims directly verifiable while preserving the abstract's brevity. revision: yes

  2. Referee: §3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.

    Authors: We acknowledge that the current exposition in §3 remains somewhat high-level. In the revision we will expand this section with a precise algorithmic description and mathematical formulation of the alignment procedure, including how the hierarchical bidirectional scans are synchronized across resolutions and how state representations are merged without discarding information. We will also add a dedicated paragraph and supporting ablation demonstrating that the module operates without task-specific hyper-parameter tuning, thereby confirming both generality and information preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents ABMamba as an architectural proposal extending state-space models with a new Aligned Hierarchical Bidirectional Scan module for video sequences. All performance claims (competitive accuracy on VATEX/MSR-VTT plus 3x throughput) are framed as empirical outcomes measured on external public benchmarks. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text that reduce any result to its own inputs by construction. The design choices are described at the level of module composition rather than self-referential definitions or uniqueness theorems imported from prior self-work. The central claims therefore remain independent of the paper's own fitted values or internal renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of the newly introduced scan module and on standard assumptions about Mamba scaling; no free parameters, axioms, or invented entities are quantified in the abstract.

invented entities (1)
  • Aligned Hierarchical Bidirectional Scan module no independent evidence
    purpose: Process video sequences at multiple temporal resolutions with bidirectional scanning while preserving linear complexity
    Newly proposed component whose behavior is asserted but not derived or independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1254 out tokens · 40905 ms · 2026-05-10T17:52:34.845720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 21 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2024)

  2. [2]

    Banerjee, S., Lavie, A.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: ACL. pp. 65–72 (2005)

  3. [3]

    In: ICLR (2024)

    Baron, E., Zimerman, I., Wolf, L.: A 2-Dimensional State Space Layer for Spatial Inductive Bias. In: ICLR (2024)

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., et al.:π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)

  5. [5]

    Blakeman, A., Basant, A., Khattar, A., et al.: Nemotron-H: A Family of Accurate andEfficientHybridMamba-TransformerModels.arXivpreprintarXiv:2504.03624 (2025)

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., et al.: RT-1: Robotics Transformer for Real- World Control at Scale. arXiv preprint arXiv:2212.06817 (2022)

  7. [7]

    Carreira, J., Noland, E., Banki-Horvath, A., et al.: A Short Note about Kinetics-

  8. [8]

    arXiv preprint arXiV:1808.01340 (2018)

  9. [9]

    Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: ACL. pp. 190–200 (2011)

  10. [10]

    In: CVPR

    Chen, Z., Wu, J., Wang, W., et al.: InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In: CVPR. pp. 24185–24198 (2024)

  11. [11]

    Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

    Chu, X., Qiao, L., Zhang, X., et al.: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766 (2024)

  12. [12]

    In: WACVW

    Cui, C., Ma, Y., Cao, X., et al.: Drive as You Speak: Enabling Human-Like In- teraction with Large Language Models in Autonomous Vehicles. In: WACVW. pp. 902–909 (2024)

  13. [13]

    In: ICML (2024)

    Dao, T., Gu, A.: Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality. In: ICML (2024)

  14. [14]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Deitke, M., Clark, C., Lee, S., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. arXiv preprint arXiv:2409.17146 (2024)

  15. [15]

    Yashima et al

    Goko, M., Kambara, M., Saito, D., et al.: Task Success Prediction for Open- VocabularyManipulationBasedonMulti-LevelAlignedRepresentations.In:CoRL (2024) 14 D. Yashima et al

  16. [16]

    In: CoLM (2024)

    Gu, A., Dao, T.: Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In: CoLM (2024)

  17. [17]

    In: ICLR (2022)

    Gu, A., Goel, K., Ré, C.: Efficiently Modeling Long Sequences with Structured State Spaces. In: ICLR (2022)

  18. [18]

    In: NeurIPS (2021)

    Gu, A., Johnson, I., Goel, K., et al.: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. In: NeurIPS (2021)

  19. [19]

    In: NeurIPS

    He, H., Bai, Y., et al.: MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection. In: NeurIPS. pp. 71162–71187 (2024)

  20. [20]

    In: ECCV (2024)

    Hu, V., Baumann, S.A., Gui, M., et al.: ZigMa: A DiT-style Zigzag Mamba Diffu- sion Model. In: ECCV (2024)

  21. [21]

    Kalman,R.:ANewApproachtoLinearFilteringandPredictionProblems.Journal of Basic Engineering82(1), 35–45 (1960)

  22. [22]

    In: ICML (2024)

    Karamcheti, S., Nair, S., Balakrishna, A., et al.: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models. In: ICML (2024)

  23. [23]

    In: ICML (2020)

    Katharopoulos, A., Vyas, A., Pappas, N., et al.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In: ICML (2020)

  24. [24]

    In: CoRL

    Kim, J., Pertsch, K., Karamcheti, S., et al.: OpenVLA: An Open-Source Vision- Language-Action Model. In: CoRL. pp. 2679–2713 (2024)

  25. [25]

    In: ICCV

    Krishna, R., Hata, K., Ren, F., et al.: Dense-Captioning Events in Videos. In: ICCV. pp. 706–715 (2017)

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., et al.: LLaVA-OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326 (2024)

  27. [27]

    Li, C., Gan, Z., et al.: Multimodal Foundation Models: From Specialists to General- Purpose Assistants. Found. Trends. Comput. Graph. Vis.16(1-2), 1–214 (2024)

  28. [28]

    In: ECCV (2024)

    Li, K., Li, X., Wang, Y., et al.: VideoMamba: State Space Model for Efficient Video Understanding. In: ECCV (2024)

  29. [29]

    In: CVPR

    Li,Y.,Song,Y.,Cao,L.,etal.:TGIF:ANewDatasetandBenchmarkonAnimated GIF Description. In: CVPR. pp. 4641–4650 (2016)

  30. [30]

    In: CAICE

    Liang,Z.,Xu,Y.,Hong,Y.,etal.:ASurveyofMultimodelLargeLanguageModels. In: CAICE. pp. 405–409 (2024)

  31. [31]

    In: ICLR (2025)

    Lieber,O.,Lenz,B.,Bata,H.,etal.:Jamba:HybridTransformer-MambaLanguage Models. In: ICLR (2025)

  32. [32]

    In: EMNLP

    Lin, B., Ye, Y., Zhu, B., et al.: Video-LLaVA: Learning United Visual Represen- tation by Alignment Before Projection. In: EMNLP. pp. 5971–5984 (2024)

  33. [33]

    Lin, C.: ROUGE: A Package For Automatic Evaluation Of Summaries. In: ACL. pp. 74–81 (2004)

  34. [34]

    In: CVPR

    Liu, H., Li, C., Li, Y., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)

  35. [35]

    In: NeurIPS

    Liu, H., Li, C., Wu, Q., et al.: Visual Instruction Tuning. In: NeurIPS. pp. 34892– 34916 (2023)

  36. [36]

    Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

    Liu, X., Shu, Y., Liu, Z., et al.: Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding. arXiv preprint arXiv:2503.18478 (2025)

  37. [37]

    arXiv preprint arXiv:2405.04404 (2024)

    Liu, X., Zhang, C., Zhang, L.: Vision Mamba: A Comprehensive Survey and Tax- onomy. arXiv preprint arXiv:2405.04404 (2024)

  38. [38]

    In: NeurIPS

    Liu, Y., Tian, Y., Zhao, Y., et al.: VMamba: Visual State Space Model. In: NeurIPS. pp. 103031–103063 (2024)

  39. [39]

    Maaz, M., Rasheed, H., et al.: Video-ChatGPT: Towards Detailed Video Under- standing via Large Vision and Language Models. In: ACL. pp. 12585–12602 (2024)

  40. [40]

    In: ICCV

    Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: ICCV. pp. 2630–2640 (2019) ABMAMBA 15

  41. [41]

    In: NeurIPS

    Nguyen, E., Goel, K., Gu, A., et al.: S4ND: modeling images and videos as multi- dimensional signals using state spaces. In: NeurIPS. p. 2846–2861 (2022)

  42. [42]

    Nguyen, T., Bin, Y., Xiao, J., et al.: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. In: ACL. pp. 3636–3657 (2024)

  43. [43]

    TMLR (2024)

    Oquab, M., Darcet, T., Moutakanni, T., et al.: DINOv2: Learning Robust Visual Features without Supervision. TMLR (2024)

  44. [44]

    Papineni, K., Roukos, S., Ward, T., et al.: BLEU: a Method for Automatic Eval- uation of Machine Translation. In: ACL. pp. 311–318 (2002)

  45. [45]

    arXiv preprint arXiv:2403.15360 , year=

    Patro, B., Agneeswaran, V.: SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv preprint arXiv:2403.15360 (2024)

  46. [46]

    Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges,

    Patro, N., Agneeswaran, S.: Mamba-360: Survey of State Space Models as Trans- former Alternative for Long Sequence Modelling: Methods, Applications, and Chal- lenges. arXiv preprint arXiv:2404.16112 (2024)

  47. [47]

    Multimedia Tools Appl.78(10), 14007–14027 (2019)

    Pini, S., Cornia, M., Bolelli, F., et al.: M-VAD Names: a Dataset for Video Cap- tioning with Naming. Multimedia Tools Appl.78(10), 14007–14027 (2019)

  48. [48]

    arXiv preprint arXiv:2403.13600 (2024)

    Qiao, Y., Yu, Z., Guo, L., et al.: Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600 (2024)

  49. [49]

    In: ICML

    Radford, A., Kim, J.W., Hallacy, C., et al.: Learning Transferable Visual Models from Natural Language Supervision. In: ICML. pp. 8748–8763 (2021)

  50. [50]

    arXiv preprint arXiv:2410.03105 (2024)

    Rahman, M.M., Tutul, A.A., Nath, A., et al.: Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv preprint arXiv:2410.03105 (2024)

  51. [51]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., et al.: Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. arXiv preprint arXiv:2403.05530 (2024)

  52. [52]

    In: GCPR

    Rohrbach, A., Rohrbach, M., Qiu, W., et al.: Coherent multi-sentence video de- scription with variable level of detail. In: GCPR. pp. 184–195 (2014)

  53. [53]

    In: CVPR

    Sarto,S.,Barraco,M.,Cornia,M.,etal.:Positive-AugmentedContrastiveLearning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)

  54. [54]

    In: ICLR (2024)

    Shiyu,W.,Haixu,W.,Xiaoming,S.,Tengge,H.,Huakun,L.,Lintao,M.,James,Z., Jun, Z.: TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In: ICLR (2024)

  55. [55]

    In: ICLR (2023)

    Smith, J., Warrington, A., Linderman, S.: Simplified State Space Layers for Se- quence Modeling. In: ICLR (2023)

  56. [56]

    Retentive Network: A Successor to Transformer for Large Language Models

    Sun,Y.,Dong,L.,Huang,S.,etal.:RetentiveNetwork:ASuccessortoTransformer for Large Language Models. arXiv preprint arXiv:2307.08621 (2023)

  57. [57]

    Video understanding with large language models: A survey.arXiv preprint arXiv:2312.17432, 2023

    Tang, Y., Bi, J., Xu, S., et al.: Video Understanding with Large Language Models: A Survey. arXiv preprint arXiv:2312.17432 (2023)

  58. [58]

    In: CoRL (2024)

    Tian,X.,Gu,J.,Li,B.,etal.:DriveVLM:TheConvergenceofAutonomousDriving and Large Vision-Language Models. In: CoRL (2024)

  59. [59]

    In: AAAI

    Tong, C., He, S., Shao, Z., et al.: G-VEval: A versatile metric for evaluating image and video captions using GPT-4o. In: AAAI. pp. 7419–7427 (2025)

  60. [60]

    NeurIPS pp

    Tong, S., Brown, E., Wu, P., et al.: Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS pp. 87310–87356 (2024)

  61. [61]

    In: CVPR

    Vedantam, R., Zitnick, L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR. pp. 4566–4575 (2015)

  62. [62]

    arXiv preprint arXiv:2404.09516 (2024)

    Wang, X., Wang, S., et al.: State Space Model for New-Generation Network Alter- native to Transformers: A Survey. arXiv preprint arXiv:2404.09516 (2024)

  63. [63]

    In: ICCV

    Wang, X., Wu, J., Chen, J., et al.: VaTeX: A Large-Scale, High-Quality Multilin- gual Dataset for Video-and-Language Research. In: ICCV. pp. 4580–4590 (2019) 16 D. Yashima et al

  64. [64]

    In: ICLR (2023)

    Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: Temporal 2D- Variation Modeling for General Time Series Analysis. In: ICLR (2023)

  65. [65]

    In: ICLR (2025)

    Xing, Y., Lan, X., Wang, R., et al.: EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment. In: ICLR (2025)

  66. [66]

    In: CVPR

    Xu, J., Mei, T., Yao, T., et al.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR. pp. 5288–5296 (2016)

  67. [67]

    arXiv preprint arXiv:2404.18861 (2024)

    Xu, R., Yang, S., Wang, Y., et al.: Visual Mamba: A Survey and New Outlooks. arXiv preprint arXiv:2404.18861 (2024)

  68. [68]

    Xu, Z., Zhang, Y., Xie, E., et al.: DriveGPT4: Interpretable End-to-End Au- tonomousDrivingViaLargeLanguageModel.IEEERA-L9(10),8186–8193(2024)

  69. [69]

    In: ICLR (2024)

    Yue, X., Song, Y., Asai, A., et al.: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. In: ICLR (2024)

  70. [70]

    In: ICCV

    Zhai, X., Mustafa, B., Kolesnikov, A., et al.: Sigmoid Loss for Language Image Pre-Training. In: ICCV. pp. 11975–11986 (2023)

  71. [71]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., et al.: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. arXiv preprint arXiv:2501.13106 (2025)

  72. [72]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., et al.: Video Instruction Tuning With Synthetic Data. arXiv preprint arXiv:2410.02713 (2024)

  73. [73]

    In: ICCAS

    Zhang, Z., Chong, K.: Comparison between First-Order Hold with Zero-Order Hold in Discretization of Input-Delay Nonlinear Systems. In: ICCAS. pp. 2892– 2896 (2007)

  74. [74]

    In: AAAI

    Zhao, H., Zhang, M., Zhao, W., et al.: Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. In: AAAI. pp. 10421–10429 (2025)

  75. [75]

    In: AAAI

    Zhou, L., Xu, C., Corso, J.: Towards automatic learning of procedures from web instructional videos. In: AAAI. p. 7590–7598 (2018)

  76. [76]

    In: ICML (2024)

    Zhu, L., Liao, B., Zhang, Q., et al.: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In: ICML (2024)

  77. [77]

    Provide a single- sentence caption that matches the style of the preceding videos

    Zou, B., Guo, Z., Hu, X., et al.: RhythmMamba: Fast, Lightweight, and Accurate Remote Physiological Measurement. In: AAAI. pp. 11077–11085 (2025) ABMAMBA 17 7 Deep State Space Models Recent advances in Deep SSMs [16,15,12] have demonstrated their remarkable advantages over predominant architectures, including Transformers, across vari- ous sequence modeli...