arxiv: 2604.08050 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Daichi Yashima , Shuhei Kurita , Yusuke Oda , Shuntaro Suzuki , Seitaro Otsuki , Komei Sugiura

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords video captioningmultimodal large language modelsstate space modelsMambaefficient video processingtemporal dependencieshierarchical scan

0 comments

The pith

A Mamba-based multimodal model processes video sequences with linear complexity to caption them competitively while running about three times faster than Transformer equivalents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the quadratic scaling problem that makes current multimodal large language models impractical for long video sequences in captioning tasks. It replaces attention mechanisms with state space models and introduces a module that scans video features bidirectionally at multiple temporal resolutions to preserve dependencies. If the approach holds, it would let open models handle extended videos on ordinary hardware without the usual compute explosion, directly improving throughput on benchmarks like VATEX and MSR-VTT.

Core claim

ABMamba builds a fully open multimodal large language model on deep state space models as the language backbone and adds an Aligned Hierarchical Bidirectional Scan module that processes video inputs across multiple temporal resolutions, achieving linear computational complexity while delivering competitive captioning performance on VATEX and MSR-VTT together with roughly three times higher throughput than typical Transformer-based MLLMs.

What carries the argument

The Aligned Hierarchical Bidirectional Scan module, which aligns and scans video features bidirectionally at several temporal resolutions before feeding them into the Mamba backbone.

If this is right

Video captioning becomes feasible for longer sequences because computational cost grows linearly rather than quadratically with length.
Fully open MLLMs gain practicality for video tasks by reaching competitive accuracy at three times the throughput of attention-based models.
The state space model backbone can serve as a drop-in replacement for attention in multimodal video settings without custom retraining for each task.
Multiple-resolution scanning preserves temporal structure across scales, supporting caption quality that matches existing MLLMs on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear scan approach could be tested on related tasks such as video question answering or temporal action localization to measure whether efficiency gains transfer.
For deployment on edge devices, the reduced memory footprint from linear complexity might allow real-time captioning of streaming video where Transformer models currently cannot run.
If the multi-resolution alignment proves robust, it could be adapted to other sequential modalities like audio or sensor data streams that share long-range dependency challenges.

Load-bearing premise

The Aligned Hierarchical Bidirectional Scan module captures the key temporal dependencies in videos at multiple resolutions without losing information or needing extra task-specific adjustments.

What would settle it

Running ABMamba on longer video sequences from VATEX and measuring both captioning accuracy and actual wall-clock throughput to check whether performance falls below Transformer baselines or the reported speed gain disappears.

Figures

Figures reproduced from arXiv: 2604.08050 by Daichi Yashima, Komei Sugiura, Seitaro Otsuki, Shuhei Kurita, Shuntaro Suzuki, Yusuke Oda.

**Figure 2.** Figure 2: The architecture of ABMamba. Given a video with a language prompt, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of Aligned Hierarchical Bidirectional Scan (AHBS) mod [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of ABMamba and baseline methods on the VATEX [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Failure case from ABMamba and baselines on the VATEX benchmark. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABMamba swaps Transformer attention for Mamba plus a new hierarchical bidirectional scan to cut video captioning cost, but the abstract gives no numbers or ablations to back the competitive results and 3x speed claim.

read the letter

The core move here is replacing quadratic attention with a Mamba backbone in an open multimodal LLM and adding the Aligned Hierarchical Bidirectional Scan to process video across several temporal resolutions at once. That keeps scaling linear, which is the practical win for longer sequences in captioning tasks like VATEX and MSR-VTT. The paper frames this combination as the new piece and reports it stays competitive with standard MLLMs while running roughly three times faster in throughput. That efficiency angle is the part worth noting for anyone dealing with real video workloads. The scan module itself is presented as a direct extension of existing state-space work rather than a wholesale new framework, so the novelty sits in the specific alignment and hierarchy for video. The abstract does not supply the actual benchmark scores, baseline tables, ablation results, or error bars, which leaves the performance claims difficult to assess on their own. If the full paper includes those controls and shows the scan does not drop critical temporal information, the story holds; otherwise the gains could be narrower than stated. No obvious circularity or invented entities appear in the description. This is the kind of incremental architecture paper that matters most to groups already experimenting with Mamba or SSMs for multimodal video. Readers outside that niche will probably wait for the numbers. I would send it to peer review so the experiments can be checked properly rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ABMamba, a fully open multimodal large language model for video captioning that replaces quadratic Transformer attention with Deep State Space Models (Mamba) as the language backbone and adds a novel Aligned Hierarchical Bidirectional Scan module to handle multi-resolution temporal dependencies in video sequences. It claims linear computational complexity together with competitive performance on standard benchmarks such as VATEX and MSR-VTT and approximately three times higher throughput than typical MLLMs.

Significance. If the empirical claims are substantiated with detailed metrics and ablations, the work would provide a concrete demonstration that state-space models can serve as a scalable backbone for open video MLLMs, addressing a key limitation of attention-based architectures on long sequences.

major comments (2)

Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.
§3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.

minor comments (2)

The abstract and introduction would benefit from explicit citation of the exact Mamba variant (e.g., Mamba-2 or original) and the precise video-captioning metrics used (CIDEr, BLEU-4, etc.).
Notation for the state-space parameters and the bidirectional scan directions could be formalized with a short equation block to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the potential impact of ABMamba. We address the two major comments point by point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim of 'competitive performance' and 'approximately three times higher throughput' is stated without any numerical scores, baseline names, table references, statistical tests, or error bars, rendering the primary empirical contribution unverifiable from the given text.

Authors: We agree that the abstract would be strengthened by greater specificity. In the revised manuscript we will update the abstract to reference concrete metrics (e.g., CIDEr scores on VATEX and MSR-VTT), name the primary baselines (such as Video-LLaMA and other MLLMs), point to the relevant tables for the throughput comparison, and note that the reported gains are consistent across runs. These additions will make the central claims directly verifiable while preserving the abstract's brevity. revision: yes
Referee: §3 (Aligned Hierarchical Bidirectional Scan module): the description of how alignment across temporal resolutions is achieved and whether it preserves information without task-specific tuning is high-level only; this mechanism is load-bearing for both the claimed generality and the absence of information loss.

Authors: We acknowledge that the current exposition in §3 remains somewhat high-level. In the revision we will expand this section with a precise algorithmic description and mathematical formulation of the alignment procedure, including how the hierarchical bidirectional scans are synchronized across resolutions and how state representations are merged without discarding information. We will also add a dedicated paragraph and supporting ablation demonstrating that the module operates without task-specific hyper-parameter tuning, thereby confirming both generality and information preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents ABMamba as an architectural proposal extending state-space models with a new Aligned Hierarchical Bidirectional Scan module for video sequences. All performance claims (competitive accuracy on VATEX/MSR-VTT plus 3x throughput) are framed as empirical outcomes measured on external public benchmarks. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text that reduce any result to its own inputs by construction. The design choices are described at the level of module composition rather than self-referential definitions or uniqueness theorems imported from prior self-work. The central claims therefore remain independent of the paper's own fitted values or internal renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of the newly introduced scan module and on standard assumptions about Mamba scaling; no free parameters, axioms, or invented entities are quantified in the abstract.

invented entities (1)

Aligned Hierarchical Bidirectional Scan module no independent evidence
purpose: Process video sequences at multiple temporal resolutions with bidirectional scanning while preserving linear complexity
Newly proposed component whose behavior is asserted but not derived or independently evidenced in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1254 out tokens · 40905 ms · 2026-05-10T17:52:34.845720+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 21 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Banerjee, S., Lavie, A.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: ACL. pp. 65–72 (2005)

2005
[3]

In: ICLR (2024)

Baron, E., Zimerman, I., Wolf, L.: A 2-Dimensional State Space Layer for Spatial Inductive Bias. In: ICLR (2024)

2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black, K., Brown, N., Driess, D., et al.:π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv preprint arXiv:2410.24164 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Blakeman, A., Basant, A., Khattar, A., et al.: Nemotron-H: A Family of Accurate andEfficientHybridMamba-TransformerModels.arXivpreprintarXiv:2504.03624 (2025)

work page arXiv 2025
[6]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., et al.: RT-1: Robotics Transformer for Real- World Control at Scale. arXiv preprint arXiv:2212.06817 (2022)

work page internal anchor Pith review arXiv 2022
[7]

Carreira, J., Noland, E., Banki-Horvath, A., et al.: A Short Note about Kinetics-
[8]

arXiv preprint arXiV:1808.01340 (2018)

work page Pith review arXiv 2018
[9]

Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: ACL. pp. 190–200 (2011)

2011
[10]

In: CVPR

Chen, Z., Wu, J., Wang, W., et al.: InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In: CVPR. pp. 24185–24198 (2024)

2024
[11]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Chu, X., Qiao, L., Zhang, X., et al.: MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766 (2024)

work page arXiv 2024
[12]

In: WACVW

Cui, C., Ma, Y., Cao, X., et al.: Drive as You Speak: Enabling Human-Like In- teraction with Large Language Models in Autonomous Vehicles. In: WACVW. pp. 902–909 (2024)

2024
[13]

In: ICML (2024)

Dao, T., Gu, A.: Transformers are SSMs: Generalized Models and Efficient Algo- rithms Through Structured State Space Duality. In: ICML (2024)

2024
[14]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, M., Clark, C., Lee, S., et al.: Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. arXiv preprint arXiv:2409.17146 (2024)

work page internal anchor Pith review arXiv 2024
[15]

Yashima et al

Goko, M., Kambara, M., Saito, D., et al.: Task Success Prediction for Open- VocabularyManipulationBasedonMulti-LevelAlignedRepresentations.In:CoRL (2024) 14 D. Yashima et al

2024
[16]

In: CoLM (2024)

Gu, A., Dao, T.: Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In: CoLM (2024)

2024
[17]

In: ICLR (2022)

Gu, A., Goel, K., Ré, C.: Efficiently Modeling Long Sequences with Structured State Spaces. In: ICLR (2022)

2022
[18]

In: NeurIPS (2021)

Gu, A., Johnson, I., Goel, K., et al.: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. In: NeurIPS (2021)

2021
[19]

In: NeurIPS

He, H., Bai, Y., et al.: MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection. In: NeurIPS. pp. 71162–71187 (2024)

2024
[20]

In: ECCV (2024)

Hu, V., Baumann, S.A., Gui, M., et al.: ZigMa: A DiT-style Zigzag Mamba Diffu- sion Model. In: ECCV (2024)

2024
[21]

Kalman,R.:ANewApproachtoLinearFilteringandPredictionProblems.Journal of Basic Engineering82(1), 35–45 (1960)

1960
[22]

In: ICML (2024)

Karamcheti, S., Nair, S., Balakrishna, A., et al.: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models. In: ICML (2024)

2024
[23]

In: ICML (2020)

Katharopoulos, A., Vyas, A., Pappas, N., et al.: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In: ICML (2020)

2020
[24]

In: CoRL

Kim, J., Pertsch, K., Karamcheti, S., et al.: OpenVLA: An Open-Source Vision- Language-Action Model. In: CoRL. pp. 2679–2713 (2024)

2024
[25]

In: ICCV

Krishna, R., Hata, K., Ren, F., et al.: Dense-Captioning Events in Videos. In: ICCV. pp. 706–715 (2017)

2017
[26]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., et al.: LLaVA-OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Li, C., Gan, Z., et al.: Multimodal Foundation Models: From Specialists to General- Purpose Assistants. Found. Trends. Comput. Graph. Vis.16(1-2), 1–214 (2024)

2024
[28]

In: ECCV (2024)

Li, K., Li, X., Wang, Y., et al.: VideoMamba: State Space Model for Efficient Video Understanding. In: ECCV (2024)

2024
[29]

In: CVPR

Li,Y.,Song,Y.,Cao,L.,etal.:TGIF:ANewDatasetandBenchmarkonAnimated GIF Description. In: CVPR. pp. 4641–4650 (2016)

2016
[30]

In: CAICE

Liang,Z.,Xu,Y.,Hong,Y.,etal.:ASurveyofMultimodelLargeLanguageModels. In: CAICE. pp. 405–409 (2024)

2024
[31]

In: ICLR (2025)

Lieber,O.,Lenz,B.,Bata,H.,etal.:Jamba:HybridTransformer-MambaLanguage Models. In: ICLR (2025)

2025
[32]

In: EMNLP

Lin, B., Ye, Y., Zhu, B., et al.: Video-LLaVA: Learning United Visual Represen- tation by Alignment Before Projection. In: EMNLP. pp. 5971–5984 (2024)

2024
[33]

Lin, C.: ROUGE: A Package For Automatic Evaluation Of Summaries. In: ACL. pp. 74–81 (2004)

2004
[34]

In: CVPR

Liu, H., Li, C., Li, Y., et al.: Improved Baselines with Visual Instruction Tuning. In: CVPR. pp. 26296–26306 (2024)

2024
[35]

In: NeurIPS

Liu, H., Li, C., Wu, Q., et al.: Visual Instruction Tuning. In: NeurIPS. pp. 34892– 34916 (2023)

2023
[36]

Video-xl-pro: Reconstructive token compres- sion for extremely long video understanding.arXiv preprint arXiv:2503.18478, 2025

Liu, X., Shu, Y., Liu, Z., et al.: Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding. arXiv preprint arXiv:2503.18478 (2025)

work page arXiv 2025
[37]

arXiv preprint arXiv:2405.04404 (2024)

Liu, X., Zhang, C., Zhang, L.: Vision Mamba: A Comprehensive Survey and Tax- onomy. arXiv preprint arXiv:2405.04404 (2024)

work page arXiv 2024
[38]

In: NeurIPS

Liu, Y., Tian, Y., Zhao, Y., et al.: VMamba: Visual State Space Model. In: NeurIPS. pp. 103031–103063 (2024)

2024
[39]

Maaz, M., Rasheed, H., et al.: Video-ChatGPT: Towards Detailed Video Under- standing via Large Vision and Language Models. In: ACL. pp. 12585–12602 (2024)

2024
[40]

In: ICCV

Miech, A., Zhukov, D., Alayrac, J.B., et al.: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In: ICCV. pp. 2630–2640 (2019) ABMAMBA 15

2019
[41]

In: NeurIPS

Nguyen, E., Goel, K., Gu, A., et al.: S4ND: modeling images and videos as multi- dimensional signals using state spaces. In: NeurIPS. p. 2846–2861 (2022)

2022
[42]

Nguyen, T., Bin, Y., Xiao, J., et al.: Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives. In: ACL. pp. 3636–3657 (2024)

2024
[43]

TMLR (2024)

Oquab, M., Darcet, T., Moutakanni, T., et al.: DINOv2: Learning Robust Visual Features without Supervision. TMLR (2024)

2024
[44]

Papineni, K., Roukos, S., Ward, T., et al.: BLEU: a Method for Automatic Eval- uation of Machine Translation. In: ACL. pp. 311–318 (2002)

2002
[45]

arXiv preprint arXiv:2403.15360 , year=

Patro, B., Agneeswaran, V.: SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv preprint arXiv:2403.15360 (2024)

work page arXiv 2024
[46]

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges,

Patro, N., Agneeswaran, S.: Mamba-360: Survey of State Space Models as Trans- former Alternative for Long Sequence Modelling: Methods, Applications, and Chal- lenges. arXiv preprint arXiv:2404.16112 (2024)

work page arXiv 2024
[47]

Multimedia Tools Appl.78(10), 14007–14027 (2019)

Pini, S., Cornia, M., Bolelli, F., et al.: M-VAD Names: a Dataset for Video Cap- tioning with Naming. Multimedia Tools Appl.78(10), 14007–14027 (2019)

2019
[48]

arXiv preprint arXiv:2403.13600 (2024)

Qiao, Y., Yu, Z., Guo, L., et al.: Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600 (2024)

work page arXiv 2024
[49]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., et al.: Learning Transferable Visual Models from Natural Language Supervision. In: ICML. pp. 8748–8763 (2021)

2021
[50]

arXiv preprint arXiv:2410.03105 (2024)

Rahman, M.M., Tutul, A.A., Nath, A., et al.: Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv preprint arXiv:2410.03105 (2024)

work page arXiv 2024
[51]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lillicrap, T., et al.: Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

In: GCPR

Rohrbach, A., Rohrbach, M., Qiu, W., et al.: Coherent multi-sentence video de- scription with variable level of detail. In: GCPR. pp. 184–195 (2014)

2014
[53]

In: CVPR

Sarto,S.,Barraco,M.,Cornia,M.,etal.:Positive-AugmentedContrastiveLearning for Image and Video Captioning Evaluation. In: CVPR. pp. 6914–6924 (2023)

2023
[54]

In: ICLR (2024)

Shiyu,W.,Haixu,W.,Xiaoming,S.,Tengge,H.,Huakun,L.,Lintao,M.,James,Z., Jun, Z.: TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. In: ICLR (2024)

2024
[55]

In: ICLR (2023)

Smith, J., Warrington, A., Linderman, S.: Simplified State Space Layers for Se- quence Modeling. In: ICLR (2023)

2023
[56]

Retentive Network: A Successor to Transformer for Large Language Models

Sun,Y.,Dong,L.,Huang,S.,etal.:RetentiveNetwork:ASuccessortoTransformer for Large Language Models. arXiv preprint arXiv:2307.08621 (2023)

work page internal anchor Pith review arXiv 2023
[57]

Video understanding with large language models: A survey.arXiv preprint arXiv:2312.17432, 2023

Tang, Y., Bi, J., Xu, S., et al.: Video Understanding with Large Language Models: A Survey. arXiv preprint arXiv:2312.17432 (2023)

work page arXiv 2023
[58]

In: CoRL (2024)

Tian,X.,Gu,J.,Li,B.,etal.:DriveVLM:TheConvergenceofAutonomousDriving and Large Vision-Language Models. In: CoRL (2024)

2024
[59]

In: AAAI

Tong, C., He, S., Shao, Z., et al.: G-VEval: A versatile metric for evaluating image and video captions using GPT-4o. In: AAAI. pp. 7419–7427 (2025)

2025
[60]

NeurIPS pp

Tong, S., Brown, E., Wu, P., et al.: Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. NeurIPS pp. 87310–87356 (2024)

2024
[61]

In: CVPR

Vedantam, R., Zitnick, L., Parikh, D.: CIDEr: Consensus-based Image Description Evaluation. In: CVPR. pp. 4566–4575 (2015)

2015
[62]

arXiv preprint arXiv:2404.09516 (2024)

Wang, X., Wang, S., et al.: State Space Model for New-Generation Network Alter- native to Transformers: A Survey. arXiv preprint arXiv:2404.09516 (2024)

work page arXiv 2024
[63]

In: ICCV

Wang, X., Wu, J., Chen, J., et al.: VaTeX: A Large-Scale, High-Quality Multilin- gual Dataset for Video-and-Language Research. In: ICCV. pp. 4580–4590 (2019) 16 D. Yashima et al

2019
[64]

In: ICLR (2023)

Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., Long, M.: TimesNet: Temporal 2D- Variation Modeling for General Time Series Analysis. In: ICLR (2023)

2023
[65]

In: ICLR (2025)

Xing, Y., Lan, X., Wang, R., et al.: EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment. In: ICLR (2025)

2025
[66]

In: CVPR

Xu, J., Mei, T., Yao, T., et al.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR. pp. 5288–5296 (2016)

2016
[67]

arXiv preprint arXiv:2404.18861 (2024)

Xu, R., Yang, S., Wang, Y., et al.: Visual Mamba: A Survey and New Outlooks. arXiv preprint arXiv:2404.18861 (2024)

work page arXiv 2024
[68]

Xu, Z., Zhang, Y., Xie, E., et al.: DriveGPT4: Interpretable End-to-End Au- tonomousDrivingViaLargeLanguageModel.IEEERA-L9(10),8186–8193(2024)

2024
[69]

In: ICLR (2024)

Yue, X., Song, Y., Asai, A., et al.: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. In: ICLR (2024)

2024
[70]

In: ICCV

Zhai, X., Mustafa, B., Kolesnikov, A., et al.: Sigmoid Loss for Language Image Pre-Training. In: ICCV. pp. 11975–11986 (2023)

2023
[71]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Zhang, B., Li, K., et al.: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding. arXiv preprint arXiv:2501.13106 (2025)

work page internal anchor Pith review arXiv 2025
[72]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., et al.: Video Instruction Tuning With Synthetic Data. arXiv preprint arXiv:2410.02713 (2024)

work page internal anchor Pith review arXiv 2024
[73]

In: ICCAS

Zhang, Z., Chong, K.: Comparison between First-Order Hold with Zero-Order Hold in Discretization of Input-Delay Nonlinear Systems. In: ICCAS. pp. 2892– 2896 (2007)

2007
[74]

In: AAAI

Zhao, H., Zhang, M., Zhao, W., et al.: Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. In: AAAI. pp. 10421–10429 (2025)

2025
[75]

In: AAAI

Zhou, L., Xu, C., Corso, J.: Towards automatic learning of procedures from web instructional videos. In: AAAI. p. 7590–7598 (2018)

2018
[76]

In: ICML (2024)

Zhu, L., Liao, B., Zhang, Q., et al.: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In: ICML (2024)

2024
[77]

Provide a single- sentence caption that matches the style of the preceding videos

Zou, B., Guo, Z., Hu, X., et al.: RhythmMamba: Fast, Lightweight, and Accurate Remote Physiological Measurement. In: AAAI. pp. 11077–11085 (2025) ABMAMBA 17 7 Deep State Space Models Recent advances in Deep SSMs [16,15,12] have demonstrated their remarkable advantages over predominant architectures, including Transformers, across vari- ous sequence modeli...

2025