Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

Hyuntak Lee; Jihoon Chung; Jinwoo Choi; Jongseo Lee; Sooa Kim; Sunghun Kim

arxiv: 2605.22823 · v1 · pith:Z6MJA2LZnew · submitted 2026-05-21 · 💻 cs.CV

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

Jongseo Lee , Hyuntak Lee , Sunghun Kim , Sooa Kim , Jihoon Chung , Jinwoo Choi This is my paper

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords Video-LLMsdirectional motion blindnessmotion directioninstruction tuningprojector objectiveMoDirect datasetDeltaDirecttemporal video understanding

0 comments

The pith

Video large language models fail at signed motion direction because accessible signals are not bound to verbal answers, but a projector-level objective closes the gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Video-LLMs perform near chance when asked which way a simple object moves left, right, up, or down. Tracing the pipeline finds that motion direction information stays linearly accessible in the vision encoder, projector, and LLM hidden states, yet the model does not link the signal to the correct answer choice. The authors introduce the MoDirect dataset family and DeltaDirect, a projector-level objective that trains prediction of normalized 2-D motion vectors from feature deltas between adjacent frames. This raises accuracy from 25.9 percent to 85.4 percent on synthetic benchmarks and adds 21.9 points on real-world tests without real tuning data or loss to other video capabilities.

Core claim

Although motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. DeltaDirect addresses this by predicting normalized 2-D motion vectors from adjacent-frame feature deltas at the projector level.

What carries the argument

DeltaDirect, a diagnosis-driven projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas to close the direction binding gap.

If this is right

Instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4% on MoDirect-SynBench.
DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline on MoDirect-RealBench without real-world tuning data.
The method preserves standard video-understanding performance while addressing the motion direction failure.
Visual complexity reduces motion signal magnitude and limits out-of-domain generalization, which the objective partially mitigates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Binding gaps between visual features and language outputs may affect other basic perceptual features in multimodal models.
Projector-level objectives could provide an efficient route to add missing low-level capabilities without retraining the full model.
Applying the same diagnosis to diagonal, rotational, or 3-D motion would test how far normalized 2-D vector prediction extends.

Load-bearing premise

Linear accessibility of motion direction from hidden states means the main issue is a binding gap fixable at the projector that will generalize beyond synthetic data without trading off other capabilities.

What would settle it

A model trained with DeltaDirect showing no gain or a drop on real-world videos with complex backgrounds would indicate the binding-gap diagnosis is incomplete or the projector fix does not transfer.

Figures

Figures reproduced from arXiv: 2605.22823 by Hyuntak Lee, Jihoon Chung, Jinwoo Choi, Jongseo Lee, Sooa Kim, Sunghun Kim.

**Figure 1.** Figure 1: Directional motion blindness in Video-LLMs. (a) Given a simple synthetic video of a yellow circle moving from left to right, recent Video-LLMs correctly identify the object’s color but answer the wrong motion direction. (b) Across Video-LLMs, appearance recognition is high, yet signed motion direction accuracy remains much lower, often near chance. To understand where this failure arises, we trace motion d… view at source ↗

**Figure 2.** Figure 2: Direction is decodable, but not converted into the answer. On Primitive-on-Syn in MODIRECT-SYNBENCH, motion direction remains linearly decodable throughout LLaVA-Video7B [78], yet QA accuracy stays near chance, exposing the direction binding gap. We organize the analysis around three research questions. (i) Can the failure be explained by missing direction supervision or insufficient input-side scaffoldin… view at source ↗

**Figure 3.** Figure 3: The direction binding gap is universal across Video-LLMs. 0 4 8 12 16 20 24 28 Layer 20 40 60 80 100 Accuracy (%) LLaVA-Video on Primitive-on-Syn Linear probe Logit lens Linear probe w/ MoDirect-Inst Logit lens w/ MoDirect-Inst [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Shared orientation, weak magnitude. (a) Instruction tuning closes the binding gap on Primitive-on-Syn, but the gap reopens on OOD domains; DeltaDirect narrows it across domains. (b) Direction concept vector orientations align across domains after instruction tuning, with late-layer cosine similarity exceeding 0.9. (c) Despite this alignment, concept-vector magnitude decreases with visual complexity, reveal… view at source ↗

**Figure 6.** Figure 6: DeltaDirect. These observations motivate a simple design principle: make the projector output carry a stronger signed displacement signal before it enters the LLM. We therefore introduce DeltaDirect, a training-only auxiliary objective applied to the projector output. Instead of adding learned motion tokens, or a motion-specific encoder at inference time, DeltaDirect uses synthetic 2-D motion vectors as … view at source ↗

**Figure 7.** Figure 7: LLM-based semantic classification tends to over-predict direction. (a) Comparison of overall direction ratios. While human annotations show that only a small fraction of QA pairs require direction understanding, the LLM predicts direction at a substantially higher rate, further supporting the over-prediction tendency. (b) Row-normalized confusion matrix comparing LLM predictions with human annotations. The… view at source ↗

**Figure 8.** Figure 8: Visual prompting examples. We compare the original input with two visual cue variants designed to make directional information more explicit. (a) Plain shows the unmodified video frame without any additional visual cue. (b) Colored Edge marks the four image borders with distinct colors. (c) Text Edge annotates the borders with directional words. D.2.2 Text Prompt Text prompting modifies the input question … view at source ↗

**Figure 9.** Figure 9: Magnitude collapses on OOD domains across all three backbones. Each cell shows the per-layer ratio of the concept-vector magnitude on the row’s domain to that on the source domain P-Syn. The P-Syn row is therefore 1.00 by construction, while OOD rows below 1.00 indicate that OOD magnitudes are smaller than the source magnitude. P / C denotes the subject type (primitive shape or cutout image) and Syn / Re… view at source ↗

**Figure 10.** Figure 10: Identity-vs-motion-direction trade-off on the post-projector representation. Each point is one feature construction (Single, T-mean, Stack, Delta, Concat. Delta), averaged over the four MODIRECT-SYNBENCH domains; error bars span the domain min and max. Single-frame and temporal-mean features encode identity well but barely exceed chance (25 %) on motion direction (right-low region). Temporal-stack additio… view at source ↗

**Figure 11.** Figure 11: DeltaDirect restores the OOD motion direction concept vector magnitude across video-LLM backbones. For each backbone, we plot the direction concept vector magnitude on each OOD domain (Cutout-on-Syn, Primitive-on-Real, Cutout-on-Real) as a ratio to the same model’s source-domain Primitive-on-Syn magnitude. The green dashed line marks the Primitive-on-Syn reference at 1.0. The MODIRECT-INST baseline (gray)… view at source ↗

**Figure 12.** Figure 12: Examples of MODIRECT-INST. (a) Primitive-on-Syn (b) Cutout-on-Syn (c) Primitive-on-Real (d) Cutout-on-Real [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗

**Figure 13.** Figure 13: Examples of MODIRECT-SYNBENCH. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗

**Figure 14.** Figure 14: Examples of MODIRECT-REALBENCH. [Question] From the viewer's perspective, in which direction is the object moving in this video? A. Right B. Left Visual Prompting: Color Edge Visual Prompting: Text Edge [Question] From the viewer's perspective, which colored edge does the object move toward? A. Green B. Yellow Visual Prompting: Plain [Question] From the viewer's perspective, which colored edge does the ob… view at source ↗

**Figure 15.** Figure 15: Default Prompting Example 55 [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗

**Figure 16.** Figure 16: Temporal Prompting Example. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_16.png] view at source ↗

**Figure 17.** Figure 17: Grid Prompting Example. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparison on Something-Something v2 with an open-ended description prompt. Compared to the baseline LLaVA-Video, DeltaDirect generates a more grounded description of both the object and its leftward motion. Blue text highlights motion- and directionrelated expressions. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison on Something-Something v2 with an open-ended description prompt. Compared to the baseline LLaVA-Video, DeltaDirect generates a more grounded description of both the object and its leftward motion. Red text highlights motion- and directionrelated expressions. YouCook2 User Prompt How does the hand interact with the sandwich in the video? DeltaDirect The hand is seen adding cheese s… view at source ↗

**Figure 20.** Figure 20: Qualitative comparison on YouCook2 under an open-ended video understanding prompt. Compared to the baseline LLaVA-Video, DeltaDirect tends to generate explicit directional motion descriptions, which can sometimes be unsupported by the actual video content. Red text highlights potentially incorrect expressions. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_20.png] view at source ↗

read the original abstract

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Video-LLMs exhibit directional motion blindness, performing near chance on signed image-plane motion direction (left/right/up/down) in simple videos, largely due to biases rather than understanding. Linear probing shows motion direction remains accessible in vision encoder, projector, and LLM hidden states, but fails to bind to verbal answers, indicating a direction binding gap. The authors introduce the MoDirect dataset family and DeltaDirect, a projector-level regression objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. This yields gains from 25.9% to 85.4% on MoDirect-SynBench and +21.9 points on MoDirect-RealBench (without real-world tuning data), while preserving standard video-understanding performance.

Significance. If the results hold, this work is significant for pinpointing and mitigating a fundamental perceptual limitation in Video-LLMs via a targeted, diagnosis-driven fix at the projector level. Concrete accuracy lifts, the linear-probing diagnosis, and out-of-domain gains from synthetic-only training are strengths. The linked GitHub code supports reproducibility. This approach could guide more reliable temporal and motion capabilities in multimodal models without broad capability trade-offs.

major comments (2)

[Abstract and results on MoDirect-RealBench] The 21.9-point gain on MoDirect-RealBench without real-world tuning data is load-bearing for the generalization claim that DeltaDirect closes the direction binding gap in general Video-LLMs. The motion direction concept vector analysis acknowledges that visual complexity weakens signal magnitude and limits out-of-domain generalization, yet no ablation or direct measurement compares signal strength or regression performance on real vs. synthetic feature deltas (e.g., due to background, texture, or multi-object motion).
[Diagnosis section (linear probing experiments)] The diagnosis localizes the failure to a direction binding gap based on linear accessibility of motion direction from encoder, projector, and LLM states. However, this requires stronger evidence that the issue is specifically binding (rather than, e.g., attention dilution or output formatting), including probe details, exact quantification thresholds for accessibility, and controls showing the signal is not utilized despite presence.

minor comments (2)

[Abstract] The abstract omits specifics on baseline Video-LLM models, number of runs, statistical significance, and data exclusion rules for the accuracy numbers.
[Method section] Clarify the exact formulation of the DeltaDirect objective (e.g., loss function, normalization details) and how feature deltas are computed across frames to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to incorporate additional evidence and details as outlined.

read point-by-point responses

Referee: [Abstract and results on MoDirect-RealBench] The 21.9-point gain on MoDirect-RealBench without real-world tuning data is load-bearing for the generalization claim that DeltaDirect closes the direction binding gap in general Video-LLMs. The motion direction concept vector analysis acknowledges that visual complexity weakens signal magnitude and limits out-of-domain generalization, yet no ablation or direct measurement compares signal strength or regression performance on real vs. synthetic feature deltas (e.g., due to background, texture, or multi-object motion).

Authors: We agree that a direct comparison of regression performance on real versus synthetic feature deltas would strengthen the generalization claims. The observed +21.9 point gain on RealBench without real-world tuning data already provides evidence of domain-agnostic improvement, but we will add a new ablation in the revised manuscript reporting MSE and signal magnitude for DeltaDirect regression when applied to real-world feature deltas (extracted from the vision encoder and projector) compared to synthetic ones. This will quantify the effect of visual complexity and better support the out-of-domain results. revision: yes
Referee: [Diagnosis section (linear probing experiments)] The diagnosis localizes the failure to a direction binding gap based on linear accessibility of motion direction from encoder, projector, and LLM states. However, this requires stronger evidence that the issue is specifically binding (rather than, e.g., attention dilution or output formatting), including probe details, exact quantification thresholds for accessibility, and controls showing the signal is not utilized despite presence.

Authors: We acknowledge the need for more rigorous supporting evidence. In the revised manuscript we will expand the diagnosis section with: (1) complete specifications of the linear probe models including architecture and training hyperparameters, (2) exact quantification thresholds (e.g., probe accuracy levels and statistical significance criteria) used to deem a signal linearly accessible, and (3) additional control experiments such as attention map analysis and output-format ablations to demonstrate that the accessible motion direction signal is not utilized in generating the verbal answer despite its presence in the hidden states. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation introduces independent auxiliary objective without reduction to inputs or self-citations

full rationale

The paper traces motion direction signals through the Video-LLM pipeline to identify linear accessibility in encoder/projector/LLM states but a binding gap to answer tokens. It then defines DeltaDirect as an explicit new projector-level regression loss that predicts normalized 2-D motion vectors directly from adjacent-frame feature deltas, trained on the introduced MoDirect synthetic data. This is not a fitted parameter renamed as prediction, nor does any central claim reduce by construction to prior outputs or self-citations. Reported gains (e.g., 25.9% to 85.4% on SynBench, +21.9 points on RealBench) are presented as empirical results of applying this new objective, with no equations or steps that equate the claimed prediction to the training signal itself. The chain is self-contained against external benchmarks and does not rely on load-bearing self-citations or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the newly introduced DeltaDirect objective and MoDirect datasets together with the standard machine-learning assumption that linear probes can reveal accessible information in intermediate representations.

axioms (1)

domain assumption Motion direction information remains linearly decodable from vision encoder, projector, and LLM hidden states
Used to localize the failure to a binding gap rather than missing signal.

invented entities (2)

DeltaDirect no independent evidence
purpose: Projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas
New training signal introduced to close the direction binding gap.
directional motion blindness no independent evidence
purpose: Label for the observed failure of Video-LLMs to bind motion direction to verbal answers
Newly coined term for the diagnosed limitation.

pith-pipeline@v0.9.0 · 5819 in / 1444 out tokens · 45753 ms · 2026-05-22T05:38:01.253163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DeltaDirect predicts normalized 2-D motion vectors from adjacent-frame projector-feature deltas... LMVP = 1/(T-1) sum ||m̂_t - m_t||_2^2
IndisputableMonolith/Foundation/ArrowOfTime.lean TemporalSequence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We uniformly sample T=8 frames per video

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

[1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 2, 6, 33, 37

work page 2024
[2]

Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations. InCVPR, 2025. 1, 3

work page 2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

9, 17, 18, 32, 37, 48, 49, 51

work page
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2. 5-vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Structure and function of visual area mt.Annu

Richard T Born and David C Bradley. Structure and function of visual area mt.Annu. Rev. Neurosci., 28 (1):157–189, 2005. 1

work page 2005
[7]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024. 4

work page 2024
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 9, 17, 18, 26, 32, 49

work page 2024
[9]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 4

work page 2024
[10]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Unifying specialized visual encoders for video language models

Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, and Olga Rus- sakovsky. Unifying specialized visual encoders for video language models. InICML, 2025

work page 2025
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 9, 49

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.arXiv:2410.07752, 2024. 3

work page arXiv 2024
[14]

MARS: Motion-augmented RGB stream for action recognition

Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-augmented RGB stream for action recognition. InCVPR, 2019. 3

work page 2019
[15]

Motionsight: Boosting fine-grained motion understanding in multimodal LLMs

Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, and Ying Tai. Motionsight: Boosting fine-grained motion understanding in multimodal LLMs. InICLR, 2026. 3

work page 2026
[16]

Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging

Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. InICLR, 2026. 9, 17, 49 10

work page 2026
[17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 1

work page 2025
[18]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

work page 2025
[19]

Psychology press, 2014

James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014. 1

work page 2014
[20]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017. 3, 9, 27, 49

work page 2017
[21]

Language models represent space and time

Wes Gurnee and Max Tegmark. Language models represent space and time. InICLR, 2024. 5, 34, 35, 36

work page 2024
[22]

Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models. InCVPR, 2025. 1, 3, 9, 17, 22, 49, 50

work page 2025
[23]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 6

work page 2022
[24]

Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, and Marcelo H. Ang. Self- supervised motion learning from static images. InCVPR, 2021. 3

work page 2021
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio- temporal reasoning in visual question answering. InCVPR, 2017. 9, 16, 22, 49, 50

work page 2017
[27]

Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, and Yadong Mu. Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization. InICML, 2024. 3

work page 2024
[28]

Map the flow: Revealing hidden pathways of information in videoLLMs

Minji Kim, Taekyung Kim, and Bohyung Han. Map the flow: Revealing hidden pathways of information in videoLLMs. InICLR, 2026. 3, 4, 6

work page 2026
[29]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 4

work page 2022
[30]

MotionSqueeze: Neural motion feature learning for video understanding

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. MotionSqueeze: Neural motion feature learning for video understanding. InECCV, 2020. 3

work page 2020
[31]

LLaV A-onevision: Easy visual task transfer.TMLR, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.TMLR, 2025. 1, 3, 9, 17, 18, 25, 32, 37, 48, 49, 51

work page 2025
[32]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 2, 5, 6, 33, 37

work page 2023
[33]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024. 1, 3, 4, 9, 16, 18, 19, 22, 25, 44, 49, 50

work page 2024
[34]

Temporal reasoning transfer from text to video

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Temporal reasoning transfer from text to video. InICLR, 2025. 1, 3

work page 2025
[35]

Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models

Shicheng Li, Lei Li, Yi Liu, Shuhuai Ren, Yuanxin Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InECCV, 2024. 3

work page 2024
[36]

TEA: Temporal excitation and aggregation for action recognition

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. TEA: Temporal excitation and aggregation for action recognition. InCVPR, 2020. 3

work page 2020
[37]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024. 1, 3, 9, 49 11

work page 2024
[38]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024. 1, 3, 9, 32, 49

work page 2024
[39]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014. 2, 6

work page 2014
[40]

St-llm: Large language models are effective temporal learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InECCV, 2024. 3

work page 2024
[41]

Flow4Agent: Long-form video under- standing via motion prior from optical flow

Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, and Wei Gao. Flow4Agent: Long-form video under- standing via motion prior from optical flow. InICCV, 2025. 3

work page 2025
[42]

Tempcompass: Do video llms really understand videos? InACL, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InACL, 2024. 1, 3, 9, 16, 22, 49, 50

work page 2024
[43]

NVILA: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De- An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: Efficient frontier visual lang...

work page 2025
[44]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024. 3, 25

work page 2024
[45]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

work page 2023
[46]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling, 2024. 2, 3, 6, 33, 34, 36

work page 2024
[47]

Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985

Ken Nakayama. Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985. doi: 10.1016/0042-6989(85)90171-3. 1

work page doi:10.1016/0042-6989(85)90171-3 1985
[48]

MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

Hong Nguyen, Dung Tran, Hieu Hoang, Phong Nguyen, and Shrikanth Narayanan. MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

work page arXiv
[49]

Interpreting GPT: The logit lens

nostalgebraist. Interpreting GPT: The logit lens. LessWrong, August 2020. URLhttps://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2026-02-22. 6, 23, 34

work page 2020
[50]

Llms know more than they show: On the intrinsic representation of llm hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations. InICLR,

work page
[51]

Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025

Yoonah Park, Haesung Pyun, and Yohan Jo. Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025. 5, 34

work page arXiv 2025
[52]

Perception test: A diagnostic benchmark for multimodal video models

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

work page 2023
[53]

Improve temporal reasoning in multimodal large language models via video contrastive decoding

Daiqing Qi, Dongliang Guo, Hanzhang Yuan, Handong Zhao, Mengxuan Hu, Lehan Yang, and Sheng Li. Improve temporal reasoning in multimodal large language models via video contrastive decoding. In NeurIPS, 2025. 1

work page 2025
[54]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 2, 6, 33, 37

work page 2024
[55]

Leveraging large language models for multiple choice question answering

Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arxiv (2022). InICLR, 2023. 4

work page 2022
[56]

Actionatlas: A videoqa benchmark for domain-specialized action recognition

Mohammadreza Salehi, Jae S Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Han- naneh Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition. InNeurIPS, 2024. 1

work page 2024
[57]

Recognizing human actions: a local svm approach

Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. InICPR, 2004. 3, 9, 49 12

work page 2004
[58]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. In ICLR, 2025. 1, 3, 9, 21, 49

work page 2025
[59]

Stroud, David A

Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, and Rahul Sukthankar. D3D: Distilled 3D networks for video action recognition. InWACV, 2020. 3

work page 2020
[60]

Probing for arithmetic errors in language mod- els

Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language mod- els. InEMNLP, 2025. 5, 34

work page 2025
[61]

Language models linearly rep- resent sentiment

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Language models linearly rep- resent sentiment. InICML, 2024. 2, 6, 33, 37

work page 2024
[62]

Favor-bench: A comprehensive benchmark for fine-grained video motion understanding

Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang YU, and Tao Chen. Favor-bench: A comprehensive benchmark for fine-grained video motion understanding. InNeurIPS,

work page
[63]

3, 9, 16, 22, 49, 50

work page
[64]

TDN: Temporal difference networks for efficient action recognition

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. TDN: Temporal difference networks for efficient action recognition. InCVPR, 2021. 3

work page 2021
[65]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InECCV, 2024. 1, 3

work page 2024
[66]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4

work page 2022
[67]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 1910
[68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, 2021. 9, 16, 22, 49, 50

work page 2021
[69]

Seeing the arrow of time in large multimodal models

Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. In NeurIPS, 2025. 3

work page 2025
[70]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 18, 44

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022

Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, and Zehuan Yuan. Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022. 3

work page arXiv 2022
[72]

mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els. InICLR, 2025. 17, 32, 48, 49

work page 2025
[73]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InICLR, 2026. 4

work page 2026
[74]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 18, 28, 44

work page 2023
[75]

PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025

Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, and Wenwu Zhu. PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025. 3

work page arXiv 2025
[76]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 3, 9, 32, 49

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024

Jianrui Zhang, Mu Cai, and Yong Jae Lee. Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024. 3, 9, 16, 22, 49, 50

work page arXiv 2024
[78]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuan- han Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InNAACL 2025, 2025. 16 13

work page 2025
[79]

Llava-next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. URLhttps:// llava-vl.github.io/blog/2024-04-30-llava-next-video/. 1, 32

work page 2024
[80]

LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025. 3, 4, 6, 9, 17, 25, 28, 31, 32, 33, 48, 49, 50, 51

work page 2025

Showing first 80 references.

[1] [1]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 2, 6, 33, 37

work page 2024

[2] [2]

Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations. InCVPR, 2025. 1, 3

work page 2025

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

9, 17, 18, 32, 37, 48, 49, 51

work page

[5] [5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2. 5-vl technical report.a...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Structure and function of visual area mt.Annu

Richard T Born and David C Bradley. Structure and function of visual area mt.Annu. Rev. Neurosci., 28 (1):157–189, 2005. 1

work page 2005

[7] [7]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024. 4

work page 2024

[8] [8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 9, 17, 18, 26, 32, 49

work page 2024

[9] [9]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 4

work page 2024

[10] [10]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Unifying specialized visual encoders for video language models

Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, and Olga Rus- sakovsky. Unifying specialized visual encoders for video language models. InICML, 2025

work page 2025

[12] [12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 9, 49

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.arXiv:2410.07752, 2024. 3

work page arXiv 2024

[14] [14]

MARS: Motion-augmented RGB stream for action recognition

Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-augmented RGB stream for action recognition. InCVPR, 2019. 3

work page 2019

[15] [15]

Motionsight: Boosting fine-grained motion understanding in multimodal LLMs

Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, and Ying Tai. Motionsight: Boosting fine-grained motion understanding in multimodal LLMs. InICLR, 2026. 3

work page 2026

[16] [16]

Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging

Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. InICLR, 2026. 9, 17, 49 10

work page 2026

[17] [17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 1

work page 2025

[18] [18]

Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

work page 2025

[19] [19]

Psychology press, 2014

James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014. 1

work page 2014

[20] [20]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017. 3, 9, 27, 49

work page 2017

[21] [21]

Language models represent space and time

Wes Gurnee and Max Tegmark. Language models represent space and time. InICLR, 2024. 5, 34, 35, 36

work page 2024

[22] [22]

Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models. InCVPR, 2025. 1, 3, 9, 17, 22, 49, 50

work page 2025

[23] [23]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 6

work page 2022

[24] [24]

Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, and Marcelo H. Ang. Self- supervised motion learning from static images. InCVPR, 2021. 3

work page 2021

[25] [25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Tgif-qa: Toward spatio- temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio- temporal reasoning in visual question answering. InCVPR, 2017. 9, 16, 22, 49, 50

work page 2017

[27] [27]

Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, and Yadong Mu. Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization. InICML, 2024. 3

work page 2024

[28] [28]

Map the flow: Revealing hidden pathways of information in videoLLMs

Minji Kim, Taekyung Kim, and Bohyung Han. Map the flow: Revealing hidden pathways of information in videoLLMs. InICLR, 2026. 3, 4, 6

work page 2026

[29] [29]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 4

work page 2022

[30] [30]

MotionSqueeze: Neural motion feature learning for video understanding

Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. MotionSqueeze: Neural motion feature learning for video understanding. InECCV, 2020. 3

work page 2020

[31] [31]

LLaV A-onevision: Easy visual task transfer.TMLR, 2025

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.TMLR, 2025. 1, 3, 9, 17, 18, 25, 32, 37, 48, 49, 51

work page 2025

[32] [32]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 2, 5, 6, 33, 37

work page 2023

[33] [33]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024. 1, 3, 4, 9, 16, 18, 19, 22, 25, 44, 49, 50

work page 2024

[34] [34]

Temporal reasoning transfer from text to video

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Temporal reasoning transfer from text to video. InICLR, 2025. 1, 3

work page 2025

[35] [35]

Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models

Shicheng Li, Lei Li, Yi Liu, Shuhuai Ren, Yuanxin Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InECCV, 2024. 3

work page 2024

[36] [36]

TEA: Temporal excitation and aggregation for action recognition

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. TEA: Temporal excitation and aggregation for action recognition. InCVPR, 2020. 3

work page 2020

[37] [37]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024. 1, 3, 9, 49 11

work page 2024

[38] [38]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024. 1, 3, 9, 32, 49

work page 2024

[39] [39]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014. 2, 6

work page 2014

[40] [40]

St-llm: Large language models are effective temporal learners

Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InECCV, 2024. 3

work page 2024

[41] [41]

Flow4Agent: Long-form video under- standing via motion prior from optical flow

Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, and Wei Gao. Flow4Agent: Long-form video under- standing via motion prior from optical flow. InICCV, 2025. 3

work page 2025

[42] [42]

Tempcompass: Do video llms really understand videos? InACL, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InACL, 2024. 1, 3, 9, 16, 22, 49, 50

work page 2024

[43] [43]

NVILA: Efficient frontier visual language models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De- An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: Efficient frontier visual lang...

work page 2025

[44] [44]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024. 3, 25

work page 2024

[45] [45]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

work page 2023

[46] [46]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling, 2024. 2, 3, 6, 33, 34, 36

work page 2024

[47] [47]

Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985

Ken Nakayama. Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985. doi: 10.1016/0042-6989(85)90171-3. 1

work page doi:10.1016/0042-6989(85)90171-3 1985

[48] [48]

MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

Hong Nguyen, Dung Tran, Hieu Hoang, Phong Nguyen, and Shrikanth Narayanan. MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

work page arXiv

[49] [49]

Interpreting GPT: The logit lens

nostalgebraist. Interpreting GPT: The logit lens. LessWrong, August 2020. URLhttps://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2026-02-22. 6, 23, 34

work page 2020

[50] [50]

Llms know more than they show: On the intrinsic representation of llm hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations. InICLR,

work page

[51] [51]

Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025

Yoonah Park, Haesung Pyun, and Yohan Jo. Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025. 5, 34

work page arXiv 2025

[52] [52]

Perception test: A diagnostic benchmark for multimodal video models

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

work page 2023

[53] [53]

Improve temporal reasoning in multimodal large language models via video contrastive decoding

Daiqing Qi, Dongliang Guo, Hanzhang Yuan, Handong Zhao, Mengxuan Hu, Lehan Yang, and Sheng Li. Improve temporal reasoning in multimodal large language models via video contrastive decoding. In NeurIPS, 2025. 1

work page 2025

[54] [54]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 2, 6, 33, 37

work page 2024

[55] [55]

Leveraging large language models for multiple choice question answering

Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arxiv (2022). InICLR, 2023. 4

work page 2022

[56] [56]

Actionatlas: A videoqa benchmark for domain-specialized action recognition

Mohammadreza Salehi, Jae S Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Han- naneh Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition. InNeurIPS, 2024. 1

work page 2024

[57] [57]

Recognizing human actions: a local svm approach

Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. InICPR, 2004. 3, 9, 49 12

work page 2004

[58] [58]

Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. In ICLR, 2025. 1, 3, 9, 21, 49

work page 2025

[59] [59]

Stroud, David A

Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, and Rahul Sukthankar. D3D: Distilled 3D networks for video action recognition. InWACV, 2020. 3

work page 2020

[60] [60]

Probing for arithmetic errors in language mod- els

Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language mod- els. InEMNLP, 2025. 5, 34

work page 2025

[61] [61]

Language models linearly rep- resent sentiment

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Language models linearly rep- resent sentiment. InICML, 2024. 2, 6, 33, 37

work page 2024

[62] [62]

Favor-bench: A comprehensive benchmark for fine-grained video motion understanding

Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang YU, and Tao Chen. Favor-bench: A comprehensive benchmark for fine-grained video motion understanding. InNeurIPS,

work page

[63] [63]

3, 9, 16, 22, 49, 50

work page

[64] [64]

TDN: Temporal difference networks for efficient action recognition

Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. TDN: Temporal difference networks for efficient action recognition. InCVPR, 2021. 3

work page 2021

[65] [65]

Internvideo2: Scaling foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InECCV, 2024. 1, 3

work page 2024

[66] [66]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4

work page 2022

[67] [67]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. 16

work page internal anchor Pith review Pith/arXiv arXiv 1910

[68] [68]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, 2021. 9, 16, 22, 49, 50

work page 2021

[69] [69]

Seeing the arrow of time in large multimodal models

Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. In NeurIPS, 2025. 3

work page 2025

[70] [70]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 18, 44

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022

Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, and Zehuan Yuan. Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022. 3

work page arXiv 2022

[72] [72]

mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els

Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els. InICLR, 2025. 17, 32, 48, 49

work page 2025

[73] [73]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InICLR, 2026. 4

work page 2026

[74] [74]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 18, 28, 44

work page 2023

[75] [75]

PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025

Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, and Wenwu Zhu. PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025. 3

work page arXiv 2025

[76] [76]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 3, 9, 32, 49

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [77]

Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024

Jianrui Zhang, Mu Cai, and Yong Jae Lee. Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024. 3, 9, 16, 22, 49, 50

work page arXiv 2024

[78] [78]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuan- han Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InNAACL 2025, 2025. 16 13

work page 2025

[79] [79]

Llava-next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. URLhttps:// llava-vl.github.io/blog/2024-04-30-llava-next-video/. 1, 32

work page 2024

[80] [80]

LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025. 3, 4, 6, 9, 17, 25, 28, 31, 32, 33, 48, 49, 50, 51

work page 2025