pith. sign in

arxiv: 2605.22823 · v1 · pith:Z6MJA2LZnew · submitted 2026-05-21 · 💻 cs.CV

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords Video-LLMsdirectional motion blindnessmotion directioninstruction tuningprojector objectiveMoDirect datasetDeltaDirecttemporal video understanding
0
0 comments X

The pith

Video large language models fail at signed motion direction because accessible signals are not bound to verbal answers, but a projector-level objective closes the gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Video-LLMs perform near chance when asked which way a simple object moves left, right, up, or down. Tracing the pipeline finds that motion direction information stays linearly accessible in the vision encoder, projector, and LLM hidden states, yet the model does not link the signal to the correct answer choice. The authors introduce the MoDirect dataset family and DeltaDirect, a projector-level objective that trains prediction of normalized 2-D motion vectors from feature deltas between adjacent frames. This raises accuracy from 25.9 percent to 85.4 percent on synthetic benchmarks and adds 21.9 points on real-world tests without real tuning data or loss to other video capabilities.

Core claim

Although motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. DeltaDirect addresses this by predicting normalized 2-D motion vectors from adjacent-frame feature deltas at the projector level.

What carries the argument

DeltaDirect, a diagnosis-driven projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas to close the direction binding gap.

If this is right

  • Instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4% on MoDirect-SynBench.
  • DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline on MoDirect-RealBench without real-world tuning data.
  • The method preserves standard video-understanding performance while addressing the motion direction failure.
  • Visual complexity reduces motion signal magnitude and limits out-of-domain generalization, which the objective partially mitigates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Binding gaps between visual features and language outputs may affect other basic perceptual features in multimodal models.
  • Projector-level objectives could provide an efficient route to add missing low-level capabilities without retraining the full model.
  • Applying the same diagnosis to diagonal, rotational, or 3-D motion would test how far normalized 2-D vector prediction extends.

Load-bearing premise

Linear accessibility of motion direction from hidden states means the main issue is a binding gap fixable at the projector that will generalize beyond synthetic data without trading off other capabilities.

What would settle it

A model trained with DeltaDirect showing no gain or a drop on real-world videos with complex backgrounds would indicate the binding-gap diagnosis is incomplete or the projector fix does not transfer.

Figures

Figures reproduced from arXiv: 2605.22823 by Hyuntak Lee, Jihoon Chung, Jinwoo Choi, Jongseo Lee, Sooa Kim, Sunghun Kim.

Figure 1
Figure 1. Figure 1: Directional motion blindness in Video-LLMs. (a) Given a simple synthetic video of a yellow circle moving from left to right, recent Video-LLMs correctly identify the object’s color but answer the wrong motion direction. (b) Across Video-LLMs, appearance recognition is high, yet signed motion direction accuracy remains much lower, often near chance. To understand where this failure arises, we trace motion d… view at source ↗
Figure 2
Figure 2. Figure 2: Direction is decodable, but not converted into the answer. On Primitive-on-Syn in MODIRECT-SYNBENCH, motion direction remains linearly decodable throughout LLaVA-Video￾7B [78], yet QA accuracy stays near chance, exposing the direction binding gap. We organize the analysis around three research questions. (i) Can the failure be explained by missing direction supervision or insufficient input-side scaffoldin… view at source ↗
Figure 3
Figure 3. Figure 3: The direction binding gap is universal across Video-LLMs. 0 4 8 12 16 20 24 28 Layer 20 40 60 80 100 Accuracy (%) LLaVA-Video on Primitive-on-Syn Linear probe Logit lens Linear probe w/ MoDirect-Inst Logit lens w/ MoDirect-Inst [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Shared orientation, weak magnitude. (a) Instruction tuning closes the binding gap on Primitive-on-Syn, but the gap reopens on OOD domains; DeltaDirect narrows it across domains. (b) Direction concept vector orientations align across domains after instruction tuning, with late-layer cosine similarity exceeding 0.9. (c) Despite this alignment, concept-vector magnitude decreases with visual complexity, reveal… view at source ↗
Figure 6
Figure 6. Figure 6: DeltaDirect. These observations motivate a simple design principle: make the projector output carry a stronger signed displacement signal be￾fore it enters the LLM. We therefore introduce DeltaDirect, a training-only auxiliary objective applied to the projector output. Instead of adding learned motion tokens, or a motion-specific en￾coder at inference time, DeltaDirect uses synthetic 2-D motion vectors as … view at source ↗
Figure 7
Figure 7. Figure 7: LLM-based semantic classification tends to over-predict direction. (a) Comparison of overall direction ratios. While human annotations show that only a small fraction of QA pairs require direction understanding, the LLM predicts direction at a substantially higher rate, further supporting the over-prediction tendency. (b) Row-normalized confusion matrix comparing LLM predictions with human annotations. The… view at source ↗
Figure 8
Figure 8. Figure 8: Visual prompting examples. We compare the original input with two visual cue variants designed to make directional information more explicit. (a) Plain shows the unmodified video frame without any additional visual cue. (b) Colored Edge marks the four image borders with distinct colors. (c) Text Edge annotates the borders with directional words. D.2.2 Text Prompt Text prompting modifies the input question … view at source ↗
Figure 9
Figure 9. Figure 9: Magnitude collapses on OOD domains across all three backbones. Each cell shows the per-layer ratio of the concept-vector magnitude on the row’s domain to that on the source do￾main P-Syn. The P-Syn row is therefore 1.00 by construction, while OOD rows below 1.00 indicate that OOD magnitudes are smaller than the source magnitude. P / C denotes the subject type (prim￾itive shape or cutout image) and Syn / Re… view at source ↗
Figure 10
Figure 10. Figure 10: Identity-vs-motion-direction trade-off on the post-projector representation. Each point is one feature construction (Single, T-mean, Stack, Delta, Concat. Delta), averaged over the four MODIRECT-SYNBENCH domains; error bars span the domain min and max. Single-frame and temporal-mean features encode identity well but barely exceed chance (25 %) on motion direction (right-low region). Temporal-stack additio… view at source ↗
Figure 11
Figure 11. Figure 11: DeltaDirect restores the OOD motion direction concept vector magnitude across video-LLM backbones. For each backbone, we plot the direction concept vector magnitude on each OOD domain (Cutout-on-Syn, Primitive-on-Real, Cutout-on-Real) as a ratio to the same model’s source-domain Primitive-on-Syn magnitude. The green dashed line marks the Primitive-on-Syn reference at 1.0. The MODIRECT-INST baseline (gray)… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of MODIRECT-INST. (a) Primitive-on-Syn (b) Cutout-on-Syn (c) Primitive-on-Real (d) Cutout-on-Real [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Examples of MODIRECT-SYNBENCH. 54 [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of MODIRECT-REALBENCH. [Question] From the viewer's perspective, in which direction is the object moving in this video? A. Right B. Left Visual Prompting: Color Edge Visual Prompting: Text Edge [Question] From the viewer's perspective, which colored edge does the object move toward? A. Green B. Yellow Visual Prompting: Plain [Question] From the viewer's perspective, which colored edge does the ob… view at source ↗
Figure 15
Figure 15. Figure 15: Default Prompting Example 55 [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Temporal Prompting Example. 56 [PITH_FULL_IMAGE:figures/full_fig_p056_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Grid Prompting Example. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative comparison on Something-Something v2 with an open-ended descrip￾tion prompt. Compared to the baseline LLaVA-Video, DeltaDirect generates a more grounded description of both the object and its leftward motion. Blue text highlights motion- and direction￾related expressions. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison on Something-Something v2 with an open-ended descrip￾tion prompt. Compared to the baseline LLaVA-Video, DeltaDirect generates a more grounded description of both the object and its leftward motion. Red text highlights motion- and direction￾related expressions. YouCook2 User Prompt How does the hand interact with the sandwich in the video? DeltaDirect The hand is seen adding cheese s… view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative comparison on YouCook2 under an open-ended video understanding prompt. Compared to the baseline LLaVA-Video, DeltaDirect tends to generate explicit direc￾tional motion descriptions, which can sometimes be unsupported by the actual video content. Red text highlights potentially incorrect expressions. 59 [PITH_FULL_IMAGE:figures/full_fig_p059_20.png] view at source ↗
read the original abstract

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Video-LLMs exhibit directional motion blindness, performing near chance on signed image-plane motion direction (left/right/up/down) in simple videos, largely due to biases rather than understanding. Linear probing shows motion direction remains accessible in vision encoder, projector, and LLM hidden states, but fails to bind to verbal answers, indicating a direction binding gap. The authors introduce the MoDirect dataset family and DeltaDirect, a projector-level regression objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. This yields gains from 25.9% to 85.4% on MoDirect-SynBench and +21.9 points on MoDirect-RealBench (without real-world tuning data), while preserving standard video-understanding performance.

Significance. If the results hold, this work is significant for pinpointing and mitigating a fundamental perceptual limitation in Video-LLMs via a targeted, diagnosis-driven fix at the projector level. Concrete accuracy lifts, the linear-probing diagnosis, and out-of-domain gains from synthetic-only training are strengths. The linked GitHub code supports reproducibility. This approach could guide more reliable temporal and motion capabilities in multimodal models without broad capability trade-offs.

major comments (2)
  1. [Abstract and results on MoDirect-RealBench] The 21.9-point gain on MoDirect-RealBench without real-world tuning data is load-bearing for the generalization claim that DeltaDirect closes the direction binding gap in general Video-LLMs. The motion direction concept vector analysis acknowledges that visual complexity weakens signal magnitude and limits out-of-domain generalization, yet no ablation or direct measurement compares signal strength or regression performance on real vs. synthetic feature deltas (e.g., due to background, texture, or multi-object motion).
  2. [Diagnosis section (linear probing experiments)] The diagnosis localizes the failure to a direction binding gap based on linear accessibility of motion direction from encoder, projector, and LLM states. However, this requires stronger evidence that the issue is specifically binding (rather than, e.g., attention dilution or output formatting), including probe details, exact quantification thresholds for accessibility, and controls showing the signal is not utilized despite presence.
minor comments (2)
  1. [Abstract] The abstract omits specifics on baseline Video-LLM models, number of runs, statistical significance, and data exclusion rules for the accuracy numbers.
  2. [Method section] Clarify the exact formulation of the DeltaDirect objective (e.g., loss function, normalization details) and how feature deltas are computed across frames to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to incorporate additional evidence and details as outlined.

read point-by-point responses
  1. Referee: [Abstract and results on MoDirect-RealBench] The 21.9-point gain on MoDirect-RealBench without real-world tuning data is load-bearing for the generalization claim that DeltaDirect closes the direction binding gap in general Video-LLMs. The motion direction concept vector analysis acknowledges that visual complexity weakens signal magnitude and limits out-of-domain generalization, yet no ablation or direct measurement compares signal strength or regression performance on real vs. synthetic feature deltas (e.g., due to background, texture, or multi-object motion).

    Authors: We agree that a direct comparison of regression performance on real versus synthetic feature deltas would strengthen the generalization claims. The observed +21.9 point gain on RealBench without real-world tuning data already provides evidence of domain-agnostic improvement, but we will add a new ablation in the revised manuscript reporting MSE and signal magnitude for DeltaDirect regression when applied to real-world feature deltas (extracted from the vision encoder and projector) compared to synthetic ones. This will quantify the effect of visual complexity and better support the out-of-domain results. revision: yes

  2. Referee: [Diagnosis section (linear probing experiments)] The diagnosis localizes the failure to a direction binding gap based on linear accessibility of motion direction from encoder, projector, and LLM states. However, this requires stronger evidence that the issue is specifically binding (rather than, e.g., attention dilution or output formatting), including probe details, exact quantification thresholds for accessibility, and controls showing the signal is not utilized despite presence.

    Authors: We acknowledge the need for more rigorous supporting evidence. In the revised manuscript we will expand the diagnosis section with: (1) complete specifications of the linear probe models including architecture and training hyperparameters, (2) exact quantification thresholds (e.g., probe accuracy levels and statistical significance criteria) used to deem a signal linearly accessible, and (3) additional control experiments such as attention map analysis and output-format ablations to demonstrate that the accessible motion direction signal is not utilized in generating the verbal answer despite its presence in the hidden states. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation introduces independent auxiliary objective without reduction to inputs or self-citations

full rationale

The paper traces motion direction signals through the Video-LLM pipeline to identify linear accessibility in encoder/projector/LLM states but a binding gap to answer tokens. It then defines DeltaDirect as an explicit new projector-level regression loss that predicts normalized 2-D motion vectors directly from adjacent-frame feature deltas, trained on the introduced MoDirect synthetic data. This is not a fitted parameter renamed as prediction, nor does any central claim reduce by construction to prior outputs or self-citations. Reported gains (e.g., 25.9% to 85.4% on SynBench, +21.9 points on RealBench) are presented as empirical results of applying this new objective, with no equations or steps that equate the claimed prediction to the training signal itself. The chain is self-contained against external benchmarks and does not rely on load-bearing self-citations or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the newly introduced DeltaDirect objective and MoDirect datasets together with the standard machine-learning assumption that linear probes can reveal accessible information in intermediate representations.

axioms (1)
  • domain assumption Motion direction information remains linearly decodable from vision encoder, projector, and LLM hidden states
    Used to localize the failure to a binding gap rather than missing signal.
invented entities (2)
  • DeltaDirect no independent evidence
    purpose: Projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas
    New training signal introduced to close the direction binding gap.
  • directional motion blindness no independent evidence
    purpose: Label for the observed failure of Video-LLMs to bind motion direction to verbal answers
    Newly coined term for the diagnosed limitation.

pith-pipeline@v0.9.0 · 5819 in / 1444 out tokens · 45753 ms · 2026-05-22T05:38:01.253163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 8 internal anchors

  1. [1]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InNeurIPS, 2024. 2, 6, 33, 37

  2. [2]

    Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations

    Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash- vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial-temporal represen- tations. InCVPR, 2025. 1, 3

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  4. [4]

    9, 17, 18, 32, 37, 48, 49, 51

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2. 5-vl technical report.a...

  6. [6]

    Structure and function of visual area mt.Annu

    Richard T Born and David C Bradley. Structure and function of visual area mt.Annu. Rev. Neurosci., 28 (1):157–189, 2005. 1

  7. [7]

    Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InCVPR, 2024. 4

  8. [8]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024. 9, 17, 18, 26, 32, 49

  9. [9]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. InNeurIPS, 2024. 4

  10. [10]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 1, 3

  11. [11]

    Unifying specialized visual encoders for video language models

    Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, and Olga Rus- sakovsky. Unifying specialized visual encoders for video language models. InICML, 2025

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 3, 9, 49

  13. [13]

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. Tvbench: Redesigning video-language evaluation.arXiv:2410.07752, 2024. 3

  14. [14]

    MARS: Motion-augmented RGB stream for action recognition

    Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-augmented RGB stream for action recognition. InCVPR, 2019. 3

  15. [15]

    Motionsight: Boosting fine-grained motion understanding in multimodal LLMs

    Yipeng Du, Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Xiang Li, Jian Yang, Zhenheng Yang, and Ying Tai. Motionsight: Boosting fine-grained motion understanding in multimodal LLMs. InICLR, 2026. 3

  16. [16]

    Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging

    Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. InICLR, 2026. 9, 17, 49 10

  17. [17]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 1

  18. [18]

    Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, and Xiang Bai. Ocrbench v2: An improved benchmark for evaluating large multimodal models on vis...

  19. [19]

    Psychology press, 2014

    James J Gibson.The ecological approach to visual perception: classic edition. Psychology press, 2014. 1

  20. [20]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017. 3, 9, 27, 49

  21. [21]

    Language models represent space and time

    Wes Gurnee and Max Tegmark. Language models represent space and time. InICLR, 2024. 5, 34, 35, 36

  22. [22]

    Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understand- ing for vision language models. InCVPR, 2025. 1, 3, 9, 17, 22, 49, 50

  23. [23]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 6

  24. [24]

    Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, and Marcelo H. Ang. Self- supervised motion learning from static images. InCVPR, 2021. 3

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  26. [26]

    Tgif-qa: Toward spatio- temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio- temporal reasoning in visual question answering. InCVPR, 2017. 9, 16, 22, 49, 50

  27. [27]

    Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization

    Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, and Yadong Mu. Video-LaVIT: Unified video-language pre-training with decoupled visual-motional tokenization. InICML, 2024. 3

  28. [28]

    Map the flow: Revealing hidden pathways of information in videoLLMs

    Minji Kim, Taekyung Kim, and Bohyung Han. Map the flow: Revealing hidden pathways of information in videoLLMs. InICLR, 2026. 3, 4, 6

  29. [29]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InNeurIPS, 2022. 4

  30. [30]

    MotionSqueeze: Neural motion feature learning for video understanding

    Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho. MotionSqueeze: Neural motion feature learning for video understanding. InECCV, 2020. 3

  31. [31]

    LLaV A-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.TMLR, 2025. 1, 3, 9, 17, 18, 25, 32, 37, 48, 49, 51

  32. [32]

    Inference-time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023. 2, 5, 6, 33, 37

  33. [33]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024. 1, 3, 4, 9, 16, 18, 19, 22, 25, 44, 49, 50

  34. [34]

    Temporal reasoning transfer from text to video

    Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang, Xu Sun, Lingpeng Kong, and Qi Liu. Temporal reasoning transfer from text to video. InICLR, 2025. 1, 3

  35. [35]

    Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models

    Shicheng Li, Lei Li, Yi Liu, Shuhuai Ren, Yuanxin Liu, Rundong Gao, Xu Sun, and Lu Hou. Vitatecs: A diagnostic dataset for temporal concept understanding of video-language models. InECCV, 2024. 3

  36. [36]

    TEA: Temporal excitation and aggregation for action recognition

    Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. TEA: Temporal excitation and aggregation for action recognition. InCVPR, 2020. 3

  37. [37]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. InECCV, 2024. 1, 3, 9, 49 11

  38. [38]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024. 1, 3, 9, 32, 49

  39. [39]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, 2014. 2, 6

  40. [40]

    St-llm: Large language models are effective temporal learners

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InECCV, 2024. 3

  41. [41]

    Flow4Agent: Long-form video under- standing via motion prior from optical flow

    Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, and Wei Gao. Flow4Agent: Long-form video under- standing via motion prior from optical flow. InICCV, 2025. 3

  42. [42]

    Tempcompass: Do video llms really understand videos? InACL, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InACL, 2024. 1, 3, 9, 16, 22, 49, 50

  43. [43]

    NVILA: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De- An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. NVILA: Efficient frontier visual lang...

  44. [44]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024. 3, 25

  45. [45]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

  46. [46]

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling, 2024. 2, 3, 6, 33, 34, 36

  47. [47]

    Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985

    Ken Nakayama. Biological image motion processing: a review.Vision Research, 25(5):625–660, 1985. doi: 10.1016/0042-6989(85)90171-3. 1

  48. [48]

    MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

    Hong Nguyen, Dung Tran, Hieu Hoang, Phong Nguyen, and Shrikanth Narayanan. MOOSE: Pay atten- tion to temporal dynamics for video understanding via optical flows.arXiv preprint arXiv:2506.01119,

  49. [49]

    Interpreting GPT: The logit lens

    nostalgebraist. Interpreting GPT: The logit lens. LessWrong, August 2020. URLhttps://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2026-02-22. 6, 23, 34

  50. [50]

    Llms know more than they show: On the intrinsic representation of llm hallucinations

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations. InICLR,

  51. [51]

    Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025

    Yoonah Park, Haesung Pyun, and Yohan Jo. Bridging the knowledge-prediction gap in llms on multiple- choice questions.arXiv preprint arXiv:2509.23782, 2025. 5, 34

  52. [52]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models. InNeurIPS, 2023. 1, 3, 9, 16, 22, 49, 50

  53. [53]

    Improve temporal reasoning in multimodal large language models via video contrastive decoding

    Daiqing Qi, Dongliang Guo, Hanzhang Yuan, Handong Zhao, Mengxuan Hu, Lehan Yang, and Sheng Li. Improve temporal reasoning in multimodal large language models via video contrastive decoding. In NeurIPS, 2025. 1

  54. [54]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 2, 6, 33, 37

  55. [55]

    Leveraging large language models for multiple choice question answering

    Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arxiv (2022). InICLR, 2023. 4

  56. [56]

    Actionatlas: A videoqa benchmark for domain-specialized action recognition

    Mohammadreza Salehi, Jae S Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Han- naneh Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition. InNeurIPS, 2024. 1

  57. [57]

    Recognizing human actions: a local svm approach

    Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. InICPR, 2004. 3, 9, 49 12

  58. [58]

    Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models

    Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, and Arman Cohan. Tomato: Assessing visual temporal reasoning capabilities in multimodal foundation models. In ICLR, 2025. 1, 3, 9, 21, 49

  59. [59]

    Stroud, David A

    Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, and Rahul Sukthankar. D3D: Distilled 3D networks for video action recognition. InWACV, 2020. 3

  60. [60]

    Probing for arithmetic errors in language mod- els

    Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language mod- els. InEMNLP, 2025. 5, 34

  61. [61]

    Language models linearly rep- resent sentiment

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Language models linearly rep- resent sentiment. InICML, 2024. 2, 6, 33, 37

  62. [62]

    Favor-bench: A comprehensive benchmark for fine-grained video motion understanding

    Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang YU, and Tao Chen. Favor-bench: A comprehensive benchmark for fine-grained video motion understanding. InNeurIPS,

  63. [63]

    3, 9, 16, 22, 49, 50

  64. [64]

    TDN: Temporal difference networks for efficient action recognition

    Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. TDN: Temporal difference networks for efficient action recognition. InCVPR, 2021. 3

  65. [65]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InECCV, 2024. 1, 3

  66. [66]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 4

  67. [67]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019. 16

  68. [68]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, 2021. 9, 16, 22, 49, 50

  69. [69]

    Seeing the arrow of time in large multimodal models

    Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. In NeurIPS, 2025. 3

  70. [70]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 18, 44

  71. [71]

    Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022

    Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, and Zehuan Yuan. Self-supervised video representation learning with motion-aware masked autoencoders.arXiv preprint arXiv:2210.04154, 2022. 3

  72. [72]

    mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language mod- els. InICLR, 2025. 17, 32, 48, 49

  73. [73]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InICLR, 2026. 4

  74. [74]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 18, 28, 44

  75. [75]

    PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025

    Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, and Wenwu Zhu. PhyVLLM: Physics-guided video language model with motion-appearance disentan- glement.arXiv preprint arXiv:2512.04532, 2025. 3

  76. [76]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 3, 9, 32, 49

  77. [77]

    Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024

    Jianrui Zhang, Mu Cai, and Yong Jae Lee. Vinoground: Scrutinizing lmms over dense temporal reasoning with short videos.arXiv preprint arXiv:2410.02763, 2024. 3, 9, 16, 22, 49, 50

  78. [78]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuan- han Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InNAACL 2025, 2025. 16 13

  79. [79]

    Llava-next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, 2024. URLhttps:// llava-vl.github.io/blog/2024-04-30-llava-next-video/. 1, 32

  80. [80]

    LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A-video: Video instruction tuning with synthetic data.TMLR, 2025. 3, 4, 6, 9, 17, 25, 28, 31, 32, 33, 48, 49, 50, 51

Showing first 80 references.