CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3
The pith
Vision-language models struggle to construct precise causal chains when answering questions about video events.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CaST-Bench supplies 2,066 complex causal questions over 1,015 videos in which each question demands that a model identify and localize a chain of multiple spatio-temporal evidences; the chains are annotated via temporal segments and bounding-box tracks created through a human-AI pipeline, and the benchmark includes novel metrics that separately measure answer correctness and the degree of visual-evidence grounding achieved.
What carries the argument
Causal chain annotations that mark the specific temporal segments and bounding-box tracks linking cause to effect in each video.
If this is right
- Models able to build explicit causal chains will show higher accuracy on causal video questions.
- Grounded chain construction will reduce answers driven by spurious correlations.
- Explicit evidence chains will make model outputs more transparent to users.
- Future vision-language models will need dedicated mechanisms for assembling spatio-temporal causal sequences.
Where Pith is reading between the lines
- The same annotation style could be applied to test causal reasoning in domains such as robotics or medical video analysis.
- Metrics that separately score grounding may become standard for judging reliability in multimodal systems.
- Training procedures that reward chain construction rather than final-answer matching could be developed using this benchmark format.
Load-bearing premise
The human-AI pipeline produces annotations that accurately identify the true causal chains in the videos without major bias or localization mistakes.
What would settle it
A vision-language model that reaches high answer accuracy on the benchmark questions while failing to localize or reference the annotated causal segments and boxes, or a large-scale human review that finds frequent mismatches between the provided annotations and the actual causal events in the videos.
Figures
read the original abstract
Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CaST-Bench, a new benchmark with 2,066 causal questions over 1,015 videos. Questions require identifying and localizing multi-step causal chains via temporal segments and bounding-box tracks, constructed through a human-AI collaborative pipeline. Novel metrics evaluate both answer correctness and visual-evidence grounding. Experiments conclude that current VLMs struggle with causal questions primarily because of limited ability to construct precise and grounded causal chains.
Significance. If the annotations prove valid, the benchmark and metrics would usefully isolate causal-chain reasoning from surface perception and spurious correlation, providing a concrete testbed for improving VLM transparency and robustness in video. The fine-grained spatio-temporal grounding annotations are a concrete contribution that could support future work on verifiable reasoning.
major comments (3)
- [§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.
- [§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.
- [Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'high-quality dataset' is used without supporting statistics; move any available agreement or quality numbers from the appendix into the main text.
- [Table 1] Table 1 or equivalent: Clarify whether the 2,066 questions are unique or include multiple questions per video, and report the distribution of chain lengths.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below. We agree that additional quantitative details and clarifications will strengthen the manuscript and will incorporate them in the revision.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.
Authors: We acknowledge the absence of reported quantitative validation metrics. The construction involved iterative human review, but agreement and error rates were not quantified in the text. In revision we will add inter-annotator agreement on a sampled subset, localization error statistics, and verification that annotated chains are minimal and causally sufficient. These additions will support the abstract's attribution. revision: yes
-
Referee: [§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.
Authors: The current experiments demonstrate low VLM performance on causal questions using the grounding metrics. To strengthen the isolation claim we will add ablations contrasting causal versus non-causal questions and report correlations between grounding scores and answer accuracy. These analyses will be included in the revised experiments section. revision: yes
-
Referee: [Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.
Authors: We will expand the Evaluation Metrics section to explicitly specify the scoring rules for partial chain coverage, including the penalty structure relative to full causal sufficiency. This clarification will demonstrate that low scores primarily reflect reasoning limitations rather than metric artifacts. revision: yes
Circularity Check
No circularity: benchmark and metrics constructed independently from new annotations and evaluations
full rationale
The paper introduces a new dataset (2,066 questions over 1,015 videos) and novel evaluation metrics via a human-AI pipeline and empirical testing on VLMs. No derivation reduces to fitted inputs, self-definitions, or self-citation chains; the central claims rest on the new annotations and observed model performance rather than any equation or prior result that is redefined or refit within the work. This is the standard case of a benchmark paper whose results are externally falsifiable against the released data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-AI collaborative annotation can produce high-quality causal chain labels using temporal segments and bounding-box tracks.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Cg-bench: Clue-grounded question answering benchmark for long video understanding
Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,
-
[3]
Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chao- fan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, et al. Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024. 3
work page 2024
-
[4]
Cross-modal causal rela- tion alignment for video question grounding
Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. Cross-modal causal rela- tion alignment for video question grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24087–24096, 2025. 3
work page 2025
-
[5]
Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, and Xihui Liu. Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025. 2
-
[6]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025
Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025. 2, 3, 5
work page 2025
-
[9]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 3, 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Causalvqa: A physically grounded causal reasoning benchmark for video models
Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Am- mar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv preprint arXiv:2506.09943, 2025. 3
-
[12]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3
work page 2025
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Egoexobench: A benchmark for first-and third-person view video understanding in mllms
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 3
- [15]
-
[16]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 6
work page 2025
-
[17]
Tongyi Lab. Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025. Model avail- able at Hugging Face. Accessed: 2025-11-12. 2, 6
work page 2025
-
[18]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Tvqa+: Spatio-temporal grounding for video question answering
Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020. 2, 3
work page 2020
-
[20]
Jiangtong Li, Li Niu, and Liqing Zhang. From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022. 3
work page 2022
-
[21]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3
work page 2024
-
[23]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2
work page 2024
-
[25]
Video-chatgpt: Towards detailed video un- derstanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2
work page 2024
-
[26]
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 3
-
[27]
Gpt-5.https://openai.com, 2025
OpenAI. Gpt-5.https://openai.com, 2025. Large language model. 6
work page 2025
-
[28]
Causal inference in statistics: An overview
Judea Pearl. Causal inference in statistics: An overview
-
[29]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024. 3
-
[31]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2
work page 2024
-
[32]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2
-
[34]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3
-
[36]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3
work page 2021
-
[38]
Can i trust your answer? visually grounded video question answering
Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 2, 3
work page 2024
-
[39]
Mimo-vl technical report, 2025
LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6
work page 2025
-
[40]
Video question answer- ing via gradually refined attention over appearance and mo- tion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3
work page 2017
-
[41]
Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 2
-
[42]
Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, et al. Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025. 3
-
[43]
Discovering the real association: Multimodal causal rea- soning in video question answering
Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal rea- soning in video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19027–19036, 2023. 3
work page 2023
-
[44]
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 3
-
[45]
Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3
work page 2025
-
[46]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025
Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 2
-
[48]
Llava- next: A strong zero-shot video understanding model, 2024
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6
work page 2024
-
[49]
Where does it exist: Spatio-temporal video grounding for multi-form sentences
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10668–10677, 2020. 2, 5
work page 2020
-
[50]
Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...
work page 2024
-
[51]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2 CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering Supplementary Material
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
9: CaST-Bench Samples from Each Category Sec
Appendix Overview The organization of the appendix is as follows: Sec. 9: CaST-Bench Samples from Each Category Sec. 10: Details of Data Annotation Pipeline Sec. 11: Details of Experiment Setup Sec. 13: Case Studies and Failure Analysis Sec. 14: Social Impact, License, and Access
-
[53]
CaST-Bench Samples from Each Category Due to page limitation in the main manuscript, we show more examples regarding all question types, as follows. •Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events. –Why questions (reasons), shown in Fig. 9 –How questions (mechanisms), shown in Fig. 10 •Counterfactua...
-
[54]
Details of Data Annotation Pipeline 10.1. Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time. As discussed in Sec. 4, carefully curating the raw video pool is essential: studio footage or single-actor clips often lack the competing causal cues and spatial ambiguity needed to str...
-
[55]
2) A version of the original image where the background is blurred to isolate the target instance
The original image where the target instance is marked with a green outline (the outline is an overlay, not part of the object). 2) A version of the original image where the background is blurred to isolate the target instance. Objective: Write exactly one English description that refers only to the target instance and its scene/context in the original im...
-
[56]
**Text Description**: A sentence identifying the target object and its surrounding scene and context
-
[57]
**Video Clip**: A silent video focused on the target object. The object is highlighted with a green border for tracking purposes. **Task**: Generate a time-stamped log detailing the specific dynamics of the specified object shown in the video. **Rules**: - Source of Truth: The video clip is the source of truth. The text input is for context only. If a fra...
-
[58]
Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA
Details of Experiment Setup 11.1. Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA. For the multiple-choice setting, the exact prompt is provided in Prompt 7. 11.2. VLM Hyperparameter Configuration We configure all VLMs withmax new tokensset to 2048, limiting each sample to at most 2,048 generated to- kens (excluding the in...
work page 2048
-
[59]
Evaluation Suite 12.1. Grounded Causal Chain Evaluation Evaluating the correctness of a predicted causal chain is fundamentally harder than checking a single grounding tar- get. A model must recover every actor that participates in the causal process, align their evidences across different time ranges, and ensure the supporting boxes stay faithful to the ...
-
[60]
A causal question about a video event {question}
-
[61]
The ground-truth conclusion answer {gt_answer}
-
[62]
The ground-truth causal reasoning process {gt_reasoning}
-
[63]
A test-taker model’s generated conclusion answer {pred_answer}
-
[64]
Assign a separate score (0{10) for each dimension according to the standards below
A test-taker model’s generated causal reasoning process {pred_reasoning} Task: Evaluate the test-taker model’s reasoning across four dimensions. Assign a separate score (0{10) for each dimension according to the standards below. Evaluation Dimensions:
-
[65]
Answer Conclusion Correctness * Compare only the model’s conclusion answer to the ground-truth conclusion answer; ignore any reasoning content. * Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment
-
[66]
Causal Chain Logical Consistency * Evaluate only the generated answer and reasoning; do not reference ground-truth answer or ground-truth reasoning. * Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer. * Identify any logical leaps, post-hoc reasoning, or teleological claims lacking justificatio...
-
[67]
Evidence Coverage & Completeness * Use the ground-truth causal reasoning as the reference. Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps. * Evaluate recall of essential causal steps and contextual conditions relative to the ground truth. * Penalize missing core components, contradict...
-
[68]
Evidence{Conclusion Overall Justification * Consider the generated answer and the generated reasoning together: does the provided reasoning justify the stated answer, and do both align with the ground truth overall? * Assess the logical coherence from evidence to conclusion, calibration of confidence, and global plausibility. Scoring Standards for each di...
-
[69]
How do customers know where to line up?
Case Studies and Failure Analysis Beyond the quantitative ablation studies and error analysis presented in the main paper, we here perform a qualitative analysis of the case studies and failure patterns exhibited by the evaluated models. Vulnerability to Spurious Visual ConfoundersA core design principle of CaST-Bench is the inclusion of distrac- tors tha...
-
[70]
A": "Because the person in the long dark coat told the child to stop moving
Social Impact, License, and Access 14.1. Broader Impact CaST-Bench advances the field of VLMs by shifting the fo- cus from surface-level perception to deep, grounded causal reasoning, a capability essential for sophisticated video analysis and anticipation tasks. By mandating that mod- els validate their answers with explicit spatio-temporal evi- dence, t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.