FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
Pith reviewed 2026-06-30 18:07 UTC · model grok-4.3
The pith
Open-source VLMs underperform on fine-grained human activity understanding while FineAgent boosts their results on FineBench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FineBench is introduced with 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos focusing on detailed person movement, person interaction, and object manipulation. Evaluation shows proprietary models like GPT-5 perform respectably but open-source VLMs significantly underperform, especially with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. FineAgent, a modular framework leveraging a Localizer and a Descriptor, consistently improves the performance of various open VLMs on FineBench.
What carries the argument
FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor to improve fine-grained video reasoning.
If this is right
- FineBench serves as a rigorous testbed for future research into fine-grained human-centric video understanding.
- FineAgent provides a practical approach to enhance reasoning capabilities in current VLMs.
- Improvements from FineAgent indicate that modular additions can address weaknesses in spatial and subtle distinction tasks.
- Open-source models have significant room to improve to match proprietary performance on fine-grained tasks.
Where Pith is reading between the lines
- Similar localization and description modules could be tested on non-human activity video tasks to see if gains generalize.
- The focus on multi-person spatial reasoning points to a potential need for VLMs to better model group dynamics explicitly.
- High annotation density in the benchmark may encourage creation of more grounded training datasets for VLMs.
Load-bearing premise
The dense QA annotations in FineBench accurately reflect true fine-grained distinctions in human movements and interactions without significant annotation errors or biases.
What would settle it
An experiment where open-source VLMs achieve comparable performance to proprietary models on FineBench without modifications like FineAgent would challenge the identified performance gaps.
Figures
read the original abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FineBench, a human-centric VQA benchmark with 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 min each), targeting fine-grained person movement, interactions, object manipulation, and compositional actions. Evaluation of VLMs shows proprietary models (e.g., GPT-5) perform respectably while open-source VLMs lag, especially on spatial reasoning in multi-person scenes. The authors propose FineAgent, a modular framework with Localizer and Descriptor modules, and report consistent performance gains for open VLMs on FineBench. Code and project page are released.
Significance. If the benchmark annotations prove reliable, FineBench supplies a large-scale, dense testbed for fine-grained human activity understanding that existing benchmarks lack, and FineAgent demonstrates a practical, modular route to improve spatial/multi-person reasoning in current VLMs. The public release of code and annotations supports reproducibility and follow-on work.
major comments (2)
- [Dataset construction] Dataset construction section: the paper provides no inter-annotator agreement statistics, no expert re-labeling of a held-out subset, and no error analysis on the 199,420 QA pairs. This is load-bearing for the central claim that open-source VLMs underperform on spatial reasoning in multi-person scenes, because annotation noise, ambiguous wording, or visually irresolvable distinctions could produce the observed gaps.
- [Evaluation] Evaluation section: no details are given on how frame-level grounding is enforced in the dense QA pairs or on statistical significance testing of the reported performance deltas between models and between FineAgent variants.
minor comments (2)
- [Abstract] The abstract refers to 'GPT-5'; clarify the exact proprietary models and versions used in the experiments.
- [FineAgent framework] FineAgent is described at a high level; explicit pseudocode or a diagram showing the Localizer–Descriptor–VLM pipeline would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of annotation quality and evaluation details.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the paper provides no inter-annotator agreement statistics, no expert re-labeling of a held-out subset, and no error analysis on the 199,420 QA pairs. This is load-bearing for the central claim that open-source VLMs underperform on spatial reasoning in multi-person scenes, because annotation noise, ambiguous wording, or visually irresolvable distinctions could produce the observed gaps.
Authors: We agree that explicit validation of annotation reliability is essential to support claims about model performance gaps. In the revised version we will add: (i) inter-annotator agreement statistics (Cohen’s kappa and percentage agreement) computed on a randomly sampled subset of 5,000 QA pairs annotated by three independent annotators; (ii) results of expert re-labeling on a held-out set of 2,000 pairs by two domain experts; and (iii) a concise error analysis categorizing sources of potential noise (ambiguous wording, visually irresolvable cases, etc.). These additions will be placed in a new subsection of the dataset construction section. revision: yes
-
Referee: [Evaluation] Evaluation section: no details are given on how frame-level grounding is enforced in the dense QA pairs or on statistical significance testing of the reported performance deltas between models and between FineAgent variants.
Authors: We acknowledge the need for greater transparency. The revised manuscript will expand the evaluation section with: (i) a description of how frame-level grounding is enforced (each QA pair is tied to a specific temporal window of 1–3 seconds and spatial bounding boxes provided during annotation, with verification that the question text references only information visible inside that window); and (ii) statistical significance testing (paired bootstrap resampling with 10,000 iterations and reported p-values) for all reported performance deltas between models and between FineAgent ablations. These details will be added without altering the experimental results. revision: yes
Circularity Check
No circularity: empirical benchmark and framework with no derivations or self-referential reductions
full rationale
The paper introduces FineBench (a dataset of 199k QA pairs on 64 videos) and FineAgent (a modular VLM enhancement) via empirical evaluation. No equations, fitted parameters, or derivation chains exist that could reduce predictions to inputs by construction. Claims rest on external model performance measurements and annotations rather than any self-definitional or self-citation load-bearing steps. The annotation quality assumption is a standard empirical risk, not a circularity issue.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 5, 6, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2, 3
-
[3]
Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, et al. Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025. 3
-
[4]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 3, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024. 5
2024
-
[6]
Hsu, and Shang-Hong Lai
Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung- Ting Su, Winston H. Hsu, and Shang-Hong Lai. Hermes: temporal-coherent long-form understanding with episodes and semantics, 2024. 3
2024
-
[7]
Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, and Winston H. Hsu. Moviecore: Cognitive reasoning in movies, 2025. 2
2025
-
[8]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Ava: A video dataset of spatio-temporally localized atomic visual actions
Chunhui Gu, Chen Sun, David A Ross, Carl V ondrick, Car- oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056,
-
[10]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3
2024
-
[12]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3
2023
-
[13]
Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3
2023
-
[14]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Hello gpt-4o.https : / / openai
OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. [Accessed 01-11-2024]. 6
2024
-
[16]
Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025
OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. [Accessed 31- 08-2025]. 5, 6
2025
-
[17]
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation
Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ash- mal Vayani, Mukund S Chettiar, Amandeep Singh, Mubarak Shah, and Deval Pandya. Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Moviechat: From dense token to sparse memory for long video understanding
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2, 3
2024
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Star: A benchmark for situated reasoning in real-world videos
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2024. 2, 3
2024
-
[21]
Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3
2024
-
[22]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 2, 3
2021
-
[23]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2, 3
2016
-
[24]
xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 6
-
[25]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 3, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13040–13051, 2024. 6
2024
-
[28]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2, 3
2019
-
[29]
Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024
Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024. 7
2024
-
[30]
Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, and Ying Shen. Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data.arXiv preprint arXiv:2412.17574, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.