pith. sign in

arxiv: 2605.19846 · v3 · pith:QQ76WJIFnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Pith reviewed 2026-06-30 18:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords fine-grained video understandingvision-language modelshuman activity understandingvideo question answering benchmarkmulti-person spatial reasoningmodular agent frameworklong-form video analysis
0
0 comments X

The pith

Open-source VLMs underperform on fine-grained human activity understanding while FineAgent boosts their results on FineBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FineBench, a benchmark of nearly 200,000 dense QA pairs from 64 long videos, to evaluate how well vision-language models grasp subtle human movements, interactions, and object manipulations. Tests reveal that open-source models lag behind proprietary ones, with particular difficulty in spatial reasoning among multiple people and spotting small differences in actions. The authors then introduce FineAgent, a framework that adds localization and description steps to improve VLM outputs on this benchmark. This setup matters for developing AI that can handle real-world scenarios needing precise activity interpretation. If correct, it highlights specific weaknesses in current models and offers a way to address them.

Core claim

FineBench is introduced with 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos focusing on detailed person movement, person interaction, and object manipulation. Evaluation shows proprietary models like GPT-5 perform respectably but open-source VLMs significantly underperform, especially with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. FineAgent, a modular framework leveraging a Localizer and a Descriptor, consistently improves the performance of various open VLMs on FineBench.

What carries the argument

FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor to improve fine-grained video reasoning.

If this is right

  • FineBench serves as a rigorous testbed for future research into fine-grained human-centric video understanding.
  • FineAgent provides a practical approach to enhance reasoning capabilities in current VLMs.
  • Improvements from FineAgent indicate that modular additions can address weaknesses in spatial and subtle distinction tasks.
  • Open-source models have significant room to improve to match proprietary performance on fine-grained tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar localization and description modules could be tested on non-human activity video tasks to see if gains generalize.
  • The focus on multi-person spatial reasoning points to a potential need for VLMs to better model group dynamics explicitly.
  • High annotation density in the benchmark may encourage creation of more grounded training datasets for VLMs.

Load-bearing premise

The dense QA annotations in FineBench accurately reflect true fine-grained distinctions in human movements and interactions without significant annotation errors or biases.

What would settle it

An experiment where open-source VLMs achieve comparable performance to proprietary models on FineBench without modifications like FineAgent would challenge the identified performance gaps.

Figures

Figures reproduced from arXiv: 2605.19846 by Gueter Josmy Faure, Hung-Ting Su, Jia-Fong Yeh, Min-Hung Chen, Winston H. Hsu.

Figure 1
Figure 1. Figure 1: (a) Examples of question types in FineBench which go beyond summarization to cover person posture, person-object interaction, and person-person interaction. (b) The capture of temporal evolution of interaction labels across frames, emphasizing spatial granularity (e.g., distinguish individuals in the same frame) and temporal granularity (e.g., resolving transitions between similar but distinct actions). Ab… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of Annotated Persons per Keyframe. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: VLM performance analysis on FineBench detailing accuracy variations. (a) Performance degradation with increasing number [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of FineAgent. It begins with (1) prompt ac￾tivation for the Localizer and Descriptor. (2) The Localizer and Descriptor, both Foundation models, provide bounding box coor￾dinates and textual captions. (3) Finally, the VLM uses this pro￾cessed information during inference. interactions compared to object-centric actions. To ad￾dress these limitations, we propose FineAgent, a modular framework design… view at source ↗
read the original abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FineBench, a human-centric VQA benchmark with 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 min each), targeting fine-grained person movement, interactions, object manipulation, and compositional actions. Evaluation of VLMs shows proprietary models (e.g., GPT-5) perform respectably while open-source VLMs lag, especially on spatial reasoning in multi-person scenes. The authors propose FineAgent, a modular framework with Localizer and Descriptor modules, and report consistent performance gains for open VLMs on FineBench. Code and project page are released.

Significance. If the benchmark annotations prove reliable, FineBench supplies a large-scale, dense testbed for fine-grained human activity understanding that existing benchmarks lack, and FineAgent demonstrates a practical, modular route to improve spatial/multi-person reasoning in current VLMs. The public release of code and annotations supports reproducibility and follow-on work.

major comments (2)
  1. [Dataset construction] Dataset construction section: the paper provides no inter-annotator agreement statistics, no expert re-labeling of a held-out subset, and no error analysis on the 199,420 QA pairs. This is load-bearing for the central claim that open-source VLMs underperform on spatial reasoning in multi-person scenes, because annotation noise, ambiguous wording, or visually irresolvable distinctions could produce the observed gaps.
  2. [Evaluation] Evaluation section: no details are given on how frame-level grounding is enforced in the dense QA pairs or on statistical significance testing of the reported performance deltas between models and between FineAgent variants.
minor comments (2)
  1. [Abstract] The abstract refers to 'GPT-5'; clarify the exact proprietary models and versions used in the experiments.
  2. [FineAgent framework] FineAgent is described at a high level; explicit pseudocode or a diagram showing the Localizer–Descriptor–VLM pipeline would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of annotation quality and evaluation details.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the paper provides no inter-annotator agreement statistics, no expert re-labeling of a held-out subset, and no error analysis on the 199,420 QA pairs. This is load-bearing for the central claim that open-source VLMs underperform on spatial reasoning in multi-person scenes, because annotation noise, ambiguous wording, or visually irresolvable distinctions could produce the observed gaps.

    Authors: We agree that explicit validation of annotation reliability is essential to support claims about model performance gaps. In the revised version we will add: (i) inter-annotator agreement statistics (Cohen’s kappa and percentage agreement) computed on a randomly sampled subset of 5,000 QA pairs annotated by three independent annotators; (ii) results of expert re-labeling on a held-out set of 2,000 pairs by two domain experts; and (iii) a concise error analysis categorizing sources of potential noise (ambiguous wording, visually irresolvable cases, etc.). These additions will be placed in a new subsection of the dataset construction section. revision: yes

  2. Referee: [Evaluation] Evaluation section: no details are given on how frame-level grounding is enforced in the dense QA pairs or on statistical significance testing of the reported performance deltas between models and between FineAgent variants.

    Authors: We acknowledge the need for greater transparency. The revised manuscript will expand the evaluation section with: (i) a description of how frame-level grounding is enforced (each QA pair is tied to a specific temporal window of 1–3 seconds and spatial bounding boxes provided during annotation, with verification that the question text references only information visible inside that window); and (ii) statistical significance testing (paired bootstrap resampling with 10,000 iterations and reported p-values) for all reported performance deltas between models and between FineAgent ablations. These details will be added without altering the experimental results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and framework with no derivations or self-referential reductions

full rationale

The paper introduces FineBench (a dataset of 199k QA pairs on 64 videos) and FineAgent (a modular VLM enhancement) via empirical evaluation. No equations, fitted parameters, or derivation chains exist that could reduce predictions to inputs by construction. Claims rest on external model performance measurements and annotations rather than any self-definitional or self-citation load-bearing steps. The annotation quality assumption is a standard empirical risk, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and framework paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5825 in / 997 out tokens · 26148 ms · 2026-06-30T18:07:55.741176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

    cs.CV 2026-06 unverdicted novelty 7.0

    NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).

Reference graph

Works this paper leans on

30 extracted references · 13 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 3, 5, 6, 7, 8

  2. [2]

    Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2, 3

  3. [3]

    Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025

    Yuxuan Cai, Jiangning Zhang, Zhenye Gan, Qingdong He, Xiaobin Hu, Junwei Zhu, Yabiao Wang, Chengjie Wang, Zhucun Xue, Xinwei He, et al. Hv-mmbench: Benchmark- ing mllms for human-centric video understanding.arXiv preprint arXiv:2507.04909, 2025. 3

  4. [4]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 3, 6, 8

  5. [5]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024. 5

  6. [6]

    Hsu, and Shang-Hong Lai

    Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung- Ting Su, Winston H. Hsu, and Shang-Hong Lai. Hermes: temporal-coherent long-form understanding with episodes and semantics, 2024. 3

  7. [7]

    Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, and Winston H. Hsu. Moviecore: Cognitive reasoning in movies, 2025. 2

  8. [8]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever compre- hensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075, 2024. 3

  9. [9]

    Ava: A video dataset of spatio-temporally localized atomic visual actions

    Chunhui Gu, Chen Sun, David A Ross, Carl V ondrick, Car- oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056,

  10. [10]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  11. [11]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

  12. [12]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 3

  13. [13]

    Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding.Advances in Neural In- formation Processing Systems, 36:46212–46244, 2023. 2, 3

  14. [14]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 6

  15. [15]

    Hello gpt-4o.https : / / openai

    OpenAI. Hello gpt-4o.https : / / openai . com / index/hello-gpt-4o/, 2024. [Accessed 01-11-2024]. 6

  16. [16]

    Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/ index/introducing- gpt- 5/, 2025. [Accessed 31- 08-2025]. 5, 6

  17. [17]

    HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

    Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ash- mal Vayani, Mukund S Chettiar, Amandeep Singh, Mubarak Shah, and Deval Pandya. Humanibench: A human-centric framework for large multimodal models evaluation.arXiv preprint arXiv:2505.11454, 2025. 2

  18. [18]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2, 3

  19. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text.arXiv preprint arXiv:2403.05530, 2024. 5, 6

  20. [20]

    Star: A benchmark for situated reasoning in real-world videos

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2024. 2, 3

  21. [21]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

  22. [22]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 2, 3

  23. [23]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 2, 3

  24. [24]

    xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024

    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yu- tong Dai, Michael S Ryoo, et al. xgen-mm (blip-3): A family of open large multimodal models.arXiv preprint arXiv:2408.08872, 2024. 6

  25. [25]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 3, 6, 8

  26. [26]

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language models.arXiv preprint arXiv:2408.04840, 2024. 3, 6, 8

  27. [27]

    mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug- owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13040–13051, 2024. 6

  28. [28]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9127–9134, 2019. 2, 3

  29. [29]

    Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024

    Yuxuan Zhang, Tianheng Cheng, Lianghui Zhu, Rui Hu, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Evf-sam: Early vision-language fusion for text-prompted segment anything model, 2024. 7

  30. [30]

    HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

    Ting Zhou, Daoyuan Chen, Qirui Jiao, Bolin Ding, Yaliang Li, and Ying Shen. Humanvbench: Exploring human-centric video understanding capabilities of mllms with synthetic benchmark data.arXiv preprint arXiv:2412.17574, 2024. 3