arxiv: 2604.23407 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.AI

Recognition: unknown

PushupBench: Your VLM is not good at counting pushups

Shengzhi Li , Jiarun Chen , Karun Sharma , Jiaqi Su , Shichao Pei

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsrepetition countingvideo understandingtemporal reasoningfine-tuning transferbenchmark evaluation

0 comments

The pith

Vision-language models fail to count repetitions in video, but training on it improves broader temporal understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a benchmark of 446 pushup videos to measure how accurately vision-language models can count repeated actions. Frontier models reach only 42 percent exact accuracy while smaller models hover near 6 percent and often guess the most frequent count instead of analyzing timing. Fine-tuning on just one thousand counting examples produces gains across multiple video benchmarks that test understanding of sequences and events. This pattern indicates that repetition counting serves as a compact test and training signal for the temporal reasoning skills needed in general video comprehension.

Core claim

Large vision-language models can identify actions in video but cannot reliably count how often those actions repeat. On the introduced set of 446 clips averaging 37 seconds each, the strongest model achieves 42.1 percent exact count accuracy. Open-source 4B models score around 6 percent, no better than supervised baselines that simply predict the modal count. Fine-tuning any model on 1k counting examples raises scores on MVBench, PerceptionTest, and TVBench, which demonstrates that practice with precise repetition tracking transfers to wider video understanding tasks.

What carries the argument

The PushupBench collection of long-form video clips used both to expose counting failures and to supply the 1k-sample fine-tuning data that produces cross-benchmark transfer gains.

If this is right

Models whose counting accuracy rises will also improve on other tasks that require tracking event duration and order.
Evaluation that reports only overall accuracy can hide models that succeed by exploiting dataset biases rather than genuine sequence analysis.
A small counting dataset can act as an efficient auxiliary objective for strengthening temporal capabilities without requiring massive new video collections.
The persistent gap between frontier and open models on exact counting reveals a current limitation in how pretraining encodes precise time-based quantification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other simple temporal probes, such as ordering events or estimating durations, could serve as similarly lightweight training signals for video models.
Benchmark suites might add counting subtasks as a quick diagnostic filter before running full suites of video evaluations.
Pretraining objectives that emphasize quantity and timing more explicitly could reduce the need for later fine-tuning on counting.

Load-bearing premise

The measured gains on other video benchmarks after counting fine-tuning result specifically from improved temporal reasoning instead of generic adaptation or data overlap.

What would settle it

An experiment that applies the same volume of fine-tuning data to a non-counting task and still obtains equal or larger gains on the same video benchmarks would falsify the claim that counting practice is the operative factor.

Figures

Figures reproduced from arXiv: 2604.23407 by Jiaqi Su, Jiarun Chen, Karun Sharma, Shengzhi Li, Shichao Pei.

**Figure 1.** Figure 1: Predicted vs. ground truth repetition counts view at source ↗

**Figure 2.** Figure 2: Ground truth distribution in training data (left) view at source ↗

**Figure 3.** Figure 3: Exercise type diversity in PushupBench. The view at source ↗

**Figure 4.** Figure 4: On-screen counter removal for benchmark curation. Left: Original frame with rep counter “03” visible. Right: Same frame with counter edited out to prevent text-based shortcuts view at source ↗

**Figure 5.** Figure 5: Examples of on-screen clues enabling reward view at source ↗

**Figure 6.** Figure 6: Reward function ablation over 180 training steps (initial 968-sample experiment). We compare three view at source ↗

read the original abstract

Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1\% exact accuracy; open-source 4B models score $\sim$6\%, matching supervised baselines. We show that accuracy alone misleads -- weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PushupBench gives a clear, usable dataset for repetition counting in VLMs plus a solid warning about modal-count gaming, but the transfer gains from fine-tuning lack the controls needed to tie them specifically to temporal reasoning.

read the letter

The main things to know are that frontier VLMs still top out at 42.1% exact accuracy on counting pushups across 446 long clips, open-source 4B models sit near 6% and match supervised baselines, and the authors show that accuracy numbers can be inflated when models simply output the most frequent count instead of tracking the motion. They also report that fine-tuning on 1k counting examples lifts scores on MVBench, PerceptionTest, and TVBench by a few points each, and they have released the data plus an integration into lmms-eval.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PushupBench, a benchmark of 446 long-form video clips (avg. 36.7s) for evaluating repetition counting in VLMs. It reports that the best frontier model reaches 42.1% exact accuracy while open-source 4B models score ~6% (matching supervised baselines). Accuracy is shown to be misleading because weaker models exploit the modal count rather than reason temporally. Fine-tuning on 1k counting samples yields gains on MVBench (+2.15), PerceptionTest (+1.88), and TVBench (+4.54), suggesting counting as a proxy for broader temporal reasoning. The benchmark is integrated into lmms-eval.

Significance. If the transfer results hold after appropriate controls, the work would be significant: it supplies a focused, challenging benchmark for a core temporal skill that current VLMs handle poorly, demonstrates a low-cost fine-tuning route to broader video understanding gains, and contributes an open evaluation harness. The modal-count exploitation analysis is a useful diagnostic insight.

major comments (2)

[Experiments section (transfer results)] Experiments section (transfer results): the claim that counting fine-tuning improves temporal reasoning specifically is load-bearing for the final suggestion in the abstract, yet no control is described for generic video-adaptation effects (e.g., a matched non-counting video fine-tuning baseline with the same 1k samples and compute). Without it, the +2.15 / +1.88 / +4.54 deltas cannot be attributed to counting rather than instruction tuning or dataset statistics.
[Evaluation protocol and results] Evaluation protocol and results: the reported exact accuracies and transfer deltas lack error bars, confidence intervals, or an explicit statement of the evaluation protocol (prompt templates, frame sampling, exact-match criterion for variable-length clips). This weakens the quantitative claims in the abstract and Table/Figure reporting the 42.1% and ~6% figures.

minor comments (2)

[Abstract] Abstract: the phrase 'matching supervised baselines' should name the specific baselines and their scores for immediate comparison.
[Dataset description] Dataset description: confirm that PushupBench clips have no overlap with the target benchmarks (MVBench, PerceptionTest, TVBench) or provide an explicit audit; this is a minor but necessary clarification for the transfer interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important areas for strengthening the manuscript's claims on transfer effects and quantitative rigor. We address each point below and will revise the paper accordingly to incorporate the suggested improvements.

read point-by-point responses

Referee: Experiments section (transfer results): the claim that counting fine-tuning improves temporal reasoning specifically is load-bearing for the final suggestion in the abstract, yet no control is described for generic video-adaptation effects (e.g., a matched non-counting video fine-tuning baseline with the same 1k samples and compute). Without it, the +2.15 / +1.88 / +4.54 deltas cannot be attributed to counting rather than instruction tuning or dataset statistics.

Authors: We agree that this control is essential to substantiate the specific benefit of counting-based fine-tuning over generic video adaptation. In the revised manuscript, we will add a matched baseline experiment: fine-tuning the same model on 1k non-counting video samples (drawn from general video QA or captioning tasks with equivalent duration and compute) and report the resulting transfer deltas on MVBench, PerceptionTest, and TVBench for direct comparison. This will allow us to isolate whether the observed gains stem from counting as a temporal proxy. revision: yes
Referee: Evaluation protocol and results: the reported exact accuracies and transfer deltas lack error bars, confidence intervals, or an explicit statement of the evaluation protocol (prompt templates, frame sampling, exact-match criterion for variable-length clips). This weakens the quantitative claims in the abstract and Table/Figure reporting the 42.1% and ~6% figures.

Authors: We acknowledge that more detailed reporting is needed for reproducibility and to support the quantitative claims. We will revise the manuscript to include: (1) error bars or confidence intervals (via bootstrap resampling over the 446 clips or multiple prompt variations where feasible); (2) an explicit evaluation protocol section detailing prompt templates, frame sampling (uniform at 1 fps for long clips), and the exact-match criterion (numerical accuracy within a small tolerance for phrasing variations). These details will be added to the main text and an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset, evaluations, and benchmark deltas are independent measurements

full rationale

The paper introduces a new dataset (PushupBench) and reports direct accuracy measurements on frontier and open-source VLMs, followed by fine-tuning experiments that produce observable deltas on external public benchmarks (MVBench, PerceptionTest, TVBench). No equations, parameters, or derivations are presented; the transfer claim is an empirical observation rather than a reduction to fitted inputs or self-referential definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The results are falsifiable against the stated benchmarks and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions about video annotation quality and benchmark independence; no new free parameters or invented entities are introduced beyond the dataset itself.

axioms (1)

domain assumption Video clips are correctly labeled with ground-truth repetition counts
Required for any accuracy or transfer measurement to be meaningful.

pith-pipeline@v0.9.0 · 5480 in / 1246 out tokens · 30923 ms · 2026-05-08T08:29:47.207977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, and Yuki M. Asano. 2024. Lost in time: A new temporal benchmark for VideoLLMs . arXiv preprint arXiv:2410.07752

work page arXiv 2024
[2]

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. 2020. Counting out time: Class agnostic video repetition counting in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10387--10396

2020
[3]

Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, and Andrew Zisserman. 2024. OVR : A dataset for open vocabulary temporal repetition counting in videos. arXiv preprint arXiv:2407.17085

work page arXiv 2024
[4]

Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, and Jiahao Zhang. 2025. Your vision-language model can't even count to 20: Exposing the failures of vlms in compositional counting. arXiv preprint arXiv:2510.04401

work page arXiv 2025
[5]

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. 2025. MotionBench : Benchmarking and improving fine-grained video motion understanding for vision language models. arXiv preprint arXiv:2501.02955

work page internal anchor Pith review arXiv 2025
[6]

Huazhang Hu, Sixun Dong, Yiqun Zhao, Dongze Lian, Zhengxin Li, and Shenghua Gao. 2022. TransRAC : Encoding multi-scale temporal correlation with transformers for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19013--19022

2022
[7]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, and 1 others. 2024. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206

2024
[8]

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, and 1 others. 2023. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36:42748--42761

2023
[9]

Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, and Donald E. Brown. 2025. Can vision-language models count? a synthetic benchmark and analysis of attention-based interventions. arXiv preprint arXiv:2511.17722

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

work page internal anchor Pith review arXiv 2024
[11]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279--1297

2025
[12]

Jiankang Wang and 1 others. 2025. SpaceVLLM : Endowing multimodal large language model with spatio-temporal video grounding capability. arXiv preprint arXiv:2503.13983

work page arXiv 2025
[13]

Ziyu Yao, Xuxin Cheng, and Yuexian Zou. 2023. PoseRAC : Pose saliency transformer for repetitive action counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14024--14034

2023
[14]

Huaidong Zhang, Xuemiao Xu, Guoqiang Han, and Shengfeng He. 2020. Context-aware and scale-insensitive temporal repetition counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 670--678

2020
[15]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[16]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...