SER: Learning to Ground Video Reasoning with Semantic Evidence Rewards
Pith reviewed 2026-06-26 00:14 UTC · model grok-4.3
The pith
SER replaces IoU overlap checks with a referee VLM that scores evidence relevance and localization for video reasoning training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SER reformulates spatio-temporal evidence grounding as a constrained verification task. A referee VLM evaluates model-generated evidence claims on two axes—relevance and localization quality—while a temporal penalty discourages loose timing. This design removes the need for dense box annotations and supports end-to-end training on standard video QA data, producing measurable gains in both answer correctness and evidence quality on the V-STAR benchmark.
What carries the argument
Semantic Evidence Reward (SER), which substitutes a referee VLM checker for pixel-level IoU computation when scoring evidence claims.
If this is right
- Training no longer requires dense spatio-temporal box annotations.
- Answer accuracy and evidence grounding improve together rather than trading off.
- The reward is less sensitive to small boundary perturbations than IoU-based alternatives.
- Models can be trained end-to-end on existing video QA datasets.
Where Pith is reading between the lines
- The same referee-VLM pattern could be tested on image or audio reasoning tasks where semantic alignment matters more than exact overlap.
- Training stability may vary with the choice or size of the referee model.
- Combining SER with other reward signals such as answer correctness could produce further gains.
Load-bearing premise
A separate referee VLM can reliably and unbiasedly judge the relevance and localization quality of the model's generated evidence claims.
What would settle it
Human raters evaluating the same evidence claims show low agreement with the referee VLM's relevance and localization scores, or the performance gain disappears when a different referee model is substituted.
Figures
read the original abstract
Video MLLMs often struggle with fine-grained spatio-temporal reasoning, sometimes generating correct answers based on irrelevant frames or objects. Although outputting spatio-temporal evidence during reasoning is a promising direction, existing RL frameworks typically rely on geometry-only (IoU) rewards, which can be sensitive to boundary perturbations and overlook semantic alignment. To address this, we propose Semantic Evidence Reward (SER), which reformulates spatio-temporal evidence grounding as a constrained verification task. Instead of computing pixel-level overlap, SER uses a referee VLM as a local checker to evaluate model-generated evidence claims across two dimensions: relevance and localization quality, combined with a temporal penalty. This design reduces the reliance on dense box annotations and enables training directly on standard video QA data. On the V-STAR benchmark, SER achieves 49.6% mLGM, improving by 3.0 points over the strong evidence-grounded baseline Open-o3-Video, demonstrating its potential in enhancing both answer accuracy and evidence grounding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Semantic Evidence Reward (SER) for video MLLMs, reformulating spatio-temporal evidence grounding as a constrained verification task. SER replaces IoU-based rewards with scores from a referee VLM evaluating generated evidence on relevance and localization quality (plus temporal penalty), enabling training on standard video QA data without dense box annotations. On the V-STAR benchmark, SER reports 49.6% mLGM, a 3.0-point gain over the Open-o3-Video baseline.
Significance. If the referee VLM scores prove reliable, SER could meaningfully advance evidence-grounded video reasoning by reducing annotation requirements and mitigating semantic misalignment in geometry-only rewards. The reported improvement on mLGM suggests the approach may jointly boost answer accuracy and evidence quality, though this hinges on the unvalidated proxy.
major comments (2)
- [Abstract (SER design paragraph)] Abstract (SER design paragraph): the central claim that the referee VLM serves as an accurate local checker for relevance and localization is load-bearing for the entire RL reward and training loop, yet no validation against human judgments, inter-annotator agreement, or calibration metrics is provided.
- [Abstract (results paragraph)] Abstract (results paragraph): the 3.0-point mLGM gain to 49.6% is presented as demonstrating effectiveness, but the absence of error bars, run counts, baseline implementation details, or referee-choice ablations makes it impossible to determine whether the improvement is robust or attributable to the proposed reward.
minor comments (1)
- The abstract refers to Open-o3-Video as a 'strong evidence-grounded baseline' without specifying its evidence mechanism or providing a direct comparison table.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting two key areas: validation of the referee VLM and statistical robustness of the reported gains. We address each comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract (SER design paragraph)] Abstract (SER design paragraph): the central claim that the referee VLM serves as an accurate local checker for relevance and localization is load-bearing for the entire RL reward and training loop, yet no validation against human judgments, inter-annotator agreement, or calibration metrics is provided.
Authors: We agree this is a substantive gap. The manuscript does not include direct human validation or calibration of the referee VLM scores. While we drew on prior VLM verification literature for the design, the absence of such metrics leaves the reliability of the reward signal unverified in this work. In the revised manuscript we will add a targeted human study on a held-out set of evidence claims, reporting agreement and calibration statistics to support the central claim. revision: yes
-
Referee: [Abstract (results paragraph)] Abstract (results paragraph): the 3.0-point mLGM gain to 49.6% is presented as demonstrating effectiveness, but the absence of error bars, run counts, baseline implementation details, or referee-choice ablations makes it impossible to determine whether the improvement is robust or attributable to the proposed reward.
Authors: The reported 49.6% mLGM and 3.0-point gain are from single training runs, and the current manuscript provides limited implementation details or ablations on referee VLM choice. We acknowledge that this limits assessment of robustness. In revision we will expand the results section with additional run details where compute permits, clearer baseline reproduction notes, and a referee-VLM ablation to isolate the contribution of the proposed reward. revision: partial
Circularity Check
No circularity detected in derivation
full rationale
The paper defines SER via an external referee VLM that scores generated evidence on relevance and localization (plus temporal penalty), then uses those scores as the RL reward. This construction is independent of the model's own parameters or outputs; the reward is not fitted to the target data, not self-defined in terms of the prediction, and not justified by self-citation chains. The reported 49.6% mLGM is an empirical benchmark result rather than a quantity forced by the reward definition itself. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Referee VLM provides accurate and unbiased judgments of relevance and localization quality
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , year=
Video-R1: Reinforcing Video Reasoning in MLLMs , author=. Advances in Neural Information Processing Systems , year=
-
[2]
arXiv preprint arXiv:2504.06958 , year=
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning , author=. arXiv preprint arXiv:2504.06958 , year=
-
[3]
Advances in Neural Information Processing Systems , year=
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception , author=. Advances in Neural Information Processing Systems , year=
-
[4]
Advances in Neural Information Processing Systems , year=
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning , author=. Advances in Neural Information Processing Systems , year=
-
[5]
Advances in Neural Information Processing Systems , volume=
Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[7]
Advances in Neural Information Processing Systems , year=
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding , author=. Advances in Neural Information Processing Systems , year=
-
[8]
arXiv preprint arXiv:2506.01908 , year=
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency , author=. arXiv preprint arXiv:2506.01908 , year=
-
[9]
arXiv preprint arXiv:2510.20579 , year=
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence , author=. arXiv preprint arXiv:2510.20579 , year=
-
[10]
Advances in Neural Information Processing Systems , year=
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning , author=. Advances in Neural Information Processing Systems , year=
-
[11]
arXiv preprint arXiv:2511.21375 , year=
Thinking with Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning , author=. arXiv preprint arXiv:2511.21375 , year=
-
[12]
International Conference on Learning Representations , year=
STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[13]
arXiv preprint arXiv:2510.23397 , year=
VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations , author=. arXiv preprint arXiv:2510.23397 , year=
-
[14]
Advances in Neural Information Processing Systems , year=
Grounded Reinforcement Learning for Visual Reasoning , author=. Advances in Neural Information Processing Systems , year=
-
[15]
International Conference on Learning Representations , year=
DeepEyes: Incentivizing Thinking with Images via Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[16]
International Conference on Learning Representations , year=
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method , author=. International Conference on Learning Representations , year=
-
[17]
arXiv preprint arXiv:2505.20272 , year=
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning , author=. arXiv preprint arXiv:2505.20272 , year=
-
[18]
arXiv preprint arXiv:2505.19094 , year=
SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards , author=. arXiv preprint arXiv:2505.19094 , year=
-
[19]
arXiv preprint arXiv:2604.08476 , year=
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization , author=. arXiv preprint arXiv:2604.08476 , year=
-
[20]
International Conference on Learning Representations , year=
VisualPRM400K: An Effective Dataset for Training Multimodal Process Reward Models , author=. International Conference on Learning Representations , year=
-
[21]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[22]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[23]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Bounding Box Regression with Uncertainty for Accurate Object Detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=
Can We Trust Bounding Box Annotations for Object Detection? , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages=
-
[25]
Proceedings of the Winter Conference on Applications of Computer Vision , pages=
Noise-Aware Evaluation of Object Detectors , author=. Proceedings of the Winter Conference on Applications of Computer Vision , pages=
-
[26]
arXiv preprint arXiv:2402.03300 , year=
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint arXiv:2402.03300 , year=
-
[27]
arXiv preprint arXiv:2502.13923 , year=
Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=
-
[28]
arXiv preprint arXiv:2504.10479 , year=
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. arXiv preprint arXiv:2504.10479 , year=
-
[29]
Cheng, Zixu and Hu, Jian and Liu, Ziquan and Si, Chenyang and Li, Wei and Gong, Shaogang , journal=
-
[30]
Advances in Neural Information Processing Systems , volume=
Scaling rl to long videos , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
Hong, Jack and Yan, Shilin and Cai, Jiayin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi , journal=
-
[32]
Hu, Kairui and Wu, Penghao and Pu, Fanyi and Xiao, Wang and Zhang, Yuanhan and Yue, Xiang and Li, Bo and Liu, Ziwei , journal=
-
[33]
Fu, Chaoyou and Dai, Yuhan and Luo, Yongdong and Li, Lei and Ren, Shuhuai and Zhang, Renrui and Wang, Zihan and Zhou, Chenyu and Shen, Yunhang and Zhang, Mengdan and others , booktitle=
-
[34]
Proceedings of the IEEE International Conference on Computer Vision , pages=
Dense-Captioning Events in Videos , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=
-
[35]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Video-chatgpt: Towards detailed video understanding via large vision and language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[36]
Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=
Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations , pages=
2023
-
[37]
2024 , booktitle =
Qian, Long and Li, Juncheng and Wu, Yu and Ye, Yaobo and Fei, Hao and Chua, Tat-Seng and Zhuang, Yueting and Tang, Siliang , title =. 2024 , booktitle =
2024
-
[38]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vtimellm: Empower llm to grasp video moments , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[39]
arXiv preprint arXiv:2410.21276 , year=
-
[40]
International Conference on Learning Representations , year=
Let's verify step by step , author=. International Conference on Learning Representations , year=
-
[41]
Advances in Neural Information Processing Systems , volume=
Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
Zhang, Boqiang and Li, Kehan and Cheng, Zesen and Hu, Zhiqiang and Yuan, Yuqian and Chen, Guanzheng and Leng, Sicong and Jiang, Yuming and Zhang, Hang and Li, Xin and others , journal=
-
[43]
2025 , note=
Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun MA and Ziwei Liu and Chunyuan Li , journal=. 2025 , note=
2025
-
[44]
Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others , booktitle=
-
[45]
arXiv preprint arXiv:2412.05271 , year=
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling , author=. arXiv preprint arXiv:2412.05271 , year=
-
[46]
Yongxin Guo and Jingyu Liu and Mingda Li and Qingbin Liu and Xi Chen and Xiaoying Tang , booktitle=
-
[47]
Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan , journal=
-
[48]
International Conference on Learning Representations , year=
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution , author=. International Conference on Learning Representations , year=
-
[49]
arXiv preprint arXiv:2403.05530 , year=
-
[50]
Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and others , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.