hub

arXiv preprint arXiv:2504.06958 (2025)

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang · 2025 · arXiv 2504.06958

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

From Priors to Perception: Grounding Video-LLMs in Physical Reality

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.

Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

Co-Evolving Policy Distillation

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.

Video-ToC: Video Tree-of-Cue Reasoning

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 3 refs

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly 2x faster convergence on video reasoning benchmarks.

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

cs.SD · 2026-04-15 · unverdicted · novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

citing papers explorer

Showing 17 of 17 citing papers.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 34
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding cs.CV · 2026-05-13 · unverdicted · none · ref 13
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection cs.CV · 2026-05-11 · unverdicted · none · ref 26
MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 16
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 19
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs cs.CV · 2026-05-02 · unverdicted · none · ref 5
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 24
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
Video-ToC: Video Tree-of-Cue Reasoning cs.CV · 2026-04-22 · unverdicted · none · ref 6
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CV · 2026-04-16 · unverdicted · none · ref 16
Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding cs.CV · 2026-04-06 · unverdicted · none · ref 16
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 22
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CV · 2026-04-02 · unverdicted · none · ref 10
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 23 · 3 links
VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly 2x faster convergence on video reasoning benchmarks.
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt cs.SD · 2026-04-15 · unverdicted · none · ref 20
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 16
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation cs.CV · 2026-05-08 · unverdicted · none · ref 12
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 20
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

arXiv preprint arXiv:2504.06958 (2025)

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer