Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5years
2026 5verdicts
UNVERDICTED 5roles
background 2polarities
background 2representative citing papers
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
citing papers explorer
-
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
-
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
-
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
-
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.