EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
read the original abstract
Do Video-LLMs have consistent temporal understanding when videos capture the same event from different viewpoints? To study this question, we introduce EgoExo-Con(sistency), a benchmark of synchronized egocentric and exocentric video pairs with human-refined queries that ensure all concepts are visible in both viewpoints. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superior temporal understanding capabilities, especially for improving cross-view consistency. All resources have been made available at https://minjoong507.github.io/projects/EgoExo-Con/
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
-
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.