EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Angela Yao; Byoung-Tak Zhang; Junbin Xiao; Junghyun Kim; Minjoon Jung

arxiv: 2510.26113 · v2 · pith:UZ62NATTnew · submitted 2025-10-30 · 💻 cs.CV · cs.AI

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Minjoon Jung , Junbin Xiao , Junghyun Kim , Byoung-Tak Zhang , Angela Yao This is my paper

classification 💻 cs.CV cs.AI

keywords temporalviewpointsconsistencyegoexo-conunderstandingacrossconsistentmodels

0 comments

read the original abstract

Do Video-LLMs have consistent temporal understanding when videos capture the same event from different viewpoints? To study this question, we introduce EgoExo-Con(sistency), a benchmark of synchronized egocentric and exocentric video pairs with human-refined queries that ensure all concepts are visible in both viewpoints. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superior temporal understanding capabilities, especially for improving cross-view consistency. All resources have been made available at https://minjoong507.github.io/projects/EgoExo-Con/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
cs.CV 2026-05 unverdicted novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
cs.CV 2026-05 conditional novelty 7.0

SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.