pith. sign in

arxiv: 2510.26113 · v2 · pith:UZ62NATTnew · submitted 2025-10-30 · 💻 cs.CV · cs.AI

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

classification 💻 cs.CV cs.AI
keywords temporalviewpointsconsistencyegoexo-conunderstandingacrossconsistentmodels
0
0 comments X
read the original abstract

Do Video-LLMs have consistent temporal understanding when videos capture the same event from different viewpoints? To study this question, we introduce EgoExo-Con(sistency), a benchmark of synchronized egocentric and exocentric video pairs with human-refined queries that ensure all concepts are visible in both viewpoints. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superior temporal understanding capabilities, especially for improving cross-view consistency. All resources have been made available at https://minjoong507.github.io/projects/EgoExo-Con/

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

    cs.CV 2026-05 unverdicted novelty 7.0

    EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.

  2. Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

    cs.CV 2026-05 conditional novelty 7.0

    SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...

  3. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.