pith. sign in

arxiv: 2603.20190 · v2 · pith:3B4EOGX2new · submitted 2026-03-20 · 💻 cs.CV

CoVR-R:Reason-Aware Composed Video Retrieval

classification 💻 cs.CV
keywords videoafter-effectscovrreasoningretrievaleditbenchmarkcausal
0
0 comments X
read the original abstract

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

    cs.IR 2026-05 unverdicted novelty 6.0

    ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.

  2. R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

    cs.CV 2026-05 unverdicted novelty 5.0

    R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.

  3. Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

    cs.CV 2026-06 unverdicted novelty 4.0

    Training-free composed video retrieval pipeline using DINOv3 for candidate selection and video-LLM reasoning achieves 48.78 Recall@1 and 51.48 Recall@5 on the CVPR 2026 challenge test set.

  4. Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R

    cs.CV 2026-05 unverdicted novelty 3.0

    Dual-route top-k retrieval with 1v1 VLM reranking reaches 95.28 R@1 on CoVR-R hidden test by merging text and DFN visual routes then using conservative pairwise VLM comparisons.