CoVR-R:Reason-Aware Composed Video Retrieval

Alaa Mostafa Lasheen; Dmitry Demidov; Fahad Khan; Omkar Thawakar; Rao Muhammad Anwer; Sai Prasanna Teja Reddy Bogireddy; Vaishnav Potlapalli; Viswanatha Reddy Gajjala

arxiv: 2603.20190 · v2 · pith:3B4EOGX2new · submitted 2026-03-20 · 💻 cs.CV

CoVR-R:Reason-Aware Composed Video Retrieval

Omkar Thawakar , Dmitry Demidov , Vaishnav Potlapalli , Sai Prasanna Teja Reddy Bogireddy , Viswanatha Reddy Gajjala , Alaa Mostafa Lasheen , Rao Muhammad Anwer , Fahad Khan This is my paper

classification 💻 cs.CV

keywords videoafter-effectscovrreasoningretrievaleditbenchmarkcausal

0 comments

read the original abstract

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval
cs.IR 2026-05 unverdicted novelty 6.0

ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking
cs.CV 2026-05 unverdicted novelty 5.0

R^3 is a zero-shot pipeline that generates reasoning traces to augment composed video queries, fuses scores via agreement-gated residual, and re-ranks candidates for the CoVR-R challenge.
Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning
cs.CV 2026-06 unverdicted novelty 4.0

Training-free composed video retrieval pipeline using DINOv3 for candidate selection and video-LLM reasoning achieves 48.78 Recall@1 and 51.48 Recall@5 on the CVPR 2026 challenge test set.
Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R
cs.CV 2026-05 unverdicted novelty 3.0

Dual-route top-k retrieval with 1v1 VLM reranking reaches 95.28 R@1 on CoVR-R hidden test by merging text and DFN visual routes then using conservative pairwise VLM comparisons.