Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

Fatih Porikli; Hwanjun Song; Jihwan Bang; Jisu Shin; Juntae Lee; Kyuwoong Hwang; Nicole Hee-Yeon Kim; Yuho Lee

arxiv: 2606.13141 · v1 · pith:QE6KHWNGnew · submitted 2026-06-11 · 💻 cs.AI

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

Yuho Lee , Jisu Shin , Nicole Hee-Yeon Kim , Jihwan Bang , Juntae Lee , Kyuwoong Hwang , Fatih Porikli , Hwanjun Song This is my paper

Pith reviewed 2026-06-27 06:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords VideoRAGRetrieval-Augmented GenerationLong VideosEgocentric VideoChunk-adaptive RerankingMultimodal RetrievalBenchmark Construction

0 comments

The pith

CARVE retrieves and supplies video chunks under individually chosen modality-granularity settings rather than one fixed setting per query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that VideoRAG progress is blocked by benchmarks that let answers be produced without video evidence and by retrieval methods that lock every chunk to the same modality and temporal scale. It introduces V-RAGBench, a set of query-evidence-answer triplets that forces retrieval to be tested on its own. CARVE runs multiple retrievers in parallel across different configurations and uses chunk-adaptive reranking to pick the best configuration for each chunk separately. The selected chunks then reach the generator already carrying their individual configuration choices, producing an interleaved evidence set that query-level methods cannot create. If correct, this shows that respecting per-chunk differences improves end-to-end performance over uniform retrieval choices.

Core claim

CARVE outperforms eight recent VideoRAG baselines by running parallel retrievers across modality-granularity configurations and applying chunk-adaptive reranking to select a winning configuration for each chunk, so that the generator receives evidence interleaving multiple configurations instead of a single one shared across the query.

What carries the argument

Chunk-adaptive reranking that selects a per-chunk winning configuration from parallel retrievers across different modality and granularity settings.

If this is right

Retrieval and generation can be evaluated independently because each triplet explicitly links query, required chunks, and answer.
Evidence reaching the generator can interleave chunks retrieved under different modality-granularity pairs.
Performance gains arise specifically from allowing configuration decisions to vary at the chunk level rather than the query level.
A simple parallel-retrieval plus reranking pipeline suffices to beat prior single-configuration VideoRAG methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Video retrieval systems may gain by treating each chunk as its own retrieval problem instead of making one decision for the entire query.
The same chunk-level selection idea could be applied after retrieval, for example by letting the generator request alternative configurations for weak chunks.
V-RAGBench could be extended to measure how often real user queries actually need multiple configurations within one answer.

Load-bearing premise

The benchmark queries are built so the supplied evidence chunks are both necessary and sufficient for the answer, and the chunk-level configuration choices made at retrieval stay optimal once the same chunks reach the generator.

What would settle it

An experiment that forces every chunk in CARVE's output to use the single best query-level configuration and measures whether answer quality drops compared with the original interleaved version.

Figures

Figures reproduced from arXiv: 2606.13141 by Fatih Porikli, Hwanjun Song, Jihwan Bang, Jisu Shin, Juntae Lee, Kyuwoong Hwang, Nicole Hee-Yeon Kim, Yuho Lee.

**Figure 1.** Figure 1: The overview of CARVE: Stage 1 builds a candidate pool via chunk-wise parallel retrieval, while Stage 2 performs chunk-adaptive reranking. The final top-k evidence with their winning configuration is passed to the generator in a modality-interleaved form. than as principled retrieval. Although retrieval is the core component on which all these frameworks ultimately rest, its validation is left at a shallow… view at source ↗

**Figure 2.** Figure 2: Overview of the V-RAGBench Construction Pipeline. We construct [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of the queries not grounded with a unique evidence chunk. [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Heatmaps of Recall@5 and nDCG@5 under different final retrieval depths. [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for QA Generation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for GPT Answerability Filtering. You have NO frames or visual input. Try to answer using ONLY common sense and the question. Question: {question} [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for GPT Blind Check Filtering. You are a strict evaluator judging whether a candidate answer to a question about video content is correct, using a reference answer as the ground truth. Question: {question} Reference Answer: {reference} Candidate Answer: {candidate} Evaluation criteria: - YES: The candidate answer captures the key factual content of the reference answer. It does not need to match wor… view at source ↗

**Figure 8.** Figure 8: Prompt for LLM Judge. You are analyzing an egocentric (first-person) video clip from someone's day. Describe what the person is seeing, doing, and experiencing during this clip. Use first-person perspective ("I am...", "I see...") and be specific about: - Activities and actions being performed - Objects being handled or interacted with - Environment and location details - Any notable events or transitions … view at source ↗

**Figure 9.** Figure 9: Prompt for Text Clip Memory Generation. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for Text Key Frame Memory Generation. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for Text Key Frame and Text Clip Answering. [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt for Text Combined Answering. You are a helpful assistant answering questions about video content. The video segments above are from a first-person view video, arranged in chronological order. Question: {question} Based on the video content above, provide a clear and concise answer. If there is not enough information to answer, say so [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt for Visual Clip Answering. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt for Visual Key Frame Answering. You are a helpful assistant answering questions about video content. The video segments and keyframe images above are from a first-person view video, arranged in chronological order. Question: {question} Based on all the visual content above, provide a clear and concise answer. If there is not enough information to answer, say so [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 15.** Figure 15: Prompt for Visual Combined Answering. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

read the original abstract

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-RAGBench forces actual video use in answers and CARVE adapts retrieval config per chunk, but the abstract shows no numbers so the size of any gain is unclear.

read the letter

The paper's main contribution is V-RAGBench, a set of query-evidence chunk-answer triplets for long videos that prevents answering without the video, and CARVE, which runs several retrievers in parallel across different modality and granularity settings then picks the best one per chunk via reranking.

This is new in how it decouples the evaluation and allows different retrieval setups for different parts of the video. It does a good job pointing out that existing VideoRAG work often uses one configuration for an entire query, which misses chunk-level differences, and that benchmarks can be solved without retrieval.

The approach is straightforward: parallel retrieval plus chunk-wise selection, leading to interleaved configurations in the final evidence passed to the generator. That part is a reasonable incremental step.

The soft spots are in the evaluation. The abstract mentions outperformance over eight baselines but gives no numbers, error bars, or dataset sizes, so it's impossible to gauge how solid the results are. The stress-test point is on target here—the reranking happens at retrieval time, but there's no evidence shown that the chosen configurations are also best for the generator stage. If the advantage is mostly from running multiple retrievers rather than the adaptive part, the claim weakens. Also, without full details it's unclear if the triplets truly make retrieval necessary and sufficient.

This paper is for people working on retrieval-augmented systems for video. A reader in multimodal RAG would get value from the benchmark design and the idea of per-chunk adaptation. It deserves a serious referee because the problems it targets are practical and the method is easy to reproduce, even if the current writeup leaves the strength of the gains open.

Recommendation: Send it out for peer review so the experiments can be properly checked.

Referee Report

2 major / 1 minor

Summary. The paper identifies two limitations in VideoRAG for long egocentric videos: benchmarks that permit answers without video content, and methods that apply a single modality-granularity configuration per query. It introduces V-RAGBench, a dataset of query-evidence chunk-answer triplets designed for decoupled retrieval and generation evaluation, and proposes CARVE, which runs parallel retrievers over multiple configurations and applies chunk-adaptive reranking to select a winning configuration per chunk. Each chunk is then passed to the generator under its retrieval-selected configuration, producing an interleaved evidence set. The authors report that CARVE outperforms eight recent VideoRAG baselines and that the resulting interleaved configurations cannot be achieved by query-level approaches.

Significance. If the empirical claims hold under rigorous evaluation, the work would be significant for VideoRAG by demonstrating the value of chunk-level rather than query-level adaptation across modalities and granularities. The introduction of V-RAGBench directly targets a known evaluation flaw and enables more faithful assessment. The CARVE approach is conceptually simple and leverages existing retrievers, which strengthens its potential impact if the chunk-adaptive propagation is shown to be the source of gains rather than an ensemble effect.

major comments (2)

The central claim that CARVE's advantage stems from propagating chunk-specific configuration decisions from retrieval to generation rests on the assumption that the reranking objective used at retrieval time aligns with the needs of the downstream generator. The provided abstract and description contain no ablation or analysis (e.g., comparing retrieval-selected vs. generator-optimal configurations per chunk) that would confirm this alignment; without such evidence the reported outperformance could arise simply from running eight parallel retrievers rather than from the adaptive mechanism.
The soundness assessment notes that the abstract supplies no quantitative results, error bars, or dataset statistics. The full manuscript must include these (with held-out test splits and statistical significance) in the experimental section to substantiate the claim of outperforming eight baselines; otherwise the central empirical result cannot be verified as load-bearing.

minor comments (1)

The abstract would benefit from a brief quantitative statement of the performance gains (e.g., average improvement across metrics and datasets) to allow readers to gauge the magnitude of the reported outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of our central claim and evaluation rigor. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The central claim that CARVE's advantage stems from propagating chunk-specific configuration decisions from retrieval to generation rests on the assumption that the reranking objective used at retrieval time aligns with the needs of the downstream generator. The provided abstract and description contain no ablation or analysis (e.g., comparing retrieval-selected vs. generator-optimal configurations per chunk) that would confirm this alignment; without such evidence the reported outperformance could arise simply from running eight parallel retrievers rather than from the adaptive mechanism.

Authors: We agree that an explicit ablation demonstrating alignment between the retrieval-time reranking objective and downstream generator performance would strengthen the central claim regarding chunk-level propagation. The current results show overall gains and interleaved configurations, but do not isolate the adaptive mechanism from potential ensemble effects. In the revised manuscript we will add a targeted analysis on held-out data that compares CARVE-selected configurations per chunk against generator-optimal configurations (measured by downstream answer quality), to quantify the degree of alignment and rule out pure ensemble explanations. revision: yes
Referee: The soundness assessment notes that the abstract supplies no quantitative results, error bars, or dataset statistics. The full manuscript must include these (with held-out test splits and statistical significance) in the experimental section to substantiate the claim of outperforming eight baselines; otherwise the central empirical result cannot be verified as load-bearing.

Authors: The full manuscript already reports quantitative results with error bars, dataset statistics, held-out test splits, and statistical significance tests in the experimental section. We will revise the presentation to make these elements more prominent and ensure all baseline comparisons are accompanied by the requested statistical details. revision: partial

Circularity Check

0 steps flagged

No circularity; method design and empirical claims are independent

full rationale

The paper describes CARVE as running parallel retrievers across configurations followed by chunk-adaptive reranking, with each chunk then supplied under its selected configuration. This interleaved behavior is a direct, explicit consequence of the method definition rather than a derived prediction or fitted result. No equations, parameter fitting, self-citations, or uniqueness theorems appear in the provided text. The outperformance claim is presented as an empirical observation on V-RAGBench; the benchmark and method are introduced as new contributions without reducing to prior self-referential inputs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical axioms or free parameters are visible in the abstract. The method relies on the domain assumption that multiple modality-granularity configurations can be meaningfully compared at chunk level and that the reranker can reliably identify the winning configuration.

axioms (1)

domain assumption Multiple modality-granularity configurations can be run in parallel and compared per chunk without prohibitive compute cost.
Implicit in the description of CARVE running parallel retrievers.

pith-pipeline@v0.9.1-grok · 5742 in / 1285 out tokens · 13561 ms · 2026-06-27T06:29:06.665063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 8 linked inside Pith

[1]

Rankarena: A unified platform for evaluating retrieval, reranking and rag with human and llm feedback

Abdelrahman Abdallah, Mahmoud Abdalla, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankarena: A unified platform for evaluating retrieval, reranking and rag with human and llm feedback. InCIKM, 2025

2025
[2]

A survey on rag with llms.Procedia Computer Science, 246:3781–3790, 2024

Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. A survey on rag with llms.Procedia Computer Science, 246:3781–3790, 2024

2024
[3]

k-means++: The advantages of careful seeding

David Arthur, Sergei Vassilvitskii, et al. k-means++: The advantages of careful seeding. In Soda, volume 7, pages 1027–1035, 2007

2007
[4]

Query expansion techniques for information retrieval: a survey.Information Processing & Management, 56(5):1698–1735, 2019

Hiteshwar Kumar Azad and Akshay Deepak. Query expansion techniques for information retrieval: a survey.Information Processing & Management, 56(5):1698–1735, 2019

2019
[5]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[6]

Wear: An outdoor sports dataset for wearable and egocentric activity recognition.IMWUT, 2024

Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, and Michael Moeller. Wear: An outdoor sports dataset for wearable and egocentric activity recognition.IMWUT, 2024

2024
[7]

Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

arXiv 2025
[8]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. In ICCV, 2025

2025
[9]

Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024

arXiv 2024
[10]

Towards multimodal lifelong understanding: A dataset and agentic baseline.arXiv preprint arXiv:2603.05484, 2026

Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, et al. Towards multimodal lifelong understanding: A dataset and agentic baseline.arXiv preprint arXiv:2603.05484, 2026

arXiv 2026
[11]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. InNeurIPS, 2024

2024
[12]

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024
[13]

Word2passage: Word-level importance re-weighting for query expansion

Jeonghwan Choi, Minjeong Ban, Minseok Kim, and Hwanjun Song. Word2passage: Word-level importance re-weighting for query expansion. InACL, 2025

2025
[14]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960
[15]

Grounded question-answering in long egocentric videos

Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. InCVPR, 2024

2024
[16]

Semantic event graphs for long-form video question answering

Aradhya Dixit and Tianxi Liang. Semantic event graphs for long-form video question answering. arXiv preprint arXiv:2601.06097, 2026

arXiv 2026
[17]

Splade: Sparse lexical and expansion model for first stage ranking

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. Splade: Sparse lexical and expansion model for first stage ranking. InSIGIR, 2021

2021
[18]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 10

2025
[19]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022
[20]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InAAAI, 2025

2025
[21]

Cova: Text-guided composed video retrieval for audio-visual content.arXiv preprint arXiv:2601.22508, 2026

Gyuwon Han, Young Kyun Jang, and Chanho Eom. Cova: Text-guided composed video retrieval for audio-visual content.arXiv preprint arXiv:2601.22508, 2026

arXiv 2026
[22]

Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

Hsin-Ling Hsu and Jengnan Tzeng. Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

arXiv 2025
[23]

M-llm based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-llm based video frame selection for efficient video understanding. InCVPR, 2025

2025
[24]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InACL, 2025

2025
[25]

Video recap: Recursive captioning of hour-long videos.arXiv preprint arXiv:2402.13250, 2024

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos.arXiv preprint arXiv:2402.13250, 2024

arXiv 2024
[26]

Ravu: Retrieval augmented video understanding with compositional reasoning over graph.arXiv preprint arXiv:2505.03173, 2025

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius. Ravu: Retrieval augmented video understanding with compositional reasoning over graph.arXiv preprint arXiv:2505.03173, 2025

arXiv 2025
[27]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InCVPR, 2017

2017
[28]

Videorag: Retrieval- augmented generation over video corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. Videorag: Retrieval- augmented generation over video corpus. InACL, 2025

2025
[29]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, 2023

2023
[30]

Language repository for long video understanding

Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S Ryoo. Language repository for long video understanding. InACL, 2025

2025
[31]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

2020
[32]

Video summarization with large language models

Min Jung Lee, Dayoung Gong, and Minsu Cho. Video summarization with large language models. InCVPR, 2025

2025
[33]

Tvqa: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018

2018
[34]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InNeurIPS, 2020

2020
[35]

Videochat: Chat-centric video understanding.Science China Information Sciences, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 2025

2025
[36]

Lla- trieval: Llm-verified retrieval for verifiable generation

Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, and Xipeng Qiu. Lla- trieval: Llm-verified retrieval for verifiable generation. InNAACL, 2024. 11

2024
[37]

Video-oasis: Rethinking evaluation of video understanding.arXiv preprint arXiv:2603.29616, 2026

Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee, Taeoh Kim, Dongyoon Wee, and Yukyung Choi. Video-oasis: Rethinking evaluation of video understanding.arXiv preprint arXiv:2603.29616, 2026

arXiv 2026
[38]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024

2024
[39]

Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

arXiv 2024
[40]

Multi-granularity correspondence learning from long-term noisy videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, and Xi Peng. Multi-granularity correspondence learning from long-term noisy videos. InICLR, 2024

2024
[41]

Video paragraph captioning as a text summarization task

Hui Liu and Xiaojun Wan. Video paragraph captioning as a text summarization task. InACL, 2021

2021
[42]

Video-rag: Visually-aligned retrieval-augmented long video comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. Video-rag: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025

2025
[43]

Query rewriting in retrieval- augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval- augmented large language models. InEMNLP, 2023

2023
[44]

Drvideo: Document retrieval based long video understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document retrieval based long video understanding. InCVPR, 2025

2025
[45]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024

2024
[46]

Sparse meets dense: A hybrid approach to enhance scientific document retrieval.arXiv preprint arXiv:2401.04055, 2024

Priyanka Mandikal and Raymond Mooney. Sparse meets dense: A hybrid approach to enhance scientific document retrieval.arXiv preprint arXiv:2401.04055, 2024

arXiv 2024
[47]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

2023
[48]

Advancing egocentric video question answering with multimodal large language models.arXiv preprint arXiv:2504.04550, 2025

Alkesh Patel, Vibhav Chitalia, and Yinfei Yang. Advancing egocentric video question answering with multimodal large language models.arXiv preprint arXiv:2504.04550, 2025

arXiv 2025
[49]

Category-specific video summarization

Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. Category-specific video summarization. InECCV, 2014

2014
[50]

Rag-fusion: a new take on retrieval-augmented generation.arXiv preprint arXiv:2402.03367, 2024

Zackary Rackauckas. Rag-fusion: a new take on retrieval-augmented generation.arXiv preprint arXiv:2402.03367, 2024

arXiv 2024
[51]

Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, and Hyo Jin Kim. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

arXiv 2026
[52]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024

2024
[53]

Videorag: Retrieval-augmented generation with extreme long-context videos

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented generation with extreme long-context videos. InKDD, 2026

2026
[54]

Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retriev- ers

Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retriev- ers. InMIPR, 2024

2024
[55]

Vgent: Graph-based retrieval-reasoning-augmented generation for long video understanding

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval-reasoning-augmented generation for long video understanding. InNeurIPS, 2025

2025
[56]

Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025. 12

Pith/arXiv arXiv 2025
[57]

Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

Pith/arXiv arXiv 2023
[58]

Egograph: Temporal knowl- edge graph for egocentric video understanding.arXiv preprint arXiv:2602.23709, 2026

Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, and Jifei Song. Egograph: Temporal knowl- edge graph for egocentric video understanding.arXiv preprint arXiv:2602.23709, 2026

arXiv 2026
[59]

Guided query refinement: Multimodal hybrid retrieval with test-time optimization

Omri Uzan, Asaf Yehudai, Eyal Shnarch, Ariel Gera, et al. Guided query refinement: Multimodal hybrid retrieval with test-time optimization. 2026

2026
[60]

Dynamic-vlm: Simple dynamic visual token compression for videollm

Han Wang, Yuxiang Nie, Yongjie Ye, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm. InICCV, 2025

2025
[61]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[62]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding with large language model as agent. InECCV, 2024

2024
[63]

Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models

Yujing Wang, Hainan Zhang, Liang Pang, Binghui Guo, Hongwei Zheng, and Zhiming Zheng. Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models. InAAAI, 2025

2025
[64]

Infogain-rag: Boosting retrieval-augmented generation through document information gain-based reranking and filtering

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, and Han Li. Infogain-rag: Boosting retrieval-augmented generation through document information gain-based reranking and filtering. InEMNLP, 2025

2025
[65]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, 2025

2025
[66]

Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Pith/arXiv arXiv 2025
[67]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, 2024

2024
[68]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

2023
[69]

Video question answering via gradually refined attention over appearance and motion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. InMM, 2017

2017
[70]

E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation.arXiv preprint arXiv:2508.01546, 2025

Zeyu Xu, Junkang Zhang, Qiang Wang, and Yi Liu. E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation.arXiv preprint arXiv:2508.01546, 2025

arXiv 2025
[71]

Adavideorag: Omni-contextual adaptive retrieval-augmented efficient long video under- standing.arXiv preprint arXiv:2506.13589, 2025

Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, and Dacheng Tao. Adavideorag: Omni-contextual adaptive retrieval-augmented efficient long video under- standing.arXiv preprint arXiv:2506.13589, 2025

arXiv 2025
[72]

Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

Pith/arXiv arXiv 2024
[73]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InCVPR, 2025

2025
[74]

Graph-to-frame rag: Visual-space knowledge fusion for training-free and auditable video reasoning.arXiv preprint arXiv:2604.04372, 2026

Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, and Nong Xiao. Graph-to-frame rag: Visual-space knowledge fusion for training-free and auditable video reasoning.arXiv preprint arXiv:2604.04372, 2026

Pith/arXiv arXiv 2026
[75]

Generative frame sampler for long video understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InACL, 2025. 13

2025
[76]

Univer- salrag: Retrieval-augmented generation over corpora of diverse modalities and granularities

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. Univer- salrag: Retrieval-augmented generation over corpora of diverse modalities and granularities. 2026

2026
[77]

Worldmm: Dynamic multimodal memory agent for long video reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning. InCVPR, 2026

2026
[78]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. InNeurIPS, 2024

2024
[79]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

2019
[80]

T5gemma 2: Seeing, reading, and understanding longer.arXiv preprint arXiv:2512.14856, 2025

Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer.arXiv preprint arXiv:2512.14856, 2025

arXiv 2025

Showing first 80 references.

[1] [1]

Rankarena: A unified platform for evaluating retrieval, reranking and rag with human and llm feedback

Abdelrahman Abdallah, Mahmoud Abdalla, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankarena: A unified platform for evaluating retrieval, reranking and rag with human and llm feedback. InCIKM, 2025

2025

[2] [2]

A survey on rag with llms.Procedia Computer Science, 246:3781–3790, 2024

Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. A survey on rag with llms.Procedia Computer Science, 246:3781–3790, 2024

2024

[3] [3]

k-means++: The advantages of careful seeding

David Arthur, Sergei Vassilvitskii, et al. k-means++: The advantages of careful seeding. In Soda, volume 7, pages 1027–1035, 2007

2007

[4] [4]

Query expansion techniques for information retrieval: a survey.Information Processing & Management, 56(5):1698–1735, 2019

Hiteshwar Kumar Azad and Akshay Deepak. Query expansion techniques for information retrieval: a survey.Information Processing & Management, 56(5):1698–1735, 2019

2019

[5] [5]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[6] [6]

Wear: An outdoor sports dataset for wearable and egocentric activity recognition.IMWUT, 2024

Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, and Michael Moeller. Wear: An outdoor sports dataset for wearable and egocentric activity recognition.IMWUT, 2024

2024

[7] [7]

Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

arXiv 2025

[8] [8]

Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents

Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, and Yali Wang. Lvagent: Long video understanding by multi-round dynamical collaboration of mllm agents. In ICCV, 2025

2025

[9] [9]

Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg-bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075, 2024

arXiv 2024

[10] [10]

Towards multimodal lifelong understanding: A dataset and agentic baseline.arXiv preprint arXiv:2603.05484, 2026

Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, et al. Towards multimodal lifelong understanding: A dataset and agentic baseline.arXiv preprint arXiv:2603.05484, 2026

arXiv 2026

[11] [11]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. InNeurIPS, 2024

2024

[12] [12]

Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024

Pith/arXiv arXiv 2024

[13] [13]

Word2passage: Word-level importance re-weighting for query expansion

Jeonghwan Choi, Minjeong Ban, Minseok Kim, and Hwanjun Song. Word2passage: Word-level importance re-weighting for query expansion. InACL, 2025

2025

[14] [14]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

1960

[15] [15]

Grounded question-answering in long egocentric videos

Shangzhe Di and Weidi Xie. Grounded question-answering in long egocentric videos. InCVPR, 2024

2024

[16] [16]

Semantic event graphs for long-form video question answering

Aradhya Dixit and Tianxi Liang. Semantic event graphs for long-form video question answering. arXiv preprint arXiv:2601.06097, 2026

arXiv 2026

[17] [17]

Splade: Sparse lexical and expansion model for first stage ranking

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. Splade: Sparse lexical and expansion model for first stage ranking. InSIGIR, 2021

2021

[18] [18]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025. 10

2025

[19] [19]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022

2022

[20] [20]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. InAAAI, 2025

2025

[21] [21]

Cova: Text-guided composed video retrieval for audio-visual content.arXiv preprint arXiv:2601.22508, 2026

Gyuwon Han, Young Kyun Jang, and Chanho Eom. Cova: Text-guided composed video retrieval for audio-visual content.arXiv preprint arXiv:2601.22508, 2026

arXiv 2026

[22] [22]

Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

Hsin-Ling Hsu and Jengnan Tzeng. Dat: Dynamic alpha tuning for hybrid retrieval in retrieval- augmented generation.arXiv preprint arXiv:2503.23013, 2025

arXiv 2025

[23] [23]

M-llm based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-llm based video frame selection for efficient video understanding. InCVPR, 2025

2025

[24] [24]

Prunevid: Visual token pruning for efficient video large language models

Xiaohu Huang, Hao Zhou, and Kai Han. Prunevid: Visual token pruning for efficient video large language models. InACL, 2025

2025

[25] [25]

Video recap: Recursive captioning of hour-long videos.arXiv preprint arXiv:2402.13250, 2024

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos.arXiv preprint arXiv:2402.13250, 2024

arXiv 2024

[26] [26]

Ravu: Retrieval augmented video understanding with compositional reasoning over graph.arXiv preprint arXiv:2505.03173, 2025

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, and Gedas Bertasius. Ravu: Retrieval augmented video understanding with compositional reasoning over graph.arXiv preprint arXiv:2505.03173, 2025

arXiv 2025

[27] [27]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InCVPR, 2017

2017

[28] [28]

Videorag: Retrieval- augmented generation over video corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. Videorag: Retrieval- augmented generation over video corpus. InACL, 2025

2025

[29] [29]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InEMNLP, 2023

2023

[30] [30]

Language repository for long video understanding

Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S Ryoo. Language repository for long video understanding. InACL, 2025

2025

[31] [31]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP, 2020

2020

[32] [32]

Video summarization with large language models

Min Jung Lee, Dayoung Gong, and Minsu Cho. Video summarization with large language models. InCVPR, 2025

2025

[33] [33]

Tvqa: Localized, compositional video question answering

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018

2018

[34] [34]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InNeurIPS, 2020

2020

[35] [35]

Videochat: Chat-centric video understanding.Science China Information Sciences, 2025

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 2025

2025

[36] [36]

Lla- trieval: Llm-verified retrieval for verifiable generation

Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, and Xipeng Qiu. Lla- trieval: Llm-verified retrieval for verifiable generation. InNAACL, 2024. 11

2024

[37] [37]

Video-oasis: Rethinking evaluation of video understanding.arXiv preprint arXiv:2603.29616, 2026

Geuntaek Lim, Minho Shim, Sungjune Park, Jaeyun Lee, Inwoong Lee, Taeoh Kim, Dongyoon Wee, and Yukyung Choi. Video-oasis: Rethinking evaluation of video understanding.arXiv preprint arXiv:2603.29616, 2026

arXiv 2026

[38] [38]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024

2024

[39] [39]

Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

arXiv 2024

[40] [40]

Multi-granularity correspondence learning from long-term noisy videos

Yijie Lin, Jie Zhang, Zhenyu Huang, Jia Liu, Zujie Wen, and Xi Peng. Multi-granularity correspondence learning from long-term noisy videos. InICLR, 2024

2024

[41] [41]

Video paragraph captioning as a text summarization task

Hui Liu and Xiaojun Wan. Video paragraph captioning as a text summarization task. InACL, 2021

2021

[42] [42]

Video-rag: Visually-aligned retrieval-augmented long video comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, et al. Video-rag: Visually-aligned retrieval-augmented long video comprehension. InNeurIPS, 2025

2025

[43] [43]

Query rewriting in retrieval- augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval- augmented large language models. InEMNLP, 2023

2023

[44] [44]

Drvideo: Document retrieval based long video understanding

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. Drvideo: Document retrieval based long video understanding. InCVPR, 2025

2025

[45] [45]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InACL, 2024

2024

[46] [46]

Sparse meets dense: A hybrid approach to enhance scientific document retrieval.arXiv preprint arXiv:2401.04055, 2024

Priyanka Mandikal and Raymond Mooney. Sparse meets dense: A hybrid approach to enhance scientific document retrieval.arXiv preprint arXiv:2401.04055, 2024

arXiv 2024

[47] [47]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. InNeurIPS, 2023

2023

[48] [48]

Advancing egocentric video question answering with multimodal large language models.arXiv preprint arXiv:2504.04550, 2025

Alkesh Patel, Vibhav Chitalia, and Yinfei Yang. Advancing egocentric video question answering with multimodal large language models.arXiv preprint arXiv:2504.04550, 2025

arXiv 2025

[49] [49]

Category-specific video summarization

Danila Potapov, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. Category-specific video summarization. InECCV, 2014

2014

[50] [50]

Rag-fusion: a new take on retrieval-augmented generation.arXiv preprint arXiv:2402.03367, 2024

Zackary Rackauckas. Rag-fusion: a new take on retrieval-augmented generation.arXiv preprint arXiv:2402.03367, 2024

arXiv 2024

[51] [51]

Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, and Hyo Jin Kim. Agentic very long video understanding.arXiv preprint arXiv:2601.18157, 2026

arXiv 2026

[52] [52]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024

2024

[53] [53]

Videorag: Retrieval-augmented generation with extreme long-context videos

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented generation with extreme long-context videos. InKDD, 2026

2026

[54] [54]

Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retriev- ers

Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retriev- ers. InMIPR, 2024

2024

[55] [55]

Vgent: Graph-based retrieval-reasoning-augmented generation for long video understanding

Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval-reasoning-augmented generation for long video understanding. InNeurIPS, 2025

2025

[56] [56]

Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025. 12

Pith/arXiv arXiv 2025

[57] [57]

Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023

Pith/arXiv arXiv 2023

[58] [58]

Egograph: Temporal knowl- edge graph for egocentric video understanding.arXiv preprint arXiv:2602.23709, 2026

Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, and Jifei Song. Egograph: Temporal knowl- edge graph for egocentric video understanding.arXiv preprint arXiv:2602.23709, 2026

arXiv 2026

[59] [59]

Guided query refinement: Multimodal hybrid retrieval with test-time optimization

Omri Uzan, Asaf Yehudai, Eyal Shnarch, Ariel Gera, et al. Guided query refinement: Multimodal hybrid retrieval with test-time optimization. 2026

2026

[60] [60]

Dynamic-vlm: Simple dynamic visual token compression for videollm

Han Wang, Yuxiang Nie, Yongjie Ye, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, and Can Huang. Dynamic-vlm: Simple dynamic visual token compression for videollm. InICCV, 2025

2025

[61] [61]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[62] [62]

Videoagent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding with large language model as agent. InECCV, 2024

2024

[63] [63]

Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models

Yujing Wang, Hainan Zhang, Liang Pang, Binghui Guo, Hongwei Zheng, and Zhiming Zheng. Maferw: Query rewriting with multi-aspect feedbacks for retrieval-augmented large language models. InAAAI, 2025

2025

[64] [64]

Infogain-rag: Boosting retrieval-augmented generation through document information gain-based reranking and filtering

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, and Han Li. Infogain-rag: Boosting retrieval-augmented generation through document information gain-based reranking and filtering. InEMNLP, 2025

2025

[65] [65]

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, and Mohit Bansal. Videotree: Adaptive tree-based video representation for llm reasoning on long videos. InCVPR, 2025

2025

[66] [66]

Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S Ryoo, and Juan Carlos Niebles. Active video perception: Iterative evidence seeking for agentic long video understanding.arXiv preprint arXiv:2512.05774, 2025

Pith/arXiv arXiv 2025

[67] [67]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, 2024

2024

[68] [68]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

2023

[69] [69]

Video question answering via gradually refined attention over appearance and motion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. InMM, 2017

2017

[70] [70]

E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation.arXiv preprint arXiv:2508.01546, 2025

Zeyu Xu, Junkang Zhang, Qiang Wang, and Yi Liu. E-vrag: Enhancing long video understanding with resource-efficient retrieval augmented generation.arXiv preprint arXiv:2508.01546, 2025

arXiv 2025

[71] [71]

Adavideorag: Omni-contextual adaptive retrieval-augmented efficient long video under- standing.arXiv preprint arXiv:2506.13589, 2025

Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, and Dacheng Tao. Adavideorag: Omni-contextual adaptive retrieval-augmented efficient long video under- standing.arXiv preprint arXiv:2506.13589, 2025

arXiv 2025

[72] [72]

Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

Pith/arXiv arXiv 2024

[73] [73]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InCVPR, 2025

2025

[74] [74]

Graph-to-frame rag: Visual-space knowledge fusion for training-free and auditable video reasoning.arXiv preprint arXiv:2604.04372, 2026

Songyuan Yang, Weijiang Yu, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, and Nong Xiao. Graph-to-frame rag: Visual-space knowledge fusion for training-free and auditable video reasoning.arXiv preprint arXiv:2604.04372, 2026

Pith/arXiv arXiv 2026

[75] [75]

Generative frame sampler for long video understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. Generative frame sampler for long video understanding. InACL, 2025. 13

2025

[76] [76]

Univer- salrag: Retrieval-augmented generation over corpora of diverse modalities and granularities

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. Univer- salrag: Retrieval-augmented generation over corpora of diverse modalities and granularities. 2026

2026

[77] [77]

Worldmm: Dynamic multimodal memory agent for long video reasoning

Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, and Sung Ju Hwang. Worldmm: Dynamic multimodal memory agent for long video reasoning. InCVPR, 2026

2026

[78] [78]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. InNeurIPS, 2024

2024

[79] [79]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

2019

[80] [80]

T5gemma 2: Seeing, reading, and understanding longer.arXiv preprint arXiv:2512.14856, 2025

Biao Zhang, Paul Suganthan, Gaël Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer.arXiv preprint arXiv:2512.14856, 2025

arXiv 2025