QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Junyeong Kim; Woojun Jung

arxiv: 2604.24052 · v1 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung , Junyeong Kim This is my paper

Pith reviewed 2026-05-08 04:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video summarizationreference-free evaluationmultimodal question answeringfactualitychronologycoveragebenchmark datasethuman correlation

0 comments

The pith

QEVA evaluates video summaries without reference texts by using multimodal questions answered from the source video itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes QEVA as a reference-free way to score how well a narrative video summary covers key events, stays factual, and maintains correct order. It works by generating questions about the original video and checking whether the summary can answer them correctly through a vision-language model. This avoids the need for human-written reference summaries that limit older metrics like n-gram overlap or LLM judges. The authors release MLVU(VS)-Eval, a benchmark of 800 summaries from 200 videos, to support consistent testing. Experiments report that QEVA matches human ratings more closely than prior methods on standard correlation measures.

Core claim

QEVA is a reference-free evaluation metric that assesses candidate summaries directly against source videos through multimodal question answering. It measures three dimensions—Coverage, Factuality, and Chronology—by generating targeted questions from the video and verifying whether the summary provides accurate responses. On the introduced MLVU(VS)-Eval benchmark of 800 summaries from 200 videos, QEVA achieves higher correlation with human judgments than existing reference-dependent approaches, as quantified by Kendall's τ_b, τ_c, and Spearman's ρ.

What carries the argument

Multimodal question-answering process that generates questions from the source video and scores summary answers along coverage, factuality, and chronology dimensions.

If this is right

Video summary evaluation no longer requires expensive human-written reference texts.
Errors in event ordering and factual inaccuracies become directly detectable through question responses.
The MLVU(VS)-Eval benchmark supplies a fixed test set for comparing any new evaluation method.
Future summarization models can be trained or selected using a metric less sensitive to reference choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same question-answering approach might expose hidden biases when the same vision-language model generates both the summary and the evaluation questions.
Adapting the question-generation step could let the method apply to text-only or audio-only narrative summaries.
Reliability would improve if question selection were made more systematic rather than model-dependent.

Load-bearing premise

The vision-language model used for question generation and answering accurately captures nuanced video semantics without injecting its own biases or factual errors.

What would settle it

Run QEVA and existing metrics on a fresh collection of video summaries with independent human ratings; if QEVA shows lower or equal correlation than reference-based baselines, the advantage claim fails.

Figures

Figures reproduced from arXiv: 2604.24052 by Junyeong Kim, Woojun Jung.

**Figure 1.** Figure 1: Overview of existing video summarization view at source ↗

**Figure 2.** Figure 2: Detailed illustration of QEVA’s multimodal question-answering methodology. Given a video and view at source ↗

**Figure 3.** Figure 3: Detailed prompts used by QEVA for multimodal question-answer generation across three distinct view at source ↗

**Figure 4.** Figure 4: Methodology figure of QEVA. You are an expert instructor in the course “Deep Video Understanding through Summarization”. ## Objective Given a textual video summary, your task is to generate exactly 10 clear and precise quiz questions. These questions will measure whether a generated video summary is factually accurate and consistent with the original video content. ## Abilities to test - Accurate recogniti… view at source ↗

**Figure 5.** Figure 5: Methodology figure of QEVA view at source ↗

**Figure 6.** Figure 6: Methodology figure of QEVA view at source ↗

read the original abstract

Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $\tau_b$, $\tau_c$, and Spearman's $\rho$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QEVA gives a workable reference-free QA metric for video summaries plus a new benchmark, but the higher human correlations rest on untested VLM reliability for factuality and chronology.

read the letter

The main point is that QEVA scores candidate video summaries directly against the source video by generating questions and using a vision-language model to answer them, producing scores on coverage, factuality, and chronology. The authors also release MLVU(VS)-Eval, an annotated set of 800 summaries from 200 videos drawn from the MLVU dataset. This is the concrete new piece: a reference-free approach tied to those three dimensions plus the benchmark data itself. The paper does a reasonable job laying out why reference-based metrics are limiting for this task and shows reported gains in Kendall's τ_b, τ_c, and Spearman's ρ against human judgments. That kind of direct comparison is the right direction for evaluation work. The benchmark alone gives the field something usable for future tests. The soft spots sit in the execution and validation. The abstract supplies no specifics on how questions are written, how VLM answers are turned into dimension scores, or any checks against VLM errors such as missed temporal order or invented details. Without ablations across models or human review of the QA outputs, it is unclear whether the higher correlations reflect better measurement or simply the VLM's own biases lining up with human ones. The stress-test concern holds on the information given. This paper is for people working on video summarization evaluation who need reference-free options. A reader who wants a new dataset to run experiments on will get immediate value; someone expecting a fully validated metric will need the methods expanded. It deserves a serious referee because the benchmark is new and the core idea is practical, even if the current evidence for superiority is thin. I would send it out for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes QEVA, a reference-free evaluation metric for narrative video summarization that assesses candidate summaries directly against source videos via multimodal question answering on three dimensions: Coverage, Factuality, and Chronology. It introduces the MLVU(VS)-Eval benchmark with 800 summaries from 200 videos and reports that QEVA achieves higher correlation with human judgments than prior metrics, as measured by Kendall's τ_b, τ_c, and Spearman's ρ.

Significance. If the central claims hold after validation, QEVA would advance evaluation practices in video summarization by removing reliance on human-written references while targeting nuanced aspects such as chronology. The creation of MLVU(VS)-Eval provides a reusable, annotated resource that supports reproducible comparisons across methods.

major comments (3)

[§4.2] §4.2 (Question Generation and Scoring): The process for deriving questions that specifically probe Coverage, Factuality, and Chronology is described at a high level only; without explicit templates, prompting strategies, or answer-matching rules (e.g., exact match vs. embedding similarity), it is impossible to verify that the VLM outputs faithfully isolate these dimensions rather than reflecting VLM-internal temporal or factual biases.
[§5.1] §5.1 (Correlation Experiments): No ablation is presented on the choice of vision-language model or on human validation of the VLM-generated answers themselves; the reported superiority in Kendall's τ_b/τ_c and Spearman's ρ therefore cannot be separated from potential confounds arising from the VLM's own error patterns on long-range narrative content.
[Table 2] Table 2 (Main Results): The correlation tables do not report statistical significance tests, confidence intervals, or per-dimension breakdowns; this weakens the claim that QEVA is demonstrably superior across all three dimensions.

minor comments (2)

[§3] The notation used for the three dimension scores (e.g., how Coverage, Factuality, and Chronology are aggregated into a final QEVA score) would benefit from an explicit equation or pseudocode block.
[§2] A few citations to recent VLM-based evaluation papers (post-2023) are absent from the related-work discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving clarity, reproducibility, and statistical rigor in our work. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (Question Generation and Scoring): The process for deriving questions that specifically probe Coverage, Factuality, and Chronology is described at a high level only; without explicit templates, prompting strategies, or answer-matching rules (e.g., exact match vs. embedding similarity), it is impossible to verify that the VLM outputs faithfully isolate these dimensions rather than reflecting VLM-internal temporal or factual biases.

Authors: We agree that the current description in §4.2 is insufficient for full reproducibility and verification of dimension isolation. In the revised manuscript, we will expand this section with the exact prompting templates and strategies used to generate questions for each dimension, along with the answer-matching rules (cosine similarity on embeddings with a fixed threshold for open-ended responses). We will also add concrete examples demonstrating how the questions target Coverage, Factuality, and Chronology specifically, and discuss steps taken to minimize VLM-internal biases. revision: yes
Referee: [§5.1] §5.1 (Correlation Experiments): No ablation is presented on the choice of vision-language model or on human validation of the VLM-generated answers themselves; the reported superiority in Kendall's τ_b/τ_c and Spearman's ρ therefore cannot be separated from potential confounds arising from the VLM's own error patterns on long-range narrative content.

Authors: We acknowledge that ablations on VLM choice and direct human validation of VLM answers would strengthen the results and help rule out confounds. In the revision, we will add an ablation comparing QEVA performance across multiple VLMs. We will also include a human validation study on a subset of VLM-generated answers, reporting agreement metrics to demonstrate reliability on narrative content. revision: yes
Referee: [Table 2] Table 2 (Main Results): The correlation tables do not report statistical significance tests, confidence intervals, or per-dimension breakdowns; this weakens the claim that QEVA is demonstrably superior across all three dimensions.

Authors: We agree that statistical tests, confidence intervals, and per-dimension results are needed to support the superiority claims. We will revise Table 2 and the surrounding text to include p-values for all reported correlations, bootstrap-derived confidence intervals, and separate breakdowns of Kendall's τ_b/τ_c and Spearman's ρ for Coverage, Factuality, and Chronology individually. revision: yes

Circularity Check

0 steps flagged

No circularity: QEVA metric and benchmark defined independently; correlations measured against external human judgments

full rationale

The paper defines QEVA directly via multimodal QA on Coverage/Factuality/Chronology dimensions against source video, introduces an independent annotated benchmark MLVU(VS)-Eval with 800 summaries, and reports empirical correlations (Kendall τ_b/τ_c, Spearman ρ) with human judgments on that benchmark. No equation or step reduces the metric output to its own inputs by construction, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on self-citation. The derivation chain is self-contained against external human labels.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5488 in / 1113 out tokens · 61038 ms · 2026-05-08T04:47:00.120476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Alon Jacovi and Yoav Goldberg

Tifa: Accurate and interpretable text-to- image faithfulness evaluation with question answer- ing.Preprint, arXiv:2303.11897. Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, and Noah A. Smith. 2022. Transparent human evaluation for image captioning.Preprint, arXiv:2111.08940. Wojciech Kry´sci´nski, Bryan McCann,...

work page arXiv 2022
[2]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 10036–10050

QuestEval: Summarization asks for fact-based evaluation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 10036–10050. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image de- scription evaluation. InProceedings of the IEEE conference on computer vision and p...

work page 2021
[3]

Deep Video Understanding through Summarization

Asking and answering questions to evalu- ate the factual consistency of summaries.Preprint, arXiv:2004.04228. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. 2020. BERTScore: Evalu- ating text generation with BERT. InInternational Conference on Learning Representations (ICLR). Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyua...

work page arXiv 2004

[1] [1]

Alon Jacovi and Yoav Goldberg

Tifa: Accurate and interpretable text-to- image faithfulness evaluation with question answer- ing.Preprint, arXiv:2303.11897. Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, and Noah A. Smith. 2022. Transparent human evaluation for image captioning.Preprint, arXiv:2111.08940. Wojciech Kry´sci´nski, Bryan McCann,...

work page arXiv 2022

[2] [2]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 10036–10050

QuestEval: Summarization asks for fact-based evaluation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 10036–10050. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image de- scription evaluation. InProceedings of the IEEE conference on computer vision and p...

work page 2021

[3] [3]

Deep Video Understanding through Summarization

Asking and answering questions to evalu- ate the factual consistency of summaries.Preprint, arXiv:2004.04228. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. 2020. BERTScore: Evalu- ating text generation with BERT. InInternational Conference on Learning Representations (ICLR). Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyua...

work page arXiv 2004