arxiv: 2605.12571 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

(2) Nanyang Technological University), Chenhao Qiu (1), Shien Song (1), Xin Luo (1), Xusheng Liu (1) ((1) Mango TV, Yechao Zhang (2)

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords long video understandingevidence misalignmentagentic systemsplanner-inspector frameworkvideo question answeringmultimodal modelsgrounded answers

0 comments

The pith

Separating planning from answer authority in video agents reduces evidence misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long video question answering agents frequently give correct answers that lack support from the visual evidence they actually retrieve or inspect. This evidence misalignment stems from a coupled design that merges long-horizon planning with final answer generation inside one model, creating pressures from saturated context and outcome-only training rewards. The paper introduces a decoupled planner-inspector framework that lets planning handle search while a separate step requires pixel-level verification before any answer is accepted. This change raises both accuracy and groundedness scores on long-video benchmarks while producing clearer search paths. A reader would care because trustworthy visual reasoning in extended videos depends on evidence that actually matches the claim.

Core claim

Existing agentic systems for long video understanding exhibit evidence misalignment, in which answers can be correct yet unsupported by the retrieved or inspected visual evidence. The root structural cause is the coupled agent paradigm that conflates long-horizon planning with answer authority. The decoupled planner-inspector framework separates these roles and gates final answering on pixel-level verification, improving both answer accuracy and evidence alignment to 55.1 percent on LVBench and 62.0 percent on LongVideoBench while generating interpretable search trajectories.

What carries the argument

decoupled planner-inspector framework that separates planning from answer authority and gates final answers on pixel-level verification

If this is right

Answer accuracy rises together with evidence alignment on four long-video benchmarks.
Search trajectories become interpretable for inspection and debugging.
Performance scales consistently when search budgets increase.
New multimodal backbones can be swapped in without retraining the planner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of planning from final authority could reduce unsupported outputs in other long-horizon agent tasks such as extended document reasoning.
Pixel-level gating might need adaptation if future models limit direct visual access.
Testing the framework on videos longer than current benchmarks would reveal whether the gains hold at greater scale.

Load-bearing premise

Gating final answers on pixel-level verification will reliably eliminate evidence misalignment without introducing new failure modes during long searches.

What would settle it

Long-video benchmarks that still show low temporal or semantic groundedness scores for answers produced under the decoupled framework would indicate the approach has not resolved misalignment.

Figures

Figures reproduced from arXiv: 2605.12571 by (2) Nanyang Technological University), Chenhao Qiu (1), Shien Song (1), Xin Luo (1), Xusheng Liu (1) ((1) Mango TV, Yechao Zhang (2).

**Figure 1.** Figure 1: Performance vs Training steps. We train an MLLM planner (Video-MTR (Xie et al., 2025) with its original tools) and an LLM-planner (using the tools in Section 4.2) over CG-Bench. Diagnostic II: Semantic Grounding. Temporal access alone does not guarantee grounded reasoning: even if the agent successfully retrieves the relevant spans for query q, it may not properly interpret them and deduce the answer with… view at source ↗

**Figure 2.** Figure 2: Temporal access vs. semantic support under trace growth. We report temporal groundedness/hallucination (Gt/Ht) and semantic groundedness/hallucination (Gs/Hs) for VideoAgent and DrVideo on LVBench. While Gt checks for temporal access, Gs verifies logical semantic support, catching trajectory drift where the agent ignores accessed evidence. 3.2. Training-time Diagnostic: Reward Pressure Training longer can… view at source ↗

**Figure 3.** Figure 3: An example of prompt pressure. The agent repeatedly retrieves candidate clips and even surfaces potentially relevant visual cues, yet its final decision is a hedged plausibility template (e.g., “might suggest”) and incorrect. Prompt pressure impairs groundedness. The above results suggest that the agent is increasingly prompted to be decisive without adequately inspecting retrieved evidence, regardless of… view at source ↗

**Figure 4.** Figure 4: Architectural comparison between the coupled agent and the decoupled agent. history ht−1, the environment returns observation ot, and the inspector evaluates the extracted evidence vt := E(ot): (rt, ut) ∼ P(· | ht−1, q), (zt, ft) ∼ I(· | vt, q). Here zt ∈ {0, 1} is a binary sufficiency verdict and ft represents the feedback, where zt=1 indicates that the current evidence is sufficient to answer. The agent… view at source ↗

**Figure 5.** Figure 5: Overview of the decoupled planner–inspector framework. answer proposal aˆt grounded in the inspected evidence. 4.3. Planner–Inspector Interaction Planner. The planner is a language-model policy for longhorizon evidence seeking. At each turn t, it takes the question q and a compact search memory ht−1, and chooses one of two actions: (i) retrieve candidate temporal spans, or (ii) submit a set of video evid… view at source ↗

**Figure 6.** Figure 6: Scaling properties. (a) Search: Increasing budget K improves our method, while coupled baselines plateau due to context saturation. (b) Perception: Replacing the inspector from 7B to 72B yields substantial accuracy gains without retraining the planner, highlighting modular scalability. framework produces fewer correct-but-unsupported answers and yields more evidence-aligned trajectories [PITH_FULL_IMAGE:f… view at source ↗

**Figure 7.** Figure 7: Training Dynamics of the Decoupled Approach. Unlike the Hallucination Gap observed in coupled baselines, our method maintains a strong alignment between answer accuracy and temporal groundedness (Gt). The performance gap remains negligible (∆ ≈ 0.02) throughout the training process. This indicates that improvements in answer accuracy are driven entirely by successful evidence retrieval, effectively elimina… view at source ↗

**Figure 8.** Figure 8: Case S1 (Counting). The inspector verifies a unique count-based cue from retrieved spans, enabling evidence-sufficient termination. Step 2 Step 4 Step 5 Answer: D <think>Okay ...I should use visual_inspect on the spans where they are together. The first span [00:50:24-00:50:40] has them together. Maybe the weather changes right after that. ...</think> Inspector Answer: SEARCH_MORE. Evidence: The scene show… view at source ↗

**Figure 9.** Figure 9: Case S2 (Post-condition). SEARCH MORE refines retrieval to a post-event span where the decisive visual cue is verified. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Case S3 (Recovery). Query refocusing recovers from inspector false negatives under long-span, sparse sampling. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Case F1 (Fine-grained perceptual failure). Fine-grained headwear attributes (small accessories / color) are misperceived, yielding evidence–option inconsistency. Step 2 Step 4 Step 5 Answer: A <think>Okay, let's see...The useful spans provided include times like 00:28:32-00:28:48, which mentions an autumnal setting. Another span at 00:06:40-00:06:56 has cherry blossoms, which is spring...</think> Inspecto… view at source ↗

**Figure 12.** Figure 12: Case F2 (Off-target retrieval). The search misses the time-aligned evidence window, forcing an answer from local cues and resulting in an incorrect commitment. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Training Dynamics across Model Configurations. We report the moving average of the reward (left y-axis) and response length (right y-axis) for four variants. The proposed decoupled + time setting achieves a superior balance between high reward attainment and concise tool usage. G. Training Dynamics of Agent Variants To empirically validate our design, we study GRPO training stability and convergence under… view at source ↗

read the original abstract

Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit "evidence misalignment": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures that amplify misalignment: prompt pressure from shared-context saturation at inference time and reward pressure from outcome-only optimization during training. These findings point to a structural root cause: the coupled agent paradigm conflates long-horizon planning with answer authority. We therefore propose the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification. Across four long-video benchmarks, our framework improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench while producing interpretable search trajectories. Moreover, the decoupled architecture scales consistently with increased search budgets and supports plug-and-play upgrades of the MLLM backbone without retraining the planner. Code and models are available at https://github.com/Echochef/VideoSEAL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The planner-inspector split with pixel verification gating is a straightforward structural fix that lifts accuracy and alignment on long-video benchmarks, but the measurement of groundedness stays underspecified.

read the letter

The main takeaway is the decoupled planner-inspector architecture. It keeps planning and search separate from answer authority, then gates the final output on a pixel-level verification step. This directly targets the evidence misalignment the authors diagnose in existing LVU agents, where correct answers often lack supporting retrieved frames. The paper shows this change raises scores to 55.1% on LVBench and 62.0% on LongVideoBench while producing more interpretable trajectories, and the gains hold when search budget increases or the backbone MLLM is swapped without retraining the planner. Releasing code and models is useful for anyone who wants to test the pattern themselves. They also introduce two diagnostics for temporal and semantic groundedness, which help make the misalignment problem concrete rather than anecdotal. The central claim rests on the architecture plus these empirical lifts rather than any circular derivation, so the burden stays on the experiments. The soft spots sit in the evaluation details. The abstract gives benchmark numbers but no error bars, no breakdown of how the groundedness scores were computed on the test sets, and no separate measurement of the inspector's own verification accuracy. If the inspector inherits the same prompt saturation or reward pressures, the gate could reject valid evidence or let misaligned answers through, especially on very long horizons where evidence is sparse. That assumption needs direct testing before the pattern can be treated as reliable. This work is aimed at groups already running agentic video QA pipelines. It has a clear enough idea and reproducible gains to deserve a serious referee, provided the revision supplies the missing protocol and inspector diagnostics.

Referee Report

2 major / 1 minor

Summary. The paper claims that agentic long-video understanding systems suffer from evidence misalignment, where correct answers lack support from retrieved evidence due to prompt saturation and outcome-only reward pressures. It introduces temporal and semantic groundedness diagnostics to characterize the issue, identifies the coupled planner-answer authority paradigm as the root cause, and proposes the VideoSEAL decoupled planner-inspector framework that gates final answers on pixel-level verification. The framework is reported to improve both accuracy and alignment, reaching 55.1% on LVBench and 62.0% on LongVideoBench across four benchmarks, while scaling with search budget and supporting backbone upgrades without retraining.

Significance. If the gains and alignment improvements hold under detailed scrutiny, the work offers a practical architectural fix for a key failure mode in long-horizon video agents, with potential for broader impact in reliable MLLM-based systems. The release of code and models aids reproducibility, and the diagnostics provide a reusable evaluation lens beyond raw accuracy.

major comments (2)

[Abstract] Abstract: the reported gains (55.1% LVBench, 62.0% LongVideoBench) and improved evidence alignment are presented without quantitative details on how temporal or semantic groundedness were measured, without error bars, and without the full experimental protocol, making it impossible to assess whether the improvements are robust or sensitive to verification noise.
[Proposed framework] Proposed framework section: the central claim that pixel-level verification reliably gates answers and eliminates misalignment assumes the inspector MLLM detects misalignment without false negatives; this requires an independent ablation measuring inspector accuracy separately from end-task accuracy to confirm no new failure modes are introduced in long-horizon search.

minor comments (1)

[Abstract] The availability of code and models at the GitHub link is a strength for reproducibility and should be highlighted in the final version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses. Revisions have been made to strengthen the presentation of experimental details and add requested ablations.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gains (55.1% LVBench, 62.0% LongVideoBench) and improved evidence alignment are presented without quantitative details on how temporal or semantic groundedness were measured, without error bars, and without the full experimental protocol, making it impossible to assess whether the improvements are robust or sensitive to verification noise.

Authors: We agree the abstract is concise and omits measurement specifics due to length limits. In the main text (Section 4.2), temporal groundedness is defined as the ratio of ground-truth evidence frames covered by the agent's trajectory, and semantic groundedness as the cosine similarity between answer and evidence embeddings via a frozen CLIP model. Full protocol details (search budget, thresholds, backbone versions) appear in Section 4. We will revise the abstract to include one sentence on the metrics and add standard deviation error bars (from 3 runs) to Table 2 results in the revision. revision: partial
Referee: [Proposed framework] Proposed framework section: the central claim that pixel-level verification reliably gates answers and eliminates misalignment assumes the inspector MLLM detects misalignment without false negatives; this requires an independent ablation measuring inspector accuracy separately from end-task accuracy to confirm no new failure modes are introduced in long-horizon search.

Authors: This is a fair and important concern. The original submission evaluates the inspector only indirectly via end-task gains. To directly address it, we have added a new independent ablation (revised Section 5.3) measuring inspector accuracy on a held-out set of 1,000 aligned/misaligned evidence pairs. The inspector achieves 91% accuracy with a 7% false-negative rate on misalignment detection. We further show that end-task accuracy remains superior to baselines even under simulated inspector noise, confirming no new long-horizon failure modes are introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architectural proposal with benchmark results

full rationale

The paper identifies evidence misalignment via two new diagnostics (temporal groundedness and semantic groundedness), attributes it to prompt saturation and outcome-only reward pressures, and proposes a decoupled planner-inspector architecture that gates answers on pixel-level verification. All central claims are supported by empirical results on four benchmarks (e.g., 55.1% on LVBench, 62.0% on LongVideoBench) rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes appear in the provided text as load-bearing steps. The chain is self-contained as an empirical intervention and architectural change.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard architectural assumptions of MLLM agents.

pith-pipeline@v0.9.0 · 5575 in / 1032 out tokens · 35943 ms · 2026-05-14T20:48:01.366437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 5 canonical work pages · 4 internal anchors

[1]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

URL https://doi.org/10.48550/arXiv.2 505.20417. Chai, Y ., Sun, H., Fang, H., Wang, S., Sun, Y ., and Wu, H. MA-RLHF: Reinforcement learning from human feed- back with macro actions, 2024. Chen, G., Liu, Y ., Huang, Y ., Pei, B., Xu, J., He, Y ., Lu, T., Wang, Y ., and Wang, L. Cg-bench: Clue-grounded question answering benchmark for long video under- sta...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2 2024
[2]

URL https: //doi.org/10.48550/arXiv.2509.24304

doi: 10.48550/ARXIV.2509.24304. URL https: //doi.org/10.48550/arXiv.2509.24304. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., and Niebles, J. C. Dense-captioning events in videos. InArXiv, 2017. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model ser...

work page doi:10.48550/arxiv.2509.24304 2017
[4]

URL https://arxiv.org/abs/2506.1 3654. Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y ., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cv 2024
[5]

chinchilla optimal

URL https://doi.org/10.48550/arXiv .2508.20478. Yang, Z., Wang, S., Zhang, K., Wu, K., Leng, S., Zhang, Y ., Li, B., Qin, C., Lu, S., Li, X., and Bing, L. LongVT: Incentivizing ”thinking with long videos” via native tool calling.arXiv preprint arXiv:2511.20785, 2025. Yao, L., Wu, H., Ouyang, K., Zhang, Y ., Xiong, C., Chen, B., Sun, X., and Li, J. Generat...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[6]

ReAct: Synergizing Reasoning and Acting in Language Models

URL https://aclanthology.org/2025. findings-acl.921/. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.03629. Yu, S., Cho, J., Yadav, P., and Bansal, M. Self-chained i...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp- 2025
[7]

[HH:MM:SS{HH:MM:SS] <caption>
[8]

clip_description

[HH:MM:SS{HH:MM:SS] <caption> ... 29 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority F.5. Prompt for Clip Captioning Clip Caption Prompt You are a vision-language assistant. You will be given multiple frames from a single video clip spanning {START_HMS}{{END_HMS}. Describe ONLY what is visible...
[9]

X" while the Tool Outputs explicitly say

Hallucination (hallucination: true/false) Definition: Hallucination is the absence of tool-based evidence for the Final Answer. Compare the Final Answer (B) against the visible Tool Outputs (A). **hallucination = false** (Not Hallucinated) IFF: - The content of the Final Answer is explicitly present in or strictly entailed by the Tool Outputs. - (For Mult...
[10]

first occurrence

Trajectory-to-Answer Clarity (trajectory_clarity: 0--10) How reconstructable and \toward-the-answer" the trajectory is (independent of correctness). Capability-adjusted scoring: Score guidance: 10: Each major step states intent, cites prior evidence, refines spans/queries, and the final answer is clearly linked to specific evidence. 7{9: Mostly coherent; ...
[11]

Identify the critical Tool Output(s) (A) that relate to the Final Answer (B)
[12]

Quote (verbatim, short snippet) the specific text in A that supports or contradicts B
[13]

- Does B contain information absent from A? -> hallucination = true

CHECK SUPPORT: - Does A contain the information in B? -> hallucination = false. - Does B contain information absent from A? -> hallucination = true
[14]

Rate trajectory_clarity based on logical progression
[15]

reasoning

Set credibility_score based on how strong that support is. =============== OUTPUT (JSON ONLY) =============== Return valid JSON exactly in this schema (no extra text): { "reasoning": "Concise justification. 1) Quote the tool evidence. 2) State if Final Answer matches/is supported by this evidence. 3) Explain scoring for clarity/credibility.", "hallucinati...