arxiv: 2601.03926 · v2 · submitted 2026-01-07 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Haeun Jang , Hwan Chang , Hwanhee Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords document policy preservationlarge vision-language modelssafety benchmarkmultimodal reasoninginformation leakagepolicy compliancedocument question answering

0 comments

The pith

Large vision-language models leak sensitive document information when synthesizing answers across text and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Doc-PP, a benchmark built from real-world reports to test how large vision-language models handle document questions while obeying user-defined non-disclosure policies. It identifies a Reasoning-Induced Safety Gap in which models disclose restricted details during complex cross-modal reasoning or multi-step synthesis, bypassing existing safeguards. The authors also show that supplying extracted text raises answer accuracy but heightens leakage risk. They propose DVA, a decompose-verify-aggregate method that separates reasoning steps from policy checks and cuts violations compared with standard prompting.

Core claim

Models exhibit a Reasoning-Induced Safety Gap by leaking sensitive information when answers require complex synthesis or aggregation across visual and textual elements, even under explicit non-disclosure policies; the DVA framework reduces this leakage by decoupling reasoning from policy verification.

What carries the argument

Doc-PP benchmark for multimodal policy preservation, together with the DVA structural inference framework that decomposes queries into separate reasoning and verification stages.

If this is right

Current safety techniques fail to block leakage that arises only after models combine information from multiple modalities or steps.
Providing OCR-extracted text improves perception yet raises the chance of policy violations.
DVA delivers a stronger baseline for policy-compliant answers than ordinary prompting or direct generation.
Document question-answering systems need safeguards that explicitly track and enforce inferred content rather than surface-level tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same leakage pattern is likely to appear in other high-stakes multimodal domains such as medical records or legal filings.
Benchmarks could be extended to test policies that change with user identity or that span several linked documents.
Training or prompting that forces explicit decomposition steps may lower leakage without large accuracy losses.

Load-bearing premise

The real-world reports chosen for the benchmark faithfully represent the range of dynamic, user-defined non-disclosure policies that occur in practice.

What would settle it

Testing the same models on a new collection of documents carrying fresh user-defined policies and checking whether the measured leakage rate falls substantially below the rates reported on Doc-PP.

read the original abstract

The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Doc-PP is a practical new benchmark for policy leakage in multimodal document QA, but its claims rest on unverified annotations and limited evaluation details.

read the letter

The paper's main contribution is Doc-PP, a benchmark built from real-world reports that tests whether LVLMs respect explicit non-disclosure policies when answers require cross-modal reasoning. It also introduces DVA, a decompose-verify-aggregate prompting approach meant to reduce leakage compared with standard methods. The core observation—that models leak more when forced to synthesize across text and images—matches what many practitioners see in deployment, and the finding that extracted text boosts perception but increases leakage is a useful data point. The decision to ground the benchmark in actual reports rather than synthetic cases gives it some grounding that pure text-only safety work often lacks. That said, the evaluation leaves several gaps. No inter-annotator agreement numbers are reported for the policy labels, so it is hard to know whether the measured leakage reflects model behavior or inconsistent ground truth. The quantitative comparisons with baselines are described at a high level without enough detail on splits, metrics, or statistical significance to judge how large the DVA gains really are. The paper is aimed at researchers working on multimodal safety and document QA systems. It is the kind of benchmark paper that can move the conversation forward if the annotation process and full results hold up, so it is worth sending to referees rather than desk-rejecting. A serious review would likely focus on tightening the evaluation protocol and clarifying how policies were operationalized.

Referee Report

1 major / 0 minor

Summary. The paper introduces Doc-PP, a benchmark of real-world multimodal documents with dynamic non-disclosure policies, to evaluate LVLMs on document QA. It reports a Reasoning-Induced Safety Gap in which models leak sensitive information during complex cross-modal synthesis, shows that extracted text improves perception but increases leakage, and proposes the DVA (Decompose-Verify-Aggregation) framework that decouples reasoning from policy verification and outperforms standard prompting.

Significance. If the benchmark annotations prove reliable, the work is significant for exposing a concrete failure mode in current LVLM safety that is missed by text-only or implicit-norm evaluations. The DVA framework supplies a reproducible structural baseline for policy-compliant multimodal inference.

major comments (1)

[Section 3] Section 3: Policy construction is described as manual extraction from real-world reports, yet no inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa) or conflict-resolution protocol are reported. Because the Reasoning-Induced Safety Gap claim rests entirely on the accuracy of these ground-truth policy-violation labels, the absence of agreement metrics leaves open the possibility that observed leakage reflects annotation inconsistency rather than model failure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on annotation reliability. We agree that inter-annotator agreement metrics are essential to substantiate the ground-truth policy-violation labels and will incorporate them in the revision.

read point-by-point responses

Referee: [Section 3] Section 3: Policy construction is described as manual extraction from real-world reports, yet no inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa) or conflict-resolution protocol are reported. Because the Reasoning-Induced Safety Gap claim rests entirely on the accuracy of these ground-truth policy-violation labels, the absence of agreement metrics leaves open the possibility that observed leakage reflects annotation inconsistency rather than model failure.

Authors: We acknowledge the validity of this concern. In the revised manuscript we will add a dedicated subsection in Section 3 reporting inter-annotator agreement on a randomly sampled subset of 200 documents. Two additional annotators independently labeled policy-violation spans; we will report Cohen’s kappa (pairwise) and Fleiss’ kappa (multi-annotator) together with the conflict-resolution protocol (majority vote followed by adjudication by a senior annotator for disagreements). These statistics will directly support the reliability of the Reasoning-Induced Safety Gap findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark or framework

full rationale

The paper introduces Doc-PP as an empirical benchmark constructed from real-world reports and proposes the DVA framework based on experimental evaluations of LVLMs. No equations, derivations, or mathematical claims are present in the provided text. The central claims about the Reasoning-Induced Safety Gap rest on observed model behaviors rather than any reduction to fitted parameters, self-definitions, or load-bearing self-citations. The work is self-contained as a benchmark evaluation without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Central claims rest on the newly introduced benchmark and framework; no free parameters, axioms, or invented entities beyond the benchmark itself are specified in the abstract.

invented entities (2)

Doc-PP benchmark no independent evidence
purpose: Evaluate policy preservation in multimodal document QA
Newly constructed dataset from real-world reports
DVA framework no independent evidence
purpose: Decouple reasoning from policy verification
Proposed structural inference method

pith-pipeline@v0.9.0 · 5480 in / 1060 out tokens · 25544 ms · 2026-05-16T16:27:04.902508+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Doc-PP ... DVA (Decompose–Verify–Aggregation), a structural inference framework that decouples reasoning from policy verification.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our evaluation highlights a systemic Reasoning-Induced Safety Gap

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Policy-Invisible Violations in LLM-Based Agents
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.