pith. machine review for the scientific record. sign in

arxiv: 2601.03926 · v2 · submitted 2026-01-07 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords document policy preservationlarge vision-language modelssafety benchmarkmultimodal reasoninginformation leakagepolicy compliancedocument question answering
0
0 comments X

The pith

Large vision-language models leak sensitive document information when synthesizing answers across text and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Doc-PP, a benchmark built from real-world reports to test how large vision-language models handle document questions while obeying user-defined non-disclosure policies. It identifies a Reasoning-Induced Safety Gap in which models disclose restricted details during complex cross-modal reasoning or multi-step synthesis, bypassing existing safeguards. The authors also show that supplying extracted text raises answer accuracy but heightens leakage risk. They propose DVA, a decompose-verify-aggregate method that separates reasoning steps from policy checks and cuts violations compared with standard prompting.

Core claim

Models exhibit a Reasoning-Induced Safety Gap by leaking sensitive information when answers require complex synthesis or aggregation across visual and textual elements, even under explicit non-disclosure policies; the DVA framework reduces this leakage by decoupling reasoning from policy verification.

What carries the argument

Doc-PP benchmark for multimodal policy preservation, together with the DVA structural inference framework that decomposes queries into separate reasoning and verification stages.

If this is right

  • Current safety techniques fail to block leakage that arises only after models combine information from multiple modalities or steps.
  • Providing OCR-extracted text improves perception yet raises the chance of policy violations.
  • DVA delivers a stronger baseline for policy-compliant answers than ordinary prompting or direct generation.
  • Document question-answering systems need safeguards that explicitly track and enforce inferred content rather than surface-level tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same leakage pattern is likely to appear in other high-stakes multimodal domains such as medical records or legal filings.
  • Benchmarks could be extended to test policies that change with user identity or that span several linked documents.
  • Training or prompting that forces explicit decomposition steps may lower leakage without large accuracy losses.

Load-bearing premise

The real-world reports chosen for the benchmark faithfully represent the range of dynamic, user-defined non-disclosure policies that occur in practice.

What would settle it

Testing the same models on a new collection of documents carrying fresh user-defined policies and checking whether the measured leakage rate falls substantially below the rates reported on Doc-PP.

read the original abstract

The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Doc-PP, a benchmark of real-world multimodal documents with dynamic non-disclosure policies, to evaluate LVLMs on document QA. It reports a Reasoning-Induced Safety Gap in which models leak sensitive information during complex cross-modal synthesis, shows that extracted text improves perception but increases leakage, and proposes the DVA (Decompose-Verify-Aggregation) framework that decouples reasoning from policy verification and outperforms standard prompting.

Significance. If the benchmark annotations prove reliable, the work is significant for exposing a concrete failure mode in current LVLM safety that is missed by text-only or implicit-norm evaluations. The DVA framework supplies a reproducible structural baseline for policy-compliant multimodal inference.

major comments (1)
  1. [Section 3] Section 3: Policy construction is described as manual extraction from real-world reports, yet no inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa) or conflict-resolution protocol are reported. Because the Reasoning-Induced Safety Gap claim rests entirely on the accuracy of these ground-truth policy-violation labels, the absence of agreement metrics leaves open the possibility that observed leakage reflects annotation inconsistency rather than model failure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on annotation reliability. We agree that inter-annotator agreement metrics are essential to substantiate the ground-truth policy-violation labels and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Section 3] Section 3: Policy construction is described as manual extraction from real-world reports, yet no inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa) or conflict-resolution protocol are reported. Because the Reasoning-Induced Safety Gap claim rests entirely on the accuracy of these ground-truth policy-violation labels, the absence of agreement metrics leaves open the possibility that observed leakage reflects annotation inconsistency rather than model failure.

    Authors: We acknowledge the validity of this concern. In the revised manuscript we will add a dedicated subsection in Section 3 reporting inter-annotator agreement on a randomly sampled subset of 200 documents. Two additional annotators independently labeled policy-violation spans; we will report Cohen’s kappa (pairwise) and Fleiss’ kappa (multi-annotator) together with the conflict-resolution protocol (majority vote followed by adjudication by a senior annotator for disagreements). These statistics will directly support the reliability of the Reasoning-Induced Safety Gap findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark or framework

full rationale

The paper introduces Doc-PP as an empirical benchmark constructed from real-world reports and proposes the DVA framework based on experimental evaluations of LVLMs. No equations, derivations, or mathematical claims are present in the provided text. The central claims about the Reasoning-Induced Safety Gap rest on observed model behaviors rather than any reduction to fitted parameters, self-definitions, or load-bearing self-citations. The work is self-contained as a benchmark evaluation without circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Central claims rest on the newly introduced benchmark and framework; no free parameters, axioms, or invented entities beyond the benchmark itself are specified in the abstract.

invented entities (2)
  • Doc-PP benchmark no independent evidence
    purpose: Evaluate policy preservation in multimodal document QA
    Newly constructed dataset from real-world reports
  • DVA framework no independent evidence
    purpose: Decouple reasoning from policy verification
    Proposed structural inference method

pith-pipeline@v0.9.0 · 5480 in / 1060 out tokens · 25544 ms · 2026-05-16T16:27:04.902508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Policy-Invisible Violations in LLM-Based Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.