arxiv: 2604.12308 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

Chanhou Lou, Haoran Li, Hong Ting Tsang, Huihao Jing, Sirui Han, Tsz Ho Li, Wenbin Hu, Yangqiu Song, Yulin Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal compliancecontext groundingLLM evaluationprivacy assessmentAI safetyGDPREU AI Actambiguous context

0 comments

The pith

ContextLens lets LLMs assess legal compliance with privacy and AI safety rules by answering crafted questions on applicability, principles, and provisions even when contexts are ambiguous or incomplete.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ContextLens as a semi-rule-based method that directs LLMs to ground real-world inputs against legal requirements instead of assuming complete context is available. It does this by having the model answer a fixed set of questions covering applicability, general principles, and detailed provisions to check against pre-defined priorities and rules such as those in GDPR and the EU AI Act. This produces higher compliance assessment accuracy than prior baselines on existing benchmarks and additionally flags ambiguous or missing factors without any model training or fine-tuning. A sympathetic reader would care because everyday privacy and safety decisions occur in messy, partial contexts where direct outcome judgments often fail.

Core claim

ContextLens instructs LLMs to answer a set of crafted questions that span applicability, general principles, and detailed provisions in order to assess compliance with pre-defined priorities and rules; by explicitly grounding the input context in the legal domain and identifying both known and unknown factors, the method improves LLM performance on GDPR and EU AI Act compliance benchmarks and surfaces ambiguous or missing information.

What carries the argument

The set of crafted questions on applicability, general principles, and detailed provisions, which the LLM answers to ground ambiguous contexts and flag unknown factors for compliance evaluation.

If this is right

LLM-based compliance checks become usable on real-world inputs that lack complete or clear context.
Assessment pipelines can now explicitly surface ambiguous and missing factors rather than producing opaque verdicts.
The same question set works across GDPR and EU AI Act benchmarks without retraining the underlying model.
Compliance evaluation shifts from direct outcome prediction to structured grounding against legal provisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same question-answering structure could be adapted to other regulatory regimes by swapping the priority and rule definitions.
Downstream systems might use the identified missing factors to trigger targeted data collection or clarification steps before final decisions.
If the questions prove robust across domains, they could serve as a reusable scaffold for auditing LLM safety and privacy outputs.

Load-bearing premise

The chosen questions are sufficient to capture the essential legal compliance requirements and LLMs can reliably ground ambiguous contexts and identify missing factors without systematic errors.

What would settle it

A controlled experiment on a new compliance scenario in which ContextLens either fails to improve accuracy over direct assessment baselines or misses a critical missing factor that human experts identify as decisive for the outcome.

Figures

Figures reproduced from arXiv: 2604.12308 by Chanhou Lou, Haoran Li, Hong Ting Tsang, Huihao Jing, Sirui Han, Tsz Ho Li, Wenbin Hu, Yangqiu Song, Yulin Chen.

**Figure 1.** Figure 1: An example of ContextLens’ regulation chunking on the GDPR. We decompose the regulation following the chapter→article→item→sub-item structure. We leverage LLM-as-a-judge to perform legal document chunking at the item and sub-item level with human annotator verification. hierarchical nature of current legal statutes. Recent experimental findings suggest that most LLMs tend to hallucinate on non-existent sit… view at source ↗

read the original abstract

Individuals' concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs' compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextLens gives LLMs a fixed set of questions to break down compliance checks under incomplete context, reporting gains on GDPR and EU AI Act benchmarks without training.

read the letter

Colleague, the core of this paper is a prompting setup called ContextLens that has the model answer crafted questions on applicability, general principles, and detailed provisions instead of scoring safety or privacy directly. It also asks the LLM to call out ambiguous or missing factors in the input. They test it on existing compliance benchmarks for GDPR and the EU AI Act and claim clear improvements over baselines with no fine-tuning or extra data. That part is straightforward and practical for anyone who needs to run regulatory checks on off-the-shelf models. The method is new enough in how it structures the breakdown and surfaces unknowns, and the experiments show it works better than direct assessment on those datasets. The writing keeps the focus on implementation details rather than overclaiming theory. The main soft spot is that everything rests on whether the chosen questions actually cover the legal ground. Real compliance decisions often involve jurisdiction-specific balancing or precedents that a static template is unlikely to hit in every case, and if the LLM mis-grounds the unknowns in systematic ways the reported gains could shrink. The abstract does not spell out expert validation of the flagged factors or full prompt examples, so the strength of the edge depends on how well those details hold up in the full experiments. This is work for people building or evaluating LLM tools for regulated domains like privacy and AI safety. Readers who need a ready-to-use prompting pattern for incomplete contexts will get something concrete out of it. I would send it to peer review. The experiments give it enough substance for referees to check the baselines and question coverage, even if revisions are needed on validation.

Referee Report

2 major / 0 minor

Summary. The paper proposes ContextLens, a semi-rule-based framework that leverages LLMs to ground ambiguous and incomplete real-world contexts for legal compliance assessment in privacy and safety. Instead of direct outcome evaluation, it instructs LLMs to answer a fixed set of crafted questions spanning applicability, general principles, and detailed provisions aligned with pre-defined rules, while also identifying known and unknown factors. Experiments on GDPR and EU AI Act compliance benchmarks claim significant improvements over existing baselines with no training required, plus the ability to flag ambiguous and missing factors.

Significance. If the reported gains hold under rigorous validation, ContextLens would offer a practical, training-free method for improving LLM-based legal compliance reasoning in imperfect contexts, with secondary utility in surfacing information gaps. This could be relevant for AI safety and privacy tooling, but the significance is constrained by the absence of evidence that the static question template captures jurisdiction-specific precedents, proportionality tests, or fact-specific inferences that often determine actual legal outcomes.

major comments (2)

[Abstract and experimental evaluation] The central claim that ContextLens 'significantly improve[s] LLMs' compliance assessment and surpass[es] existing baselines without any training' (abstract) is load-bearing but unsupported by details on exact baseline implementations, prompt templates, statistical tests, or effect sizes; without these, it is impossible to distinguish genuine context-modeling gains from prompt-engineering artifacts.
[ContextLens framework description] The framework's reliance on a static, hand-crafted question set for applicability, general principles, and detailed provisions (abstract) is presented as sufficient to produce reliable compliance scores, yet the manuscript provides no expert legal validation, comparison to full case-law analysis, or test for systematic LLM mis-grounding of ambiguities; this directly undermines the claim that unknown factors are reliably identified.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment point by point below, proposing targeted revisions to improve clarity and transparency while maintaining the manuscript's focus on a practical, training-free framework.

read point-by-point responses

Referee: [Abstract and experimental evaluation] The central claim that ContextLens 'significantly improve[s] LLMs' compliance assessment and surpass[es] existing baselines without any training' (abstract) is load-bearing but unsupported by details on exact baseline implementations, prompt templates, statistical tests, or effect sizes; without these, it is impossible to distinguish genuine context-modeling gains from prompt-engineering artifacts.

Authors: We agree that the experimental evaluation section requires greater specificity to support the central claims. In the revised manuscript, we will add: (1) the complete prompt templates for ContextLens and all baselines, (2) precise descriptions of baseline implementations including any shared prompting strategies, (3) statistical significance tests (e.g., paired t-tests with reported p-values), and (4) effect sizes. These additions will allow readers to assess whether gains arise from the structured question-based grounding rather than prompt variations alone. Our existing ablation studies already isolate the contribution of the applicability, principles, and provisions components. revision: yes
Referee: [ContextLens framework description] The framework's reliance on a static, hand-crafted question set for applicability, general principles, and detailed provisions (abstract) is presented as sufficient to produce reliable compliance scores, yet the manuscript provides no expert legal validation, comparison to full case-law analysis, or test for systematic LLM mis-grounding of ambiguities; this directly undermines the claim that unknown factors are reliably identified.

Authors: We recognize that expert legal validation and direct comparison against full case-law analysis would strengthen claims about reliability. However, such validation requires access to legal experts and jurisdiction-specific case databases that exceed the scope and resources of this work. The current benchmarks already contain ambiguous and incomplete contexts, and we evaluate unknown-factor identification through explicit flagging performance. We will expand the limitations section to explicitly discuss the absence of precedent-based or proportionality testing and the risk of LLM mis-grounding, while outlining directions for future expert-in-the-loop studies. revision: partial

standing simulated objections not resolved

Expert legal validation of the static question set and systematic comparison against full case-law analysis, as both require external legal expertise and proprietary data sources unavailable for the current study.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces ContextLens as a semi-rule-based prompting framework that directs LLMs to answer a fixed set of crafted questions spanning applicability, general principles, and detailed provisions for GDPR and EU AI Act compliance assessment. This method is defined directly by the question template and LLM grounding instructions rather than by any fitted parameters, self-referential equations, or load-bearing self-citations that reduce the central claim to its own inputs. Empirical results on existing benchmarks are presented as independent validation, with no uniqueness theorems, ansatzes, or renamings that collapse the framework back onto prior fitted quantities or author-defined constructs. The derivation chain remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that LLMs can perform structured legal reasoning when guided by questions, with no free parameters or invented entities introduced.

axioms (1)

domain assumption Large language models can reliably answer crafted legal questions to ground context and identify known versus unknown factors.
The framework depends on this capability without training or additional verification steps described in the abstract.

pith-pipeline@v0.9.0 · 5520 in / 1225 out tokens · 51267 ms · 2026-05-10T15:17:54.043549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 3 canonical work pages · 1 internal anchor

[1]

In2006 IEEE Symposium on Secu- rity and Privacy (S&P’06), pages 15 pp.–198

Privacy and contextual integrity: framework and applications. In2006 IEEE Symposium on Secu- rity and Privacy (S&P’06), pages 15 pp.–198. Sebastian Benthall, Seda Gürses, and Helen Nis- senbaum. 2017. Contextual integrity through the lens of computer science. 2(1):1–69. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- siotis, Nikolaos Aletras, and Io...

2017
[2]

arXiv preprint arXiv:2502.16580 , year=

LEGAL-BERT: The muppets straight out of law school. InFindings of the Association for Com- putational Linguistics: EMNLP 2020, pages 2898– 2904, Online. Association for Computational Lin- guistics. Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can indirect prompt injection attacks be detected and removed? arXiv pr...

work page arXiv 2020
[3]

CI-Bench: Benchmarking contextual integrity of ai assistants on synthetic data,

Ci-bench: Benchmarking contextual in- tegrity of ai assistants on synthetic data.Preprint, arXiv:2409.13903. Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Giovanni Coma...

work page arXiv 2017
[4]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct preference optimization: Your lan- guage model is secretly a reward model.Preprint, arXiv:2305.18290. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Prox- imal policy optimization algorithms.Preprint, arXiv:1707.06347. Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024. Privacylens: Evaluating...

work page internal anchor Pith review arXiv 2017
[5]

Cohen” measures Cohen’s kappa coeffi- cient between human evaluators and LLMs. “Fleiss

While some literature advocates for a cumu- lative application model where both specific and specific grounds must be satisfied concurrently (Co- mandè and Schneider, 2022; Florea, 2023), our ContextLenstreats Article 8 and Article 9 as spe- cial exceptions that override specific rules. Follow- ing Kelsen’s theory of normative hierarchy (Kelsen and Hartne...

2022
[6]

If there is an AI system involved, please identify the name of the AI system, the type of the AI system, and the usage of the AI system

Analyze the case and identify if there is any AI system involved. If there is an AI system involved, please identify the name of the AI system, the type of the AI system, and the usage of the AI system
[7]

If there is more than one action, please identify all of them

If there is an AI system involved, please identify atomic actions that are performed by the AI system and the target of the action. If there is more than one action, please identify all of them
[8]

AI_system_involved

For each action, please identify the target of the action and the purpose of the action. **Case**:<event> **Output format**: Output format should be in JSON format: { “AI_system_involved": True/False, “AI_system_name": “name of the AI system", ... } ContextLens’ Subsequent Prompt Template for the EU AI Act if AI System is Involved Your task is to play the...
[9]

Product manufacturer
[10]

Authorised representative **background**: Definitions: Provider: a natural or legal person, public authority, agency or other body that develops an AI system or a general purpose AI model (or that has an AI system or a general purpose AI model developed) and places them on the market or puts the system into service under its own name or trademark, whether...
[11]

Putting a different name/trademark on the system
[12]

Modifying the intended purpose of a system already in operation
[13]

Performing a substantial modification (see Article 3 point 23) to the system
[14]

*Question 10**:Does your AI system (or the product for which your AI system is a ’safety component’) fall within any of the following high-risk categories? **Options**:

None of the above ... *Question 10**:Does your AI system (or the product for which your AI system is a ’safety component’) fall within any of the following high-risk categories? **Options**:
[15]

Civil aviation security
[16]

question_1

None of the above **Output format**: Output format should be in JSON format and your answer should contain only the numerical index of the option without any text in the options. You should select at least one option for each question. { “question_1": [option_1, option_2, ...], “question_2": [option_1, option_2, ...], ... } Table 9: Evaluated prompt templ...

2020