Recognition: unknown
ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
ContextLens lets LLMs assess legal compliance with privacy and AI safety rules by answering crafted questions on applicability, principles, and provisions even when contexts are ambiguous or incomplete.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ContextLens instructs LLMs to answer a set of crafted questions that span applicability, general principles, and detailed provisions in order to assess compliance with pre-defined priorities and rules; by explicitly grounding the input context in the legal domain and identifying both known and unknown factors, the method improves LLM performance on GDPR and EU AI Act compliance benchmarks and surfaces ambiguous or missing information.
What carries the argument
The set of crafted questions on applicability, general principles, and detailed provisions, which the LLM answers to ground ambiguous contexts and flag unknown factors for compliance evaluation.
If this is right
- LLM-based compliance checks become usable on real-world inputs that lack complete or clear context.
- Assessment pipelines can now explicitly surface ambiguous and missing factors rather than producing opaque verdicts.
- The same question set works across GDPR and EU AI Act benchmarks without retraining the underlying model.
- Compliance evaluation shifts from direct outcome prediction to structured grounding against legal provisions.
Where Pith is reading between the lines
- The same question-answering structure could be adapted to other regulatory regimes by swapping the priority and rule definitions.
- Downstream systems might use the identified missing factors to trigger targeted data collection or clarification steps before final decisions.
- If the questions prove robust across domains, they could serve as a reusable scaffold for auditing LLM safety and privacy outputs.
Load-bearing premise
The chosen questions are sufficient to capture the essential legal compliance requirements and LLMs can reliably ground ambiguous contexts and identify missing factors without systematic errors.
What would settle it
A controlled experiment on a new compliance scenario in which ContextLens either fails to improve accuracy over direct assessment baselines or misses a critical missing factor that human experts identify as decisive for the outcome.
Figures
read the original abstract
Individuals' concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs' compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ContextLens, a semi-rule-based framework that leverages LLMs to ground ambiguous and incomplete real-world contexts for legal compliance assessment in privacy and safety. Instead of direct outcome evaluation, it instructs LLMs to answer a fixed set of crafted questions spanning applicability, general principles, and detailed provisions aligned with pre-defined rules, while also identifying known and unknown factors. Experiments on GDPR and EU AI Act compliance benchmarks claim significant improvements over existing baselines with no training required, plus the ability to flag ambiguous and missing factors.
Significance. If the reported gains hold under rigorous validation, ContextLens would offer a practical, training-free method for improving LLM-based legal compliance reasoning in imperfect contexts, with secondary utility in surfacing information gaps. This could be relevant for AI safety and privacy tooling, but the significance is constrained by the absence of evidence that the static question template captures jurisdiction-specific precedents, proportionality tests, or fact-specific inferences that often determine actual legal outcomes.
major comments (2)
- [Abstract and experimental evaluation] The central claim that ContextLens 'significantly improve[s] LLMs' compliance assessment and surpass[es] existing baselines without any training' (abstract) is load-bearing but unsupported by details on exact baseline implementations, prompt templates, statistical tests, or effect sizes; without these, it is impossible to distinguish genuine context-modeling gains from prompt-engineering artifacts.
- [ContextLens framework description] The framework's reliance on a static, hand-crafted question set for applicability, general principles, and detailed provisions (abstract) is presented as sufficient to produce reliable compliance scores, yet the manuscript provides no expert legal validation, comparison to full case-law analysis, or test for systematic LLM mis-grounding of ambiguities; this directly undermines the claim that unknown factors are reliably identified.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment point by point below, proposing targeted revisions to improve clarity and transparency while maintaining the manuscript's focus on a practical, training-free framework.
read point-by-point responses
-
Referee: [Abstract and experimental evaluation] The central claim that ContextLens 'significantly improve[s] LLMs' compliance assessment and surpass[es] existing baselines without any training' (abstract) is load-bearing but unsupported by details on exact baseline implementations, prompt templates, statistical tests, or effect sizes; without these, it is impossible to distinguish genuine context-modeling gains from prompt-engineering artifacts.
Authors: We agree that the experimental evaluation section requires greater specificity to support the central claims. In the revised manuscript, we will add: (1) the complete prompt templates for ContextLens and all baselines, (2) precise descriptions of baseline implementations including any shared prompting strategies, (3) statistical significance tests (e.g., paired t-tests with reported p-values), and (4) effect sizes. These additions will allow readers to assess whether gains arise from the structured question-based grounding rather than prompt variations alone. Our existing ablation studies already isolate the contribution of the applicability, principles, and provisions components. revision: yes
-
Referee: [ContextLens framework description] The framework's reliance on a static, hand-crafted question set for applicability, general principles, and detailed provisions (abstract) is presented as sufficient to produce reliable compliance scores, yet the manuscript provides no expert legal validation, comparison to full case-law analysis, or test for systematic LLM mis-grounding of ambiguities; this directly undermines the claim that unknown factors are reliably identified.
Authors: We recognize that expert legal validation and direct comparison against full case-law analysis would strengthen claims about reliability. However, such validation requires access to legal experts and jurisdiction-specific case databases that exceed the scope and resources of this work. The current benchmarks already contain ambiguous and incomplete contexts, and we evaluate unknown-factor identification through explicit flagging performance. We will expand the limitations section to explicitly discuss the absence of precedent-based or proportionality testing and the risk of LLM mis-grounding, while outlining directions for future expert-in-the-loop studies. revision: partial
- Expert legal validation of the static question set and systematic comparison against full case-law analysis, as both require external legal expertise and proprietary data sources unavailable for the current study.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces ContextLens as a semi-rule-based prompting framework that directs LLMs to answer a fixed set of crafted questions spanning applicability, general principles, and detailed provisions for GDPR and EU AI Act compliance assessment. This method is defined directly by the question template and LLM grounding instructions rather than by any fitted parameters, self-referential equations, or load-bearing self-citations that reduce the central claim to its own inputs. Empirical results on existing benchmarks are presented as independent validation, with no uniqueness theorems, ansatzes, or renamings that collapse the framework back onto prior fitted quantities or author-defined constructs. The derivation chain remains self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can reliably answer crafted legal questions to ground context and identify known versus unknown factors.
Reference graph
Works this paper leans on
-
[1]
In2006 IEEE Symposium on Secu- rity and Privacy (S&P’06), pages 15 pp.–198
Privacy and contextual integrity: framework and applications. In2006 IEEE Symposium on Secu- rity and Privacy (S&P’06), pages 15 pp.–198. Sebastian Benthall, Seda Gürses, and Helen Nis- senbaum. 2017. Contextual integrity through the lens of computer science. 2(1):1–69. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- siotis, Nikolaos Aletras, and Io...
2017
-
[2]
arXiv preprint arXiv:2502.16580 , year=
LEGAL-BERT: The muppets straight out of law school. InFindings of the Association for Com- putational Linguistics: EMNLP 2020, pages 2898– 2904, Online. Association for Computational Lin- guistics. Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can indirect prompt injection attacks be detected and removed? arXiv pr...
-
[3]
CI-Bench: Benchmarking contextual integrity of ai assistants on synthetic data,
Ci-bench: Benchmarking contextual in- tegrity of ai assistants on synthetic data.Preprint, arXiv:2409.13903. Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Giovanni Coma...
-
[4]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Direct preference optimization: Your lan- guage model is secretly a reward model.Preprint, arXiv:2305.18290. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Prox- imal policy optimization algorithms.Preprint, arXiv:1707.06347. Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. 2024. Privacylens: Evaluating...
work page internal anchor Pith review arXiv 2017
-
[5]
Cohen” measures Cohen’s kappa coeffi- cient between human evaluators and LLMs. “Fleiss
While some literature advocates for a cumu- lative application model where both specific and specific grounds must be satisfied concurrently (Co- mandè and Schneider, 2022; Florea, 2023), our ContextLenstreats Article 8 and Article 9 as spe- cial exceptions that override specific rules. Follow- ing Kelsen’s theory of normative hierarchy (Kelsen and Hartne...
2022
-
[6]
If there is an AI system involved, please identify the name of the AI system, the type of the AI system, and the usage of the AI system
Analyze the case and identify if there is any AI system involved. If there is an AI system involved, please identify the name of the AI system, the type of the AI system, and the usage of the AI system
-
[7]
If there is more than one action, please identify all of them
If there is an AI system involved, please identify atomic actions that are performed by the AI system and the target of the action. If there is more than one action, please identify all of them
-
[8]
AI_system_involved
For each action, please identify the target of the action and the purpose of the action. **Case**:<event> **Output format**: Output format should be in JSON format: { “AI_system_involved": True/False, “AI_system_name": “name of the AI system", ... } ContextLens’ Subsequent Prompt Template for the EU AI Act if AI System is Involved Your task is to play the...
-
[9]
Product manufacturer
-
[10]
Authorised representative **background**: Definitions: Provider: a natural or legal person, public authority, agency or other body that develops an AI system or a general purpose AI model (or that has an AI system or a general purpose AI model developed) and places them on the market or puts the system into service under its own name or trademark, whether...
-
[11]
Putting a different name/trademark on the system
-
[12]
Modifying the intended purpose of a system already in operation
-
[13]
Performing a substantial modification (see Article 3 point 23) to the system
-
[14]
*Question 10**:Does your AI system (or the product for which your AI system is a ’safety component’) fall within any of the following high-risk categories? **Options**:
None of the above ... *Question 10**:Does your AI system (or the product for which your AI system is a ’safety component’) fall within any of the following high-risk categories? **Options**:
-
[15]
Civil aviation security
-
[16]
question_1
None of the above **Output format**: Output format should be in JSON format and your answer should contain only the numerical index of the option without any text in the options. You should select at least one option for each question. { “question_1": [option_1, option_2, ...], “question_2": [option_1, option_2, ...], ... } Table 9: Evaluated prompt templ...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.