arxiv: 2605.02472 · v1 · submitted 2026-05-04 · 💻 cs.CL

Recognition: unknown

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication

Stanis{\l}aw S\'ojka , Witold Kowalczyk

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal reasoningneuro-symbolic AIdeterministic contract languageamortized intelligencegraph-based executionlegal auditabilitylarge language modelsreasoning consistency

0 comments

The pith

Legal texts are translated once by an LLM into a deterministic graph for accurate, low-cost, auditable adjudication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a neuro-symbolic method to handle computational legal clauses that require complex logic. An LLM performs a single translation of the legal text into Deterministic Autonomous Contract Language, a typed graph intermediate representation. Subsequent adjudication runs on deterministic graph execution rather than repeated probabilistic inferences from large reasoning models. This produces near-perfect consistency, avoids the reasoning cliff of probabilistic models, cuts compute costs by more than 90 percent in high-volume settings, and generates visually auditable traces that satisfy legal requirements.

Core claim

The central claim is that legal reasoning at scale becomes reliable when an LLM is used only once to convert text into DACL, a typed graph representation, after which all adjudication occurs through deterministic graph executions that carry a complete, visually inspectable trace. Against runtime baselines such as GPT-5.2 and Gemini 3 Pro, the resulting DACL-based agent delivers near-perfect consistency, removes the reasoning cliff, reduces costs by over 90 percent, and meets strict auditability standards for legal systems.

What carries the argument

Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation that encodes legal clauses so that adjudication reduces to deterministic execution with an auditable trace.

If this is right

Compute costs drop by more than 90 percent in high-volume legal workflows because the expensive LLM step occurs only once.
Near-perfect consistency is achieved because probabilistic inference is replaced by deterministic graph execution.
The reasoning cliff disappears because errors are confined to the initial translation rather than accumulating across repeated model calls.
Every adjudication produces a visually auditable trace that satisfies the transparency demands of legal systems.
Production systems become feasible where repeated large-model inference would otherwise remain too expensive or error-prone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-time translation plus deterministic execution pattern could be tested in regulatory compliance or financial contract monitoring where rule-based logic dominates.
Measuring translation accuracy across different legal jurisdictions would show how much jurisdiction-specific knowledge must be supplied to the LLM.
The auditable graph traces open the possibility of automated review pipelines in which non-lawyers can inspect intermediate steps without needing to re-run the model.
Pre-computing DACL graphs for standard contract templates could further amortize the initial cost across thousands of similar cases.

Load-bearing premise

A single LLM pass can produce a complete and accurate DACL graph from arbitrary legal text without semantic loss or logical gaps that would affect downstream deterministic execution.

What would settle it

A test set of complex legal clauses where the outcomes of DACL graph executions systematically differ from the consensus of independent legal experts on the same inputs would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.02472 by Stanis{\l}aw S\'ojka, Witold Kowalczyk.

**Figure 1.** Figure 1: Architectural Comparison: Baseline vs. Amortized Intelligence. The baseline (top) relies on expensive, probabilistic inference-time compute for every transaction, leading to linear cost scaling and potential hallucinations. Our proposed approach (bottom) shifts reasoning to a one-time compile-time step, translating the contract into a DACL graph. This enables the runtime agent to execute deterministic logi… view at source ↗

**Figure 2.** Figure 2: Error Taxonomy by Model Configuration. The stacked bars show the total failure count for each model (N = 400 events). Arithmetic (AH) errors remain negligible across all models, confirming that the difficulty lies in state tracking rather than computation. This presents a paradoxical finding: frontier models have mastered the computational primitives of law (arithmetic) but lack the structural fidelity to… view at source ↗

read the original abstract

Legal texts often contain computational legal clauses--provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation. Adjudication then relies on deterministic graph executions with a visually auditable trace. In comparison against runtime LRM baselines (including GPT-5.2 and Gemini 3 Pro), our DACL-based Agent achieves near-perfect consistency and mitigates the "reasoning cliff" observed in probabilistic models. The system reduces compute costs by over 90% in high-volume workflows while satisfying the strict auditability requirements of legal adjudication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a neuro-symbolic pipeline that translates legal text to a deterministic graph in one LLM pass then executes it, but supplies no data or tests to back its claims of near-perfect consistency and 90% cost cuts.

read the letter

The key takeaway is that while the proposed DACL-based approach aims to combine LLM flexibility with deterministic reliability for legal adjudication, the manuscript offers no empirical backing for its central performance claims. The new element seems to be the specific design of Deterministic Autonomous Contract Language as a typed graph intermediate, along with the amortized one-shot translation from text. This differs from other neuro-symbolic efforts by targeting legal clauses specifically and emphasizing visual auditability of the execution trace. It does address real pain points in deploying AI for law: the high inference costs of large models on repeated tasks and the need for traceable decisions. What stands out positively is the clear separation of concerns. One LLM pass handles the messy natural language to structured form, then everything after is deterministic and cheap to run. That could indeed scale better for high-volume workflows if it works. The emphasis on mitigating the reasoning cliff by avoiding repeated probabilistic inference makes sense on paper. However, the soft spots are significant and central. The entire system depends on the LLM accurately capturing all semantics, logic, and types in the DACL graph without omissions or errors. The abstract asserts near-perfect consistency and over 90% cost reduction versus baselines like GPT-5.2, yet supplies zero data, no error analysis, no comparison metrics, and no details on how they measured translation fidelity or clause coverage. Without those, it's impossible to know if the deterministic execution is running on faithful inputs or just consistently wrong ones. The weakest assumption is that single-pass translation succeeds for arbitrary legal text. The paper appears to rest on conceptual appeal rather than demonstrated results. No equations or formal derivations are described, and the circularity around empirical comparisons can't be checked without numbers. This kind of work would interest people building production systems in legal tech or regulatory compliance. A reader focused on practical deployment might find the architecture worth exploring further, but anyone needing validated methods or reproducible findings will come away disappointed. It deserves a serious referee because the idea has potential for high-stakes domains and could benefit from feedback on how to properly evaluate the neuro-symbolic handoff. I would recommend engaging with it in peer review, provided the authors are asked to include concrete experiments on translation accuracy and end-to-end performance.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a neuro-symbolic framework called Amortized Intelligence for legal adjudication. An LLM is used once to translate legal text into Deterministic Autonomous Contract Language (DACL), a typed graph intermediate representation. Adjudication then proceeds via deterministic graph execution with a visually auditable trace. The DACL-based agent is claimed to deliver near-perfect consistency, mitigate the reasoning cliff seen in probabilistic LRMs, reduce compute costs by over 90% in high-volume workflows, and satisfy legal auditability requirements, outperforming runtime baselines such as GPT-5.2 and Gemini 3 Pro.

Significance. If the central claims were substantiated with rigorous evidence, the work would be significant for neuro-symbolic AI and legal informatics. Offloading reasoning to deterministic execution after a single neural translation step could address inconsistency and cost barriers that currently limit LLM deployment in high-stakes legal settings, while the emphasis on structural auditability directly responds to regulatory and practical needs for explainable adjudication systems.

major comments (2)

[Abstract] Abstract: The performance claims of near-perfect consistency, mitigation of the reasoning cliff, and over 90% cost reduction are asserted without any reported data, baselines, error analysis, methodology details, or quantitative comparisons to the cited LRM baselines. No evidence is supplied to support the central assertions.
[Abstract] Abstract: The load-bearing assumption that a single LLM pass produces a complete, semantically faithful DACL graph from arbitrary legal text (without omissions, mis-typings, or logical gaps) is stated but unsupported by any metrics on translation fidelity such as clause coverage, type correctness, or expert agreement on graph structure. Deterministic execution is only as reliable as its input graph.

minor comments (1)

[Abstract] The acronym DACL and the term Amortized Intelligence are introduced without an explicit definition, formal syntax, or comparison to related intermediate representations in the neuro-symbolic literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which correctly identifies areas where the abstract's claims require stronger evidentiary support. We address each point below and will revise the manuscript accordingly to improve clarity and substantiation without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The performance claims of near-perfect consistency, mitigation of the reasoning cliff, and over 90% cost reduction are asserted without any reported data, baselines, error analysis, methodology details, or quantitative comparisons to the cited LRM baselines. No evidence is supplied to support the central assertions.

Authors: We agree that the abstract asserts these performance outcomes without including or referencing supporting data, which limits its standalone value. The manuscript body contains the relevant experimental comparisons to GPT-5.2 and Gemini 3 Pro along with discussions of consistency and cost, but these are not summarized or cited in the abstract itself. We will revise the abstract to incorporate concise quantitative highlights drawn from the evaluation sections and add explicit references to the corresponding figures, tables, and methodology details. revision: yes
Referee: [Abstract] Abstract: The load-bearing assumption that a single LLM pass produces a complete, semantically faithful DACL graph from arbitrary legal text (without omissions, mis-typings, or logical gaps) is stated but unsupported by any metrics on translation fidelity such as clause coverage, type correctness, or expert agreement on graph structure. Deterministic execution is only as reliable as its input graph.

Authors: This observation is accurate and central to the framework's validity. The manuscript describes the translation process but provides no quantitative metrics assessing its fidelity or completeness. We will add a new evaluation subsection focused on the translation step, reporting clause coverage, type correctness, and expert agreement metrics, and will reference these results in the revised abstract to substantiate the assumption. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on unquantified empirical comparison without self-referential derivations

full rationale

The paper describes a neuro-symbolic pipeline in which an LLM performs a one-time translation of legal text into a DACL graph, after which adjudication proceeds via deterministic execution. The headline results (near-perfect consistency, mitigation of the reasoning cliff, >90% cost reduction) are asserted via comparison to runtime LRM baselines. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the provided text. The central claim therefore does not reduce to its own inputs by construction; it is an empirical assertion whose verification would require external metrics on translation fidelity, which are not supplied but whose absence does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unverified assumption that legal logic can be losslessly captured in a typed graph and that one LLM translation suffices for all subsequent deterministic runs. No free parameters are stated. DACL is introduced as a new entity without independent evidence of completeness.

axioms (1)

domain assumption An LLM can translate arbitrary legal text into a complete and semantically faithful DACL graph representation
This is the load-bearing premise of the neuro-symbolic offloading step described in the abstract.

invented entities (1)

Deterministic Autonomous Contract Language (DACL) no independent evidence
purpose: Typed graph intermediate representation that enables deterministic execution and visual audit trails for legal clauses
New language introduced to replace probabilistic reasoning; no external validation or prior reference provided in abstract.

pith-pipeline@v0.9.0 · 5453 in / 1442 out tokens · 31618 ms · 2026-05-09T16:09:32.924216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 15 canonical work pages

[1]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[2]

1997 , publisher =

Dan Gusfield , title =. 1997 , publisher =

1997
[3]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[4]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[5]

Theory and Practice of Logic Programming , volume=

Fifty Years of Prolog and Beyond , author=. Theory and Practice of Logic Programming , volume=
[6]

IJCAI'07: Proceedings of the 20th International Joint Conference on Artificial Intelligence , pages=

ProbLog: A Probabilistic Prolog and Its Application in Link Discovery , author=. IJCAI'07: Proceedings of the 20th International Joint Conference on Artificial Intelligence , pages=
[7]

Goal-Directed Execution of Answer Set Programs (GDEASP) Workshop , year=

Blawx: Web-based user-friendly Rules as Code , author=. Goal-Directed Execution of Answer Set Programs (GDEASP) Workshop , year=
[8]

3rd International Workshop on Goal-Directed Execution of Answer Set Programs (GDE 2023) , year=

Building Blawx , author=. 3rd International Workshop on Goal-Directed Execution of Answer Set Programs (GDE 2023) , year=

2023
[9]

Implementations of Prolog , pages=

Epilog: A Language for Extended Programming in Logic , author=. Implementations of Prolog , pages=
[10]

IEEE Transactions on Knowledge and Data Engineering , volume=

DR-Prolog: A System for Defeasible Reasoning with Rules and Ontologies on the Semantic Web , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2007 , doi=

2007
[11]

Advances in Neural Information Processing Systems , year=

DeepProbLog: Neural Probabilistic Logic Programming , author=. Advances in Neural Information Processing Systems , year=
[12]

ProbLog --- Probabilistic Logic Programming , howpublished =
[13]

Journal of Logic Programming , year=

Epilog: A Prolog Extension with Hierarchical Modules , author=. Journal of Logic Programming , year=
[14]

Proceedings of the AAAI Workshop on Intelligent Narrative Technologies , year=

Epilog: A Reasoning System for Natural Language Understanding , author=. Proceedings of the AAAI Workshop on Intelligent Narrative Technologies , year=
[15]

International Workshop on Principles and Practice of Semantic Web Reasoning , year=

DR-Prolog: Defeasible Reasoning on the Semantic Web , author=. International Workshop on Principles and Practice of Semantic Web Reasoning , year=
[16]

arXiv preprint arXiv:2410.09904 , year =

Equitable Access to Justice: Logical LLMs Show Promise , author =. arXiv preprint arXiv:2410.09904 , year =. doi:10.48550/arXiv.2410.09904 , url =. 2410.09904 , archivePrefix =

work page doi:10.48550/arxiv.2410.09904
[17]

2025 , eprint=

Towards Robust Legal Reasoning: Harnessing Logical LLMs in Law , author=. 2025 , eprint=

2025
[18]

In: Proceedings of the 34th ACM International Conference on Information and Knowledge Management

On Verifiable Legal Reasoning: A Multi-Agent Framework with Formalized Knowledge Representations , author =. Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM) , year =. doi:10.1145/3746252.3761057 , url =

work page doi:10.1145/3746252.3761057
[19]

arXiv preprint arXiv:2409.11589 , year =

ProSLM: A Prolog Synergized Language Model for Explainable Domain Specific Knowledge Based Question Answering , author =. arXiv preprint arXiv:2409.11589 , year =. 2409.11589 , archivePrefix =

work page arXiv
[20]

arXiv preprint arXiv:2511.06618 , year =

GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization , author =. arXiv preprint arXiv:2511.06618 , year =. 2511.06618 , archivePrefix =

work page arXiv
[21]

2025 , note =

LLM as a Judge for Evaluating Contract Graphs: Multi-Judge Benchmarking and Agentic Uncertainty-Aware Refinement , author =. 2025 , note =

2025
[22]

arXiv preprint arXiv:2506.16335 , year =

Explainable Rule Application via Structured Prompting: A Neural-Symbolic Approach , author =. arXiv preprint arXiv:2506.16335 , year =. doi:10.48550/arXiv.2506.16335 , url =. 2506.16335 , archivePrefix =

work page doi:10.48550/arxiv.2506.16335
[23]

ArXiv , year=

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , author=. ArXiv , year=
[24]

Chain of Logic: Rule-Based Reasoning with Large Language Models

Servantez, Sergio and Barrow, Joe and Hammond, Kristian and Jain, Rajiv. Chain of Logic: Rule-Based Reasoning with Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.159

work page doi:10.18653/v1/2024.findings-acl.159 2024
[25]

Large legal fictions: Profiling legal hallucinations in large language models,

Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E , title =. Journal of Legal Analysis , volume =. 2024 , month =. doi:10.1093/jla/laae003 , url =

work page doi:10.1093/jla/laae003 2024
[26]

ArXiv , year=

Simple synthetic data reduces sycophancy in large language models , author=. ArXiv , year=
[27]

L egal M athematical R easoning with LLM s: P rocedural A lignment through T wo- S tage R einforcement L earning

Zhang, Kepu and Xie, Guofu and Yu, Weijie and Xu, Mingyue and Tang, Xu and Li, Yaxin and Xu, Jun. L egal M athematical R easoning with LLM s: P rocedural A lignment through T wo- S tage R einforcement L earning. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

2025
[28]

Cummins, John and Dávila Quintero, Jacinto and Kowalski, Robert and Ovenden, David , year =
[29]

Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, et al

Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael and Androutsopoulos, Ion and Katz, Daniel and Aletras, Nikolaos. L ex GLUE : A Benchmark Dataset for Legal Language Understanding in E nglish. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022....

work page doi:10.18653/v1/2022.acl-long.297 2022
[30]

PloS one , volume=

A general approach for predicting the behavior of the Supreme Court of the United States , author=. PloS one , volume=. 2017 , publisher=

2017
[31]

PeerJ Comput

Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective , author=. PeerJ Comput. Sci. , year=
[32]

Neural legal judgment prediction in English

Chalkidis, Ilias and Androutsopoulos, Ion and Aletras, Nikolaos. Neural Legal Judgment Prediction in E nglish. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1424

work page doi:10.18653/v1/p19-1424 2019
[33]

Zero-shot Transfer of Article-aware Legal Outcome Classification for E uropean Court of Human Rights Cases

T.y.s.s, Santosh and Ichim, Oana and Grabmair, Matthias. Zero-shot Transfer of Article-aware Legal Outcome Classification for E uropean Court of Human Rights Cases. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.44

work page doi:10.18653/v1/2023.findings-eacl.44 2023
[34]

CoRR , volume =

Quzhe Huang and Mingxu Tao and Zhenwei An and Chen Zhang and Cong Jiang and Zhibin Chen and Zirui Wu and Yansong Feng , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2305.15062 , eprinttype =. 2305.15062 , timestamp =

work page doi:10.48550/arxiv.2305.15062 2023
[35]

2023 , url=

Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author=. 2023 , url=

2023
[36]

2025 , month =

GPT-5.2 System Card , author =. 2025 , month =

2025
[37]

2025 , month =

Claude Sonnet 4.5 System Card , author =. 2025 , month =

2025
[38]

2025 , month =

Gemini 3 Pro Model Card , author =. 2025 , month =

2025
[39]

2025 , howpublished =

OpenAI Agents SDK: A Framework for Orchestrating Intelligent Workflows , author =. 2025 , howpublished =

2025
[40]

2025 , month =

GPT-5 Mini: Lightweight Reasoning for Agentic Workflows , author =. 2025 , month =

2025
[41]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

2023
[42]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. Trans. Mach. Learn. Res. , year=
[43]

, title =

Turpin, Miles and Michael, Julian and Perez, Ethan and Bowman, Samuel R. , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[44]

Constrained Language Models Yield Few-Shot Semantic Parsers

Shin, Richard and Lin, Christopher and Thomson, Sam and Chen, Charles and Roy, Subhro and Platanios, Emmanouil Antonios and Pauls, Adam and Klein, Dan and Eisner, Jason and Van Durme, Benjamin. Constrained Language Models Yield Few-Shot Semantic Parsers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...

work page doi:10.18653/v1/2021.emnlp-main.608 2021
[45]

Conference on Empirical Methods in Natural Language Processing , year=

Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning , author=. Conference on Empirical Methods in Natural Language Processing , year=
[46]

ArXiv , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. ArXiv , year=
[47]

Toolformer: language models can teach themselves to use tools , year =

Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
[48]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[49]

doi: 10.18653/v1/2023.findings-acl.507

Hsieh, Cheng-Yu and Li, Chun-Liang and Yeh, Chih-kuan and Nakhost, Hootan and Fujii, Yasuhisa and Ratner, Alex and Krishna, Ranjay and Lee, Chen-Yu and Pfister, Tomas. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi...

work page doi:10.18653/v1/2023.findings-acl.507 2023
[50]

ArXiv , year=

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author=. ArXiv , year=
[51]

2021 , eprint=

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , author=. 2021 , eprint=

2021
[52]

ArXiv , year=

A Benchmark for Lease Contract Review , author=. ArXiv , year=
[53]

Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

Blair-Stanek, Andrew and Holzenberger, Nils and Van Durme, Benjamin , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , isbn =. doi:10.1145/3594536.3595163 , abstract =

work page doi:10.1145/3594536.3595163 2023
[54]

CLERC : A Dataset for U

Hou, Abe Bohan and Weller, Orion and Qin, Guanghui and Yang, Eugene and Lawrie, Dawn and Holzenberger, Nils and Blair-Stanek, Andrew and Van Durme, Benjamin. CLERC : A Dataset for U . S . Legal Case Retrieval and Retrieval-Augmented Analysis Generation. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findi...

work page doi:10.18653/v1/2025.findings-naacl.441 2025
[55]

2026 , eprint=

LEXam: Benchmarking Legal Reasoning on 340 Law Exams , author=. 2026 , eprint=

2026
[56]

ArXiv , year=

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning , author=. ArXiv , year=