pith. machine review for the scientific record. sign in

arxiv: 2604.04443 · v1 · submitted 2026-04-06 · 💻 cs.CL

Recognition: no theorem link

DeonticBench: A Benchmark for Reasoning over Rules

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords deontic reasoningLLM benchmarklegal reasoningPrologrule-based reasoningsymbolic executiontax lawhousing law
0
0 comments X

The pith

Large language models reach only 44 percent accuracy on complex deontic reasoning tasks drawn from real taxes, policies, and laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of more than six thousand tasks that test how AI systems handle obligations, permissions, and prohibitions under detailed real-world rules. Tasks come from U.S. federal taxes, airline baggage rules, immigration procedures, and state housing law. Models may reason directly in text or convert the statutes and facts into Prolog programs that can be executed for exact answers. Current frontier models top out at 44.4 percent on the hardest numeric subset and 46.6 macro-F1 on housing cases. Training improves the quality of generated Prolog programs, yet reinforcement learning still leaves the models unable to solve the tasks reliably.

Core claim

DEONTICBENCH supplies 6,232 tasks across four domains and supports both free-form language reasoning and an optional workflow that translates rules and facts into executable Prolog, complete with reference programs; across frontier LLMs and coding models the best hard-subset scores are 44.4 percent on SARA Numeric and 46.6 macro-F1 on Housing, while supervised fine-tuning and reinforcement learning raise Prolog generation quality without producing reliable solutions to the tasks.

What carries the argument

The dual workflow that lets models either reason directly in language or translate statutes plus case facts into executable Prolog programs that return formal interpretations and explicit traces.

If this is right

  • Frontier models remain unreliable for high-stakes reasoning about obligations and permissions in legal and policy domains.
  • Training on symbolic program generation improves output quality yet does not close the gap to reliable task performance.
  • Benchmarks that combine long-context language input with formal execution traces are required to measure progress on rule-based reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tighter integration of language models with external solvers could be tested directly on the same tasks to measure whether performance jumps.
  • The persistent failure after training suggests that deontic reasoning may depend on mechanisms for tracking exceptions and priorities that current pre-training does not supply.
  • The benchmark could serve as a training signal for future models that aim to combine natural-language understanding with formal deduction.

Load-bearing premise

The chosen statutes, case facts, and Prolog translations faithfully represent the central difficulties of real-world deontic reasoning without introducing selection or formalization artifacts.

What would settle it

A model that consistently exceeds 70 percent accuracy on the hard subsets while using the supplied Prolog workflow, or a legal-expert audit that finds the reference translations do not match actual statute interpretations.

Figures

Figures reproduced from arXiv: 2604.04443 by Akhil Deo, Benjamin Van Durme, Guangyao Dou, Jingyu Zhang, Luis Brena, Nils Holzenberger, William Jurayj.

Figure 1
Figure 1. Figure 1: Walkthrough of a DEONTICBENCH instance in the symbolic setting. (1) Given the full problem context, the model performs deontic reasoning to identify and apply the relevant rules. (2) The LLM translates the problem into Prolog code. (3) The generated Prolog is executed by SWI-Prolog solver. The illustrated example is a 2017 tax-liability case. To study this capability, we introduce DEONTICBENCH, a benchmark… view at source ↗
Figure 2
Figure 2. Figure 2: Performance decomposition analysis for SARA Numeric and Airline. Each model [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning effort ablation on SARA Numeric hard cases. Each panel corresponds [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance decomposition analysis for SARA Binary, Housing, and USCIS. Each [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
read the original abstract

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage, immigration, and housing law. Tasks support direct chain-of-thought reasoning in natural language or an optional symbolic workflow in which models translate statutes and facts into executable Prolog (with reference programs released for all instances). Evaluations across frontier LLMs and coding models show peak hard-subset performance of 44.4% on SARA Numeric and 46.6 macro-F1 on Housing; supervised fine-tuning and reinforcement learning improve Prolog generation quality but do not yield reliable task solutions.

Significance. If the tasks and reference Prolog programs faithfully encode the source statutes and facts, the benchmark would provide a valuable, reproducible resource for studying long-context deontic reasoning in high-stakes domains and for comparing neural versus symbolic approaches. The release of reference programs and the dual workflow are concrete strengths that support future work on improving rule-based reasoning.

major comments (2)
  1. [§3] §3 (Benchmark Construction): No validation is reported for the statute-to-Prolog translations (e.g., logical equivalence checks between natural-language and Prolog outcomes, expert review of deontic operators, or inter-annotator agreement on formalizations). This is load-bearing for the central claims, because the headline results (44.4% on SARA Numeric hard subset, 46.6 macro-F1 on Housing) and the conclusion that RL methods fail to solve the tasks reliably presuppose that the reference programs are faithful encodings; unexamined simplifications or encoding errors would make model failures artifacts of the benchmark rather than genuine deontic-reasoning limits.
  2. [§4] §4 (Task Subsets and Metrics): The definition and construction of the 'hard' subsets (used for the 44.4% figure) and the precise criteria for macro-F1 on Housing are not detailed with respect to selection bias controls or coverage of deontic edge cases. This weakens the interpretation that the reported numbers demonstrate broad model limitations rather than properties of the chosen instances.
minor comments (1)
  1. [Table 1] Table 1 and Figure 2: Axis labels and caption text could more explicitly distinguish direct CoT accuracy from solver-based accuracy to avoid reader confusion when comparing the two workflows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of benchmark validation and subset construction. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): No validation is reported for the statute-to-Prolog translations (e.g., logical equivalence checks between natural-language and Prolog outcomes, expert review of deontic operators, or inter-annotator agreement on formalizations). This is load-bearing for the central claims, because the headline results (44.4% on SARA Numeric hard subset, 46.6 macro-F1 on Housing) and the conclusion that RL methods fail to solve the tasks reliably presuppose that the reference programs are faithful encodings; unexamined simplifications or encoding errors would make model failures artifacts of the benchmark rather than genuine deontic-reasoning limits.

    Authors: We agree that the absence of reported validation steps for the Prolog translations is a limitation, as the reference programs are central to the benchmark's utility and claims. The manuscript currently relies on releasing the programs for external verification without detailing internal checks. In the revision, we will expand §3 with a dedicated subsection on the translation methodology, including how deontic operators were mapped to Prolog predicates, sample logical equivalence verifications (by running Prolog on held-out instances and comparing to natural-language ground truth), and any expert consultation performed during construction. We will also discuss potential encoding limitations explicitly. revision: yes

  2. Referee: [§4] §4 (Task Subsets and Metrics): The definition and construction of the 'hard' subsets (used for the 44.4% figure) and the precise criteria for macro-F1 on Housing are not detailed with respect to selection bias controls or coverage of deontic edge cases. This weakens the interpretation that the reported numbers demonstrate broad model limitations rather than properties of the chosen instances.

    Authors: We concur that the hard-subset criteria and Housing metric details require clarification to support the interpretation of model limitations. The hard subsets were constructed using factors such as rule count, exception presence, and context length, but these were not fully enumerated. In the revised manuscript, we will augment §4 with explicit selection criteria, quantitative statistics on deontic feature coverage (e.g., obligations, permissions, prohibitions, conflicts), and confirmation that selection avoided unintended bias beyond the stated rules. We will also specify the exact label set and macro-F1 computation for Housing. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark with empirical evaluations

full rationale

The paper introduces DEONTICBENCH as a new collection of 6,232 tasks with released reference Prolog programs. All reported numbers (44.4% on SARA Numeric hard subset, 46.6 macro-F1 on Housing) are direct empirical measurements of model performance on these tasks. No equations, fitted parameters, or predictions appear that reduce to prior inputs by construction. No self-citations are invoked to justify uniqueness theorems or ansatzes. The derivation chain consists solely of task construction followed by external evaluation, which is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; no free parameters, mathematical axioms, or new invented entities are introduced. The work rests on existing legal texts and the Prolog language as external tools.

pith-pipeline@v0.9.0 · 5581 in / 1074 out tokens · 39365 ms · 2026-05-10T20:18:16.564455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

Reference graph

Works this paper leans on

44 extracted references · 26 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    International ai safety report 2026.arXiv preprint arXiv:2602.21012, 2026

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, et al. International ai safety report 2026.arXiv preprint arXiv:2602.21012,

  2. [2]

    Openai cribbed our tax example, but can gpt-4 really do tax?arXiv preprint arXiv:2309.09992,

    Andrew Blair-Stanek, Nils Holzenberger, and Benjamin Van Durme. Openai cribbed our tax example, but can gpt-4 really do tax?arXiv preprint arXiv:2309.09992,

  3. [3]

    Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference

    Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, and Tianke Ban. Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 18128–18142,

  4. [4]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, and Fan Zhou. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,

  5. [5]

    Predicate-guided generation for mathematical reasoning

    Jiajun Chen and Yik-Cheung Tam. Predicate-guided generation for mathematical reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 9097–9110,

  6. [6]

    Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models

    Michael K Chen, Xikun Zhang, and Dacheng Tao. Justlogic: A comprehensive benchmark for evaluating deductive reasoning in large language models.arXiv preprint arXiv:2501.14851,

  7. [7]

    Steering

    10 Preprint. Under review. Yongchao Chen, Harsh Jhamtani, Srinagesh Sharma, Chuchu Fan, and Chi Wang. Steering large language models between code execution and textual reasoning.arXiv preprint arXiv:2410.03524,

  8. [8]

    arXiv preprint arXiv:2506.17088 (2025) 3

    Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, and Huaxia Li. Chain-of-thought prompting obscures hallucination cues in large language models: An empirical evaluation.arXiv preprint arXiv:2506.17088,

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  10. [10]

    Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,

    Fatemeh Dehghani, Roya Dehghani, Yazdan Naderzadeh Ardebili, and Shahryar Rahna- mayan. Large language models in legal systems: A survey.Humanities and Social Sciences Communications, 12(1):1977,

  11. [11]

    Cl-bench: A benchmark for context learning

    URL https://arxiv. org/abs/2602.03587. Scott Geng, Hamish Ivison, Chun-Liang Li Li, Maarten Sap, Jerry Li, Ranjay Krishna, and Pang Wei Koh. The delta learning hypothesis: Preference tuning on weak data can yield strong gains.arXiv preprint arXiv:2507.06187,

  12. [12]

    FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai. arXiv preprint arXiv:2411.04872,

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  14. [14]

    Folio: Natural language reasoning with first-order logic

    Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 22017–22031,

  15. [15]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  16. [16]

    Connecting symbolic statutory reason- ing with legal information extraction

    Nils Holzenberger and Benjamin Van Durme. Connecting symbolic statutory reason- ing with legal information extraction. In Daniel Preo t,iuc-Pietro, Catalina Goanta, Ilias Chalkidis, Leslie Barrett, Gerasimos Spanakis, and Nikolaos Aletras (eds.),Proceedings of the Natural Legal Language Processing Workshop 2023, pp. 113–131, Singapore, December

  17. [17]

    doi: 10.18653/v1/2023.nllp-1.12

    Association for Computational Linguistics. doi: 10.18653/v1/2023.nllp-1.12. URL https://aclanthology.org/2023.nllp-1.12/. Jiani Huang, Ziyang Li, Binghong Chen, Karan Samel, Mayur Naik, Le Song, and Xujie Si. Scallop: From probabilistic deductive databases to scalable differentiable reasoning. Advances in Neural Information Processing Systems, 34:25134–25145,

  18. [18]

    Contextbench: A benchmark for context retrieval in coding agents.arXiv preprint arXiv:2602.05892,

    Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T Barr, Sarro Federica, Zhaoyang Chu, and He Ye. Contextbench: A benchmark for context retrieval in coding agents.arXiv preprint arXiv:2602.05892,

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783,

  20. [20]

    Logicalthought: Logic-based ontological grounding of llms for high-assurance reasoning.arXiv preprint arXiv:2510.01530,

    Navapat Nananukul, Yue Zhang, Ryan Lee, Eric Boxer, Jonathan May, Vibhav Giridhar Gogate, Jay Pujara, and Mayank Kejriwal. Logicalthought: Logic-based ontological grounding of llms for high-assurance reasoning.arXiv preprint arXiv:2510.01530,

  21. [21]

    Senate bill s7263, 2025–2026 legislative session

    New York State Senate. Senate bill s7263, 2025–2026 legislative session. https://www. nysenate.gov/legislation/bills/2025/S7263,

  22. [22]

    arXiv preprint arXiv:2512.13961,

  23. [23]

    Introducing GPT-5.2-Codex

    OpenAI. Introducing GPT-5.2-Codex. https://openai.com/index/ introducing-gpt-5-2-codex/, December 2025a. OpenAI. Introducing openai o3 and o4-mini, 2025b. URL https://openai.com/index/ introducing-o3-and-o4-mini/. Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logica...

  24. [24]

    Neupsl: Neural probabilistic soft logic.arXiv preprint arXiv:2205.14268,

    Connor Pryor, Charles Dickens, Eriq Augustine, Alon Albalak, William Wang, and Lise Getoor. Neupsl: Neural probabilistic soft logic.arXiv preprint arXiv:2205.14268,

  25. [25]

    Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv:2210.01240,

    Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.arXiv preprint arXiv:2210.01240,

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  27. [27]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256,

  28. [28]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

    Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

  29. [29]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  30. [30]

    Proofwriter: Generating implications, proofs, and abductive statements over natural language

    Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3621–3634,

  31. [31]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534,

  32. [32]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  33. [33]

    Xiaocheng Yang, Bingsen Chen, and Yik-Cheung Tam

    URLhttps://arxiv.org/abs/2306.15626. Xiaocheng Yang, Bingsen Chen, and Yik-Cheung Tam. Arithmetic reasoning with LLM: prolog generation & permutation. In Kevin Duh, Helena G ´omez-Adorno, and Steven Bethard (eds.),Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies:...

  34. [34]

    A reasoning-focused legal retrieval benchmark

    Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D Man- ning, Peter Henderson, and Daniel E Ho. A reasoning-focused legal retrieval benchmark. InProceedings of the 2025 Symposium on Computer Science and Law, pp. 169–193,

  35. [35]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403.13372. Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, and William Yang Wang. Rulearena: A benchmark for rule-guided reasoning with llms in real-world scenarios. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL),

  36. [36]

    You are a helpful assistant trained to conduct deontic reasoning

    14 Preprint. Under review. A Additional Prompts Additional Instructions for Facts Extraction.The following instruction is used to prompt GPT-5-mini to extract factual information from the content of USCIS cases: USCIS facts extraction prompt Use this administrative appeals office appeal case to extract only the facts related to the petitioner (not the ana...

  37. [37]

    B.5.2 Reinforcement Learning DR

    We train these models on 4 H100 GPUs. B.5.2 Reinforcement Learning DR. GRPOFor reinforcement learning, we adopt Dr. GRPO (Liu et al., 2025), an unbiased variant of GRPO. Given an input question q, we sample a group of G responses {o1, . . ., oG} from the old policyπ θold (· |q). The clipped surrogate objective is JDr. GRPO(θ) =E q∼p Q,{oi}G i=1 ∼πθold (·|...

  38. [38]

    In contrast, Qwen-based coding models frequently fail the tasks, resulting in near-zero scores in several settings

    GPT-5.2-Codex achieves the strongest overall performance, yet its performance remains sensitive to prompts. In contrast, Qwen-based coding models frequently fail the tasks, resulting in near-zero scores in several settings. This suggests that code generation for rule-grounded reasoning is challenging for coding agents, with small prompt variations leading...

  39. [39]

    D.1 USCIS-AAO Dataset Scope and Preprocessing The AAO possesses a defined jurisdiction, function, and policy

    D USCIS-AAO Dataset Descriptive Statistics Table 8 presents the statistics of the USCIS-AAO dataset. D.1 USCIS-AAO Dataset Scope and Preprocessing The AAO possesses a defined jurisdiction, function, and policy. This appendix outlines the scope of publicly available materials and details the preprocessing steps used to construct the dataset. AAO Jurisdicti...

  40. [40]

    Alice and her son continued doing so after Harold’s death

    They had been living in the same house since 1993, maintained by Alice. Alice and her son continued doing so after Harold’s death. Alice’s gross income for the year 2017 was$236,422. Alice employed Bob, Cameron, Dan, Emily, Fred, and George for agricultural labor from Sep 9th to Oct 1st 2017, paying them$5,012 each. Alice takes the standard deduction in

  41. [41]

    % Facts from the case spouse('Alice','Harold')

    Question How much tax does Alice have to pay in 2017? Label $68,844 Reference Prolog (abridged). % Facts from the case spouse('Alice','Harold'). died('Harold',2016). not_remarried('Alice',2017). joint_return_possible('Alice','Harold',2016). child('Alice','Son'). birth_year('Son',2000). principal_abode('Son','Alice',2017). not_joint_return('Son',2017). mai...

  42. [42]

    Bob had no income in

    Alice’s income in 2015 was$504,598. Bob had no income in

  43. [43]

    27 Preprint

    Label Entailment Reference Prolog. 27 Preprint. Under review. /* ---------- Statutory rules ---------- */ /* Sec. 152(c)(1): qualifying child => dependent (subject to Sec. 152(b)(2) married-joint-return exception) */ dependent(Child, Taxpayer, Year) :- qualifying_child(Child, Taxpayer, Year), \+ married_joint_return(Child, Year). /* Sec. 152(b)(1): applie...

  44. [44]

    -> format('Result: Entailment˜n') ; format('Result: Contradiction˜n') ). :- halt. E.3 Airline: Baggage Policy Reasoning Airline — Example Instance Statutes (excerpt) The airline domain encodes American Airlines’ published baggage policies covering carry-on allowances, checked-bag fees by route and cabin class, and special-item surcharges. Policy fragment ...