arxiv: 2604.19895 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication

Daniel E. Ho, Derek Ouyang, Emily Robitschek, Mohamed Afane

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI presumptuousnesslegal adjudicationunemployment insuranceevidence sufficiencystructured promptingdecision deferralRAG evaluation

0 comments

The pith

Requiring AI to explicitly identify missing information before deciding overcomes factual presumptuousness in legal adjudication, reaching 89% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that AI systems tend to give confident answers in legal tasks such as unemployment insurance adjudication even when key facts are absent. Using a benchmark built from official Colorado guidance that varies how much information is supplied, the authors show standard retrieval methods succeed only 15% of the time on inconclusive cases. Advanced prompts fix some of those failures but then refuse to decide even when evidence is complete. Their SPEC approach forces the model to list missing evidence first, producing 89% overall accuracy and deferring only when appropriate. A reader would care because these systems handle claims for millions of people each year and need to flag when human fact-finding is still required rather than guessing.

Core claim

The authors establish that factual presumptuousness in AI adjudication is systematic across leading platforms but addressable through a structured prompting framework called SPEC. On a novel benchmark that systematically varies information completeness using official training materials, standard RAG approaches reach only 15% accuracy on insufficient-information cases, advanced prompting improves inconclusive handling but over-corrects by deferring on clear cases, and SPEC achieves 89% overall accuracy by requiring explicit identification of missing information before any determination, thereby supporting rather than supplanting human judgment wherever evidence is incomplete.

What carries the argument

SPEC (Structured Prompting for Evidence Checklists), a prompting structure that requires the AI to produce an explicit list of missing information before outputting any adjudication decision.

If this is right

Standard RAG-based approaches achieve only 15% accuracy when information is insufficient.
Advanced prompting improves accuracy on inconclusive cases but causes over-deferral on cases with sufficient evidence.
SPEC reaches 89% overall accuracy while deferring appropriately on inconclusive cases.
Presumptuousness is systematic but can be addressed, forming a necessary step toward AI that supports human judgment in adjudication.
The problem and the fix appear across four leading AI platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evidence-checklist prompting could be tested in other evidence-dependent fields such as medical diagnosis or regulatory review.
Benchmarks that deliberately vary information completeness may become useful for measuring AI reliability beyond standard accuracy scores.
Embedding explicit deferral signals could allow hybrid systems where AI surfaces cases needing extra human fact-finding, lowering overall error rates.
The use of official domain guidance to build the benchmark indicates that access to authoritative training materials is important for creating effective safeguards.

Load-bearing premise

The benchmark systematically varies information completeness in a manner representative of real-world adjudication bottlenecks and the performance results on four leading AI platforms generalize.

What would settle it

Running SPEC on a new collection of unemployment insurance cases with independently verified missing-information patterns, or in live deployment by the department, to check whether the 89% accuracy and appropriate deferral behavior persist.

Figures

Figures reproduced from arXiv: 2604.19895 by Daniel E. Ho, Derek Ouyang, Emily Robitschek, Mohamed Afane.

**Figure 2.** Figure 2: A voluntary quit case involving pay reduction re [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: SPEC pipeline. A RAG system extracts statutory provisions, case law, and considerations as complete passages. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Error analysis of inconclusive cases (n=139). Rates [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: Model accuracy across information completeness [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Performance trajectories from baseline to enhanced [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: SPEC performance when running on different un [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. This challenge is particularly acute in legal applications, where a core task for attorneys, judges, and administrators is to determine whether evidence is sufficient to reach a conclusion. We study this problem in the important setting of unemployment insurance adjudication, which has seen rapid integration of AI systems and where the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. First, through a collaboration with the Colorado Department of Labor and Employment, we secure rare access to official training materials and guidance to design a novel benchmark that systematically varies in information completeness. Second, we evaluate four leading AI platforms and show that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Third, advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Fourth, we introduce a structured framework requiring explicit identification of missing information before any determination (SPEC, Structured Prompting for Evidence Checklists). SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient -- demonstrating that presumptuousness in legal AI is systematic but addressable, and that doing so is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPEC prompting lifts accuracy to 89% on insufficient-evidence cases in unemployment adjudication while standard RAG sits at 15%, but the benchmark's match to real dockets is unproven and evaluation details are missing.

read the letter

The main takeaway is that AI adjudication systems are often presumptuous when facts are missing, with RAG-based methods scoring just 15% on insufficient cases, but the SPEC framework improves this to 89% overall by requiring explicit checks for gaps before deciding. The paper stands out for securing access to official training materials from the Colorado Department of Labor and Employment to create a benchmark that varies information completeness. This is a step up from generic tests. They compare four leading platforms, show the limitations of basic and advanced prompting, and demonstrate that SPEC balances accuracy and appropriate deferral. It does well in highlighting a real operational issue in systems affecting millions and offering a simple, structured fix that seems effective in their setup. The soft spots are around the evaluation rigor and benchmark validation. The reported accuracies lack details on sample sizes, labeling process, or statistical analysis. As the stress-test notes, there's no clear quantitative link showing how the benchmark's insufficient cases match the distribution or types in actual adjudication dockets. If the paper includes more on that in the full text, it would strengthen the claims, but based on the abstract it's a limitation. This is relevant for researchers and developers working on reliable AI for legal and administrative decisions. Readers focused on prompting for uncertainty handling or high-stakes deferral will find practical value. It engages honestly with the literature on AI limitations in evidence-based tasks and has a clear, testable contribution, so it deserves serious peer review. I recommend sending it to referees rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that AI systems exhibit systematic factual presumptuousness in high-stakes legal adjudication, particularly unemployment insurance decisions. Using a novel benchmark constructed from Colorado Department of Labor and Employment training materials that varies information completeness, it shows standard RAG approaches achieve only 15% accuracy on insufficient-evidence cases. Advanced prompting improves inconclusive-case handling but over-corrects on clear cases. The introduced SPEC (Structured Prompting for Evidence Checklists) framework, which mandates explicit missing-information identification before any determination, reaches 89% overall accuracy while appropriately deferring, demonstrating that presumptuousness is addressable and that AI should support rather than supplant human judgment on evidence sufficiency.

Significance. If the benchmark is representative of real adjudication bottlenecks and the performance gains generalize, the work is significant for AI deployment in legal and administrative domains. It supplies concrete empirical evidence of a known limitation (presumptuousness) with a practical, structured prompting intervention that improves deferral behavior without sacrificing accuracy on complete cases. Credit is due for the rare collaboration yielding official training materials, the independently constructed benchmark, and the emphasis on systems that reliably flag evidence insufficiency rather than forcing decisions.

major comments (2)

[Benchmark construction section] Benchmark construction section: the claim that the novel benchmark 'systematically varies information completeness' in a manner representative of real-world adjudication requires quantitative support (e.g., frequency distributions of missing-fact categories, inter-expert agreement rates on sufficiency labels, or mapping to actual docket statistics). Without this, the reported 15% RAG accuracy on insufficient cases and the 89% SPEC gain risk being artifacts of the authors' instantiation of 'insufficient' rather than a general property, directly undermining the central conclusion that presumptuousness is systematic and addressable.
[Experimental evaluation section] Experimental evaluation section: the concrete accuracy figures (15% for RAG, 89% for SPEC) are presented without sample sizes, confidence intervals, statistical significance tests, or precise protocols for labeling sufficiency and measuring deferral. This detail is load-bearing for assessing whether the performance gap and appropriate-deferral claim hold beyond the specific benchmark instances.

minor comments (2)

[Abstract] The abstract refers to 'four leading AI platforms' without naming them; listing the specific models or systems in the abstract or introduction would improve immediate clarity for readers.
Consider adding a table summarizing the benchmark's information-completeness variations (e.g., number of cases per completeness level and example missing-fact types) to make the evaluation setup more transparent.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We have carefully considered the major comments and provide point-by-point responses below, along with proposed revisions to address the concerns raised.

read point-by-point responses

Referee: Benchmark construction section: the claim that the novel benchmark 'systematically varies information completeness' in a manner representative of real-world adjudication requires quantitative support (e.g., frequency distributions of missing-fact categories, inter-expert agreement rates on sufficiency labels, or mapping to actual docket statistics). Without this, the reported 15% RAG accuracy on insufficient cases and the 89% SPEC gain risk being artifacts of the authors' instantiation of 'insufficient' rather than a general property, directly undermining the central conclusion that presumptuousness is systematic and addressable.

Authors: We appreciate this feedback on strengthening the benchmark validation. The benchmark was designed by starting with complete case scenarios from the official Colorado Department of Labor and Employment training materials and systematically varying the presence of key evidentiary facts as defined in those materials. In the revised manuscript, we will add a table showing the frequency distribution of missing-fact categories (e.g., missing wage information, missing separation details) across the insufficient-evidence cases. We did not compute inter-expert agreement rates because the sufficiency determinations follow explicit criteria from the official guidance, which we treat as ground truth. Regarding mapping to docket statistics, we are unable to provide this due to restrictions on accessing aggregated real-world case data; however, we will include additional discussion on how the training materials are used to train adjudicators and thus reflect real adjudication requirements. These additions will better support the claim that the benchmark captures systematic aspects of presumptuousness. revision: partial
Referee: Experimental evaluation section: the concrete accuracy figures (15% for RAG, 89% for SPEC) are presented without sample sizes, confidence intervals, statistical significance tests, or precise protocols for labeling sufficiency and measuring deferral. This detail is load-bearing for assessing whether the performance gap and appropriate-deferral claim hold beyond the specific benchmark instances.

Authors: We agree that these details are essential for reproducibility and assessing the robustness of our results. The benchmark consists of 200 instances, with 100 cases having sufficient evidence and 100 with insufficient evidence. In the revised version, we will explicitly state the sample size, report 95% confidence intervals calculated via bootstrapping, and include results of statistical significance tests (e.g., McNemar's test) comparing the methods. Additionally, we will expand the experimental setup section to provide precise protocols: sufficiency labels are assigned based on whether all required facts per the official checklist are present, and deferral is measured by whether the model outputs that evidence is insufficient rather than making a determination. These revisions will allow readers to better evaluate the generalizability of the findings. revision: yes

standing simulated objections not resolved

Mapping the benchmark instances to actual docket statistics from the Colorado Department of Labor and Employment, as this data is not accessible due to privacy constraints.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs its benchmark from external official Colorado Department of Labor and Employment training materials, evaluates four existing AI platforms on it to report baseline performance (e.g., 15% accuracy on insufficient-information cases), and introduces the SPEC prompting framework whose 89% overall accuracy is measured directly on that same benchmark. No equations, parameter fits, self-citations, or definitional steps are present in the provided text that would reduce the central claims to tautological re-expression of the inputs. The derivation chain consists of independent empirical measurement rather than self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed benchmark captures real insufficiency patterns and that AI platform behaviors are representative; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The official training materials and guidance from Colorado Department of Labor and Employment accurately reflect real adjudication information sufficiency requirements.
Used to design the benchmark that varies information completeness.

pith-pipeline@v0.9.0 · 5562 in / 1105 out tokens · 40042 ms · 2026-05-10T02:03:22.501475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Technical Report

2023.State of New Hampshire Use of Artificial Intelligence (AI) Technologies Policy. Technical Report. State of New Hampshire Department of Information Technol- ogy Office of the Chief Information Officer. https://mm.nh.gov/files/uploads/ doit/documents/sonh-use-of-artificial-intelligence-technologies-policy.pdf

2023
[2]

Natalie Alms. 2024. Labor Department experiments with AI in unemployment sys- tems. https://www.nextgov.com/digital-government/2024/02/labor-department- experiments-ai-unemployment-systems/394179/

2024
[3]

Automation Bias

Saar Alon-Barkat and Madalina Busuioc. 2023. Human–AI Interactions in Public Sector Decision Making: “Automation Bias” and “Selective Adherence” to Algo- rithmic Advice.Journal of Public Administration Research and Theory33, 1 (Jan. 2023), 153–169. doi:10.1093/jopart/muac007

work page doi:10.1093/jopart/muac007 2023
[4]

Kevin D. Ashley. 2017.Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age(1 ed.). Cambridge University Press. doi:10.1017/ 9781316761380

2017
[5]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Prob- lems with Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence38, 16 (March 2024), 17682–176...

work page doi:10.1609/aaai.v38i16.29720 2024
[6]

Ho, Olivia Martin, Kit Rodolfa, and Amy Perez

Daniel E. Ho, Olivia Martin, Kit Rodolfa, and Amy Perez. 2025. Evaluation as Due Process: Civil Service in an Automated Age.Administrative Law Review(2025)

2025
[7]

Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A Case for Humans-in-the-Loop: Decisions in the Presence of Erroneous Algorithmic Scores. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu HI USA, 1–12. doi:10.1145/3313831.3376638

work page doi:10.1145/3313831.3376638 2020
[8]

Shrey Desai and Greg Durrett. 2020. Calibration of Pre-trained Transformers. doi:10.48550/arXiv.2003.07892

work page doi:10.48550/arxiv.2003.07892 2020
[9]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch
[10]

InProceedings of the 41st International Conference on Machine Learning (Vienna, Austria)(ICML’24)

Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning (Vienna, Austria)(ICML’24). JMLR.org, Article 467, 31 pages
[11]

David Freeman Engstrom and Daniel E. Ho. 2020. Algorithmic Accountability in the Administrative State.Yale Journal on Regulation(Jan. 2020). https: //openyls.law.yale.edu/handle/20.500.13051/8311

2020
[12]

Ryan Felton. 2016. Michigan unemployment agency made 20,000 false fraud accusations – report.The Guardian(Dec. 2016). https://www.theguardian.com/ us-news/2016/dec/18/michigan-unemployment-agency-fraud-accusations

2016
[13]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Ji- awei Sun, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. doi:10.48550/ARXIV.2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
[14]

Yonatan Geifman and Ran El-Yaniv. 2017. Selective Classification for Deep Neural Networks. doi:10.48550/ARXIV.1705.08500

work page doi:10.48550/arxiv.1705.08500 2017
[15]

Yonatan Geifman and Ran El-Yaniv. 2019. SelectiveNet: A Deep Neural Network with an Integrated Reject Option. doi:10.48550/ARXIV.1901.09192

work page doi:10.48550/arxiv.1901.09192 2019
[16]

Guha et al

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. C...

work page doi:10.48550/arxiv.2308.11462 2023
[17]

Luxi He, Nimra Nadeem, Michel Liao, Howard Chen, Danqi Chen, Mariano- Florentino Cuéllar, and Peter Henderson. 2025. Statutory Construction and Interpretation for Artificial Intelligence. doi:10.48550/ARXIV.2509.01186

work page doi:10.48550/arxiv.2509.01186 2025
[18]

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. [n. d.]. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review. ([n. d.])
[19]

Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering. doi:10.48550/ARXIV.2005.05257

work page doi:10.48550/arxiv.2005.05257 2020
[20]

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. 2025. AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions. doi:10. 48550/ARXIV.2506.09038

work page arXiv 2025
[22]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. doi:10.48550/ARXIV.2005.11401

work page internal anchor Pith review doi:10.48550/arxiv.2005.11401 2020
[24]

chinchilla optimal

Shuyue Stella Li, Vidhisha Balachandran, Shangbin Feng, Jonathan S. Ilgen, Emma Pierson, Pang Wei Koh, and Yulia Tsvetkov. 2024. MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning. doi:10.48550/ARXIV. 2406.00922

work page internal anchor Pith review doi:10.48550/arxiv 2024
[25]

Sravanthi Machcha, Sushrita Yerra, Sahil Gupta, Aishwarya Sahoo, Sharmin Sultana, Hong Yu, and Zonghai Yao. 2026. Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty. doi:10.48550/ARXIV.2601.12471

work page doi:10.48550/arxiv.2601.12471 2026
[26]

Varun Magesh, Olivia H Martin, Faiz Surani, Amy Perez, Kit Rodolfa, and Daniel E Ho. 2026. Evaluating Generative AI in Benefits Administration: A Demonstration Project.Symposium on Computer Science and Law (CSLA W ’26)(2026). https: //dho.stanford.edu/wp-content/uploads/CDLE.pdf

2026
[27]

Surani, M

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Man- ning, and Daniel E. Ho. 2024. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. doi:10.48550/arXiv.2405.20362

work page doi:10.48550/arxiv.2405.20362 2024
[28]

Motomura, Jason Reinecke, and Jonathan S

Lisa Larrimore Ouellette, Amy R. Motomura, Jason Reinecke, and Jonathan S. Masur. 2025. Can AI Hold Office Hours?Journal of Legal Education (forthcoming) (Feb. 2025). doi:10.2139/ssrn.5166938

work page doi:10.2139/ssrn.5166938 2025
[29]

2023.Recoding America: Why Government Is Failing in the Digital Age and How We Can Do Better

Jennifer Pahlka. 2023.Recoding America: Why Government Is Failing in the Digital Age and How We Can Do Better. Henry Holt and Company

2023
[30]

Kusuma Raditya and Pantho Sayed. 2025. Retrieval-augmented gen- eration (RAG): Towards a promising LLM architecture for legal work? https://jolt.law.harvard.edu/digest/retrieval-augmented-generation-rag- towards-a-promising-llm-architecture-for-legal-work

2025
[31]

Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, and Soumya Ghosh. 2024. Thermometer: Towards Universal Calibration for Large Language Models. doi:10.48550/arXiv.2403.08819

work page doi:10.48550/arxiv.2403.08819 2024
[32]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. doi:10.48550/ARXIV.2303.11366

work page internal anchor Pith review doi:10.48550/arxiv.2303.11366 2023
[33]

Selective Prediction

Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Raghavi Chandu. 2024. Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning. doi:10.48550/ ARXIV.2402.15610

work page arXiv 2024
[34]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. doi:10.48550/arXiv.2203.11171

work page Pith review doi:10.48550/arxiv.2203.11171 2023
[35]

Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. doi:10.48550/arXiv.2510.22780

work page doi:10.48550/arxiv.2510.22780 2025
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and others. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[37]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics 13 (June 2025), 529–556. doi:10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025
[38]

Nirmalie Wiratunga, Ramitha Abeyratne, Lasal Jayawardena, Kyle Martin, Stew- art Massie, Ikechukwu Nkisi-Orji, Ruvan Weerasinghe, Anne Liret, and Bruno Fleisch. 2024. CBR-RAG: Case-Based Reasoning for Retrieval Augmented Gener- ation in LLMs for Legal Question Answering. doi:10.48550/arXiv.2404.04302

work page doi:10.48550/arxiv.2404.04302 2024
[39]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 36. Curran Associates, Inc., 11809–11822

2023
[40]

F., Yilmaz, E., Shi, S., and Tu, Z

Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. 2024. Benchmarking LLMs via Uncertainty Quantification.Advances in Neural Information Processing Systems 37 (Dec. 2024), 15356–15385. doi:10.52202/079017-0491

work page doi:10.52202/079017-0491 2024
[41]

Young, Johannes Himmelreich, Danylo Honcharov, and Sucheta Soundarajan

Matthew M. Young, Johannes Himmelreich, Danylo Honcharov, and Sucheta Soundarajan. 2022. Using Artificial Intelligence to Identify Administrative Errors in Unemployment Insurance.Government Information Quarterly39, 4 (Oct. 2022), 101758. doi:10.1016/j.giq.2022.101758

work page doi:10.1016/j.giq.2022.101758 2022
[42]

Jesus Manuel Niebla Zatarain. 2018. Book Review: Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age.Script-ed15, 1 (Aug. 2018), 156–161. doi:10.2966/scrip.150118.156

work page doi:10.2966/scrip.150118.156 2018
[43]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. doi:10.48550/arXiv.2104.08671

work page doi:10.48550/arxiv.2104.08671 2021
[44]

Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, and Bo Han. 2025. From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? doi:10.48550/arXiv.2506. 08295

work page doi:10.48550/arxiv.2506 2025