Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

Andrew Lou; David Shin

arxiv: 2606.23716 · v1 · pith:2WYXV2TVnew · submitted 2026-06-16 · 💻 cs.CY · cs.AI

Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

Andrew Lou , David Shin This is my paper

Pith reviewed 2026-06-26 22:24 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords legal AI benchmarkspro se access to justiceLLM robustnesslegal reasoning evaluationaccess to justicepro se litigantsbenchmark upper bound

0 comments

The pith

Legal AI benchmarks measure performance only on expert-preprocessed inputs rather than the raw prompts typical of people without lawyers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that benchmarks used to evaluate legal AI cannot support claims about improving access to justice because they test models on inputs already cleaned and structured by legal experts. This setup captures only an upper bound of what models can achieve. Real access-to-justice scenarios involve pro se litigants whose inputs contain noisy narratives, buried facts, omissions, incorrect assumptions about the law, and surface errors. These features match conditions shown in machine learning research to cause large language models to degrade through long-context issues, underspecification, hallucination, and sensitivity to small changes. A small experiment applying perturbations to an existing legal benchmark illustrates the resulting performance gap, and the authors conclude that new benchmarks focused on pro se-like inputs are required to make access-to-justice claims testable.

Core claim

Benchmarks that evaluate legal reasoning on inputs preprocessed by legal experts measure only the upper bound of model performance, whereas access to justice for pro se litigants depends on the lower bound of performance under inputs containing noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors; these conditions align with known LLM degradation factors such as long-context sensitivity, underspecification, hallucination, and typographical perturbations, as shown by a perturbation experiment on a legal benchmark.

What carries the argument

The upper-bound versus lower-bound distinction in legal AI evaluation, where the upper bound arises from expert-preprocessed inputs and the lower bound from pro se input degradations.

If this is right

If model development continues to rely only on upper-bound benchmarks, the performance gap for actual pro se users may stay hidden or grow larger.
Access-to-justice claims about large language models will lack empirical grounding until benchmarks directly test robustness to pro se-like inputs.
Legal AI systems may fail to deliver benefits for self-represented individuals without explicit focus on handling unprocessed user inputs.
New benchmark designs must incorporate pro se input characteristics to allow claims about improved access to justice to be tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Creating test sets by systematically adding common pro se errors to existing legal cases could provide a practical way to measure the lower bound.
The same upper-bound versus lower-bound distinction could apply to AI tools intended for non-experts in other technical domains such as medical or financial advice.
Prioritizing training data that includes variable-quality user text might close the observed gap more directly than scaling model size alone.

Load-bearing premise

That degradations caused by pro se inputs such as noise and omissions produce effects on models comparable to those documented for long-context sensitivity, hallucination, and typographical perturbations in general machine learning research.

What would settle it

A controlled test finding no meaningful performance drop when legal benchmark cases are altered to add noisy narratives, buried facts, omissions, and typographical errors typical of pro se inputs.

Figures

Figures reproduced from arXiv: 2606.23716 by Andrew Lou, David Shin.

**Figure 1.** Figure 1: Clean and Typo-Perturbed LEXam Multiple-Choice Question We test typo distortions at three different frequencies to mimic increasing orders of severity (every 2 words, every 3 words and every 4 words) across three different models. The results are as follows [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Model Accuracy Across Typo Perturbation Frequencies For the small sample of questions we tested, we observe a general degradation in the legal reasoning ability of all models. Moreover, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 5.** Figure 5: Interleaved Context Dilution Design We used three models to test the effect of interleaving filler sentences. The results are as follows [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Model Accuracy Under Interleaved Context Dilution In both flavors of the context dilution perturbation, we observe similar trends to the typo perturbation. Results from the general LLM sensitivity literature seem to translate similarly in affecting legal reasoning of LLMs. By inserting filler sentences of beautiful scenery, we also note that we are purposely estimating a best-case scenario and still obse… view at source ↗

**Figure 4.** Figure 4: Model Accuracy Under Padding-Based Context Dilution In the second version of this perturbation, we intersperse the content of the question throughout the whole prompt. We do so by inserting two filler sentences between each question sentence as well as between each answer choice. By not including the question as a contiguous chunk of the prompt, this setting tests the ability of the model to repeatedly ide… view at source ↗

read the original abstract

Legal AI benchmark research frequently invokes the assumption that large language models can improve access to justice, including for people who cannot access lawyers in order to understand and exercise their legal rights. We argue that current benchmarks are not equipped to support this assumption because they evaluate legal reasoning over inputs that have already been preprocessed by legal experts, which measures the upper bound of model performance. Access to justice depends on a lower bound: how models perform when inputs come from pro se litigants, whose prompts may contain noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors. These degradations are comparable to conditions under which LLMs are known to degrade in the general machine learning literature, including long-context sensitivity, underspecification, hallucination, and typographical perturbations. We connect evidence from pro se literature with this body of machine learning research and present a small perturbation experiment on LEXam, a legal benchmark, to illustrate the gap between these two bounds. If model development continues to focus on benchmarks that measure only the upper bound, this gap may remain hidden or even widen. We conclude by calling for legal benchmarks that directly measure robustness under pro se-like inputs so that access-to-justice claims about legal AI can become empirically testable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that legal benchmarks test clean expert inputs while access-to-justice claims need robustness on raw pro se inputs, and the distinction holds up as a useful critique.

read the letter

The main thing here is that existing legal benchmarks only measure performance on inputs already cleaned up by lawyers, so they cannot support claims that models will help people without lawyers. The paper frames this as an upper-bound versus lower-bound problem and links pro se input problems like noise, omissions, and folk assumptions to known LLM weaknesses such as hallucination and context sensitivity.

What is new is the direct mapping from pro se literature to those ML robustness results, plus the small perturbation experiment on LEXam that shows the gap in practice. The argument is straightforward and avoids overclaiming; it treats the experiment as illustration rather than proof of effect size.

The work is strongest when it sticks to the logical point that access-to-justice statements require testing under pro se conditions. The citation pattern draws from both fields without circularity.

The main limitation is that the experiment remains small and illustrative, so the paper does not deliver quantitative evidence on how large the performance drop actually is across benchmarks. That keeps the contribution more as a call for better evaluation than a completed demonstration.

This is for researchers building or evaluating legal AI tools who care about deployment outside controlled settings. Anyone working on access-to-justice applications or benchmark design will get value from the framing.

It deserves peer review because the central distinction is clear and the assumption it challenges appears in many papers. I would send it to referees.

Referee Report

0 major / 3 minor

Summary. The paper claims that legal AI benchmarks evaluate LLMs on expert-preprocessed inputs (upper bound of performance) but access-to-justice applications require robustness on raw pro se litigant inputs containing noise, buried facts, omissions, folk-legal assumptions, and errors (lower bound). It connects pro se literature to documented LLM failure modes (long-context sensitivity, underspecification, hallucination, typographical perturbations), presents a small illustrative perturbation experiment on LEXam to show the gap, and calls for new benchmarks that directly test pro se-like conditions so that access-to-justice claims become empirically testable.

Significance. If the upper/lower-bound distinction holds, the work is significant for identifying a structural mismatch in how legal benchmarks are constructed relative to real-world pro se use cases. It explicitly links pro se input characteristics to general ML degradation conditions and supplies an illustrative experiment as a concrete starting point rather than a quantitative proof of effect size. This framing could productively redirect benchmark development toward falsifiable robustness tests.

minor comments (3)

[Abstract] Abstract: the phrase 'a small perturbation experiment on LEXam' is used without defining LEXam or briefly characterizing the perturbation types; a one-sentence gloss would aid readers who encounter the paper before the methods section.
The mapping from pro se input traits to specific LLM failure modes is presented as comparable; adding a short table or enumerated list that pairs each pro se characteristic with the corresponding ML literature citation would improve traceability without altering the argument.
[Conclusion] The conclusion calls for 'legal benchmarks that directly measure robustness under pro se-like inputs'; specifying one or two minimal design requirements (e.g., inclusion of unedited narrative prompts, omission of key facts) would make the recommendation more actionable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our core argument, and the recommendation for minor revision. We are pleased that the distinction between upper-bound performance on expert-preprocessed inputs and the lower-bound robustness needed for pro se inputs is viewed as significant for redirecting benchmark development.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances its central claim by citing independent pro se literature on input characteristics and general ML literature on LLM failure modes (long-context sensitivity, hallucination, etc.), then illustrates the gap with a perturbation experiment on the external LEXam benchmark. No equations, parameter fitting, self-definitional mappings, or load-bearing self-citations appear; the upper-bound versus lower-bound distinction is argued from external evidence without reducing to the paper's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that pro se input degradations match known LLM failure modes from general ML literature; this is drawn from cited bodies of work rather than new axioms or parameters introduced in the paper.

pith-pipeline@v0.9.1-grok · 5748 in / 1139 out tokens · 35219 ms · 2026-06-26T22:24:18.408364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages

[1]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Re, Christopher and Chilton, Adam and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enamul and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gillian and Porat, Hadar and Hegland, Jason and Wu, Jessic...

work page arXiv
[2]

arXiv preprint arXiv:2505.12864 , year=

Fan, Angela and Gonsalves, Timothy and Ney, Mathias and Sukharevsky, Alex and Samuel, Ranajoy and Greer, Morgan and Guldimann, Peter and Chaykowski, Kathleen and Lawrence, Mark and Yingling, David and Catanzaro, Bryan and Resnik, Philip and Mehta, Sameep and Fraiberger, Samuel P. and Choi, Jonathan H. , year =. 2505.12864 , archivePrefix =

work page arXiv
[3]

2408.10343 , archivePrefix =

Pipitone, Nicholas and Alami, Ghita Houir , year =. 2408.10343 , archivePrefix =

work page arXiv
[4]

Levy, Andrew Hammond , journal =
[5]

and Cantone, Jason A

Stienstra, Donna and Bataillon, Jared J. and Cantone, Jason A. , institution =. 2011 , url =

2011
[6]

2015 , url =

Hannaford-Agor, Paula and Graves, Scott and Miller, Shelley Spacek , institution =. 2015 , url =

2015
[7]

and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F

Engstrom, David Freeman and Hagan, Margaret and Ho, Daniel E. and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F. , year =
[8]

Toy-Cronin, Bridgette and McLachlan, Saskia and Buckley, Jenni and Hunter, Ruth and McLay, Geoff , journal =
[9]

and Conley, John M

O'Barr, William M. and Conley, John M. , journal =
[10]

and O'Barr, William M

Conley, John M. and O'Barr, William M. , publisher =
[11]

2026 , month = may, url =

2026
[12]

2026 , month = apr, url =

Nerkar, Santul , howpublished =. 2026 , month = apr, url =

2026
[13]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =. 2024 , doi =

2024
[14]

2025 , eprint =

Yang, Chenyang and Shi, Yike and Ma, Qianou and Liu, Michael Xieyang and K. 2025 , eprint =

2025
[15]

, year =

Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , year =. 2506.09038 , archivePrefix =

work page arXiv
[16]

, journal =

Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E. , journal =. 2024 , doi =

2024
[17]

2306.04528 , archivePrefix =

Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Zhang, Yue and Gong, Neil Zhenqiang and Xie, Xing , year =. 2306.04528 , archivePrefix =

work page arXiv
[18]

2510.04950 , archivePrefix =

Dobariya, Om and Kumar, Akhil , year =. 2510.04950 , archivePrefix =

work page arXiv
[19]

2025 , url =

Joren, Hailey and Zhang, Jianyi and Ferng, Chun-Sung and Juan, Da-Cheng and Taly, Ankur and Rashtchian, Cyrus , booktitle =. 2025 , url =. 2411.06037 , archivePrefix =

work page arXiv 2025
[21]

Budzinski, Andrew , journal =
[22]

Administrative Office of the U.S. Courts . Pro Se Case Filings Have Increased in U.S. District Courts Since 2000 , 2021. URL https://www.uscourts.gov/data-news/reports/analysis-reports/pro-se-case-filings-have-increased-us-district-courts-2000

2000
[23]

Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M

Alzahrani, N., Alyahya, H., Alnumay, Y., AlRashed, S., Alsubaie, S., Almushayqih, Y., Mirza, F., Alotaibi, N., Al-Twairesh, N., Alowisheq, A., Bari, M. S., and Khan, H. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vo...

work page doi:10.18653/v1/2024.acl-long.744 2024
[24]

Overhauling Rules of Evidence in Pro Se Courts

Budzinski, A. Overhauling Rules of Evidence in Pro Se Courts . University of Richmond Law Review, 56, 2022

2022
[25]

Conley, J. M. and O'Barr, W. M. Rules versus Relationships: The Ethnography of Legal Discourse . University of Chicago Press, 1990

1990
[26]

Dahl, M., Magesh, V., Suzgun, M., and Ho, D. E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models . Journal of Legal Analysis, 16 0 (1): 0 64--93, 2024. doi:10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024
[27]

and Kumar, A

Dobariya, O. and Kumar, A. Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy , 2025

2025
[28]

F., Hagan, M., Ho, D

Engstrom, D. F., Hagan, M., Ho, D. E., Law, S., Ng, L., and Shanahan, C. F. Making the A2J Crisis Count: Data, Reform, and the Eviction Machine , 2024. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4286971

2024
[29]

P., and Choi, J

Fan, A., Gonsalves, T., Ney, M., Sukharevsky, A., Samuel, R., Greer, M., Guldimann, P., Chaykowski, K., Lawrence, M., Yingling, D., Catanzaro, B., Resnik, P., Mehta, S., Fraiberger, S. P., and Choi, J. H. LEXam : Benchmarking Legal Reasoning on 340 Law Exams , 2025

2025
[30]

Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026

Fisher Phillips . Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026. URL https://www.fisherphillips.com/en/insights/insights/employer-playbook-for-attacking-ai-use-in-pro-se-litigation

2026
[31]

E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D

Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, M., Ma, M., Livermore, M. A., Rasumov-Rahe, N., Holzen...

2023
[32]

Hannaford-Agor, P., Graves, S., and Miller, S. S. The Landscape of Civil Litigation in State Courts , 2015. URL https://www.ncsc.org/__data/assets/pdf_file/0020/13376/civiljusticereport-2015.pdf

2015
[33]

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., and Rashtchian, C. Sufficient Context: A New Lens on Retrieval Augmented Generation Systems . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Jjr2Odj8DJ

2025
[34]

Kirichenko, P., Ibrahim, M., Chaudhuri, K., and Bell, S. J. AbstentionBench : Reasoning LLM s Fail on Unanswerable Questions , 2025

2025
[35]

The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022

Legal Services Corporation . The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022. URL https://justicegap.lsc.gov/

2022
[36]

Levy, A. H. Empirical Patterns of Pro Se Litigation in Federal District Courts . University of Chicago Law Review, 85 0 (7): 0 1819--1871, 2018

2018
[37]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle: How Language Models Use Long Contexts . Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[38]

Nerkar, S. A.I. ``Hallucinations'' Created Errors in Court Filing, Top Law Firm Says . The New York Times, April 2026. URL https://www.nytimes.com/2026/04/21/nyregion/sullivan-cromwell-ai-hallucination.html

2026
[39]

O'Barr, W. M. and Conley, J. M. Litigant Satisfaction versus Legal Adequacy in Small Claims Court Narratives . Law & Society Review, 19 0 (4): 0 661--701, 1985

1985
[40]

and Alami, G

Pipitone, N. and Alami, G. H. LegalBench-RAG : A Benchmark for Retrieval-Augmented Generation in the Legal Domain , 2024

2024
[41]

J., and Cantone, J

Stienstra, D., Bataillon, J. J., and Cantone, J. A. Assistance to Pro Se Litigants in U.S. District Courts: A Report on Surveys of Clerks of Court and Chief Judges , 2011. URL https://www.fjc.gov/sites/default/files/2012/ProSeUSDC.pdf

2011
[42]

Report to the Chief Judge of the State of New York , 2010

Task Force to Expand Access to Civil Legal Services in New York . Report to the Chief Judge of the State of New York , 2010. URL https://ww2.nycourts.gov/sites/default/files/document/files/2018-04/CLS-TaskForceREPORT.pdf

2010
[43]

Tightening the Justice Gap: How to Use AI to Improve Access to Justice

Toy-Cronin, B., McLachlan, S., Buckley, J., Hunter, R., and McLay, G. Tightening the Justice Gap: How to Use AI to Improve Access to Justice . Journal of Dispute Resolution, 2022 0 (1): 0 79--110, 2022

2022
[44]

X., K \"a stner, C., and Wu, T

Yang, C., Shi, Y., Ma, Q., Liu, M. X., K \"a stner, C., and Wu, T. What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts , 2025

2025
[45]

Z., and Xie, X

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., and Xie, X. PromptRobust : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts , 2023

2023

[1] [1]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Re, Christopher and Chilton, Adam and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enamul and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gillian and Porat, Hadar and Hegland, Jason and Wu, Jessic...

work page arXiv

[2] [2]

arXiv preprint arXiv:2505.12864 , year=

Fan, Angela and Gonsalves, Timothy and Ney, Mathias and Sukharevsky, Alex and Samuel, Ranajoy and Greer, Morgan and Guldimann, Peter and Chaykowski, Kathleen and Lawrence, Mark and Yingling, David and Catanzaro, Bryan and Resnik, Philip and Mehta, Sameep and Fraiberger, Samuel P. and Choi, Jonathan H. , year =. 2505.12864 , archivePrefix =

work page arXiv

[3] [3]

2408.10343 , archivePrefix =

Pipitone, Nicholas and Alami, Ghita Houir , year =. 2408.10343 , archivePrefix =

work page arXiv

[4] [4]

Levy, Andrew Hammond , journal =

[5] [5]

and Cantone, Jason A

Stienstra, Donna and Bataillon, Jared J. and Cantone, Jason A. , institution =. 2011 , url =

2011

[6] [6]

2015 , url =

Hannaford-Agor, Paula and Graves, Scott and Miller, Shelley Spacek , institution =. 2015 , url =

2015

[7] [7]

and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F

Engstrom, David Freeman and Hagan, Margaret and Ho, Daniel E. and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F. , year =

[8] [8]

Toy-Cronin, Bridgette and McLachlan, Saskia and Buckley, Jenni and Hunter, Ruth and McLay, Geoff , journal =

[9] [9]

and Conley, John M

O'Barr, William M. and Conley, John M. , journal =

[10] [10]

and O'Barr, William M

Conley, John M. and O'Barr, William M. , publisher =

[11] [11]

2026 , month = may, url =

2026

[12] [12]

2026 , month = apr, url =

Nerkar, Santul , howpublished =. 2026 , month = apr, url =

2026

[13] [13]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =. 2024 , doi =

2024

[14] [14]

2025 , eprint =

Yang, Chenyang and Shi, Yike and Ma, Qianou and Liu, Michael Xieyang and K. 2025 , eprint =

2025

[15] [15]

, year =

Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , year =. 2506.09038 , archivePrefix =

work page arXiv

[16] [16]

, journal =

Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E. , journal =. 2024 , doi =

2024

[17] [17]

2306.04528 , archivePrefix =

Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Zhang, Yue and Gong, Neil Zhenqiang and Xie, Xing , year =. 2306.04528 , archivePrefix =

work page arXiv

[18] [18]

2510.04950 , archivePrefix =

Dobariya, Om and Kumar, Akhil , year =. 2510.04950 , archivePrefix =

work page arXiv

[19] [19]

2025 , url =

Joren, Hailey and Zhang, Jianyi and Ferng, Chun-Sung and Juan, Da-Cheng and Taly, Ankur and Rashtchian, Cyrus , booktitle =. 2025 , url =. 2411.06037 , archivePrefix =

work page arXiv 2025

[20] [21]

Budzinski, Andrew , journal =

[21] [22]

Administrative Office of the U.S. Courts . Pro Se Case Filings Have Increased in U.S. District Courts Since 2000 , 2021. URL https://www.uscourts.gov/data-news/reports/analysis-reports/pro-se-case-filings-have-increased-us-district-courts-2000

2000

[22] [23]

Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M

Alzahrani, N., Alyahya, H., Alnumay, Y., AlRashed, S., Alsubaie, S., Almushayqih, Y., Mirza, F., Alotaibi, N., Al-Twairesh, N., Alowisheq, A., Bari, M. S., and Khan, H. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vo...

work page doi:10.18653/v1/2024.acl-long.744 2024

[23] [24]

Overhauling Rules of Evidence in Pro Se Courts

Budzinski, A. Overhauling Rules of Evidence in Pro Se Courts . University of Richmond Law Review, 56, 2022

2022

[24] [25]

Conley, J. M. and O'Barr, W. M. Rules versus Relationships: The Ethnography of Legal Discourse . University of Chicago Press, 1990

1990

[25] [26]

Dahl, M., Magesh, V., Suzgun, M., and Ho, D. E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models . Journal of Legal Analysis, 16 0 (1): 0 64--93, 2024. doi:10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024

[26] [27]

and Kumar, A

Dobariya, O. and Kumar, A. Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy , 2025

2025

[27] [28]

F., Hagan, M., Ho, D

Engstrom, D. F., Hagan, M., Ho, D. E., Law, S., Ng, L., and Shanahan, C. F. Making the A2J Crisis Count: Data, Reform, and the Eviction Machine , 2024. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4286971

2024

[28] [29]

P., and Choi, J

Fan, A., Gonsalves, T., Ney, M., Sukharevsky, A., Samuel, R., Greer, M., Guldimann, P., Chaykowski, K., Lawrence, M., Yingling, D., Catanzaro, B., Resnik, P., Mehta, S., Fraiberger, S. P., and Choi, J. H. LEXam : Benchmarking Legal Reasoning on 340 Law Exams , 2025

2025

[29] [30]

Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026

Fisher Phillips . Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026. URL https://www.fisherphillips.com/en/insights/insights/employer-playbook-for-attacking-ai-use-in-pro-se-litigation

2026

[30] [31]

E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D

Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, M., Ma, M., Livermore, M. A., Rasumov-Rahe, N., Holzen...

2023

[31] [32]

Hannaford-Agor, P., Graves, S., and Miller, S. S. The Landscape of Civil Litigation in State Courts , 2015. URL https://www.ncsc.org/__data/assets/pdf_file/0020/13376/civiljusticereport-2015.pdf

2015

[32] [33]

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., and Rashtchian, C. Sufficient Context: A New Lens on Retrieval Augmented Generation Systems . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Jjr2Odj8DJ

2025

[33] [34]

Kirichenko, P., Ibrahim, M., Chaudhuri, K., and Bell, S. J. AbstentionBench : Reasoning LLM s Fail on Unanswerable Questions , 2025

2025

[34] [35]

The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022

Legal Services Corporation . The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022. URL https://justicegap.lsc.gov/

2022

[35] [36]

Levy, A. H. Empirical Patterns of Pro Se Litigation in Federal District Courts . University of Chicago Law Review, 85 0 (7): 0 1819--1871, 2018

2018

[36] [37]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle: How Language Models Use Long Contexts . Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[37] [38]

Nerkar, S. A.I. ``Hallucinations'' Created Errors in Court Filing, Top Law Firm Says . The New York Times, April 2026. URL https://www.nytimes.com/2026/04/21/nyregion/sullivan-cromwell-ai-hallucination.html

2026

[38] [39]

O'Barr, W. M. and Conley, J. M. Litigant Satisfaction versus Legal Adequacy in Small Claims Court Narratives . Law & Society Review, 19 0 (4): 0 661--701, 1985

1985

[39] [40]

and Alami, G

Pipitone, N. and Alami, G. H. LegalBench-RAG : A Benchmark for Retrieval-Augmented Generation in the Legal Domain , 2024

2024

[40] [41]

J., and Cantone, J

Stienstra, D., Bataillon, J. J., and Cantone, J. A. Assistance to Pro Se Litigants in U.S. District Courts: A Report on Surveys of Clerks of Court and Chief Judges , 2011. URL https://www.fjc.gov/sites/default/files/2012/ProSeUSDC.pdf

2011

[41] [42]

Report to the Chief Judge of the State of New York , 2010

Task Force to Expand Access to Civil Legal Services in New York . Report to the Chief Judge of the State of New York , 2010. URL https://ww2.nycourts.gov/sites/default/files/document/files/2018-04/CLS-TaskForceREPORT.pdf

2010

[42] [43]

Tightening the Justice Gap: How to Use AI to Improve Access to Justice

Toy-Cronin, B., McLachlan, S., Buckley, J., Hunter, R., and McLay, G. Tightening the Justice Gap: How to Use AI to Improve Access to Justice . Journal of Dispute Resolution, 2022 0 (1): 0 79--110, 2022

2022

[43] [44]

X., K \"a stner, C., and Wu, T

Yang, C., Shi, Y., Ma, Q., Liu, M. X., K \"a stner, C., and Wu, T. What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts , 2025

2025

[44] [45]

Z., and Xie, X

Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., and Xie, X. PromptRobust : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts , 2023

2023