pith. sign in

arxiv: 2606.23716 · v1 · pith:2WYXV2TVnew · submitted 2026-06-16 · 💻 cs.CY · cs.AI

Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

Pith reviewed 2026-06-26 22:24 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords legal AI benchmarkspro se access to justiceLLM robustnesslegal reasoning evaluationaccess to justicepro se litigantsbenchmark upper bound
0
0 comments X

The pith

Legal AI benchmarks measure performance only on expert-preprocessed inputs rather than the raw prompts typical of people without lawyers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that benchmarks used to evaluate legal AI cannot support claims about improving access to justice because they test models on inputs already cleaned and structured by legal experts. This setup captures only an upper bound of what models can achieve. Real access-to-justice scenarios involve pro se litigants whose inputs contain noisy narratives, buried facts, omissions, incorrect assumptions about the law, and surface errors. These features match conditions shown in machine learning research to cause large language models to degrade through long-context issues, underspecification, hallucination, and sensitivity to small changes. A small experiment applying perturbations to an existing legal benchmark illustrates the resulting performance gap, and the authors conclude that new benchmarks focused on pro se-like inputs are required to make access-to-justice claims testable.

Core claim

Benchmarks that evaluate legal reasoning on inputs preprocessed by legal experts measure only the upper bound of model performance, whereas access to justice for pro se litigants depends on the lower bound of performance under inputs containing noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors; these conditions align with known LLM degradation factors such as long-context sensitivity, underspecification, hallucination, and typographical perturbations, as shown by a perturbation experiment on a legal benchmark.

What carries the argument

The upper-bound versus lower-bound distinction in legal AI evaluation, where the upper bound arises from expert-preprocessed inputs and the lower bound from pro se input degradations.

If this is right

  • If model development continues to rely only on upper-bound benchmarks, the performance gap for actual pro se users may stay hidden or grow larger.
  • Access-to-justice claims about large language models will lack empirical grounding until benchmarks directly test robustness to pro se-like inputs.
  • Legal AI systems may fail to deliver benefits for self-represented individuals without explicit focus on handling unprocessed user inputs.
  • New benchmark designs must incorporate pro se input characteristics to allow claims about improved access to justice to be tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Creating test sets by systematically adding common pro se errors to existing legal cases could provide a practical way to measure the lower bound.
  • The same upper-bound versus lower-bound distinction could apply to AI tools intended for non-experts in other technical domains such as medical or financial advice.
  • Prioritizing training data that includes variable-quality user text might close the observed gap more directly than scaling model size alone.

Load-bearing premise

That degradations caused by pro se inputs such as noise and omissions produce effects on models comparable to those documented for long-context sensitivity, hallucination, and typographical perturbations in general machine learning research.

What would settle it

A controlled test finding no meaningful performance drop when legal benchmark cases are altered to add noisy narratives, buried facts, omissions, and typographical errors typical of pro se inputs.

Figures

Figures reproduced from arXiv: 2606.23716 by Andrew Lou, David Shin.

Figure 1
Figure 1. Figure 1: Clean and Typo-Perturbed LEXam Multiple-Choice Question We test typo distortions at three different frequencies to mimic increasing orders of severity (every 2 words, every 3 words and every 4 words) across three different models. The results are as follows [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model Accuracy Across Typo Perturbation Frequencies For the small sample of questions we tested, we observe a general degradation in the legal reasoning ability of all models. Moreover, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interleaved Context Dilution Design We used three models to test the effect of interleaving filler sentences. The results are as follows [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model Accuracy Under Interleaved Context Dilution In both flavors of the context dilution perturbation, we ob￾serve similar trends to the typo perturbation. Results from the general LLM sensitivity literature seem to translate sim￾ilarly in affecting legal reasoning of LLMs. By inserting filler sentences of beautiful scenery, we also note that we are purposely estimating a best-case scenario and still obse… view at source ↗
Figure 4
Figure 4. Figure 4: Model Accuracy Under Padding-Based Context Dilution In the second version of this perturbation, we intersperse the content of the question throughout the whole prompt. We do so by inserting two filler sentences between each question sentence as well as between each answer choice. By not including the question as a contiguous chunk of the prompt, this setting tests the ability of the model to repeatedly ide… view at source ↗
read the original abstract

Legal AI benchmark research frequently invokes the assumption that large language models can improve access to justice, including for people who cannot access lawyers in order to understand and exercise their legal rights. We argue that current benchmarks are not equipped to support this assumption because they evaluate legal reasoning over inputs that have already been preprocessed by legal experts, which measures the upper bound of model performance. Access to justice depends on a lower bound: how models perform when inputs come from pro se litigants, whose prompts may contain noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors. These degradations are comparable to conditions under which LLMs are known to degrade in the general machine learning literature, including long-context sensitivity, underspecification, hallucination, and typographical perturbations. We connect evidence from pro se literature with this body of machine learning research and present a small perturbation experiment on LEXam, a legal benchmark, to illustrate the gap between these two bounds. If model development continues to focus on benchmarks that measure only the upper bound, this gap may remain hidden or even widen. We conclude by calling for legal benchmarks that directly measure robustness under pro se-like inputs so that access-to-justice claims about legal AI can become empirically testable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that legal AI benchmarks evaluate LLMs on expert-preprocessed inputs (upper bound of performance) but access-to-justice applications require robustness on raw pro se litigant inputs containing noise, buried facts, omissions, folk-legal assumptions, and errors (lower bound). It connects pro se literature to documented LLM failure modes (long-context sensitivity, underspecification, hallucination, typographical perturbations), presents a small illustrative perturbation experiment on LEXam to show the gap, and calls for new benchmarks that directly test pro se-like conditions so that access-to-justice claims become empirically testable.

Significance. If the upper/lower-bound distinction holds, the work is significant for identifying a structural mismatch in how legal benchmarks are constructed relative to real-world pro se use cases. It explicitly links pro se input characteristics to general ML degradation conditions and supplies an illustrative experiment as a concrete starting point rather than a quantitative proof of effect size. This framing could productively redirect benchmark development toward falsifiable robustness tests.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'a small perturbation experiment on LEXam' is used without defining LEXam or briefly characterizing the perturbation types; a one-sentence gloss would aid readers who encounter the paper before the methods section.
  2. The mapping from pro se input traits to specific LLM failure modes is presented as comparable; adding a short table or enumerated list that pairs each pro se characteristic with the corresponding ML literature citation would improve traceability without altering the argument.
  3. [Conclusion] The conclusion calls for 'legal benchmarks that directly measure robustness under pro se-like inputs'; specifying one or two minimal design requirements (e.g., inclusion of unedited narrative prompts, omission of key facts) would make the recommendation more actionable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our core argument, and the recommendation for minor revision. We are pleased that the distinction between upper-bound performance on expert-preprocessed inputs and the lower-bound robustness needed for pro se inputs is viewed as significant for redirecting benchmark development.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances its central claim by citing independent pro se literature on input characteristics and general ML literature on LLM failure modes (long-context sensitivity, hallucination, etc.), then illustrates the gap with a perturbation experiment on the external LEXam benchmark. No equations, parameter fitting, self-definitional mappings, or load-bearing self-citations appear; the upper-bound versus lower-bound distinction is argued from external evidence without reducing to the paper's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that pro se input degradations match known LLM failure modes from general ML literature; this is drawn from cited bodies of work rather than new axioms or parameters introduced in the paper.

pith-pipeline@v0.9.1-grok · 5748 in / 1139 out tokens · 35219 ms · 2026-06-26T22:24:18.408364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages

  1. [1]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Guha, Neel and Nyarko, Julian and Ho, Daniel E. and Re, Christopher and Chilton, Adam and Chohlas-Wood, Alex and Peters, Austin and Waldon, Brandon and Rockmore, Daniel N. and Zambrano, Diego and Talisman, Dmitry and Hoque, Enamul and Surani, Faiz and Fagan, Frank and Sarfaty, Galit and Dickinson, Gillian and Porat, Hadar and Hegland, Jason and Wu, Jessic...

  2. [2]

    arXiv preprint arXiv:2505.12864 , year=

    Fan, Angela and Gonsalves, Timothy and Ney, Mathias and Sukharevsky, Alex and Samuel, Ranajoy and Greer, Morgan and Guldimann, Peter and Chaykowski, Kathleen and Lawrence, Mark and Yingling, David and Catanzaro, Bryan and Resnik, Philip and Mehta, Sameep and Fraiberger, Samuel P. and Choi, Jonathan H. , year =. 2505.12864 , archivePrefix =

  3. [3]

    2408.10343 , archivePrefix =

    Pipitone, Nicholas and Alami, Ghita Houir , year =. 2408.10343 , archivePrefix =

  4. [4]

    Levy, Andrew Hammond , journal =

  5. [5]

    and Cantone, Jason A

    Stienstra, Donna and Bataillon, Jared J. and Cantone, Jason A. , institution =. 2011 , url =

  6. [6]

    2015 , url =

    Hannaford-Agor, Paula and Graves, Scott and Miller, Shelley Spacek , institution =. 2015 , url =

  7. [7]

    and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F

    Engstrom, David Freeman and Hagan, Margaret and Ho, Daniel E. and Law, Stephanie and Ng, Lauren and Shanahan, Colleen F. , year =

  8. [8]

    Toy-Cronin, Bridgette and McLachlan, Saskia and Buckley, Jenni and Hunter, Ruth and McLay, Geoff , journal =

  9. [9]

    and Conley, John M

    O'Barr, William M. and Conley, John M. , journal =

  10. [10]

    and O'Barr, William M

    Conley, John M. and O'Barr, William M. , publisher =

  11. [11]

    2026 , month = may, url =

  12. [12]

    2026 , month = apr, url =

    Nerkar, Santul , howpublished =. 2026 , month = apr, url =

  13. [13]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy , journal =. 2024 , doi =

  14. [14]

    2025 , eprint =

    Yang, Chenyang and Shi, Yike and Ma, Qianou and Liu, Michael Xieyang and K. 2025 , eprint =

  15. [15]

    , year =

    Kirichenko, Polina and Ibrahim, Mark and Chaudhuri, Kamalika and Bell, Samuel J. , year =. 2506.09038 , archivePrefix =

  16. [16]

    , journal =

    Dahl, Matthew and Magesh, Varun and Suzgun, Mirac and Ho, Daniel E. , journal =. 2024 , doi =

  17. [17]

    2306.04528 , archivePrefix =

    Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Zhang, Yue and Gong, Neil Zhenqiang and Xie, Xing , year =. 2306.04528 , archivePrefix =

  18. [18]

    2510.04950 , archivePrefix =

    Dobariya, Om and Kumar, Akhil , year =. 2510.04950 , archivePrefix =

  19. [19]

    2025 , url =

    Joren, Hailey and Zhang, Jianyi and Ferng, Chun-Sung and Juan, Da-Cheng and Taly, Ankur and Rashtchian, Cyrus , booktitle =. 2025 , url =. 2411.06037 , archivePrefix =

  20. [21]

    Budzinski, Andrew , journal =

  21. [22]

    Administrative Office of the U.S. Courts . Pro Se Case Filings Have Increased in U.S. District Courts Since 2000 , 2021. URL https://www.uscourts.gov/data-news/reports/analysis-reports/pro-se-case-filings-have-increased-us-district-courts-2000

  22. [23]

    Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Al- rashed, Shaykhah Alsubaie, Yousef Almushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M

    Alzahrani, N., Alyahya, H., Alnumay, Y., AlRashed, S., Alsubaie, S., Almushayqih, Y., Mirza, F., Alotaibi, N., Al-Twairesh, N., Alowisheq, A., Bari, M. S., and Khan, H. When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vo...

  23. [24]

    Overhauling Rules of Evidence in Pro Se Courts

    Budzinski, A. Overhauling Rules of Evidence in Pro Se Courts . University of Richmond Law Review, 56, 2022

  24. [25]

    Conley, J. M. and O'Barr, W. M. Rules versus Relationships: The Ethnography of Legal Discourse . University of Chicago Press, 1990

  25. [26]

    Dahl, M., Magesh, V., Suzgun, M., and Ho, D. E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models . Journal of Legal Analysis, 16 0 (1): 0 64--93, 2024. doi:10.1093/jla/laae003

  26. [27]

    and Kumar, A

    Dobariya, O. and Kumar, A. Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy , 2025

  27. [28]

    F., Hagan, M., Ho, D

    Engstrom, D. F., Hagan, M., Ho, D. E., Law, S., Ng, L., and Shanahan, C. F. Making the A2J Crisis Count: Data, Reform, and the Eviction Machine , 2024. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4286971

  28. [29]

    P., and Choi, J

    Fan, A., Gonsalves, T., Ney, M., Sukharevsky, A., Samuel, R., Greer, M., Guldimann, P., Chaykowski, K., Lawrence, M., Yingling, D., Catanzaro, B., Resnik, P., Mehta, S., Fraiberger, S. P., and Choi, J. H. LEXam : Benchmarking Legal Reasoning on 340 Law Exams , 2025

  29. [30]

    Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026

    Fisher Phillips . Employer Playbook for Attacking AI Use in Pro Se Litigation: A Roundup of Recent Court Sanctions Against ChatGPT Plaintiffs , May 2026. URL https://www.fisherphillips.com/en/insights/insights/employer-playbook-for-attacking-ai-use-in-pro-se-litigation

  30. [31]

    E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D

    Guha, N., Nyarko, J., Ho, D. E., Re, C., Chilton, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J. J., Choi, J. H., Tobia, K., Hagan, M., Ma, M., Livermore, M. A., Rasumov-Rahe, N., Holzen...

  31. [32]

    Hannaford-Agor, P., Graves, S., and Miller, S. S. The Landscape of Civil Litigation in State Courts , 2015. URL https://www.ncsc.org/__data/assets/pdf_file/0020/13376/civiljusticereport-2015.pdf

  32. [33]

    Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

    Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., and Rashtchian, C. Sufficient Context: A New Lens on Retrieval Augmented Generation Systems . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Jjr2Odj8DJ

  33. [34]

    Kirichenko, P., Ibrahim, M., Chaudhuri, K., and Bell, S. J. AbstentionBench : Reasoning LLM s Fail on Unanswerable Questions , 2025

  34. [35]

    The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022

    Legal Services Corporation . The Justice Gap: The Unmet Civil Legal Needs of Low-Income Americans , 2022. URL https://justicegap.lsc.gov/

  35. [36]

    Levy, A. H. Empirical Patterns of Pro Se Litigation in Federal District Courts . University of Chicago Law Review, 85 0 (7): 0 1819--1871, 2018

  36. [37]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle: How Language Models Use Long Contexts . Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024. doi:10.1162/tacl_a_00638

  37. [38]

    Nerkar, S. A.I. ``Hallucinations'' Created Errors in Court Filing, Top Law Firm Says . The New York Times, April 2026. URL https://www.nytimes.com/2026/04/21/nyregion/sullivan-cromwell-ai-hallucination.html

  38. [39]

    O'Barr, W. M. and Conley, J. M. Litigant Satisfaction versus Legal Adequacy in Small Claims Court Narratives . Law & Society Review, 19 0 (4): 0 661--701, 1985

  39. [40]

    and Alami, G

    Pipitone, N. and Alami, G. H. LegalBench-RAG : A Benchmark for Retrieval-Augmented Generation in the Legal Domain , 2024

  40. [41]

    J., and Cantone, J

    Stienstra, D., Bataillon, J. J., and Cantone, J. A. Assistance to Pro Se Litigants in U.S. District Courts: A Report on Surveys of Clerks of Court and Chief Judges , 2011. URL https://www.fjc.gov/sites/default/files/2012/ProSeUSDC.pdf

  41. [42]

    Report to the Chief Judge of the State of New York , 2010

    Task Force to Expand Access to Civil Legal Services in New York . Report to the Chief Judge of the State of New York , 2010. URL https://ww2.nycourts.gov/sites/default/files/document/files/2018-04/CLS-TaskForceREPORT.pdf

  42. [43]

    Tightening the Justice Gap: How to Use AI to Improve Access to Justice

    Toy-Cronin, B., McLachlan, S., Buckley, J., Hunter, R., and McLay, G. Tightening the Justice Gap: How to Use AI to Improve Access to Justice . Journal of Dispute Resolution, 2022 0 (1): 0 79--110, 2022

  43. [44]

    X., K \"a stner, C., and Wu, T

    Yang, C., Shi, Y., Ma, Q., Liu, M. X., K \"a stner, C., and Wu, T. What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts , 2025

  44. [45]

    Z., and Xie, X

    Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., and Xie, X. PromptRobust : Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts , 2023