arxiv: 2603.19042 · v4 · submitted 2026-03-19 · 💻 cs.AI

Recognition: no theorem link

Man and machine: artificial intelligence and judicial decision making

Arthur Dyevre , Ahmad Shahvaroughi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords artificial intelligencejudicial decision makingrisk assessmentcriminal justicehuman-AI interactionpretrial decisionssentencing bias

0 comments

The pith

Empirical studies find AI decision aids produce modest or nonexistent changes in judges' pretrial and sentencing outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews research from computer science, economics, law, criminology, and psychology on AI risk assessment tools in criminal justice. It links the predictive performance of these tools, documented biases in human judicial decisions, and the limited evidence on how judges actually respond to algorithmic recommendations. The central finding is that AI aids shift real-world decisions only slightly or not at all. A reader would care because this suggests that widespread adoption of AI in courts may not quickly transform outcomes, while underscoring the need for better data on human-AI collaboration in uncertain legal settings.

Core claim

Using criminal justice risk assessment as the focal case, the review concludes that existing empirical evidence indicates the impact of AI decision-aid tools on pretrial and sentencing decisions is modest or nonexistent. The authors connect three strands of work: the performance and fairness of AI instruments, the strengths and biases of human judges, and the nature of AI-plus-human interactions. They identify gaps in understanding how judges navigate uncertain environments and how individual characteristics shape responses to AI advice.

What carries the argument

Synthetic review that integrates evidence on AI predictive validity, human judicial biases, and observed judge-AI interactions in criminal justice contexts.

If this is right

AI tools are unlikely to produce large immediate changes in judicial outcomes under current deployment patterns.
Future work should prioritize direct observation of how judges interpret and override AI advice rather than isolated AI-versus-human accuracy comparisons.
Interdisciplinary integration can reveal new insights into both algorithmic limitations and human decision processes.
Individual judge traits such as experience or risk tolerance may moderate responses to AI recommendations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar modest effects might appear in other high-stakes domains where professionals receive algorithmic advice, such as medical diagnosis or financial lending.
Courts could achieve more reliable decisions by combining modest AI inputs with targeted training on common human biases rather than replacing judges outright.
Longer-term studies tracking the same judges before and after AI adoption could isolate whether familiarity reduces or increases reliance on the tools.

Load-bearing premise

The collection of studies examined across fields accurately reflects the current state of knowledge without major selection bias or important unpublished gaps.

What would settle it

A randomized controlled trial in actual courts in which AI recommendations are provided to some judges but not others and produces statistically significant shifts in pretrial release rates or sentence lengths would falsify the modest-impact claim.

read the original abstract

The integration of artificial intelligence (AI) technologies into judicial decision-making, particularly in pretrial, sentencing, and parole contexts, has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI's role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI-plus-human interactions. Across the fields of computer science, economics, law, criminology, and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision-aid tools on pretrial and sentencing decisions is modest or nonexistent, our review also reveals important gaps in the existing literature. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate uncertain decision-making environments, and examine how individual characteristics influence judges' responses to AI advice. We argue that AI-versus-human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers. We advocate greater interdisciplinary integration to foster cross-fertilization in future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a literature synthesis that links AI tool performance, human biases, and judge-AI interactions in criminal justice but rests its modest-impact claim on an undescribed review process.

read the letter

The main takeaway is that this paper reviews work across computer science, economics, law, criminology, and psychology on AI risk assessment tools in pretrial and sentencing decisions. It concludes that these tools show modest or nonexistent effects on outcomes while noting gaps in understanding how judges actually use the recommendations. The synthesis connects AI fairness metrics, documented human biases, and the thinner evidence on combined human-machine decisions, which is a practical way to organize scattered findings. It also points to useful next steps such as examining how individual judge traits shape responses to algorithmic advice and studying decision-making under uncertainty. That interdisciplinary framing is the clearest contribution here. The soft spot is the missing detail on how the literature was assembled. The modest-impact summary draws from prior studies, yet the text gives no search protocol, inclusion rules, or checks for publication bias. Without that, it is hard to judge whether studies with larger effects were overlooked, which directly affects how much weight to give the central observation. This paper is aimed at researchers and policymakers who need a map of current evidence on AI in courts rather than a new empirical result or formal model. A reader starting work on human-AI collaboration in high-stakes settings would get value from the identified gaps. It deserves peer review because the topic is policy-relevant and the call for more integrated research is concrete, even if the synthesis methods need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a synthetic review across computer science, economics, law, criminology, and psychology on AI risk-assessment tools in judicial contexts (pretrial, sentencing, parole). It examines AI predictive validity and fairness, human judicial biases, and human-AI interaction patterns, concluding that existing empirical evidence shows AI decision aids have modest or nonexistent effects on pretrial and sentencing outcomes while identifying literature gaps and advocating greater interdisciplinary work.

Significance. If the synthesis holds, the review usefully connects disparate literatures and flags concrete gaps (e.g., limited study of judge-AI interaction under uncertainty and individual judge characteristics). The interdisciplinary framing and call for AI-versus-human comparisons are constructive, though the modest-impact claim rests entirely on the representativeness of the sampled studies rather than new data or derivations.

major comments (2)

[Abstract / synthetic review] Abstract and review synthesis section: the headline claim that 'existing empirical evidence indicates that the impact of AI decision-aid tools on pretrial and sentencing decisions is modest or nonexistent' is presented without any description of the literature search protocol, inclusion/exclusion criteria, database sources, or bias-correction procedures (e.g., funnel-plot or trim-and-fill analysis). This omission is load-bearing because the conclusion is a direct summary of the reviewed corpus; absent these details it is impossible to evaluate selection or publication bias.
[Synthetic review / discussion] Discussion of cross-field evidence: the text reports findings from five disciplines but does not indicate how conflicting results (e.g., larger effects in one field versus null in another) were weighted or reconciled. Without an explicit aggregation method, the 'modest or nonexistent' summary cannot be assessed for robustness.

minor comments (2)

[Abstract] The abstract states that 'researchers have made significant progress' in three areas but provides no quantitative indicators (e.g., number of studies per area or effect-size ranges) to support the characterization of progress.
[Throughout] The manuscript would benefit from a summary table listing the key studies cited for AI performance, human bias, and interaction effects, with columns for sample size, outcome measure, and reported effect direction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and outline the revisions we will make to improve transparency and robustness of the synthetic review.

read point-by-point responses

Referee: Abstract / synthetic review] Abstract and review synthesis section: the headline claim that 'existing empirical evidence indicates that the impact of AI decision-aid tools on pretrial and sentencing decisions is modest or nonexistent' is presented without any description of the literature search protocol, inclusion/exclusion criteria, database sources, or bias-correction procedures (e.g., funnel-plot or trim-and-fill analysis). This omission is load-bearing because the conclusion is a direct summary of the reviewed corpus; absent these details it is impossible to evaluate selection or publication bias.

Authors: We agree that greater transparency on the literature search process is warranted. Although the review is synthetic rather than a formal systematic review or meta-analysis, we will add a new subsection (likely in the introduction or a dedicated 'Review Approach' section) that describes the primary databases consulted (Google Scholar, Web of Science, SSRN, PubMed, and discipline-specific repositories), the core search terms and Boolean combinations used for each field, and the inclusion criteria (empirical studies on AI risk tools in judicial settings, published 2010–2023, with preference for randomized or quasi-experimental designs). We will also explicitly note that quantitative bias-correction methods such as funnel plots were not applied, as they are designed for meta-analyses of effect sizes rather than narrative syntheses across heterogeneous literatures. These additions will allow readers to better assess the scope and potential biases of the corpus we reviewed. revision: yes
Referee: Synthetic review / discussion] Discussion of cross-field evidence: the text reports findings from five disciplines but does not indicate how conflicting results (e.g., larger effects in one field versus null in another) were weighted or reconciled. Without an explicit aggregation method, the 'modest or nonexistent' summary cannot be assessed for robustness.

Authors: We will revise the discussion section to include an explicit description of our synthesis approach. We will add a paragraph explaining that we reconciled cross-field findings by (1) prioritizing studies with stronger causal identification (e.g., field experiments and regression discontinuity designs from economics and criminology), (2) contextualizing results by decision stage (pretrial vs. sentencing) and outcome measure, and (3) highlighting methodological differences that explain apparent conflicts (e.g., lab-based psychology studies vs. observational field data). The 'modest or nonexistent' conclusion is driven primarily by the higher-quality empirical studies in economics and criminology; computer science and psychology contributions are used mainly to illuminate mechanisms rather than to override null findings. We will also add a summary table listing the key studies, their designs, and how each contributed to the overall assessment. This will make the aggregation logic transparent and allow readers to evaluate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: literature synthesis with no internal derivations or self-referential reductions

full rationale

This is a synthetic review paper summarizing external empirical findings from computer science, economics, law, criminology, and psychology on AI risk assessment tools and judicial decision-making. No original equations, fitted parameters, predictions, or derivations are presented that could reduce by construction to inputs defined within the paper. The central claim about modest or nonexistent impact is explicitly framed as a summary of 'existing empirical evidence' from prior studies, with no self-citation chain or ansatz serving as load-bearing justification for the result itself. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature review that introduces no new free parameters, axioms, or invented entities; it relies entirely on summarizing existing empirical studies from multiple disciplines.

pith-pipeline@v0.9.0 · 5554 in / 1025 out tokens · 43013 ms · 2026-05-15T08:17:05.356956+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

S., Bertrand, M., & Mullainathan, S

Abrams, D. S., Bertrand, M., & Mullainathan, S. (2012). Do Judges Vary in Their Treatment of Race? The Journal of Legal Studies, 41(2), 347 –383. https://doi.org/10.1086/666006 Angelova, V., Dobbie, W., & Yang, C. S. (2025). Algorithmic Recommendations and Human Discretion. The Review of Economic Studies , rdaf084. https://doi.org/10.1093/restud/rda f084 ...

work page doi:10.1086/666006 2012
[2]

Ashley, K. D. (2017). Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age . Cambridge University Press. Beebe, B. (2006). An empirical study of the multifactor tests for trademark infringement. Calif. L. Rev., 94,

work page 2017
[3]

Beebe, B. (2007). An empirical study of US copyright fair use opinions, 1978-2005. U. Pa. L. Rev. , 156,

work page 2007
[4]

Beebe, B. (2020). An Empirical Study of U.S. Copyright Fair Use Opinions Updated, 1978 -2019. New York University Journal of Intellectual Property & Entertainment Law (JIPEL), 10(1), 1–39. Ben-Michael, E., Greiner, D. J., Huang, M., Imai, K., Jiang, Z., & Shin, S. (2025). Does AI help humans make better decisions? A statistical evaluation framework for ex...

work page doi:10.1073/pnas.2505 2020
[5]

https://doi.org/10.1257/jep.35.1.97 Bontrager, S., Barrick, K., & Stupi, E. (2013). Gender and sentencing: A meta-analysis of contemporary research. J. Gender Race & Just. , 16,

work page doi:10.1257/jep.35.1.97 2013
[6]

Braman, E., & Nelson, T. E. (2007). Mechanism of Motivated Reasoning? Analogical Perception in Discrimination Disputes. American Journal of Political Science, 51(4), Article

work page 2007
[7]

https://doi.org/https://doi.org/10.1 111/j.1540-5907.2007.00290.x Breiman, L. (2001). Random Forests. Machine Learning , 45(1), 5 –32. https://doi.org/10.1023/A:1010933 404324 Bushway, S. D., Owens, E. G., & Piehl, A. M. (2012). Sentencing Guidelines and Judicial Discretion: Quasi - Experimental Evidence from Human Calculation Errors. Journal of Empirical...

work page doi:10.1023/a:1010933 2007
[8]

D., Benedict, K., & Perry, J

Casper, J. D., Benedict, K., & Perry, J. L. (1989). Juror decision making, attitudes, and the hindsight bias. Law and Human Behavior , 13(3), 291–310. https://doi.org/10.1007/BF010670 31 Chen, D., Moskowitz, T. J., & Shue, K. (2016). Decision -Making Under the Gambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires. The Quarter...

work page doi:10.1007/bf010670 1989
[9]

https://doi.org/10.1177/088740341 5604899 Dyevre, A., & Rodilla Lazaro, A. (2025). Ideological Polarization on Constitutional Courts: Evidence from Spain. Available at SSRN 5124478. https://papers.ssrn.com/sol3/pape rs.cfm?abstract_id=5124478 Ebbesen, E. B., & Konecni, V. J. (1975). Decision making and information integration in the courts: The setting of...

work page doi:10.1177/088740341 2025
[10]

Eisenberg, T., & Miller, G. P. (2004). Attorney Fees in Class Action Settlements: An Empirical Study. Journal of Empirical Legal Studies, 1(1), 27 –78. https://doi.org/10.1111/j.1740- 1461.2004.00002.x Engel, C., Linhardt, L., & Schubert, M. (2025). Code is law: How COMPAS affects the way the judiciary handles the risk of recidivism. Artificial Intelligen...

work page doi:10.1111/j.1740- 2004
[11]

M., & Posner, R

https://doi.org/10.1177/014616720 5282152 Epstein, L., Landes, W. M., & Posner, R. A. (2013). The Behavior of Federal Judges: A Theoretical and Empirical Study of Rational Choice. Harvard University Press. Fazel, S., Burghart, M., Fanshawe, T., Gil, S. D., Monahan, J., & Yu, R. (2022). The predictive performance of criminal risk assessment tools used at s...

work page doi:10.1177/014616720 2013
[12]

C., & Spiteri, R

https://doi.org/10.1111/j.1751- 9004.2011.00343.x Ghasemi, M., Anvari, D., Atapour, M., Stephen wormith, J., Stockdale, K. C., & Spiteri, R. J. (2021). The Application of Machine Learning to a General Risk –Need Assessment Instrument in the Prediction of Criminal Recidivism. Criminal Justice and Behavior , 48(4), 518 –538. https://doi.org/10.1177/00938548...

work page doi:10.1111/j.1751- 2011
[13]

Equality of Opportunity in Supervised Learning

Hangartner, D., Lauderdale, B. E., & Spirig, J. (2025). Inferring individual preferences from group decisions: Judicial preference variation and aggregation on collegial courts. British Journal of Political Science, 55, e163. Hardt, M., Price, E., & Srebro, N. (2016). Equality of Opportunity in Supervised Learning (arXiv:1610.02413). arXiv. https://doi.or...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.161 2025
[14]

https://doi.org/10.1017/S19302975 00008597 Huang, B. I. (2010). Lightened Scrutiny. Harvard Law Review , 124, 1109–

work page doi:10.1017/s19302975 2010
[15]

J., Halen, R., & Shin, S

Imai, K., Jiang, Z., Greiner, D. J., Halen, R., & Shin, S. (2023). Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment*. Journal of the Royal Statistical Society Series A: Statistics in Society , 186(2), 167–

work page 2023
[16]

https://doi.org/10.1093/jrsssa/qna d010 Jung, J., Concannon, C., Shroff, R., Goel, S., & Goldstein, D. G. (2017). Simple rules for complex decisions (arXiv:1702.04690). arXiv. https://doi.org/10.48550/arXiv.170 2.04690 Kahneman, D. (2011). Thinking, Fast and Slow. Penguin UK. Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for classificat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/jrsssa/qna 2017
[17]

Kukucka, J., & Kassin, S. M. (2014). Do confessions taint perceptions of handwriting evidence? An empirical test of the forensic confirmation bias. Law and Human Behavior, 38(3), Article

work page 2014
[18]

https://doi.org/10.1037/lhb000006 6 Lagioia, F., Rovatti, R., & Sartor, G. (2023). Algorithmic fairness through group parities? The case of COMPAS -SAPMOC. AI & SOCIETY, 38(2), 459 –478. https://doi.org/10.1007/s00146- 022-01441-y Lakkaraju, H., Kleinberg, J., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). The Selective Labels Problem: Evaluating Al...

work page doi:10.1037/lhb000006 2023
[19]

https://doi.org/10.1037/a0017881 Parasuraman, R., & Manzey, D. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52, 381 –410. https://doi.org/10.1177/001872081 0376055 Philippe, A. (2020). Gender Disparities in Sentencing. Economica, 87(348), 1037–1077. https://doi.org/10.1111/ecca.12333 Pinello, D. R. (1...

work page doi:10.1037/a0017881 2010
[20]

J., & Wistrich, A

https://doi.org/10.1146/annurev- lawsocsci-110615-085032 Rachlinski, J. J., & Wistrich, A. J. (2021). Benevolent sexism in judges. San Diego L. Rev., 58,

work page doi:10.1146/annurev- 2021
[21]

J., Wistrich, A

Rachlinski, J. J., Wistrich, A. J., & Guthrie, C. (2012). Altering attention in adjudication. UCLA L. Rev., 60,

work page 2012
[22]

J., Wistrich, A

Rachlinski, J. J., Wistrich, A. J., & Guthrie, C. (2015). Can judges make reliable numeric judgments: Distorted damages and skewed sentences. Ind. LJ, 90,

work page 2015
[23]

Rambachan, A. (2024). Identifying Prediction Mistakes in Observational Data*. The Quarterly Journal of Economics , 139(3), 1665 –1711. https://doi.org/10.1093/qje/qjae01 3 Ramji-Nogales, J., Schoenholtz, A. I., & Schrag, P. G. (2007). Refugee roulette: Disparities in asylum adjudication. Stan. L. Rev. , 60,

work page doi:10.1093/qje/qjae01 2024
[24]

Rieskamp, J., & Hoffrage, U. (1999). When do people use simple heuristics, and how can we tell. Simple Heuristics That Make Us Smart, 141–167. Romeo, G., & Conti, D. (2025). Exploring automation bias in human –AI collaboration: A review and implications for explainable AI. AI & SOCIETY . https://doi.org/10.1007/s00146- 025-02422-7 Rudin, C., Wang, C., & C...

work page doi:10.1007/s00146- 1999
[25]

https://doi.org/10.1177/009385481 9848793 Shumway, C., & Wilson, R. (2022). Workplace disruptions, judge caseloads, and judge decisions: Evidence from SSA judicial corps retirements. Journal of Public Economics, 205, 104573. https://doi.org/10.1016/j.jpubeco.2 021.104573 Skeem, J., Monahan, J., & Lowenkamp, C. (2016). Gender, risk assessment, and sanction...

work page doi:10.1177/009385481 2022
[26]

Starr, S. B. (2015). Estimating Gender Disparities in Federal Criminal Cases. American Law and Economics Review , 17(1), 127 –

work page 2015
[27]

https://doi.org/10.1093/aler/ahu01 0 Stephen Wormith, J., & Bonta, J. (2018). The Level of Service (LS) Instruments. In Handbook of Recidivism Risk/Needs Assessment Tools (pp. 117 –145). John Wiley & Sons, Ltd. https://doi.org/10.1002/978111918 4256.ch6 Stevenson, M. T., & Doleac, J. L. (2024). Algorithmic Risk Assessment in the Hands of Humans. American ...

work page doi:10.1093/aler/ahu01 2018
[28]

S., Prakken, H., Renooij, S., & Verheij, B

Vlek, C. S., Prakken, H., Renooij, S., & Verheij, B. (2016). A method for explaining Bayesian networks for legal evidence with scenarios. Artificial Intelligence and Law , 24(3), 285 –324. https://doi.org/10.1007/s10506- 016-9183-4 Wistrich, A. J., & Rachlinski, J. J. (2012). How Lawyers’ Intuitions Prolong Litigation Modeling Human Decisionmaking in the ...

work page doi:10.1007/s10506- 2016