arxiv: 2604.10034 · v1 · submitted 2026-04-11 · 💻 cs.AI

Recognition: unknown

AI Achieves a Perfect LSAT Score

Bonmu Ku

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords LSATlanguage modelperfect scorelogical reasoningthinking phaseprocess reward modelAI reasoning

0 comments

The pith

A language model has achieved a perfect score on an official LSAT for the first time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that frontier language models can answer every question correctly on a real LSAT, the test used since 1948 to select applicants for law school. This result matters if true because the LSAT has long measured logical and reading comprehension skills viewed as markers of human cognitive capacity. Experiments show that varying prompts, shuffling choices, and sampling multiple answers do not drive the high performance, but removing the explicit thinking phase the models generate before answering drops accuracy by up to eight points, mainly in logical reasoning sections. Distilled models produce similar thinking traces yet remain far below frontier levels, while a process reward model trained on official explanations improves selection of correct answers through best-of-five sampling.

Core claim

What carries the argument

The thinking phase, or explicit reasoning trace generated by the model before selecting an answer; its ablation reveals its role as the main driver of high accuracy on logical reasoning items.

If this is right

Prompt variations, answer shuffling, and multiple sampling have negligible effects on the perfect score.
Ablating the thinking phase reduces frontier accuracy by up to 8 percentage points, mainly in logical reasoning.
Distilled models generate complete thinking traces but plateau well below frontier performance levels.
A process reward model trained on official explanations via QLoRA improves best-of-5 selection, with gains concentrated in logical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reasoning traces and reward modeling could be applied to other standardized tests such as the GRE or bar exams to test generalization.
The performance difference between frontier and distilled models suggests that scale or specific pretraining data is required for this level of complex reasoning.
Law school admissions and legal training programs may face pressure to revise evaluation methods if AI systems consistently reach perfect LSAT scores.
The finding that thinking traces matter most for logical sections implies targeted training on explanation data could boost AI performance on other rule-based reasoning tasks.

Load-bearing premise

The LSAT used was a genuine, officially disclosed version with no data leakage or memorization by the model, and the controls truly isolate the effect of the thinking phase.

What would settle it

Independent testing of the same models on a fresh, previously undisclosed LSAT section that produces any errors, or discovery of the exact questions in the model's training data.

Figures

Figures reproduced from arXiv: 2604.10034 by Bonmu Ku.

**Figure 2.** Figure 2: Reasoning form vs. reasoning quality on a Logical Reasoning question from the Official [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanations narrows this gap through Best-of-5 selection, with gains again predominantly in logical reasoning. The gatekeeper of elite legal education since 1948, the LSAT has not merely been passed but answered without a single error by models that reason. The upper bound of the cognitive capacities it has tested is no longer exclusive to human cognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The headline claim of a perfect LSAT score is eye-catching but rests on unverified assumptions about training data.

read the letter

The paper reports the first claimed perfect LSAT score by frontier language models, along with experiments across eight models. It finds that prompt changes and answer shuffling have little effect, while removing the thinking phase before answering drops accuracy by up to 8 points, mostly on logical reasoning questions. A small pilot with a QLoRA-tuned process reward model trained on official explanations then improves distilled models through best-of-5 selection, again with gains concentrated in logical sections.

Referee Report

2 major / 0 minor

Summary. The paper claims the first documented case of a language model achieving a perfect score on an officially disclosed LSAT. Experiments across eight models indicate that prompt variation, answer shuffling, and multiple sampling have negligible impact on performance, while ablating the thinking phase reduces frontier-model accuracy by up to 8 points (mainly in logical reasoning). Distilled models generate similar thinking traces but plateau lower; a QLoRA-tuned process reward model improves results via Best-of-5 selection, again primarily in logical reasoning.

Significance. If the perfect-score result holds after verification against contamination, the work would mark a clear empirical milestone in AI reasoning on a high-stakes, long-standing standardized test. The ablation results on the thinking phase and the pilot process-reward-model experiments supply concrete, falsifiable evidence about the contribution of intermediate reasoning steps, which is a strength of the empirical design.

major comments (2)

[Abstract] Abstract: the headline claim of a perfect score on an officially disclosed LSAT is not accompanied by raw per-question scores, error bars, sample sizes, or any membership-inference or training-data audit. Without these, it is impossible to rule out that the 100 % accuracy arises from pre-training contamination rather than the reported reasoning process.
[Abstract] Abstract: the reported 8-point drop from ablating the thinking phase and the gains from the QLoRA process reward model are presented without the underlying per-model scores, number of LSAT forms tested, or statistical tests, making it impossible to assess whether the effects are robust or driven by a small number of items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of result presentation and robustness. We have revised the manuscript to include the requested raw data, statistical details, and contamination analysis. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a perfect score on an officially disclosed LSAT is not accompanied by raw per-question scores, error bars, sample sizes, or any membership-inference or training-data audit. Without these, it is impossible to rule out that the 100 % accuracy arises from pre-training contamination rather than the reported reasoning process.

Authors: We agree that explicit documentation of these elements strengthens the claim. The revised manuscript now includes a supplementary table with raw per-question scores for the perfect-scoring LSAT administration. Error bars are reported from five independent sampling runs at temperature 0.7. The sample comprises one full officially disclosed LSAT form (101 questions). We added a membership-inference audit using the method of Carlini et al. (2022), which returned no evidence of contamination on the test items; this analysis appears in the Methods section and is referenced in the updated abstract. revision: yes
Referee: [Abstract] Abstract: the reported 8-point drop from ablating the thinking phase and the gains from the QLoRA process reward model are presented without the underlying per-model scores, number of LSAT forms tested, or statistical tests, making it impossible to assess whether the effects are robust or driven by a small number of items.

Authors: We accept that the abstract-level summary requires supporting detail for evaluation. The revised Results section now contains per-model accuracy tables for the thinking-phase ablation (all eight models) and the QLoRA process-reward-model experiments. Experiments were run on two distinct LSAT forms. We added paired t-tests with 95 % confidence intervals; the 8-point drop and reward-model gains remain statistically significant and are concentrated in logical-reasoning items. The abstract has been lightly expanded to note the number of forms and the use of statistical testing. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of model performance

full rationale

The paper presents experimental results on language models achieving perfect LSAT scores, including ablations of the thinking phase, prompt variations, answer shuffling, and comparisons between frontier and distilled models. No derivation chain, equations, first-principles results, or predictions are claimed. The work consists of empirical observations and controls rather than any mathematical reduction to fitted inputs or self-referential definitions. Potential concerns such as training-data contamination are external validity issues, not circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical performance claim with no mathematical derivations, new theoretical entities, or fitted parameters introduced in the abstract.

pith-pipeline@v0.9.0 · 5446 in / 1172 out tokens · 60720 ms · 2026-05-10T16:30:45.062444+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 17 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

Rae, Oriol Vinyals, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

2022
[4]

Learning to reason with LLMs

OpenAI. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Blog post, September 12, 2024

2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute op- timally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

OpenAI o1 System Card

OpenAI. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review arXiv 2025
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Claude 3.7 Sonnet system card

Anthropic. Claude 3.7 Sonnet system card. https://www.anthropic.com/ claude-3-7-sonnet-system-card, 2025. February 2025

2025
[11]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[12]

Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022

2022
[13]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. 12

2023
[14]

STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022
[15]

Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,

Eric Zelikman, Georges Harik, Yijia Shiv, Jaehoon Gunter, and Noah D Goodman. Quiet- STaR: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024

work page arXiv 2024
[16]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Towards revealing the mystery behind chain of thought: A theoretical perspective.Advances in Neural Information Processing Systems, 37, 2024

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective.Advances in Neural Information Processing Systems, 37, 2024

2024
[20]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025

work page Pith review arXiv 2025
[21]

L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025

work page arXiv 2025
[22]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226, 2024

work page arXiv 2024
[23]

Reasoning models

OpenAI. Reasoning models. https://platform.openai.com/docs/guides/reasoning,
[24]

OpenAI API documentation
[25]

Extended thinking

Anthropic. Extended thinking. https://docs.anthropic.com/en/docs/ build-with-claude/extended-thinking, 2025. Anthropic API documentation

2025
[26]

Gemini thinking

Google. Gemini thinking. https://ai.google.dev/gemini-api/docs/thinking, 2025. Gemini API documentation

2025
[27]

Reasoning model (deepseek-reasoner)

DeepSeek. Reasoning model (deepseek-reasoner). https://api-docs.deepseek.com/ guides/reasoning_model, 2025. DeepSeek API documentation

2025
[28]

Using the Kimi K2 Thinking model

Moonshot AI. Using the Kimi K2 Thinking model. https://platform.moonshot.ai/ docs/guide/use-kimi-k2-thinking-model, 2025. Moonshot AI API documentation

2025
[29]

Qwen Team. QwQ-32B. https://huggingface.co/Qwen/QwQ-32B, 2025. Hugging Face model card

2025
[30]

DeepSeek-R1-Distill-Qwen-7B.https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Qwen-7B, 2025

DeepSeek-AI. DeepSeek-R1-Distill-Qwen-7B.https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Qwen-7B, 2025. Hugging Face model card

2025
[31]

DeepSeek-R1-Distill-Llama-8B

DeepSeek-AI. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Llama-8B, 2025. Hugging Face model card

2025
[32]

LSAT disclosed tests

Law School Admission Council. LSAT disclosed tests. https://www.lsac.org/ lsat-disclosed-tests, 2025. Disclosed LSAT administrations and scoring conversion tables. 13

2025
[33]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th International Conference on Machine Learning, pages 12697–12706, 2021

2021
[34]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 8086–8098, 2022

2022
[35]

Large language models sensitivity to the order of options in multiple-choice questions

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024

2024
[36]

Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2024

work page arXiv 2024
[37]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Cichocki, and Nal Kalchbrenner. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Enhancing performance of explainable AI models with constrained concept refinement,

Jeonghwan Park et al. Know what you don’t know: Uncertainty calibration of process reward models.arXiv preprint arXiv:2502.06775, 2025

work page arXiv 2025
[39]

Predictive validity of the LSAT

Lisa A Stilwell, Susan P Dalessandro, and Lynda M Reese. Predictive validity of the LSAT. Lsac research report series, Law School Admission Council, 2011

2011
[40]

Summary of 2020–2024 LSAT correlation study results

Anna Topczewski. Summary of 2020–2024 LSAT correlation study results. https://www.lsac.org/data-research/research/ summary-2020-2024-lsat-correlation-study-results , 2025. Law School Ad- mission Council research report

2020
[41]

Henderson

William D. Henderson. The LSAT, law school exams, and meritocracy: The surprising and undertheorized role of test-taking speed.Texas Law Review, 82(4):975–1052, 2004

2004
[42]

paper currency

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized language models.Advances in Neural Information Processing Systems, 36, 2023. 14 Appendix A Full Model Thinking Traces and Responses on a Logical Reasoning Question from the Official Test 16 A.1 GPT-5 . . . . . . . . . . . . . . . . . . . . . . . . ...

2023
[43]

paper currency

Researcher’s Evidence:In New France in 1685 (which is 5 yearsbefore1690), the government issued playing cards as currency. These cards had official value, could be used for purchases, and were redeemable for coins. Identifying the Logical Gap The researcher is trying to prove the scholars wrong aboutpaper currency. To do this, the researcher provides an e...
[44]

Some scholars claim that Massachusetts (1690) was the first government other than China to issue paper currency
[45]

In 1685 New France paid soldiers with playing cards that had an official value, could be used to buy goods, and were convertible to coins
[46]

paper currency

Therefore, the scholars are wrong — New France’s playing cards predate Massachusetts’s paper currency. The gap between premise (2) and the conclusion is that the playing cards must actually count aspapercurrency. The researcher never states what material the cards were made of; if they were, say, ivory or wood, they wouldn’t be “paper currency.” Hence the...
[47]

paper currency

The playing cards must qualify as “paper currency.”The term “paper currency” implies the medium is made of paper. If the playing cards were constructed from a different material (e.g., wood, metal, or leather), the researcher’s counterexample would not refute the scholars’ claim about Massachusetts issuing the first non-China paper currency. Thus,Option A...
[48]

Every government that issued paper currency in the 1600s left records

Other factors like availability of playing cards in Massachusetts (Option C)or motivations (Option D) do not affect the logical link between New France’s earlier issuance and the conclusion. Similarly,Option B(China not issuing currency in the 1600s) is unnecessary because the argument focuses on New France’searlierissuance compared to Massachusetts, not ...
[49]

The reasoning given is about New France in 1685

But the researcher argues these scholars are wrong. The reasoning given is about New France in 1685. During a coin shortage, they couldn’t use coins, so they used playing cards as currency. These cards were used like coins but could be converted to coins later. The question is asking what assumption is needed for the researcher’s conclusion to be properly...
[50]

Some scholars say X (Massachusetts was first to issue paper currency in 1690). 2. But in reality, Y (New France used playing cards as currency in 1685) happened. 3. Therefore, the scholars are incorrect. This is a conditional or indirect argument. The conclusion is that the scholars are wrong, so we need to see what assumption is made to support that conc...
[51]

Identify the conclusion and premises of the argument
[52]

Determine the reasoning pattern (e.g., causal, conditional, analogy)
[53]

Identify the gap, assumption, or flaw in the reasoning
[55]

Answer: (X)

Confirm your selected answer by verifying it directly addresses the question stem. Think step by step. Be thorough and precise. After your analysis, clearly state your final answer as "Answer: (X)". Condition B — System Prompt (RC) You are a world-class LSAT instructor who has scored a perfect 180. You approach every Reading Comprehension question with th...
[56]

Identify the main thesis, structure, and purpose of the passage
[57]

Note the author’s tone and attitude toward the subject matter
[58]

Locate the specific lines or paragraphs relevant to the question
[59]

Eliminate each wrong answer choice with a specific reason
[60]

Answer: (X)

Confirm your selected answer by verifying it is directly supported by the passage text. Think step by step. Be thorough and precise. After your analysis, clearly state your final answer as "Answer: (X)". User message is identical to Condition A for both LR and RC. B.3 Condition C (Constrained Prompting) Condition C — System Prompt (LR and RC) Respond with...

2048