Recognition: unknown
AI Achieves a Perfect LSAT Score
Pith reviewed 2026-05-10 16:30 UTC · model grok-4.3
The pith
A language model has achieved a perfect score on an official LSAT for the first time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanation
What carries the argument
The thinking phase, or explicit reasoning trace generated by the model before selecting an answer; its ablation reveals its role as the main driver of high accuracy on logical reasoning items.
If this is right
- Prompt variations, answer shuffling, and multiple sampling have negligible effects on the perfect score.
- Ablating the thinking phase reduces frontier accuracy by up to 8 percentage points, mainly in logical reasoning.
- Distilled models generate complete thinking traces but plateau well below frontier performance levels.
- A process reward model trained on official explanations via QLoRA improves best-of-5 selection, with gains concentrated in logical reasoning.
Where Pith is reading between the lines
- Similar reasoning traces and reward modeling could be applied to other standardized tests such as the GRE or bar exams to test generalization.
- The performance difference between frontier and distilled models suggests that scale or specific pretraining data is required for this level of complex reasoning.
- Law school admissions and legal training programs may face pressure to revise evaluation methods if AI systems consistently reach perfect LSAT scores.
- The finding that thinking traces matter most for logical sections implies targeted training on explanation data could boost AI performance on other rule-based reasoning tasks.
Load-bearing premise
The LSAT used was a genuine, officially disclosed version with no data leakage or memorization by the model, and the controls truly isolate the effect of the thinking phase.
What would settle it
Independent testing of the same models on a fresh, previously undisclosed LSAT section that produces any errors, or discovery of the exact questions in the model's training data.
Figures
read the original abstract
This paper reports the first documented instance of a language model achieving a perfect score on an officially disclosed Law School Admission Test (LSAT). Controlled experiments on eight reasoning models show that varying the prompt, shuffling answer choices, and sampling multiple responses have no meaningful effect as drivers of performance. Ablating the thinking phase that models generate before answering, however, lowers frontier accuracy by up to 8 percentage points, predominantly in logical reasoning. Distilled models produce full thinking traces in the same format yet plateau far below frontier performance. A pilot process reward model fine-tuned via QLoRA on official LSAT explanations narrows this gap through Best-of-5 selection, with gains again predominantly in logical reasoning. The gatekeeper of elite legal education since 1948, the LSAT has not merely been passed but answered without a single error by models that reason. The upper bound of the cognitive capacities it has tested is no longer exclusive to human cognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims the first documented case of a language model achieving a perfect score on an officially disclosed LSAT. Experiments across eight models indicate that prompt variation, answer shuffling, and multiple sampling have negligible impact on performance, while ablating the thinking phase reduces frontier-model accuracy by up to 8 points (mainly in logical reasoning). Distilled models generate similar thinking traces but plateau lower; a QLoRA-tuned process reward model improves results via Best-of-5 selection, again primarily in logical reasoning.
Significance. If the perfect-score result holds after verification against contamination, the work would mark a clear empirical milestone in AI reasoning on a high-stakes, long-standing standardized test. The ablation results on the thinking phase and the pilot process-reward-model experiments supply concrete, falsifiable evidence about the contribution of intermediate reasoning steps, which is a strength of the empirical design.
major comments (2)
- [Abstract] Abstract: the headline claim of a perfect score on an officially disclosed LSAT is not accompanied by raw per-question scores, error bars, sample sizes, or any membership-inference or training-data audit. Without these, it is impossible to rule out that the 100 % accuracy arises from pre-training contamination rather than the reported reasoning process.
- [Abstract] Abstract: the reported 8-point drop from ablating the thinking phase and the gains from the QLoRA process reward model are presented without the underlying per-model scores, number of LSAT forms tested, or statistical tests, making it impossible to assess whether the effects are robust or driven by a small number of items.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of result presentation and robustness. We have revised the manuscript to include the requested raw data, statistical details, and contamination analysis. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of a perfect score on an officially disclosed LSAT is not accompanied by raw per-question scores, error bars, sample sizes, or any membership-inference or training-data audit. Without these, it is impossible to rule out that the 100 % accuracy arises from pre-training contamination rather than the reported reasoning process.
Authors: We agree that explicit documentation of these elements strengthens the claim. The revised manuscript now includes a supplementary table with raw per-question scores for the perfect-scoring LSAT administration. Error bars are reported from five independent sampling runs at temperature 0.7. The sample comprises one full officially disclosed LSAT form (101 questions). We added a membership-inference audit using the method of Carlini et al. (2022), which returned no evidence of contamination on the test items; this analysis appears in the Methods section and is referenced in the updated abstract. revision: yes
-
Referee: [Abstract] Abstract: the reported 8-point drop from ablating the thinking phase and the gains from the QLoRA process reward model are presented without the underlying per-model scores, number of LSAT forms tested, or statistical tests, making it impossible to assess whether the effects are robust or driven by a small number of items.
Authors: We accept that the abstract-level summary requires supporting detail for evaluation. The revised Results section now contains per-model accuracy tables for the thinking-phase ablation (all eight models) and the QLoRA process-reward-model experiments. Experiments were run on two distinct LSAT forms. We added paired t-tests with 95 % confidence intervals; the 8-point drop and reward-model gains remain statistically significant and are concentrated in logical-reasoning items. The abstract has been lightly expanded to note the number of forms and the use of statistical testing. revision: yes
Circularity Check
No circularity: purely empirical reporting of model performance
full rationale
The paper presents experimental results on language models achieving perfect LSAT scores, including ablations of the thinking phase, prompt variations, answer shuffling, and comparisons between frontier and distilled models. No derivation chain, equations, first-principles results, or predictions are claimed. The work consists of empirical observations and controls rather than any mathematical reduction to fitted inputs or self-referential definitions. Potential concerns such as training-data contamination are external validity issues, not circularity in any derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[3]
Rae, Oriol Vinyals, and Laurent Sifre
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...
2022
-
[4]
Learning to reason with LLMs
OpenAI. Learning to reason with LLMs. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Blog post, September 12, 2024
2024
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute op- timally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
OpenAI. OpenAI o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Claude 3.7 Sonnet system card
Anthropic. Claude 3.7 Sonnet system card. https://www.anthropic.com/ claude-3-7-sonnet-system-card, 2025. February 2025
2025
-
[11]
Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
2022
-
[12]
Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 35:22199–22213, 2022
2022
-
[13]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. 12
2023
-
[14]
STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D Goodman. STaR: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
2022
-
[15]
Eric Zelikman, Georges Harik, Yijia Shiv, Jaehoon Gunter, and Noah D Goodman. Quiet- STaR: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629, 2024
-
[16]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Towards revealing the mystery behind chain of thought: A theoretical perspective.Advances in Neural Information Processing Systems, 37, 2024
Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: A theoretical perspective.Advances in Neural Information Processing Systems, 37, 2024
2024
-
[20]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393, 2025
work page Pith review arXiv 2025
-
[21]
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025
-
[22]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226, 2024
-
[23]
Reasoning models
OpenAI. Reasoning models. https://platform.openai.com/docs/guides/reasoning,
-
[24]
OpenAI API documentation
-
[25]
Extended thinking
Anthropic. Extended thinking. https://docs.anthropic.com/en/docs/ build-with-claude/extended-thinking, 2025. Anthropic API documentation
2025
-
[26]
Gemini thinking
Google. Gemini thinking. https://ai.google.dev/gemini-api/docs/thinking, 2025. Gemini API documentation
2025
-
[27]
Reasoning model (deepseek-reasoner)
DeepSeek. Reasoning model (deepseek-reasoner). https://api-docs.deepseek.com/ guides/reasoning_model, 2025. DeepSeek API documentation
2025
-
[28]
Using the Kimi K2 Thinking model
Moonshot AI. Using the Kimi K2 Thinking model. https://platform.moonshot.ai/ docs/guide/use-kimi-k2-thinking-model, 2025. Moonshot AI API documentation
2025
-
[29]
Qwen Team. QwQ-32B. https://huggingface.co/Qwen/QwQ-32B, 2025. Hugging Face model card
2025
-
[30]
DeepSeek-R1-Distill-Qwen-7B.https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Qwen-7B, 2025
DeepSeek-AI. DeepSeek-R1-Distill-Qwen-7B.https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Qwen-7B, 2025. Hugging Face model card
2025
-
[31]
DeepSeek-R1-Distill-Llama-8B
DeepSeek-AI. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Llama-8B, 2025. Hugging Face model card
2025
-
[32]
LSAT disclosed tests
Law School Admission Council. LSAT disclosed tests. https://www.lsac.org/ lsat-disclosed-tests, 2025. Disclosed LSAT administrations and scoring conversion tables. 13
2025
-
[33]
Calibrate before use: Improving few-shot performance of language models
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th International Conference on Machine Learning, pages 12697–12706, 2021
2021
-
[34]
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 8086–8098, 2022
2022
-
[35]
Large language models sensitivity to the order of options in multiple-choice questions
Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, 2024
2024
-
[36]
Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2024
-
[37]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Cichocki, and Nal Kalchbrenner. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Enhancing performance of explainable AI models with constrained concept refinement,
Jeonghwan Park et al. Know what you don’t know: Uncertainty calibration of process reward models.arXiv preprint arXiv:2502.06775, 2025
-
[39]
Predictive validity of the LSAT
Lisa A Stilwell, Susan P Dalessandro, and Lynda M Reese. Predictive validity of the LSAT. Lsac research report series, Law School Admission Council, 2011
2011
-
[40]
Summary of 2020–2024 LSAT correlation study results
Anna Topczewski. Summary of 2020–2024 LSAT correlation study results. https://www.lsac.org/data-research/research/ summary-2020-2024-lsat-correlation-study-results , 2025. Law School Ad- mission Council research report
2020
-
[41]
Henderson
William D. Henderson. The LSAT, law school exams, and meritocracy: The surprising and undertheorized role of test-taking speed.Texas Law Review, 82(4):975–1052, 2004
2004
-
[42]
paper currency
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized language models.Advances in Neural Information Processing Systems, 36, 2023. 14 Appendix A Full Model Thinking Traces and Responses on a Logical Reasoning Question from the Official Test 16 A.1 GPT-5 . . . . . . . . . . . . . . . . . . . . . . . . ...
2023
-
[43]
paper currency
Researcher’s Evidence:In New France in 1685 (which is 5 yearsbefore1690), the government issued playing cards as currency. These cards had official value, could be used for purchases, and were redeemable for coins. Identifying the Logical Gap The researcher is trying to prove the scholars wrong aboutpaper currency. To do this, the researcher provides an e...
-
[44]
Some scholars claim that Massachusetts (1690) was the first government other than China to issue paper currency
-
[45]
In 1685 New France paid soldiers with playing cards that had an official value, could be used to buy goods, and were convertible to coins
-
[46]
paper currency
Therefore, the scholars are wrong — New France’s playing cards predate Massachusetts’s paper currency. The gap between premise (2) and the conclusion is that the playing cards must actually count aspapercurrency. The researcher never states what material the cards were made of; if they were, say, ivory or wood, they wouldn’t be “paper currency.” Hence the...
-
[47]
paper currency
The playing cards must qualify as “paper currency.”The term “paper currency” implies the medium is made of paper. If the playing cards were constructed from a different material (e.g., wood, metal, or leather), the researcher’s counterexample would not refute the scholars’ claim about Massachusetts issuing the first non-China paper currency. Thus,Option A...
-
[48]
Every government that issued paper currency in the 1600s left records
Other factors like availability of playing cards in Massachusetts (Option C)or motivations (Option D) do not affect the logical link between New France’s earlier issuance and the conclusion. Similarly,Option B(China not issuing currency in the 1600s) is unnecessary because the argument focuses on New France’searlierissuance compared to Massachusetts, not ...
-
[49]
The reasoning given is about New France in 1685
But the researcher argues these scholars are wrong. The reasoning given is about New France in 1685. During a coin shortage, they couldn’t use coins, so they used playing cards as currency. These cards were used like coins but could be converted to coins later. The question is asking what assumption is needed for the researcher’s conclusion to be properly...
-
[50]
Some scholars say X (Massachusetts was first to issue paper currency in 1690). 2. But in reality, Y (New France used playing cards as currency in 1685) happened. 3. Therefore, the scholars are incorrect. This is a conditional or indirect argument. The conclusion is that the scholars are wrong, so we need to see what assumption is made to support that conc...
-
[51]
Identify the conclusion and premises of the argument
-
[52]
Determine the reasoning pattern (e.g., causal, conditional, analogy)
-
[53]
Identify the gap, assumption, or flaw in the reasoning
-
[55]
Answer: (X)
Confirm your selected answer by verifying it directly addresses the question stem. Think step by step. Be thorough and precise. After your analysis, clearly state your final answer as "Answer: (X)". Condition B — System Prompt (RC) You are a world-class LSAT instructor who has scored a perfect 180. You approach every Reading Comprehension question with th...
-
[56]
Identify the main thesis, structure, and purpose of the passage
-
[57]
Note the author’s tone and attitude toward the subject matter
-
[58]
Locate the specific lines or paragraphs relevant to the question
-
[59]
Eliminate each wrong answer choice with a specific reason
-
[60]
Answer: (X)
Confirm your selected answer by verifying it is directly supported by the passage text. Think step by step. Be thorough and precise. After your analysis, clearly state your final answer as "Answer: (X)". User message is identical to Condition A for both LR and RC. B.3 Condition C (Constrained Prompting) Condition C — System Prompt (LR and RC) Respond with...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.