arxiv: 2605.07268 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Hanmeng Liu , Shichao Weng , Xiulai Liu , Zhicai Zhang , Anli Yan , Xiaozhang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords LogiHardcombinatorial hardeningLLM reasoningmultiple-choice benchmarkscompositional failureslogical judgmentItem Response Theorycompleteness verification

0 comments

The pith

Converting multiple-choice questions into 2-order logical judgments exposes 31-56% accuracy drops in frontier LLMs from combinatorial reasoning gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogiHard as a deterministic framework that converts 0-order multiple-choice selection tasks into 2-order logical judgment tasks requiring additional reasoning steps and verification. Evaluations on the resulting LogiHard-2k dataset show consistent accuracy declines of 31 to 56 percent across twelve state-of-the-art models, accompanied by multi-select failure and early exit bias that human test-takers do not display. These drops transfer to MMLU with a 47 percent degradation while preserving logical validity, pointing to a domain-agnostic combinatorial reasoning gap rather than knowledge deficits. The authors tie the pattern to a training-induced completeness-verification deficit that current models inherit.

Core claim

By deterministically transforming 0-order selection into 2-order judgment through combinatorial hardening and ranking items via 9-dimensional analysis of model traces with Item Response Theory for adaptive control, the work establishes that frontier LLMs exhibit large, consistent performance degradation on the hardened items. This degeneration arises specifically from a combinatorial reasoning gap and completeness-verification deficit rather than missing knowledge, as confirmed by the absence of comparable failures in humans and by zero-shot validity-preserving transfer to other benchmarks.

What carries the argument

The LogiHard framework, which performs deterministic combinatorial transformation of 0-order selection into 2-order logical judgment while integrating Item Response Theory for precise difficulty control.

If this is right

LLMs exhibit multi-select failure and early exit bias that human test-takers avoid on the same items.
Zero-shot transfer produces 47 percent accuracy degradation on MMLU while preserving validity.
The aggregate degeneration remains consistent and domain-agnostic across tested benchmarks.
Performance collapse traces to a combinatorial reasoning gap and completeness-verification deficit induced by training rather than knowledge shortfalls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be applied to additional high-stakes exam domains to map the scope of the reasoning gap without new data collection.
Model training that emphasizes explicit verification steps might reduce the observed early exit and multi-select patterns.
Static benchmarks risk underestimating limitations if they remain at 0-order selection without such hardening.

Load-bearing premise

The transformation from 0-order selection to 2-order judgment preserves logical validity without introducing artifacts or new knowledge demands, and the accuracy drops reflect a specific reasoning gap rather than surface changes or evaluation biases.

What would settle it

A direct comparison in which the same models maintain original accuracy levels on the combinatorially hardened 2-order versions or in which degradation correlates with domain-specific knowledge gaps instead of reasoning structure.

read the original abstract

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogiHard produces clear accuracy drops on hardened logic items across models, but the drops may trace to format and length shifts rather than a pure combinatorial gap.

read the letter

The main point is that this paper turns standard multiple-choice items into 2-order judgment versions and reports 31-56% accuracy drops on twelve frontier models, with a 47-point drop on zero-shot MMLU transfer. They also note two behaviors that look LLM-specific: trouble with multi-select options and early exit before full verification. Humans do not show the same pattern in their data. That empirical pattern is the part worth checking if you care about reasoning benchmarks that hold up longer than current ones.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LogiHard, a deterministic framework that transforms 0-order multiple-choice selection questions into 2-order logical judgment tasks to increase reasoning overhead and steps. It combines this with Item Response Theory (IRT) and computerized adaptive testing (CAT) for difficulty control, constructs the LogiHard-2k dataset by ranking high-stakes exam items via 9-dimensional analysis of model thinking traces, and evaluates twelve frontier LLMs. The paper reports accuracy degradations of 31-56% on the hardened items, identifies LLM-specific failures (multi-select and early-exit bias) absent in humans, and shows a 47% drop (89.84% to 42.86%) under zero-shot MMLU transfer, attributing the consistent degeneration to a combinatorial reasoning gap and training-induced completeness-verification deficit rather than knowledge shortfalls.

Significance. If the transformations preserve logical validity and isolate increased reasoning demands without introducing format or length artifacts, the results would provide evidence of a fundamental compositional limitation in current LLMs that is distinct from knowledge deficits and not mitigated by scale. The integration of IRT/CAT for efficient, controlled evaluation and the 9-dimensional trace analysis for item selection represent methodological strengths over purely ad-hoc hardening approaches. The cross-domain MMLU transfer adds weight to the domain-agnostic claim.

major comments (3)

[Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.
[Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.
[Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.

minor comments (2)

[Abstract] The terms '0-order selection' and '2-order logical judgment' are used without a concise formal definition or example in the abstract; a short illustrative example would improve accessibility.
[Abstract] No reference is made to prior IRT applications in LLM evaluation or to existing work on logical nesting in reasoning benchmarks; adding 2-3 targeted citations would strengthen the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment point-by-point below, outlining specific revisions to the manuscript that will incorporate the suggested improvements while preserving the core contributions of LogiHard.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.

Authors: We agree that the abstract's brevity leaves the validity preservation claim underspecified. Section 3 of the full manuscript defines the deterministic transformation rules (converting 0-order selection to 2-order judgment via logical equivalence, where the model must judge whether a candidate satisfies the original condition), but we will revise the abstract to include a concise description of these rules and add an appendix with formal equivalence proofs, example transformations, and verification procedures. To address controls, we will incorporate new ablations in the revision that match prompt length, token count, and format across conditions, demonstrating that the observed degradations persist under these controls and are not attributable to surface artifacts. revision: yes
Referee: [Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.

Authors: The MMLU transfer used the identical LogiHard transformation rules and IRT-based item selection on a subset of 100 MMLU items. We will revise the abstract and methods section to explicitly confirm this, report per-item sample details, and include error bars (e.g., standard deviation across items and bootstrap confidence intervals). These additions will clarify that the 47% drop isolates the combinatorial reasoning factor rather than format biases. revision: yes
Referee: [Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.

Authors: We will add statistical controls (paired significance tests across models and items) to the evaluation section. The manuscript already notes that these biases are absent in human testees based on pilot observations; we will expand this with quantitative human performance metrics on the hardened items where available. For isolating the combinatorial element, we will include ablations comparing against length- and format-matched controls in the revision. These changes will strengthen the attribution to the completeness-verification deficit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; core claims rest on direct empirical measurements

full rationale

The paper's central results consist of measured accuracy degradations (31-56% on LogiHard-2k items and 47% on MMLU zero-shot transfer) obtained by applying the described combinatorial transformation to selected questions and evaluating frontier models. IRT/CAT and 9-dimensional trace analysis serve only for item selection and difficulty ranking; they do not define or derive the reported failure modes (multi-select failure, early-exit bias, completeness-verification deficit) by construction. No equations reduce the performance gap to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and the validity-preservation assertion is presented as an independent property of the deterministic LogiHard mapping rather than a redefinition of the observed drops. The derivation therefore remains self-contained against external model evaluations and human baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that the combinatorial hardening preserves validity while increasing genuine reasoning demand; IRT parameters and the 9-dimensional ranking criteria are introduced without external validation in the abstract.

free parameters (1)

9-dimensional analysis criteria
Used to cognitively rank high-stakes examination questions via model thinking traces before transformation

axioms (1)

domain assumption Combinatorial transformation preserves logical validity of original questions
Invoked to support the claim of 'provable validity preservation' and domain-agnostic applicability

invented entities (1)

2-order logical judgment no independent evidence
purpose: To increase thinking overhead and reasoning steps beyond 0-order selection
New construct introduced by the framework to explain the observed failures

pith-pipeline@v0.9.0 · 5569 in / 1670 out tokens · 73799 ms · 2026-05-11T02:30:47.072674+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment... validity-by-construction... propositional logic tasks via exactness (EXACTi ≡ pi ∧ ⋀j≠i ¬pj), Disjunction (pi ∨ pj), and Negation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

IRT 3PL model... Gold Score via weighted linear combination of 9 cognitive metrics... CAT engine... Fisher information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Claude opus 4.6 system card, 2026

Anthropic. Claude opus 4.6 system card, 2026

work page 2026
[2]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

work page 2026
[3]

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025

work page 2025
[4]

Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Benchmarking large language models under data contamination: A survey from static to dynamic evaluation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con...

work page 2025
[5]

Dynamic benchmarking of reasoning capa- bilities in code large language models under data contamination

Simin Chen, Pranav Pusarla, and Baishakhi Ray. Dynamic benchmarking of reasoning capa- bilities in code large language models under data contamination. InProceedings of the 42nd International Conference on Machine Learning (ICML). PMLR, 2025

work page 2025
[6]

Gemini 3.1 pro model card, February 2026

Google DeepMind. Gemini 3.1 pro model card, February 2026

work page 2026
[7]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[8]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[9]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie 10 Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Luce...

work page 2026
[10]

Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, and Megan Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

work page arXiv 2024
[11]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[12]

Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A. Smith. Fluid language model benchmarking. InSecond Conference on Language Modeling, 2025

work page 2025
[13]

Big-bench extra hard

Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, et al. Big-bench extra hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26473–26501, 2025

work page 2025
[14]

Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation.arXiv preprint arXiv:2602.17316, 2026

Bogdan Kosti´c, Conor Fallon, Julian Risch, and Alexander Löser. Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation.arXiv preprint arXiv:2602.17316, 2026

work page arXiv 2026
[15]

Comparative evaluation of openai o1 and human performance in higher order cognition.Scientific Reports, 2025

Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, and Xiaoming Zhai. Comparative evaluation of openai o1 and human performance in higher order cognition.Scientific Reports, 2025

work page 2025
[16]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

work page arXiv 2024
[17]

Okbench: Democratizing llm evaluation with fully automated, on-demand, open knowledge benchmarking, 2025

Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. Okbench: Democratizing llm evaluation with fully automated, on-demand, open knowledge benchmarking, 2025

work page 2025
[18]

Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding

Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023. 11

work page 2023
[19]

Logicot: Logical chain-of-thought instruction tuning

Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. Logicot: Logical chain-of-thought instruction tuning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2908–2921, 2023

work page 2023
[20]

Lord.Applications of Item Response Theory to Practical Testing Problems

F.M. Lord.Applications of Item Response Theory to Practical Testing Problems. L. Erlbaum Associates, 1980

work page 1980
[21]

Do llms know when to not answer? investigating abstention abilities of large language models

Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 9329–9345, 2025

work page 2025
[22]

Frontier llms still struggle with simple reasoning tasks, 2025

Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks, 2025

work page 2025
[23]

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024

work page 2024
[24]

Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery

Luke Moffett and Bhuwan Dhingra. Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 6999–701...

work page 2025
[25]

s1: Simple test-time scaling, 2025

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025

work page 2025
[26]

NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...

work page 2025
[27]

Contamination detection for vlms using multi-modal semantic perturbation.International Conference on Learning Representations, 2026

Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, and Yong Jae Lee. Contamination detection for vlms using multi-modal semantic perturbation.International Conference on Learning Representations, 2026

work page 2026
[28]

Large language models sensitivity to the order of options in multiple-choice questions

Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, 2024

work page 2024
[29]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

work page 2025
[30]

None of the others: a general tech- nique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks

Eva Sánchez Salido, Julio Gonzalo, and Guillermo Marco. None of the others: a general tech- nique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks. arXiv preprint arXiv:2502.12896, 2025

work page arXiv 2025
[31]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity

Parshin Shojaee*, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. InNeurIPS, 2025

work page 2025
[32]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2026
[33]

The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination.arXiv preprint arXiv:2503.16402, 2025

Yifan Sun, Han Wang, Dongbai Li, Gang Wang, and Huan Zhang. The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination.arXiv preprint arXiv:2503.16402, 2025

work page arXiv 2025
[34]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review arXiv 2022
[35]

None of the above, less of the right parallel patterns in human and llm performance on multi-choice questions answering

Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less of the right parallel patterns in human and llm performance on multi-choice questions answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20112–20134, 2025. 14

work page 2025
[36]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Universal adversarial triggers for attacking and analyzing nlp, 2021

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp, 2021

work page 2021
[38]

Livebench: A challenging, contamination-free LLM benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...

work page 2025
[39]

AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge

Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, and William Yang Wang. AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the ...

work page 2025
[40]

On memorization of large language models in logical reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedin...

work page 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Adversarial distractor generation for mcqa: Lever- aging in-context learning and rule-based approaches.Natural Language Processing Journal, 13:100186, 2025

Gulsum Yigit and Mehmet Fatih Amasyali. Adversarial distractor generation for mcqa: Lever- aging in-context learning and rule-based approaches.Natural Language Processing Journal, 13:100186, 2025

work page 2025
[43]

A" or "A, B

Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-free multi-task language understanding benchmark, 2024. 15 A Implementation Details All models were accessed through their respective official public APIs: GPT-5.4 and o3 (OpenAI), Claude-Opus-4...

work page 2024
[44]

Analyze step by step, explaining the reasoning basis for each step

work page
[45]

Evaluate each option (correct or reason for elimination)

work page
[46]

Please answer all 30 multiple-choice questions to the best of your ability. There is no time limit. Each question has one or more correct options. Select all that apply

Finally provide a definitive answer Please begin: Strict mode: Please solve the following logical reasoning problem. Before giving the final answer, show your complete thinking process: {base} Please begin your detailed reasoning: Minimal mode: {base} Please reason in detail before giving your answer: Human evaluators received the following instruction: "...

work page
[47]

Old Zhang wins OR Old Yan wins (P V Q)

work page
[48]

If Old Zhang wins→overseas project damaged (P→R)

work page
[49]

The company’s overseas project might not be damaged, and domestic product development project won’t be paused

If Old Yan wins→domestic project paused (Q→S) **Evaluating each statement:** **Statement I:** "The company’s overseas project might not be damaged, and domestic product development project won’t be paused." From premises: P V Q. If P, then R. If Q, then S. So from P V Q, we get R V S (overseas damaged OR domestic paused). This means it’s NOT possible that...

work page