pith. machine review for the scientific record. sign in

arxiv: 2605.07268 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords LogiHardcombinatorial hardeningLLM reasoningmultiple-choice benchmarkscompositional failureslogical judgmentItem Response Theorycompleteness verification
0
0 comments X

The pith

Converting multiple-choice questions into 2-order logical judgments exposes 31-56% accuracy drops in frontier LLMs from combinatorial reasoning gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogiHard as a deterministic framework that converts 0-order multiple-choice selection tasks into 2-order logical judgment tasks requiring additional reasoning steps and verification. Evaluations on the resulting LogiHard-2k dataset show consistent accuracy declines of 31 to 56 percent across twelve state-of-the-art models, accompanied by multi-select failure and early exit bias that human test-takers do not display. These drops transfer to MMLU with a 47 percent degradation while preserving logical validity, pointing to a domain-agnostic combinatorial reasoning gap rather than knowledge deficits. The authors tie the pattern to a training-induced completeness-verification deficit that current models inherit.

Core claim

By deterministically transforming 0-order selection into 2-order judgment through combinatorial hardening and ranking items via 9-dimensional analysis of model traces with Item Response Theory for adaptive control, the work establishes that frontier LLMs exhibit large, consistent performance degradation on the hardened items. This degeneration arises specifically from a combinatorial reasoning gap and completeness-verification deficit rather than missing knowledge, as confirmed by the absence of comparable failures in humans and by zero-shot validity-preserving transfer to other benchmarks.

What carries the argument

The LogiHard framework, which performs deterministic combinatorial transformation of 0-order selection into 2-order logical judgment while integrating Item Response Theory for precise difficulty control.

If this is right

  • LLMs exhibit multi-select failure and early exit bias that human test-takers avoid on the same items.
  • Zero-shot transfer produces 47 percent accuracy degradation on MMLU while preserving validity.
  • The aggregate degeneration remains consistent and domain-agnostic across tested benchmarks.
  • Performance collapse traces to a combinatorial reasoning gap and completeness-verification deficit induced by training rather than knowledge shortfalls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to additional high-stakes exam domains to map the scope of the reasoning gap without new data collection.
  • Model training that emphasizes explicit verification steps might reduce the observed early exit and multi-select patterns.
  • Static benchmarks risk underestimating limitations if they remain at 0-order selection without such hardening.

Load-bearing premise

The transformation from 0-order selection to 2-order judgment preserves logical validity without introducing artifacts or new knowledge demands, and the accuracy drops reflect a specific reasoning gap rather than surface changes or evaluation biases.

What would settle it

A direct comparison in which the same models maintain original accuracy levels on the combinatorially hardened 2-order versions or in which degradation correlates with domain-specific knowledge gaps instead of reasoning structure.

read the original abstract

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LogiHard, a deterministic framework that transforms 0-order multiple-choice selection questions into 2-order logical judgment tasks to increase reasoning overhead and steps. It combines this with Item Response Theory (IRT) and computerized adaptive testing (CAT) for difficulty control, constructs the LogiHard-2k dataset by ranking high-stakes exam items via 9-dimensional analysis of model thinking traces, and evaluates twelve frontier LLMs. The paper reports accuracy degradations of 31-56% on the hardened items, identifies LLM-specific failures (multi-select and early-exit bias) absent in humans, and shows a 47% drop (89.84% to 42.86%) under zero-shot MMLU transfer, attributing the consistent degeneration to a combinatorial reasoning gap and training-induced completeness-verification deficit rather than knowledge shortfalls.

Significance. If the transformations preserve logical validity and isolate increased reasoning demands without introducing format or length artifacts, the results would provide evidence of a fundamental compositional limitation in current LLMs that is distinct from knowledge deficits and not mitigated by scale. The integration of IRT/CAT for efficient, controlled evaluation and the 9-dimensional trace analysis for item selection represent methodological strengths over purely ad-hoc hardening approaches. The cross-domain MMLU transfer adds weight to the domain-agnostic claim.

major comments (3)
  1. [Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.
  2. [Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.
  3. [Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.
minor comments (2)
  1. [Abstract] The terms '0-order selection' and '2-order logical judgment' are used without a concise formal definition or example in the abstract; a short illustrative example would improve accessibility.
  2. [Abstract] No reference is made to prior IRT applications in LLM evaluation or to existing work on logical nesting in reasoning benchmarks; adding 2-3 targeted citations would strengthen the positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment point-by-point below, outlining specific revisions to the manuscript that will incorporate the suggested improvements while preserving the core contributions of LogiHard.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of 'provable validity preservation' for the 0-to-2-order combinatorial transformation is unsupported by any explicit transformation rules, equivalence verification procedure, or controls that hold prompt length, token count, answer format (selection vs. judgment), or logical nesting constant. This is load-bearing for attributing the 31-56% degradation specifically to a 'combinatorial reasoning gap' rather than surface artifacts or evaluation biases.

    Authors: We agree that the abstract's brevity leaves the validity preservation claim underspecified. Section 3 of the full manuscript defines the deterministic transformation rules (converting 0-order selection to 2-order judgment via logical equivalence, where the model must judge whether a candidate satisfies the original condition), but we will revise the abstract to include a concise description of these rules and add an appendix with formal equivalence proofs, example transformations, and verification procedures. To address controls, we will incorporate new ablations in the revision that match prompt length, token count, and format across conditions, demonstrating that the observed degradations persist under these controls and are not attributable to surface artifacts. revision: yes

  2. Referee: [Abstract] Abstract (MMLU transfer paragraph): The reported drop from 89.84% to 42.86% is presented without error bars, per-item sample details, or confirmation that the same hardening rules and IRT controls were applied to MMLU items. Without these, it is unclear whether the degradation isolates the intended reasoning factor or reflects new format-induced biases.

    Authors: The MMLU transfer used the identical LogiHard transformation rules and IRT-based item selection on a subset of 100 MMLU items. We will revise the abstract and methods section to explicitly confirm this, report per-item sample details, and include error bars (e.g., standard deviation across items and bootstrap confidence intervals). These additions will clarify that the 47% drop isolates the combinatorial reasoning factor rather than format biases. revision: yes

  3. Referee: [Abstract] Abstract (evaluation paragraph): The accuracy degradations (31% to 56%) and identification of 'multi-select failure and early exit bias' lack any mention of statistical controls, human baseline performance on the identical hardened items, or ablation isolating the combinatorial element from length/format changes. This weakens the causal link to a training-induced completeness-verification deficit.

    Authors: We will add statistical controls (paired significance tests across models and items) to the evaluation section. The manuscript already notes that these biases are absent in human testees based on pilot observations; we will expand this with quantitative human performance metrics on the hardened items where available. For isolating the combinatorial element, we will include ablations comparing against length- and format-matched controls in the revision. These changes will strengthen the attribution to the completeness-verification deficit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; core claims rest on direct empirical measurements

full rationale

The paper's central results consist of measured accuracy degradations (31-56% on LogiHard-2k items and 47% on MMLU zero-shot transfer) obtained by applying the described combinatorial transformation to selected questions and evaluating frontier models. IRT/CAT and 9-dimensional trace analysis serve only for item selection and difficulty ranking; they do not define or derive the reported failure modes (multi-select failure, early-exit bias, completeness-verification deficit) by construction. No equations reduce the performance gap to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and the validity-preservation assertion is presented as an independent property of the deterministic LogiHard mapping rather than a redefinition of the observed drops. The derivation therefore remains self-contained against external model evaluations and human baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that the combinatorial hardening preserves validity while increasing genuine reasoning demand; IRT parameters and the 9-dimensional ranking criteria are introduced without external validation in the abstract.

free parameters (1)
  • 9-dimensional analysis criteria
    Used to cognitively rank high-stakes examination questions via model thinking traces before transformation
axioms (1)
  • domain assumption Combinatorial transformation preserves logical validity of original questions
    Invoked to support the claim of 'provable validity preservation' and domain-agnostic applicability
invented entities (1)
  • 2-order logical judgment no independent evidence
    purpose: To increase thinking overhead and reasoning steps beyond 0-order selection
    New construct introduced by the framework to explain the observed failures

pith-pipeline@v0.9.0 · 5569 in / 1670 out tokens · 73799 ms · 2026-05-11T02:30:47.072674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    Claude opus 4.6 system card, 2026

    Anthropic. Claude opus 4.6 system card, 2026

  2. [2]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

    Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

  3. [3]

    Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025

  4. [4]

    Benchmarking large language models under data contamination: A survey from static to dynamic evaluation

    Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, and Baishakhi Ray. Benchmarking large language models under data contamination: A survey from static to dynamic evaluation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Con...

  5. [5]

    Dynamic benchmarking of reasoning capa- bilities in code large language models under data contamination

    Simin Chen, Pranav Pusarla, and Baishakhi Ray. Dynamic benchmarking of reasoning capa- bilities in code large language models under data contamination. InProceedings of the 42nd International Conference on Machine Learning (ICML). PMLR, 2025

  6. [6]

    Gemini 3.1 pro model card, February 2026

    Google DeepMind. Gemini 3.1 pro model card, February 2026

  7. [7]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  8. [8]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  9. [9]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie 10 Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Luce...

  10. [10]

    Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

    Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, and Megan Ung. Changing answer order can decrease mmlu accuracy.arXiv preprint arXiv:2406.19470, 2024

  11. [11]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  12. [12]

    Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A. Smith. Fluid language model benchmarking. InSecond Conference on Language Modeling, 2025

  13. [13]

    Big-bench extra hard

    Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K Jain, Virginia Aglietti, Disha Jindal, Yuanzhu Peter Chen, et al. Big-bench extra hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26473–26501, 2025

  14. [14]

    Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation.arXiv preprint arXiv:2602.17316, 2026

    Bogdan Kosti´c, Conor Fallon, Julian Risch, and Alexander Löser. Same meaning, different scores: Lexical and syntactic sensitivity in llm evaluation.arXiv preprint arXiv:2602.17316, 2026

  15. [15]

    Comparative evaluation of openai o1 and human performance in higher order cognition.Scientific Reports, 2025

    Ehsan Latif, Yifan Zhou, Shuchen Guo, Yizhu Gao, Lehong Shi, Matthew Nyaaba, Arne Bewerdorff, Xiantong Yang, and Xiaoming Zhai. Comparative evaluation of openai o1 and human performance in higher order cognition.Scientific Reports, 2025

  16. [16]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024

  17. [17]

    Okbench: Democratizing llm evaluation with fully automated, on-demand, open knowledge benchmarking, 2025

    Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. Okbench: Democratizing llm evaluation with fully automated, on-demand, open knowledge benchmarking, 2025

  18. [18]

    Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding

    Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023. 11

  19. [19]

    Logicot: Logical chain-of-thought instruction tuning

    Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, and Yue Zhang. Logicot: Logical chain-of-thought instruction tuning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2908–2921, 2023

  20. [20]

    Lord.Applications of Item Response Theory to Practical Testing Problems

    F.M. Lord.Applications of Item Response Theory to Practical Testing Problems. L. Erlbaum Associates, 1980

  21. [21]

    Do llms know when to not answer? investigating abstention abilities of large language models

    Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. Do llms know when to not answer? investigating abstention abilities of large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 9329–9345, 2025

  22. [22]

    Frontier llms still struggle with simple reasoning tasks, 2025

    Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, András György, and Csaba Szepesvári. Frontier llms still struggle with simple reasoning tasks, 2025

  23. [23]

    Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2024

  24. [24]

    Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery

    Luke Moffett and Bhuwan Dhingra. Close or cloze? assessing the robustness of large language models to adversarial perturbations via word recovery. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, pages 6999–701...

  25. [25]

    s1: Simple test-time scaling, 2025

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025

  26. [26]

    NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav M...

  27. [27]

    Contamination detection for vlms using multi-modal semantic perturbation.International Conference on Learning Representations, 2026

    Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, and Yong Jae Lee. Contamination detection for vlms using multi-modal semantic perturbation.International Conference on Learning Representations, 2026

  28. [28]

    Large language models sensitivity to the order of options in multiple-choice questions

    Pouya Pezeshkpour and Estevam Hruschka. Large language models sensitivity to the order of options in multiple-choice questions. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2006–2017, 2024

  29. [29]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  30. [30]

    None of the others: a general tech- nique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks

    Eva Sánchez Salido, Julio Gonzalo, and Guillermo Marco. None of the others: a general tech- nique to distinguish reasoning from memorization in multiple-choice llm evaluation benchmarks. arXiv preprint arXiv:2502.12896, 2025

  31. [31]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity

    Parshin Shojaee*, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. InNeurIPS, 2025

  32. [32]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  33. [33]

    The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination.arXiv preprint arXiv:2503.16402, 2025

    Yifan Sun, Han Wang, Dongbai Li, Gang Wang, and Huan Zhang. The emperor’s new clothes in benchmarking? a rigorous examination of mitigation strategies for llm benchmark data contamination.arXiv preprint arXiv:2503.16402, 2025

  34. [34]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

  35. [35]

    None of the above, less of the right parallel patterns in human and llm performance on multi-choice questions answering

    Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, and Yun-Nung Chen. None of the above, less of the right parallel patterns in human and llm performance on multi-choice questions answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20112–20134, 2025. 14

  36. [36]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  37. [37]

    Universal adversarial triggers for attacking and analyzing nlp, 2021

    Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp, 2021

  38. [38]

    Livebench: A challenging, contamination-free LLM benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-free LLM benchmark. InThe...

  39. [39]

    AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge

    Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Anh Tuan Luu, and William Yang Wang. AntiLeakBench: Preventing data contamination by automatically constructing benchmarks with updated real-world knowledge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the ...

  40. [40]

    On memorization of large language models in logical reasoning

    Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedin...

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    Adversarial distractor generation for mcqa: Lever- aging in-context learning and rule-based approaches.Natural Language Processing Journal, 13:100186, 2025

    Gulsum Yigit and Mehmet Fatih Amasyali. Adversarial distractor generation for mcqa: Lever- aging in-context learning and rule-based approaches.Natural Language Processing Journal, 13:100186, 2025

  43. [43]

    A" or "A, B

    Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, and Furu Wei. Mmlu-cf: A contamination-free multi-task language understanding benchmark, 2024. 15 A Implementation Details All models were accessed through their respective official public APIs: GPT-5.4 and o3 (OpenAI), Claude-Opus-4...

  44. [44]

    Analyze step by step, explaining the reasoning basis for each step

  45. [45]

    Evaluate each option (correct or reason for elimination)

  46. [46]

    Please answer all 30 multiple-choice questions to the best of your ability. There is no time limit. Each question has one or more correct options. Select all that apply

    Finally provide a definitive answer Please begin: Strict mode: Please solve the following logical reasoning problem. Before giving the final answer, show your complete thinking process: {base} Please begin your detailed reasoning: Minimal mode: {base} Please reason in detail before giving your answer: Human evaluators received the following instruction: "...

  47. [47]

    Old Zhang wins OR Old Yan wins (P V Q)

  48. [48]

    If Old Zhang wins→overseas project damaged (P→R)

  49. [49]

    The company’s overseas project might not be damaged, and domestic product development project won’t be paused

    If Old Yan wins→domestic project paused (Q→S) **Evaluating each statement:** **Statement I:** "The company’s overseas project might not be damaged, and domestic product development project won’t be paused." From premises: P V Q. If P, then R. If Q, then S. So from P V Q, we get R V S (overseas damaged OR domestic paused). This means it’s NOT possible that...