The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

Aik Beng Ng; Ashutosh Kumar; Avinash Anand; Avni Mittal; Erik Cambria; Mahisha Ramesh; Rajiv Ratn Shah; Simon See; Timothy Liu; Zhengkui Wang

arxiv: 2606.11470 · v1 · pith:7SXRB2XKnew · submitted 2026-06-09 · 💻 cs.CL

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

Avinash Anand , Mahisha Ramesh , Avni Mittal , Ashutosh Kumar , Erik Cambria , Zhengkui Wang , Timothy Liu , Aik Beng Ng

show 2 more authors

Simon See Rajiv Ratn Shah

This is my paper

Pith reviewed 2026-06-27 12:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM reasoningtaxonomyChain-of-Thoughtfailure modessurveymulti-hop reasoningagentic reasoningreinforcement learning

0 comments

The pith

A taxonomy sorts LLM reasoning into nine paradigms and catalogs their shared failure modes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews more than 300 papers to create a structured taxonomy of how large language models approach reasoning tasks. It groups existing work into nine categories that include Chain-of-Thought prompting, multi-hop inference, mathematical problem solving, common-sense reasoning, visual and temporal tasks, code generation, retrieval-augmented methods, tool and agent use, and reinforcement learning approaches. The work then extracts recurring limitations that appear across these groups, such as hallucinations during reasoning steps, brittle multi-step logic, weak causal understanding, and limited ability to transfer skills to new domains. A reader would care because the resulting map shows where current techniques consistently fall short and points to concrete directions for building more dependable systems.

Core claim

The paper establishes that LLM reasoning research can be organized into a taxonomy of nine paradigms—Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning—while methodological trends in prompting, architectures, and benchmarks are analyzed and recurring limitations including reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization are synthesized across the literature.

What carries the argument

The structured taxonomy of nine LLM reasoning paradigms that classifies methods, architectures, training objectives, and evaluation approaches while exposing shared failure patterns.

Load-bearing premise

The more than 300 papers selected from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology form a representative and unbiased sample of all current LLM reasoning research.

What would settle it

Discovery of a substantial body of LLM reasoning papers that cannot be placed in any of the nine taxonomy categories or that exhibit failure modes absent from the synthesized list.

Figures

Figures reproduced from arXiv: 2606.11470 by Aik Beng Ng, Ashutosh Kumar, Avinash Anand, Avni Mittal, Erik Cambria, Mahisha Ramesh, Rajiv Ratn Shah, Simon See, Timothy Liu, Zhengkui Wang.

**Figure 2.** Figure 2: A taxonomy of LLM reasoning paradigms. Thirty-six families of methods are arranged into a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of research papers by reasoning paradigm. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of publications by reasoning paradigm and year. Darker cells indicate higher publi [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Taxonomy of reasoning types covered in this survey, organized by category and subcategory. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: A shared skeleton (Question → Decompose → Retrieve → Step → Verify → Aggregate → Answer) with thirteen paradigm "lines" showing which stations each one stops at. Teaches that paradigms are commitments about which stages matter [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Chain-of-Thought and Multi-Hop reasoning interaction. Each reasoning hop produces an inter [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Meta-reasoning engines arranged by training requirement. SQuARE operates through prompting [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Two dimensions of CoT improvement. The efficiency zone (bottom) maps methods that reduce token cost: CCoT and LightThinker compress rationales into dense tokens, Soft Thinking replaces discrete tokens with embedding mixtures, Token Budget caps per-question length, and Chain of Draft writes minimal drafts. The robustness zone (upper left) maps methods that audit or clean chains: DeltaBench benchmarks error … view at source ↗

**Figure 10.** Figure 10: When to apply Chain-of-Thought. CoT produces strong gains on symbolic and multi-step tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-hop reasoning failure modes. Hop 1 (entity retrieval) typically succeeds, but Hop 2 (knowl [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Four layers of architectural innovation for mathematical reasoning. From bottom to top: low-level [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: The inference-time feedback loop for mathematical reasoning. Exploration generates candidate [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Three reinforcing pillars of mathematical reasoning improvement. Process supervision provides [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Mathematical reasoning flow. Row 1: generic LLM reasoning pipeline (shared across domains). [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Commonsense reasoning flow. Row 1: generic pipeline. Row 2: commonsense example showing [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Multimodal reasoning flow. Row 1: generic vision-language pipeline. Row 2: example showing [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Temporal reasoning flow. Row 1: generic pipeline. Row 2: temporal example showing time pars [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗

**Figure 19.** Figure 19: Code reasoning flow. Row 1: generic pipeline. Row 2: code-specific example showing intent [PITH_FULL_IMAGE:figures/full_fig_p044_19.png] view at source ↗

**Figure 20.** Figure 20: How RAG research directions connect. Long-context methods feed unified architectures, which [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗

**Figure 21.** Figure 21: Retrieval-augmented reasoning flow. Row 1: generic RAG pipeline. Row 2: example showing [PITH_FULL_IMAGE:figures/full_fig_p050_21.png] view at source ↗

**Figure 22.** Figure 22: Layered architecture of tool-augmented and agentic reasoning systems. From bottom to top: [PITH_FULL_IMAGE:figures/full_fig_p052_22.png] view at source ↗

**Figure 23.** Figure 23: Three levers of reinforcement learning for reasoning. The LLM policy is refined through: (1) learn [PITH_FULL_IMAGE:figures/full_fig_p057_23.png] view at source ↗

**Figure 24.** Figure 24: Three strategies for cross-lingual reasoning transfer. Path A merges reasoning-strong and [PITH_FULL_IMAGE:figures/full_fig_p060_24.png] view at source ↗

**Figure 25.** Figure 25: Three nested levels of meta-reasoning. The [PITH_FULL_IMAGE:figures/full_fig_p065_25.png] view at source ↗

**Figure 26.** Figure 26: The social reasoning paradox. Direct answering yields moderate bias with low transparency. [PITH_FULL_IMAGE:figures/full_fig_p067_26.png] view at source ↗

**Figure 27.** Figure 27: Accuracy is the tip; faithfulness, robustness, calibration, generalization, efficiency, and safety sit [PITH_FULL_IMAGE:figures/full_fig_p076_27.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem solving, and contextual understanding, their reasoning behavior is often inconsistent and sensitive to prompting strategies, task design, and model scale. This survey provides a systematic analysis of more than 300 recent papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology to examine how reasoning capabilities emerge in LLMs and where they fail. We make three main contributions. First, we introduce a structured taxonomy of LLM reasoning research, covering Chain-of-Thought reasoning, multi-hop reasoning, mathematical reasoning, common sense reasoning, visual and temporal reasoning, code and algorithmic reasoning, retrieval-augmented reasoning, tool-augmented and agentic reasoning, and reinforcement learning-based reasoning. Second, we analyze methodological trends across these paradigms, including prompting methods, model architectures, training objectives, reward modeling, and evaluation benchmarks. Third, we synthesize recurring limitations and failure modes, such as reasoning hallucinations, brittle multi-step inference, weak causal abstraction, and poor cross-domain generalization. By organizing a rapidly expanding literature, this survey offers a unified view of the current capabilities and limitations of reasoning in LLMs. We also identify emerging research directions, including meta-reasoning, self-evolving reasoning frameworks, multimodal reasoning, and socially grounded reasoning. Overall, this work aims to serve as a reference for developing more robust, interpretable, and generalizable reasoning systems in future language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a broad survey that organizes LLM reasoning work into categories but leaves the paper selection process undocumented, limiting how much weight the taxonomy can carry.

read the letter

The core of this paper is a taxonomy that buckets LLM reasoning into nine areas: chain-of-thought, multi-hop, mathematical, commonsense, visual/temporal, code/algorithmic, retrieval-augmented, tool/agentic, and RL-based. It then flags shared failure modes like hallucinations, brittle multi-step steps, weak causal abstraction, and poor cross-domain transfer, plus some forward-looking topics such as meta-reasoning and self-evolving systems.

It does a reasonable job of naming the main paradigms and sketching trends in prompting, architectures, and benchmarks. For someone entering the area, the single document that points to representative papers across those buckets can cut down on initial reading time.

The soft spot is the missing selection protocol. The abstract lists the databases but supplies no search strings, date ranges, inclusion rules, or deduplication steps. Without those, it is difficult to judge whether the taxonomy over-weights heavily cited CoT work or under-samples less visible multimodal or agentic lines. The failure-mode synthesis therefore sits on an unverified sample rather than a transparent corpus.

The paper contains no new methods, experiments, or formal derivations, so its contribution is organizational. That is fine for a survey, but it means the value is mainly navigational.

This is the sort of paper that helps early-career researchers or people shifting into reasoning work get oriented. Specialists already following the literature will probably find little they do not already know. It is coherent on its own terms and engages the cited papers without obvious internal contradictions, so it meets the threshold for peer review if the authors can document how the 300 papers were chosen.

Referee Report

1 major / 0 minor

Summary. The paper claims to deliver a systematic survey of LLM reasoning by analyzing more than 300 papers drawn from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology. It introduces a taxonomy organized around nine paradigms (Chain-of-Thought, multi-hop, mathematical, commonsense, visual/temporal, code/algorithmic, retrieval-augmented, tool/agentic, and RL-based reasoning), examines trends in prompting, architectures, objectives, and benchmarks, synthesizes recurring failure modes (hallucinations, brittle multi-step inference, weak causal abstraction, poor cross-domain generalization), and flags emerging directions such as meta-reasoning and multimodal reasoning.

Significance. If the selection and synthesis were demonstrably representative, the work would offer a useful organizing framework for a fast-moving subfield and could help researchers avoid duplicated effort on known failure modes. The survey format itself carries modest incremental value once the methodology is made transparent; absent that transparency, the claimed unification of capabilities and limitations rests on an unverified sample.

major comments (1)

[Abstract] Abstract (and any methodology section): the central claim of a 'systematic analysis of more than 300 recent papers' that yields a reliable taxonomy and synthesized failure modes is unsupported because the manuscript supplies no search strings, date cutoffs, inclusion/exclusion criteria, deduplication procedure, or relevance-screening protocol. Without these elements the representativeness of the corpus cannot be assessed and the taxonomy may systematically over- or under-weight particular paradigms.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the transparency of our literature search process must be improved to support the claim of a systematic analysis, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and any methodology section): the central claim of a 'systematic analysis of more than 300 recent papers' that yields a reliable taxonomy and synthesized failure modes is unsupported because the manuscript supplies no search strings, date cutoffs, inclusion/exclusion criteria, deduplication procedure, or relevance-screening protocol. Without these elements the representativeness of the corpus cannot be assessed and the taxonomy may systematically over- or under-weight particular paradigms.

Authors: We acknowledge that the manuscript does not currently include explicit details on the search methodology. In the revised version we will insert a new 'Literature Collection and Screening' subsection (placed after the introduction) that documents: the primary search strings employed across arXiv, Semantic Scholar, Google Scholar, Papers with Code, and ACL Anthology (combinations of terms such as "LLM reasoning", "chain-of-thought", "multi-hop reasoning", "mathematical reasoning", "commonsense reasoning", "tool use", "agentic reasoning", and "failure modes"); the date range (primarily January 2022–March 2024, with selected foundational works from 2018–2021); inclusion criteria (peer-reviewed or preprint papers that either introduce a new reasoning method, benchmark, or failure analysis within one of the nine paradigms); exclusion criteria (non-English works, purely theoretical papers without empirical results, and duplicates); the deduplication procedure (title/abstract matching followed by manual review); and the relevance screening protocol (initial keyword filter followed by abstract-level review by the authors). We will also note that the collection was not performed under a formal PRISMA-style protocol. These additions will allow readers to assess corpus representativeness. If the documented process reveals gaps, we will temper the abstract's wording from "systematic analysis" to "comprehensive survey" as appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external literature

full rationale

The paper is a literature survey that constructs a taxonomy by reviewing >300 external papers from public databases. It contains no equations, no fitted parameters, no predictions derived from internal data, and no load-bearing self-citations that reduce the central claims to prior author work by definition. The selection protocol is undocumented, but this is a methodological limitation, not a circular reduction of any derivation to its own inputs. All listed patterns (self-definitional, fitted-input-called-prediction, etc.) are absent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey; the central claims rest on the body of reviewed papers rather than new parameters, axioms, or postulated entities introduced by the authors.

pith-pipeline@v0.9.1-grok · 5854 in / 1162 out tokens · 32029 ms · 2026-06-27T12:58:10.285653+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 64 canonical work pages · 10 internal anchors

[1]

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Roman Abramov, Felix Steinbauer, and Gjergji Kasneci. Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers . arXiv preprint arXiv:2504.20752, 2025. doi:10.48550/arXiv.2504.20752. URL https://arxiv.org/abs/2504.20752

work page doi:10.48550/arxiv.2504.20752 2025
[2]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

Open deep search: Democratizing search with open-source reasoning agents

Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, et al. Open deep search: Democratizing search with open-source reasoning agents. arXiv preprint arXiv:2503.20201, 2025. URL https://arxiv.org/abs/2503.20201

arXiv 2025
[4]

Can llms reason like humans? assessing theory of mind reasoning in llms for open-ended questions

Maryam Amirizaniani et al. Can llms reason like humans? assessing theory of mind reasoning in llms for open-ended questions. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024. URL https://dl.acm.org/doi/10.1145/3627673.3679832

work page doi:10.1145/3627673.3679832 2024
[5]

Training language models to reason efficiently, 2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently, 2025. URL https://arxiv. org/abs/2502.04463. URL https://arxiv.org/abs/2502.04463

arXiv 2025
[6]

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching

Simon A Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. arXiv preprint arXiv:2503.05179, 2025. URL https://arxiv.org/abs/2503.05179

arXiv 2025
[7]

Adaptive domain modeling with language models: A multi-agent approach to task planning

Harisankar Babu, Philipp Schillinger, and Tamim Asfour. Adaptive domain modeling with language models: A multi-agent approach to task planning. arXiv preprint arXiv:2506.19592, 2025

arXiv 2025
[8]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7421--7454, 2024 a

2024
[9]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks . arXiv preprint arXiv:2412.15204, 2024 b . doi:10.48550/arXiv.2412.15204. URL https://arxiv.org/abs/2412.15204

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15204 2024
[10]

Longwriter: Unleashing 10,000+ word generation from long context llms

Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longwriter: Unleashing 10,000+ word generation from long context llms. arXiv preprint arXiv:2408.07055, 2024 c . URL https://arxiv.org/abs/2408.07055

arXiv 2024
[11]

Hunter, and Katharina von der Wense

George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E. Hunter, and Katharina von der Wense. Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA . arXiv preprint arXiv:2412.10079, 2024. doi:10.48550/arXiv.2412.10079. URL https://arxiv.org/abs/2412.10079

work page doi:10.48550/arxiv.2412.10079 2024
[12]

A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023. URL https://arxiv.org/abs/2302.04023

Pith/arXiv arXiv 2023
[13]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Stickland, Tomek Korbak, and Owain Evans. The reversal curse: Llms trained on “a is b” fail to learn “b is a”. In International Conference on Learning Representations, volume 2024, pp.\ 18623--18642, 2024

2024
[14]

Reasoning language models: A blueprint

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025. URL https://arxiv.org/abs/2501.11223

arXiv 2025
[15]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

1901
[16]

Hierarchical prompting taxonomy: A universal evaluation framework for large language models aligned with human cognitive principles

Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, and Aman Chadha. Hierarchical prompting taxonomy: A universal evaluation framework for large language models aligned with human cognitive principles. arXiv preprint arXiv:2406.12644, 2024

arXiv 2024
[17]

Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models

Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. arXiv preprint arXiv:2502.00698, 2025. URL https://arxiv.org/abs/2502.00698

arXiv 2025
[18]

Senticnet 7: A commonsense-based neurosymbolic ai framework for explainable sentiment analysis

Erik Cambria, Qian Liu, Sergio Decherchi, Frank Xing, and Kenneth Kwok. Senticnet 7: A commonsense-based neurosymbolic ai framework for explainable sentiment analysis. In Proceedings of the thirteenth language resources and evaluation conference, pp.\ 3829--3839, 2022

2022
[19]

Chart-based reasoning: Transferring capabilities from llms to vlms

Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Abhanshu Sharma. Chart-based reasoning: Transferring capabilities from llms to vlms. arXiv preprint arXiv:2403.12596, 2024. URL https://arxiv.org/abs/2403.12596

arXiv 2024
[20]

Language models as compilers: Simulating pseudocode execution improves algorithmic reasoning in language models

Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, et al. Language models as compilers: Simulating pseudocode execution improves algorithmic reasoning in language models. arXiv preprint arXiv:2404.02575, 2024. URL https://arxiv.org/abs/2404.02575

arXiv 2024
[21]

AlphaMath Almost Zero: Process Supervision without Process

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. AlphaMath Almost Zero: Process Supervision without Process . Advances in Neural Information Processing Systems, 37, 2024 a . doi:10.52202/079017-0870. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/30dfe47a3ccbee68cffa0c19ccb1bc00-Abstract-Conference.html

work page doi:10.52202/079017-0870 2024
[22]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs . arXiv preprint arXiv:2412.18925, 2024 b . doi:10.48550/arXiv.2412.18925. URL https://arxiv.org/abs/2412.18925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.18925 2024
[23]

Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models

Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models. arXiv preprint arXiv:2402.01620, 2024 c . URL https://arxiv.org/abs/2402.01620

arXiv 2024
[24]

Reverse thinking makes llms stronger reasoners

Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. Reverse thinking makes llms stronger reasoners. arXiv preprint arXiv:2411.19865, 2024 d . URL https://arxiv.org/abs/2411.19865

arXiv 2024
[25]

Breaking language barriers in multilingual mathematical reasoning: Insights and observations

Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246, 2023. URL https://arxiv.org/abs/2310.20246

arXiv 2023
[26]

Judgelrm: Large reasoning models as a judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025 a . URL https://arxiv.org/abs/2504.00050

arXiv 2025
[27]

Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought

Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought . Advances in Neural Information Processing Systems, 37: 0 54872--54904, 2024 e . doi:10.52202/079017-1740. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/62ab1c...

work page doi:10.52202/079017-1740 2024
[28]

Advancing tool-augmented large language models: Integrating insights from errors in inference trees

Sijia Chen, Yibo Wang, Yi-Feng Wu, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees. Advances in Neural Information Processing Systems, 37: 0 106555--106581, 2024 f . URL https://arxiv.org/abs/2406.07115

arXiv 2024
[29]

Tabfact: A large-scale dataset for table-based fact verification

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019

arXiv 1909
[30]

X. Chen, G. Li, Z. Wang, B. Jin, C. Qian, Y. Wang, and H. Ji. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 b . URL https://arxiv.org/pdf/2505.02387

arXiv 2025
[31]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs . arXiv preprint arXiv:2412.21187, 2024 g . doi:10.48550/arXiv.2412.21187. URL https://arxiv.org/abs/2412.21187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.21187 2024
[32]

Premise Order Matters in Reasoning with Large Language Models

Xinyun Chen, Ryan Andrew Chi, Xuezhi Wang, and Denny Zhou. Premise Order Matters in Reasoning with Large Language Models . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 6596--6620. PMLR, 2024 h . URL https://proceedings.mlr.press/v235/chen24i.html

2024
[33]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed Chain of Thought: Efficient Reasoning Through Dense Representations . arXiv preprint arXiv:2412.13171, 2024. doi:10.48550/arXiv.2412.13171. URL https://arxiv.org/abs/2412.13171

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.13171 2024
[34]

Coral: Benchmarking multi-turn conversational retrieval-augmentation generation

Yiruo Cheng, Kelong Mao, Ziliang Zhao, Guanting Dong, Hongjin Qian, Yongkang Wu, Tetsuya Sakai, Ji-Rong Wen, and Zhicheng Dou. Coral: Benchmarking multi-turn conversational retrieval-augmentation generation. arXiv preprint arXiv:2410.23090, 2024. URL https://arxiv.org/abs/2410.23090

arXiv 2024
[35]

Thoughtsculpt: Reasoning with intermediate revision and search

Yizhou Chi, Kevin Yang, and Dan Klein. Thoughtsculpt: Reasoning with intermediate revision and search. arXiv preprint arXiv:2404.05966, 2024. URL https://arxiv.org/abs/2404.05966

arXiv 2024
[36]

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais, and Ivan P Yamshchikov. What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks . arXiv preprint arXiv:2504.07825, 2025. doi:10.48550/arXiv.2504.07825. URL https://arxiv.org/abs/2504.07825

work page doi:10.48550/arxiv.2504.07825 2025
[37]

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3043--3054, 2023. URL https://arxiv.org/abs/2202.04053

arXiv 2023
[38]

M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv preprint arXiv:2411.04952, 2024. URL https://arxiv.org/abs/2411.04952

arXiv 2024
[39]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24 0 (240): 0 1--113, 2023

2023
[40]

Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models. arXiv preprint arXiv:2311.17667, 2023. URL https://arxiv.org/abs/2311.17667

arXiv 2023
[41]

BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering

Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1229--1248. Associatio...

work page doi:10.18653/v1/2024.acl-long.67 2024
[42]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[43]

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025. URL https://arxiv.org/abs/2502.08235

arXiv 2025
[44]

Self-recognition in language models

Tim R Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, and Caglar Gulcehre. Self-recognition in language models. arXiv preprint arXiv:2407.06946, 2024. URL https://arxiv.org/abs/2407.06946

arXiv 2024
[45]

Benchmarks for Automated Commonsense Reasoning: A Survey

Ernest Davis. Benchmarks for Automated Commonsense Reasoning: A Survey . ACM Computing Surveys, 56 0 (4): 0 1--41, 2024. doi:10.1145/3615355. URL https://doi.org/10.1145/3615355

work page doi:10.1145/3615355 2024
[46]

Flow-dpo: Improving llm mathematical reasoning through online multi-agent learning

Yihe Deng and Paul Mineiro. Flow-dpo: Improving llm mathematical reasoning through online multi-agent learning. arXiv preprint arXiv:2410.22304, 2024. URL https://arxiv.org/abs/2410.22304

arXiv 2024
[47]

Retrievegpt: Merging prompts and mathematical models for enhanced code-mixed information retrieval

Aniket Deroy and Subhankar Maity. Retrievegpt: Merging prompts and mathematical models for enhanced code-mixed information retrieval. arXiv preprint arXiv:2411.04752, 2024. URL https://arxiv.org/abs/2411.04752

arXiv 2024
[48]

Mozer, and Sanjeev Arora

Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael C. Mozer, and Sanjeev Arora. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving . Advances in Neural Information Processing Systems, 37: 0 19783--19812, 2024. doi:10.52202/079017-0623. URL htt...

work page doi:10.52202/079017-0623 2024
[49]

Mapeval: A map-based evaluation of geo-spatial reasoning in foundation models

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapeval: A map-based evaluation of geo-spatial reasoning in foundation models. arXiv preprint arXiv:2501.00316, 2024. URL https://arxiv.org/abs/2501.00316

arXiv 2024
[50]

Unleashing reasoning capability of llms via scalable question synthesis from scratch

Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Qiaoming Zhu, and Min Zhang. Unleashing reasoning capability of llms via scalable question synthesis from scratch. arXiv preprint arXiv:2410.18693, 2024. URL https://arxiv.org/abs/2410.18693

arXiv 2024
[51]

Progressive multimodal reasoning via active retrieval

Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Progressive multimodal reasoning via active retrieval. arXiv preprint arXiv:2412.14835, 2024. URL https://arxiv.org/abs/2412.14835

arXiv 2024
[52]

Diverse inference and verification for advanced reasoning

Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, et al. Diverse inference and verification for advanced reasoning. arXiv preprint arXiv:2502.09955, 2025. URL https://arxiv.org/abs/2502.09955

arXiv 2025
[53]

Competitive programming with large reasoning models

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models. arXiv preprint arXiv:2502.06807, 2025. URL https://arxiv.org/abs/2502.06807

arXiv 2025
[54]

Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms

Alexander R Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing. Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms. arXiv preprint arXiv:2507.17476, 2025

arXiv 2025
[55]

Slam: Towards efficient multilingual reasoning via selective language alignment

Yuchun Fan, Yongyu Mu, Yilin Wang, Lei Huang, Junhao Ruan, Bei Li, Tong Xiao, Shujian Huang, Xiaocheng Feng, and Jingbo Zhu. Slam: Towards efficient multilingual reasoning via selective language alignment. arXiv preprint arXiv:2501.03681, 2025. URL https://aclanthology.org/2025.coling-main.637/

arXiv 2025
[56]

Synworld: Virtual scenario synthesis for agentic action knowledge refinement

Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, et al. Synworld: Virtual scenario synthesis for agentic action knowledge refinement. arXiv preprint arXiv:2504.03561, 2025. URL https://arxiv.org/abs/2504.03561

arXiv 2025
[57]

Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs

Tianqing Fang, Zeming Chen, Yangqiu Song, and Antoine Bosselut. Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11365--11384, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.acl-long.613 2024
[58]

Test of time: A benchmark for evaluating llms on temporal reasoning

Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan Perozzi. Test of time: A benchmark for evaluating llms on temporal reasoning. arXiv preprint arXiv:2406.09170, 2024. URL https://arxiv.org/abs/2406.09170

arXiv 2024
[59]

Video-r1: Reinforcing video reasoning in mllms

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. URL https://arxiv.org/abs/2503.21776

Pith/arXiv arXiv 2025
[60]

Rag foundry: A framework for enhancing llms for retrieval augmented generation

Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, and Peter Izsak. Rag foundry: A framework for enhancing llms for retrieval augmented generation. arXiv preprint arXiv:2408.02545, 2024. URL https://arxiv.org/abs/2408.02545

arXiv 2024
[61]

SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models

Daniel Fleischer, Moshe Berchansky, Gad Markovits, and Moshe Wasserblat. SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models . arXiv preprint arXiv:2502.09390, 2025. doi:10.48550/arXiv.2502.09390. URL https://arxiv.org/abs/2502.09390

work page doi:10.48550/arxiv.2502.09390 2025
[62]

Re-adaptir: Improving information retrieval through reverse engineered adaptation

William Fleshman and Benjamin Van Durme. Re-adaptir: Improving information retrieval through reverse engineered adaptation. arXiv preprint arXiv:2406.14764, 2024. URL https://arxiv.org/abs/2406.14764

arXiv 2024
[63]

Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong

Tairan Fu, Javier Conde, Gonzalo Mart \' nez, Mar \' a Grandury, and Pedro Reviriego. Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong. arXiv preprint arXiv:2501.09775, 2025. URL https://arxiv.org/abs/2501.09775

Pith/arXiv arXiv 2025
[64]

Efficiently serving llm reasoning programs with certaindex

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, and Hao Zhang. Efficiently serving llm reasoning programs with certaindex. arXiv preprint arXiv:2412.20993, 2024. URL https://arxiv.org/abs/2412.20993

arXiv 2024
[65]

I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders

Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y Rogov, Elena Tutubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders. arXiv preprint arXiv:2503.18878, 2025. URL https://arxiv.org/abs/2503.18878

arXiv 2025
[66]

Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, Lei Liang, and Jun Zhou. Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts . arXiv preprint arXiv:2405.19893, 2024. doi:10.48550/arXiv.2405.19893. URL https://arxiv.org/abs/2405.19893

work page doi:10.48550/arxiv.2405.19893 2024
[67]

Understanding social reasoning in language models with language models

Kanishk Gandhi, Jan-Philipp Fr \"a nken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36: 0 13518--13529, 2023. URL https://arxiv.org/abs/2306.15448

arXiv 2023
[68]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025. URL https://arxiv.org/abs/2503.01307

Pith/arXiv arXiv 2025
[69]

Efficient tool use with chain-of-abstraction reasoning

Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, and Tianlu Wang. Efficient tool use with chain-of-abstraction reasoning. arXiv preprint arXiv:2401.17464, 2024. URL https://arxiv.org/abs/2401.17464

arXiv 2024
[70]

Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues

Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, and Yi Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues. arXiv preprint arXiv:2502.01630, 2025. URL https://arxiv.org/abs/2502.01630

arXiv 2025
[71]

Scaling up test-time compute with latent reasoning: A recurrent depth approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. URL https://arxiv.org/abs/2502.05171

Pith/arXiv arXiv 2025
[72]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

2020
[73]

Large language models are not strong abstract reasoners

Ga \"e l Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are not strong abstract reasoners. arXiv preprint arXiv:2305.19555, 2023

arXiv 2023
[74]

The multilingual mind: A survey of multilingual reasoning in language models

Akash Ghosh, Debayan Datta, Sriparna Saha, and Chirag Agarwal. The multilingual mind: A survey of multilingual reasoning in language models. arXiv preprint arXiv:2502.09457, 2025. URL https://arxiv.org/abs/2502.09457

arXiv 2025
[75]

Compositional preference models for aligning lms

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023. URL https://arxiv.org/abs/2310.13011

arXiv 2023
[76]

Great models think alike and this undermines ai oversight

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, and Jonas Geiping. Great models think alike and this undermines ai oversight. arXiv preprint arXiv:2502.04313, 2025. URL https://arxiv.org/abs/2502.04313

arXiv 2025
[77]

Towards an ai co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025. URL https://arxiv.org/abs/2502.18864

Pith/arXiv arXiv 2025
[78]

Large language models orchestrating structured reasoning achieve kaggle grandmaster level

Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, et al. Large language models orchestrating structured reasoning achieve kaggle grandmaster level. arXiv preprint arXiv:2411.03562, 2024. URL https://arxiv.org/a...

arXiv 2024
[79]

Cruxeval: A benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024. URL https://arxiv.org/abs/2401.03065

Pith/arXiv arXiv 2024
[80]

Auditing prompt caching in language model apis

Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. Auditing prompt caching in language model apis. arXiv preprint arXiv:2502.07776, 2025. URL https://arxiv.org/abs/2502.07776

arXiv 2025

Showing first 80 references.

[1] [1]

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Roman Abramov, Felix Steinbauer, and Gjergji Kasneci. Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers . arXiv preprint arXiv:2504.20752, 2025. doi:10.48550/arXiv.2504.20752. URL https://arxiv.org/abs/2504.20752

work page doi:10.48550/arxiv.2504.20752 2025

[2] [2]

Gpt-4 technical report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

Open deep search: Democratizing search with open-source reasoning agents

Salaheddin Alzubi, Creston Brooks, Purva Chiniya, Edoardo Contente, Chiara von Gerlach, Lucas Irwin, Yihan Jiang, Arda Kaz, Windsor Nguyen, Sewoong Oh, et al. Open deep search: Democratizing search with open-source reasoning agents. arXiv preprint arXiv:2503.20201, 2025. URL https://arxiv.org/abs/2503.20201

arXiv 2025

[4] [4]

Can llms reason like humans? assessing theory of mind reasoning in llms for open-ended questions

Maryam Amirizaniani et al. Can llms reason like humans? assessing theory of mind reasoning in llms for open-ended questions. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024. URL https://dl.acm.org/doi/10.1145/3627673.3679832

work page doi:10.1145/3627673.3679832 2024

[5] [5]

Training language models to reason efficiently, 2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently, 2025. URL https://arxiv. org/abs/2502.04463. URL https://arxiv.org/abs/2502.04463

arXiv 2025

[6] [6]

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching

Simon A Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. arXiv preprint arXiv:2503.05179, 2025. URL https://arxiv.org/abs/2503.05179

arXiv 2025

[7] [7]

Adaptive domain modeling with language models: A multi-agent approach to task planning

Harisankar Babu, Philipp Schillinger, and Tamim Asfour. Adaptive domain modeling with language models: A multi-agent approach to task planning. arXiv preprint arXiv:2506.19592, 2025

arXiv 2025

[8] [8]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7421--7454, 2024 a

2024

[9] [9]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks . arXiv preprint arXiv:2412.15204, 2024 b . doi:10.48550/arXiv.2412.15204. URL https://arxiv.org/abs/2412.15204

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15204 2024

[10] [10]

Longwriter: Unleashing 10,000+ word generation from long context llms

Yushi Bai, Jiajie Zhang, Xin Lv, Linzhi Zheng, Siqi Zhu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longwriter: Unleashing 10,000+ word generation from long context llms. arXiv preprint arXiv:2408.07055, 2024 c . URL https://arxiv.org/abs/2408.07055

arXiv 2024

[11] [11]

Hunter, and Katharina von der Wense

George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E. Hunter, and Katharina von der Wense. Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA . arXiv preprint arXiv:2412.10079, 2024. doi:10.48550/arXiv.2412.10079. URL https://arxiv.org/abs/2412.10079

work page doi:10.48550/arxiv.2412.10079 2024

[12] [12]

A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023. URL https://arxiv.org/abs/2302.04023

Pith/arXiv arXiv 2023

[13] [13]

a is b” fail to learn “b is a

Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Stickland, Tomek Korbak, and Owain Evans. The reversal curse: Llms trained on “a is b” fail to learn “b is a”. In International Conference on Learning Representations, volume 2024, pp.\ 18623--18642, 2024

2024

[14] [14]

Reasoning language models: A blueprint

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223, 2025. URL https://arxiv.org/abs/2501.11223

arXiv 2025

[15] [15]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

1901

[16] [16]

Hierarchical prompting taxonomy: A universal evaluation framework for large language models aligned with human cognitive principles

Devichand Budagam, Ashutosh Kumar, Mahsa Khoshnoodi, Sankalp KJ, Vinija Jain, and Aman Chadha. Hierarchical prompting taxonomy: A universal evaluation framework for large language models aligned with human cognitive principles. arXiv preprint arXiv:2406.12644, 2024

arXiv 2024

[17] [17]

Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models

Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models. arXiv preprint arXiv:2502.00698, 2025. URL https://arxiv.org/abs/2502.00698

arXiv 2025

[18] [18]

Senticnet 7: A commonsense-based neurosymbolic ai framework for explainable sentiment analysis

Erik Cambria, Qian Liu, Sergio Decherchi, Frank Xing, and Kenneth Kwok. Senticnet 7: A commonsense-based neurosymbolic ai framework for explainable sentiment analysis. In Proceedings of the thirteenth language resources and evaluation conference, pp.\ 3829--3839, 2022

2022

[19] [19]

Chart-based reasoning: Transferring capabilities from llms to vlms

Victor Carbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, and Abhanshu Sharma. Chart-based reasoning: Transferring capabilities from llms to vlms. arXiv preprint arXiv:2403.12596, 2024. URL https://arxiv.org/abs/2403.12596

arXiv 2024

[20] [20]

Language models as compilers: Simulating pseudocode execution improves algorithmic reasoning in language models

Hyungjoo Chae, Yeonghyeon Kim, Seungone Kim, Kai Tzu-iunn Ong, Beong-woo Kwak, Moohyeon Kim, Seonghwan Kim, Taeyoon Kwon, Jiwan Chung, Youngjae Yu, et al. Language models as compilers: Simulating pseudocode execution improves algorithmic reasoning in language models. arXiv preprint arXiv:2404.02575, 2024. URL https://arxiv.org/abs/2404.02575

arXiv 2024

[21] [21]

AlphaMath Almost Zero: Process Supervision without Process

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. AlphaMath Almost Zero: Process Supervision without Process . Advances in Neural Information Processing Systems, 37, 2024 a . doi:10.52202/079017-0870. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/30dfe47a3ccbee68cffa0c19ccb1bc00-Abstract-Conference.html

work page doi:10.52202/079017-0870 2024

[22] [22]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs . arXiv preprint arXiv:2412.18925, 2024 b . doi:10.48550/arXiv.2412.18925. URL https://arxiv.org/abs/2412.18925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.18925 2024

[23] [23]

Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models

Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magdi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models. arXiv preprint arXiv:2402.01620, 2024 c . URL https://arxiv.org/abs/2402.01620

arXiv 2024

[24] [24]

Reverse thinking makes llms stronger reasoners

Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, et al. Reverse thinking makes llms stronger reasoners. arXiv preprint arXiv:2411.19865, 2024 d . URL https://arxiv.org/abs/2411.19865

arXiv 2024

[25] [25]

Breaking language barriers in multilingual mathematical reasoning: Insights and observations

Nuo Chen, Zinan Zheng, Ning Wu, Ming Gong, Dongmei Zhang, and Jia Li. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246, 2023. URL https://arxiv.org/abs/2310.20246

arXiv 2023

[26] [26]

Judgelrm: Large reasoning models as a judge

Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, and Bingsheng He. Judgelrm: Large reasoning models as a judge. arXiv preprint arXiv:2504.00050, 2025 a . URL https://arxiv.org/abs/2504.00050

arXiv 2025

[27] [27]

Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought

Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought . Advances in Neural Information Processing Systems, 37: 0 54872--54904, 2024 e . doi:10.52202/079017-1740. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/62ab1c...

work page doi:10.52202/079017-1740 2024

[28] [28]

Advancing tool-augmented large language models: Integrating insights from errors in inference trees

Sijia Chen, Yibo Wang, Yi-Feng Wu, Qingguo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees. Advances in Neural Information Processing Systems, 37: 0 106555--106581, 2024 f . URL https://arxiv.org/abs/2406.07115

arXiv 2024

[29] [29]

Tabfact: A large-scale dataset for table-based fact verification

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019

arXiv 1909

[30] [30]

X. Chen, G. Li, Z. Wang, B. Jin, C. Qian, Y. Wang, and H. Ji. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 b . URL https://arxiv.org/pdf/2505.02387

arXiv 2025

[31] [31]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs . arXiv preprint arXiv:2412.21187, 2024 g . doi:10.48550/arXiv.2412.21187. URL https://arxiv.org/abs/2412.21187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.21187 2024

[32] [32]

Premise Order Matters in Reasoning with Large Language Models

Xinyun Chen, Ryan Andrew Chi, Xuezhi Wang, and Denny Zhou. Premise Order Matters in Reasoning with Large Language Models . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 6596--6620. PMLR, 2024 h . URL https://proceedings.mlr.press/v235/chen24i.html

2024

[33] [33]

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Jeffrey Cheng and Benjamin Van Durme. Compressed Chain of Thought: Efficient Reasoning Through Dense Representations . arXiv preprint arXiv:2412.13171, 2024. doi:10.48550/arXiv.2412.13171. URL https://arxiv.org/abs/2412.13171

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.13171 2024

[34] [34]

Coral: Benchmarking multi-turn conversational retrieval-augmentation generation

Yiruo Cheng, Kelong Mao, Ziliang Zhao, Guanting Dong, Hongjin Qian, Yongkang Wu, Tetsuya Sakai, Ji-Rong Wen, and Zhicheng Dou. Coral: Benchmarking multi-turn conversational retrieval-augmentation generation. arXiv preprint arXiv:2410.23090, 2024. URL https://arxiv.org/abs/2410.23090

arXiv 2024

[35] [35]

Thoughtsculpt: Reasoning with intermediate revision and search

Yizhou Chi, Kevin Yang, and Dan Klein. Thoughtsculpt: Reasoning with intermediate revision and search. arXiv preprint arXiv:2404.05966, 2024. URL https://arxiv.org/abs/2404.05966

arXiv 2024

[36] [36]

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais, and Ivan P Yamshchikov. What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks . arXiv preprint arXiv:2504.07825, 2025. doi:10.48550/arXiv.2504.07825. URL https://arxiv.org/abs/2504.07825

work page doi:10.48550/arxiv.2504.07825 2025

[37] [37]

Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models

Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 3043--3054, 2023. URL https://arxiv.org/abs/2202.04053

arXiv 2023

[38] [38]

M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv preprint arXiv:2411.04952, 2024. URL https://arxiv.org/abs/2411.04952

arXiv 2024

[39] [39]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of machine learning research, 24 0 (240): 0 1--113, 2023

2023

[40] [40]

Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models. arXiv preprint arXiv:2311.17667, 2023. URL https://arxiv.org/abs/2311.17667

arXiv 2023

[41] [41]

BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering

Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1229--1248. Associatio...

work page doi:10.18653/v1/2024.acl-long.67 2024

[42] [42]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[43] [43]

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025. URL https://arxiv.org/abs/2502.08235

arXiv 2025

[44] [44]

Self-recognition in language models

Tim R Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, and Caglar Gulcehre. Self-recognition in language models. arXiv preprint arXiv:2407.06946, 2024. URL https://arxiv.org/abs/2407.06946

arXiv 2024

[45] [45]

Benchmarks for Automated Commonsense Reasoning: A Survey

Ernest Davis. Benchmarks for Automated Commonsense Reasoning: A Survey . ACM Computing Surveys, 56 0 (4): 0 1--41, 2024. doi:10.1145/3615355. URL https://doi.org/10.1145/3615355

work page doi:10.1145/3615355 2024

[46] [46]

Flow-dpo: Improving llm mathematical reasoning through online multi-agent learning

Yihe Deng and Paul Mineiro. Flow-dpo: Improving llm mathematical reasoning through online multi-agent learning. arXiv preprint arXiv:2410.22304, 2024. URL https://arxiv.org/abs/2410.22304

arXiv 2024

[47] [47]

Retrievegpt: Merging prompts and mathematical models for enhanced code-mixed information retrieval

Aniket Deroy and Subhankar Maity. Retrievegpt: Merging prompts and mathematical models for enhanced code-mixed information retrieval. arXiv preprint arXiv:2411.04752, 2024. URL https://arxiv.org/abs/2411.04752

arXiv 2024

[48] [48]

Mozer, and Sanjeev Arora

Aniket Didolkar, Anirudh Goyal, Nan Rosemary Ke, Siyuan Guo, Michal Valko, Timothy Lillicrap, Danilo Rezende, Yoshua Bengio, Michael C. Mozer, and Sanjeev Arora. Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving . Advances in Neural Information Processing Systems, 37: 0 19783--19812, 2024. doi:10.52202/079017-0623. URL htt...

work page doi:10.52202/079017-0623 2024

[49] [49]

Mapeval: A map-based evaluation of geo-spatial reasoning in foundation models

Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, and Md Rizwan Parvez. Mapeval: A map-based evaluation of geo-spatial reasoning in foundation models. arXiv preprint arXiv:2501.00316, 2024. URL https://arxiv.org/abs/2501.00316

arXiv 2024

[50] [50]

Unleashing reasoning capability of llms via scalable question synthesis from scratch

Yuyang Ding, Xinyu Shi, Xiaobo Liang, Juntao Li, Qiaoming Zhu, and Min Zhang. Unleashing reasoning capability of llms via scalable question synthesis from scratch. arXiv preprint arXiv:2410.18693, 2024. URL https://arxiv.org/abs/2410.18693

arXiv 2024

[51] [51]

Progressive multimodal reasoning via active retrieval

Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Progressive multimodal reasoning via active retrieval. arXiv preprint arXiv:2412.14835, 2024. URL https://arxiv.org/abs/2412.14835

arXiv 2024

[52] [52]

Diverse inference and verification for advanced reasoning

Iddo Drori, Gaston Longhitano, Mao Mao, Seunghwan Hyun, Yuke Zhang, Sungjun Park, Zachary Meeks, Xin-Yu Zhang, Ben Segev, Howard Yong, et al. Diverse inference and verification for advanced reasoning. arXiv preprint arXiv:2502.09955, 2025. URL https://arxiv.org/abs/2502.09955

arXiv 2025

[53] [53]

Competitive programming with large reasoning models

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models. arXiv preprint arXiv:2502.06807, 2025. URL https://arxiv.org/abs/2502.06807

arXiv 2025

[54] [54]

Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms

Alexander R Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, and Chen Xing. Multinrc: A challenging and native multilingual reasoning evaluation benchmark for llms. arXiv preprint arXiv:2507.17476, 2025

arXiv 2025

[55] [55]

Slam: Towards efficient multilingual reasoning via selective language alignment

Yuchun Fan, Yongyu Mu, Yilin Wang, Lei Huang, Junhao Ruan, Bei Li, Tong Xiao, Shujian Huang, Xiaocheng Feng, and Jingbo Zhu. Slam: Towards efficient multilingual reasoning via selective language alignment. arXiv preprint arXiv:2501.03681, 2025. URL https://aclanthology.org/2025.coling-main.637/

arXiv 2025

[56] [56]

Synworld: Virtual scenario synthesis for agentic action knowledge refinement

Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, et al. Synworld: Virtual scenario synthesis for agentic action knowledge refinement. arXiv preprint arXiv:2504.03561, 2025. URL https://arxiv.org/abs/2504.03561

arXiv 2025

[57] [57]

Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs

Tianqing Fang, Zeming Chen, Yangqiu Song, and Antoine Bosselut. Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 11365--11384, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.acl-long.613 2024

[58] [58]

Test of time: A benchmark for evaluating llms on temporal reasoning

Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, and Bryan Perozzi. Test of time: A benchmark for evaluating llms on temporal reasoning. arXiv preprint arXiv:2406.09170, 2024. URL https://arxiv.org/abs/2406.09170

arXiv 2024

[59] [59]

Video-r1: Reinforcing video reasoning in mllms

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025. URL https://arxiv.org/abs/2503.21776

Pith/arXiv arXiv 2025

[60] [60]

Rag foundry: A framework for enhancing llms for retrieval augmented generation

Daniel Fleischer, Moshe Berchansky, Moshe Wasserblat, and Peter Izsak. Rag foundry: A framework for enhancing llms for retrieval augmented generation. arXiv preprint arXiv:2408.02545, 2024. URL https://arxiv.org/abs/2408.02545

arXiv 2024

[61] [61]

SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models

Daniel Fleischer, Moshe Berchansky, Gad Markovits, and Moshe Wasserblat. SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models . arXiv preprint arXiv:2502.09390, 2025. doi:10.48550/arXiv.2502.09390. URL https://arxiv.org/abs/2502.09390

work page doi:10.48550/arxiv.2502.09390 2025

[62] [62]

Re-adaptir: Improving information retrieval through reverse engineered adaptation

William Fleshman and Benjamin Van Durme. Re-adaptir: Improving information retrieval through reverse engineered adaptation. arXiv preprint arXiv:2406.14764, 2024. URL https://arxiv.org/abs/2406.14764

arXiv 2024

[63] [63]

Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong

Tairan Fu, Javier Conde, Gonzalo Mart \' nez, Mar \' a Grandury, and Pedro Reviriego. Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong. arXiv preprint arXiv:2501.09775, 2025. URL https://arxiv.org/abs/2501.09775

Pith/arXiv arXiv 2025

[64] [64]

Efficiently serving llm reasoning programs with certaindex

Yichao Fu, Junda Chen, Siqi Zhu, Zheyu Fu, Zhongdongming Dai, Aurick Qiao, and Hao Zhang. Efficiently serving llm reasoning programs with certaindex. arXiv preprint arXiv:2412.20993, 2024. URL https://arxiv.org/abs/2412.20993

arXiv 2024

[65] [65]

I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders

Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y Rogov, Elena Tutubalina, and Ivan Oseledets. I have covered all the bases here: Interpreting reasoning features in large language models via sparse autoencoders. arXiv preprint arXiv:2503.18878, 2025. URL https://arxiv.org/abs/2503.18878

arXiv 2025

[66] [66]

Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts

Chunjing Gan, Dan Yang, Binbin Hu, Hanxiao Zhang, Siyuan Li, Ziqi Liu, Yue Shen, Lin Ju, Zhiqiang Zhang, Jinjie Gu, Lei Liang, and Jun Zhou. Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts . arXiv preprint arXiv:2405.19893, 2024. doi:10.48550/arXiv.2405.19893. URL https://arxiv.org/abs/2405.19893

work page doi:10.48550/arxiv.2405.19893 2024

[67] [67]

Understanding social reasoning in language models with language models

Kanishk Gandhi, Jan-Philipp Fr \"a nken, Tobias Gerstenberg, and Noah Goodman. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 36: 0 13518--13529, 2023. URL https://arxiv.org/abs/2306.15448

arXiv 2023

[68] [68]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025. URL https://arxiv.org/abs/2503.01307

Pith/arXiv arXiv 2025

[69] [69]

Efficient tool use with chain-of-abstraction reasoning

Silin Gao, Jane Dwivedi-Yu, Ping Yu, Xiaoqing Ellen Tan, Ramakanth Pasunuru, Olga Golovneva, Koustuv Sinha, Asli Celikyilmaz, Antoine Bosselut, and Tianlu Wang. Efficient tool use with chain-of-abstraction reasoning. arXiv preprint arXiv:2401.17464, 2024. URL https://arxiv.org/abs/2401.17464

arXiv 2024

[70] [70]

Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues

Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, and Yi Zhang. Tremu: Towards neuro-symbolic temporal reasoning for llm-agents with memory in multi-session dialogues. arXiv preprint arXiv:2502.01630, 2025. URL https://arxiv.org/abs/2502.01630

arXiv 2025

[71] [71]

Scaling up test-time compute with latent reasoning: A recurrent depth approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025. URL https://arxiv.org/abs/2502.05171

Pith/arXiv arXiv 2025

[72] [72]

Shortcut learning in deep neural networks

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2 0 (11): 0 665--673, 2020

2020

[73] [73]

Large language models are not strong abstract reasoners

Ga \"e l Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are not strong abstract reasoners. arXiv preprint arXiv:2305.19555, 2023

arXiv 2023

[74] [74]

The multilingual mind: A survey of multilingual reasoning in language models

Akash Ghosh, Debayan Datta, Sriparna Saha, and Chirag Agarwal. The multilingual mind: A survey of multilingual reasoning in language models. arXiv preprint arXiv:2502.09457, 2025. URL https://arxiv.org/abs/2502.09457

arXiv 2025

[75] [75]

Compositional preference models for aligning lms

Dongyoung Go, Tomasz Korbak, Germ \'a n Kruszewski, Jos Rozen, and Marc Dymetman. Compositional preference models for aligning lms. arXiv preprint arXiv:2310.13011, 2023. URL https://arxiv.org/abs/2310.13011

arXiv 2023

[76] [76]

Great models think alike and this undermines ai oversight

Shashwat Goel, Joschka Struber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, and Jonas Geiping. Great models think alike and this undermines ai oversight. arXiv preprint arXiv:2502.04313, 2025. URL https://arxiv.org/abs/2502.04313

arXiv 2025

[77] [77]

Towards an ai co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025. URL https://arxiv.org/abs/2502.18864

Pith/arXiv arXiv 2025

[78] [78]

Large language models orchestrating structured reasoning achieve kaggle grandmaster level

Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, et al. Large language models orchestrating structured reasoning achieve kaggle grandmaster level. arXiv preprint arXiv:2411.03562, 2024. URL https://arxiv.org/a...

arXiv 2024

[79] [79]

Cruxeval: A benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Rozi \`e re, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024. URL https://arxiv.org/abs/2401.03065

Pith/arXiv arXiv 2024

[80] [80]

Auditing prompt caching in language model apis

Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. Auditing prompt caching in language model apis. arXiv preprint arXiv:2502.07776, 2025. URL https://arxiv.org/abs/2502.07776

arXiv 2025