arxiv: 2502.14768 · v1 · submitted 2025-02-20 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie , Zitian Gao , Qingnan Ren , Haoming Luo , Yuqian Hong , Bryan Dai , Joey Zhou , Kai Qiu

show 2 more authors

Zhirong Wu Chong Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords rule-based reinforcement learningLLM reasoninglogic puzzlesgeneralizationAIMEAMCsynthetic dataformat reward

0 comments

The pith

Rule-based RL on 5K logic puzzles induces reflection and verification in a 7B model that transfers to AIME and AMC.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether rule-based reinforcement learning on simple synthetic logic puzzles can produce complex reasoning behaviors in large language models. A 7B model is trained with a system prompt that stresses the thinking process plus a strict format reward that blocks shortcuts. After exposure to only 5,000 logic problems, the model exhibits reflection, verification, and summarization—behaviors absent from the training data—and applies these skills to solve problems on the AIME and AMC mathematics benchmarks. The work treats logic puzzles as controllable proxies that let researchers isolate how reinforcement learning shapes reasoning dynamics.

Core claim

A 7B model trained via rule-based reinforcement learning on 5,000 synthetic logic puzzles acquires advanced reasoning skills such as reflection, verification, and summarization that do not appear in the logic corpus; these skills enable measurable generalization to the AIME and AMC mathematics competitions.

What carries the argument

Rule-based reinforcement learning driven by a stringent format reward that penalizes shortcut outputs together with a system prompt that requires explicit thinking before answering.

If this is right

Small synthetic datasets can suffice to induce transferable reasoning in language models.
Strict format rewards stabilize RL training by reducing degenerate outputs.
Logic puzzles function as effective training proxies for developing math-level reasoning.
Reflection and verification emerge as by-products of outcome-based RL even when absent from the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recipe may extend to other structured domains such as code generation or scientific reasoning where verification is cheap.
Interactions between the base model's pre-trained knowledge and the RL signal likely amplify the observed generalization.
Increasing the volume of logic puzzles or varying their complexity could further strengthen transfer to harder benchmarks.

Load-bearing premise

The advanced reasoning behaviors are produced by the reinforcement learning updates rather than already present in the base model or triggered by the system prompt alone.

What would settle it

Run the identical 7B model on the same logic puzzles using only the system prompt and no reinforcement learning updates, then measure performance on AIME and AMC; comparable scores would indicate the skills were not induced by RL.

read the original abstract

Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in large reasoning models. To analyze reasoning dynamics, we use synthetic logic puzzles as training data due to their controllable complexity and straightforward answer verification. We make some key technical contributions that lead to effective and stable RL training: a system prompt that emphasizes the thinking and answering process, a stringent format reward function that penalizes outputs for taking shortcuts, and a straightforward training recipe that achieves stable convergence. Our 7B model develops advanced reasoning skills-such as reflection, verification, and summarization-that are absent from the logic corpus. Remarkably, after training on just 5K logic problems, it demonstrates generalization abilities to the challenging math benchmarks AIME and AMC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rule-based RL on 5K logic puzzles produces a 7B model that improves on AIME/AMC and shows reflection-like steps, but the gains cannot be cleanly attributed to the RL without base-model controls.

read the letter

The main point is that a 7B model trained with rule-based RL on synthetic logic puzzles generalizes to AIME and AMC after only 5K examples and begins producing reflection, verification, and summarization steps that were not in the training data. The authors extend the DeepSeek-R1 approach by swapping in logic puzzles for easier verification and controllable difficulty, then add a system prompt that stresses the thinking process plus a format reward that blocks short-circuit answers. That combination yields stable training and the reported transfer. The data choice and reward design are practical and worth copying if you are running similar experiments. The soft spot is exactly the one flagged in the stress test: no base-model numbers under the identical prompt and format constraints, no training curves, and no ablations. Without those, it is impossible to tell whether the advanced behaviors were latent in the 7B checkpoint and simply surfaced by prompting, or whether the RL actually induced them. The abstract gives final scores but nothing on trajectory or controls, so the central claim rests on an attribution that is not yet demonstrated. This is for people already working on RL post-training for reasoning models. A reader who wants a concrete, small-data recipe can extract the prompt and reward details and try them. It deserves peer review because the empirical outcome is specific enough to be checked and the method is cheap to reproduce, even if the current draft will need added controls and statistics before it is solid.

Referee Report

2 major / 1 minor

Summary. The paper introduces Logic-RL, a rule-based RL framework that trains a 7B LLM on 5K synthetic logic puzzles using a system prompt emphasizing the thinking/answering process and a stringent format reward that penalizes shortcuts. It claims this induces advanced reasoning behaviors (reflection, verification, summarization) absent from the training corpus and yields generalization to AIME and AMC math benchmarks.

Significance. If the attribution to RL holds, the result would indicate that rule-based RL on small, verifiable synthetic logic data can elicit complex reasoning capabilities and cross-domain generalization in LLMs, offering a data-efficient path distinct from large-scale math pretraining. The controllable complexity of the logic puzzles is a methodological strength for studying reasoning dynamics.

major comments (2)

[Experiments / Results] The central claim that RL induces reflection, verification, and summarization (and thereby AIME/AMC generalization) requires isolating the RL contribution from the base 7B model and the system prompt. No base-model performance under the identical prompt and format-reward constraints is reported, leaving the attribution unsupported.
[Results] The AIME/AMC generalization results are presented without training curves, reward-component ablations, multiple random seeds, or statistical significance tests. This absence prevents assessment of whether the reported gains are robust or sensitive to prompt engineering alone.

minor comments (1)

[Abstract] The abstract refers to 'some key technical contributions' without enumerating them; an explicit bullet list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to provide the requested controls and robustness analyses.

read point-by-point responses

Referee: [Experiments / Results] The central claim that RL induces reflection, verification, and summarization (and thereby AIME/AMC generalization) requires isolating the RL contribution from the base 7B model and the system prompt. No base-model performance under the identical prompt and format-reward constraints is reported, leaving the attribution unsupported.

Authors: We agree that isolating the RL contribution is essential. In the revised manuscript we have added a direct baseline: the unmodified 7B model evaluated under the exact same system prompt and format-reward constraints (no RL updates). These results show the base model solves <5% of the logic puzzles and exhibits no AIME/AMC generalization, while the RL-trained model reaches the reported performance and displays the described reasoning behaviors. The comparison is now included in Section 4.2 and Table 3. revision: yes
Referee: [Results] The AIME/AMC generalization results are presented without training curves, reward-component ablations, multiple random seeds, or statistical significance tests. This absence prevents assessment of whether the reported gains are robust or sensitive to prompt engineering alone.

Authors: We appreciate the request for robustness evidence. The revision adds (i) full training curves for both reward components and validation accuracy, (ii) ablations that isolate the format reward and answer reward, (iii) results averaged over three random seeds with standard deviations, and (iv) paired t-tests confirming statistical significance of the AIME/AMC gains. These appear in the new Figure 4, updated Table 2, and Appendix C. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical generalization claim is measured outcome, not tautological derivation

full rationale

The paper reports an empirical training run: rule-based RL on 5K synthetic logic puzzles produces a 7B model whose post-training performance on AIME/AMC is measured and reported. No derivation chain, equations, or fitted parameters are presented whose outputs reduce by construction to the inputs (no self-definitional scaling, no 'prediction' that is the training objective renamed, no uniqueness theorem imported from the same authors). The central claim is falsifiable external performance, not a logical identity. Absence of base-model ablations is an evidence gap, not a circularity; it does not make the reported numbers equivalent to the training data by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical ML study whose central claim rests on the assumption that the RL reward signal and prompt design are sufficient to elicit new reasoning behaviors; no explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5447 in / 1155 out tokens · 18202 ms · 2026-05-16T20:54:31.737562+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spurious Rewards: Rethinking Training Signals in RLVR
cs.AI 2025-06 accept novelty 8.0

Spurious rewards in RLVR can produce large gains in mathematical reasoning for certain language models via GRPO's clipping bias amplifying pretraining behaviors like code reasoning.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 accept novelty 7.0

Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
cs.LG 2026-04 unverdicted novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Visual Perceptual to Conceptual First-Order Rule Learning Networks
cs.AI 2026-04 unverdicted novelty 7.0

γILP is a differentiable pipeline for inducing first-order rules from unlabeled image data, showing strong performance on symbolic relational datasets, relational images, and pure image datasets such as Kandinsky patterns.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
cs.IR 2026-04 unverdicted novelty 6.0

ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
cs.CL 2025-12 unverdicted novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
cs.CV 2025-11 unverdicted novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
cs.CL 2025-04 unverdicted novelty 6.0

WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering
cs.CL 2026-05 unverdicted novelty 5.0

IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
cs.CV 2026-04 unverdicted novelty 5.0

OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 20 Pith papers

[1]

Le, Sergey Levine, and Yi Ma

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025

work page 2025
[2]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[3]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page 2025
[4]

Alphazero-like tree-search can guide large language model decoding and training, 2024

Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024

work page 2024
[5]

Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024

work page 2024
[6]

Interpretable contrastive monte carlo tree search reasoning, 2024

Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, and Lijie Wen. Interpretable contrastive monte carlo tree search reasoning, 2024

work page 2024
[7]

Reinforce++: A simple and efficient approach for aligning large language models, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025

work page 2025
[8]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

work page 2023
[9]

There may not be aha moment in r1-zero-like training — a pilot study

Zichen Liu, Changyu Chen, Wenjun Li, Tianyu Pang, Chao Du, and Min Lin. There may not be aha moment in r1-zero-like training — a pilot study. https://oatllm.notion.site/ oat-zero, 2025. Notion Blog

work page 2025
[10]

Introducing openai o1

OpenAI. Introducing openai o1. https://openai.com/o1/, 2024. Accessed: 2024-10-02

work page 2024
[11]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Assoc...

work page 2023
[12]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[13]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

work page 2024
[14]

Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017

work page 2017
[15]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page 2025
[16]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Assoc...

work page 2022
[17]

On memorization of large language models in logical reasoning, 2024

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning, 2024

work page 2024
[18]

Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024

work page 2024
[19]

Towards large reasoning models: A survey of reinforced reasoning with large language models, 2025

Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models, 2025

work page 2025
[20]

No train still gain

Haotian Xu. No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function, 2023

work page 2023
[21]

Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025

Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, and Debing Zhang. Redstar: Does scaling long-cot data unlock better slow-reasoning systems?, 2025

work page 2025
[22]

Qwen2.5-1m technical report, 2025

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page 2025
[23]

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process, 2024

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process, 2024

work page 2024
[24]

I am not entirely sure; let’s re-check this step

Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, and Soroush V osoughi. Working memory identifies reasoning limits in language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16896–16922, Miami, Florida, USA, November 2024. Association for Compu...

work page 2024
[25]

If Samuel is a knight then Lily is a knave

Jack’s statement is: "If Samuel is a knight then Lily is a knave." . . . Now that we know Samuel is a knave, let’s analyze Samuel’s statement again: - Samuel’s statement: "If Jack is a knight then Lily is a knave." - An implication "If P then Q" is false only when P is true and Q is false. (Formula Application) Here, "If Jack is a knight then Lily is a kn...

work page
[26]

If Sophia is a knight then Evelyn is a knight

Benjamin: "If Sophia is a knight then Evelyn is a knight" - If Benjamin is a knight, then the statement must be true. Thus, if Sophia is a knight, Evelyn has to be a knight

work page
[27]

Benjamin is a knave or James is a knight

Michael: "Benjamin is a knave or James is a knight" . . . From Jacob’s statement, if Jacob is a knight then Sophia must be a knight and Benjamin must be a knave. 以卡文的方式推理: - 假设Jacob 是骑士，那么Jacob 的声明"Benjamin 是背叛者且Sophia 是骑士" 就是真的。这意味着Benjamin 是背叛者而Sophia 是骑士。 - 既然Sophia 是骑士，根据Benjamin 的声明"如果Sophia 是骑士那么Evelyn 也是骑士"，Evelyn 也是骑士。 . . . 综上所述，Sophia, Evelyn, ...

work page