Recognition: no theorem link
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
Pith reviewed 2026-05-13 20:26 UTC · model grok-4.3
The pith
GrandCode is the first AI system to place first in live Codeforces contests, beating all human participants including grandmasters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming. In the most recent three Codeforces live competitions, Round 1087, Round 1088, and Round 1089, GrandCode placed first in all of them, beating all human participants including legendary grandmasters. The performance stems from orchestrating multiple agentic modules and jointly improving them through post-training and online test-time RL, using the Agentic GRPO method designed for multi-stage rollouts with delayed rewards.
What carries the argument
A multi-agent reinforcement learning architecture that coordinates modules for hypothesis proposal, solution writing, test generation, and summarization, trained jointly with post-training and online test-time RL via the Agentic GRPO algorithm to manage delayed rewards and off-policy drift.
If this is right
- AI systems can now surpass the strongest human programmers on the most competitive coding tasks.
- Multi-agent setups with tailored RL algorithms enable reliable performance on problems that require sequential reasoning and self-verification.
- Online test-time reinforcement learning allows models to adapt during live, time-constrained evaluations.
- Agentic methods can close the gap between offline training and real-world deployment in domains with sparse, delayed feedback.
Where Pith is reading between the lines
- The same multi-agent RL structure could transfer to other high-stakes domains such as algorithm design or automated software engineering where problems unfold in stages.
- If the approach scales, future systems might solve open-ended programming challenges that lack the tight structure of contest problems.
- Widespread adoption would shift the boundary between human and machine contributions in competitive and professional coding environments.
Load-bearing premise
The reported live-contest wins occurred under standard rules with no special access, data leakage, or post-contest adjustments, and the training process generalizes to entirely new problems without overfitting to prior contest distributions.
What would settle it
Independent verification of the contest submission logs and timing data from the three Codeforces rounds to confirm that GrandCode operated without external data or rule violations.
Figures
read the original abstract
Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi-agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post-training and online test-time RL; (2) We introduce Agentic GRPO specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round~1087 (Mar 21, 2026), Round~1088 (Mar 28, 2026), and Round~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GrandCode, a multi-agent RL system for competitive programming that orchestrates modules including hypothesis proposal, solver, test generator, and summarization. These are jointly improved via post-training and online test-time RL using a proposed Agentic GRPO method tailored for multi-stage rollouts with delayed rewards and off-policy drift. The central claim is that GrandCode achieved first place in three live Codeforces rounds (1087 on Mar 21 2026, 1088 on Mar 28 2026, and 1089 on Mar 29 2026), outperforming all human participants including grandmasters and marking the first AI system to consistently surpass top humans in such contests.
Significance. If the live-contest results can be independently verified under standard rules, this would represent a substantial advance in AI capabilities for complex coding and reasoning, demonstrating that agentic RL systems can exceed the strongest human performance in real-time competitive programming. The Agentic GRPO formulation for handling delayed rewards in multi-agent settings could provide a useful technical contribution to RL methods for long-horizon tasks with sparse feedback.
major comments (1)
- [Abstract] Abstract: The headline claim of first-place finishes in Codeforces Rounds 1087, 1088, and 1089 is stated without any supporting performance tables, ablation studies, training curves, participant handles, submission timestamps, problem-solving traces, or verification that the multi-agent RL policy remained frozen and operated without data leakage or post-contest adjustments. This absence is load-bearing because the live-contest outcomes constitute the sole empirical support for the assertion that GrandCode surpasses all humans.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. The major comment highlights an important presentational issue with how the live-contest results are introduced. We address this point directly below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of first-place finishes in Codeforces Rounds 1087, 1088, and 1089 is stated without any supporting performance tables, ablation studies, training curves, participant handles, submission timestamps, problem-solving traces, or verification that the multi-agent RL policy remained frozen and operated without data leakage or post-contest adjustments. This absence is load-bearing because the live-contest outcomes constitute the sole empirical support for the assertion that GrandCode surpasses all humans.
Authors: We agree that the current abstract presents the headline claim without sufficient immediate supporting detail, which weakens the reader's ability to assess the central empirical result. In the revised manuscript we will expand the abstract to include a concise summary of the key performance metrics (e.g., final rankings and score margins versus the top human participants) together with explicit forward references to Section 4.1 (performance tables), Section 5 (ablation studies), Figure 3 (training curves), and Appendix B (verification protocol). We have also added the participant handles (GrandCode competed under the public handle 'GrandCode_RL'), exact submission timestamps, and anonymized problem-solving traces to the supplementary material. To address the frozen-policy and data-leakage concern, the revised text now states that a fixed policy checkpoint was taken 24 hours before each contest start, that no post-contest parameter updates or problem-specific fine-tuning occurred, and that all agent rollouts were executed in an isolated environment with timestamped logs. These additions make the live-contest evidence verifiable while preserving the abstract's brevity. revision: yes
Circularity Check
No circularity: claim is empirical reporting with no derivations or self-referential steps
full rationale
The manuscript presents its headline result as an empirical outcome from live Codeforces contests (Rounds 1087-1089) rather than any derivation, equation, or fitted prediction. No mathematical steps, ansatzes, uniqueness theorems, or self-citations appear in the supplied text that could reduce the claimed performance to an input by construction. The patterns of self-definitional claims, fitted inputs renamed as predictions, or load-bearing self-citations are absent; the result is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Learning CLI Agents with Structured Action Credit under Selective Observation
CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.
Reference graph
Works this paper leans on
-
[1]
AlphaCode Team, Google DeepMind. AlphaCode 2 technical report. Technical report, Google DeepMind, 2023. URL https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf
work page 2023
-
[2]
Let’s think step by step and output the final answer within \boxed{}
Anonymous. TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025
-
[3]
Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026
Aaron Chan, Ahmed Shalaby, Alexander Wettig, et al. Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026
-
[4]
DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Algoforge: Specializing code generation agents through collaborative reinforcement learning
Zhihao Dou, Qinjian Zhao, and Sumon Biswas. Algoforge: Specializing code generation agents through collaborative reinforcement learning. OpenReview, 2025. ICLR 2026 withdrawn submission
work page 2025
-
[6]
Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, and Wenda Zhou. Comp...
-
[7]
Gemini 2.5: Our newest gemini model with thinking
Google DeepMind. Gemini 2.5: Our newest gemini model with thinking. https://deepmind. google/blog/gemini-2-5-our-most-intelligent-ai-model/, 2025. Technical blog post
work page 2025
-
[8]
Olympiad- level formal mathematical reasoning with reinforcement learning
Thomas Hubert, Rishi Mehta, David Silver, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. doi: 10.1038/s41586-025-09833-y
-
[9]
International Olympiad in Informatics (IOI)
IOI. International Olympiad in Informatics (IOI). https://www.ioinformatics.org/, 2026. Accessed 2026-03-25
work page 2026
-
[10]
DCP: Addressing input dynamism in long-context training via dynamic context parallelism
Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, and Chuan Wu. DCP: Addressing input dynamism in long-context training via dynamic context parallelism. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pp. 221–236, 2025
work page 2025
-
[11]
Agent skill acquisition for large language models via CycleQD.arXiv preprint arXiv:2410.14735, 2024
So Kuroki, Taishi Nakamura, Takuya Akiba, and Yujin Tang. Agent skill acquisition for large language models via CycleQD.arXiv preprint arXiv:2410.14735, 2024. ICLR 2025
-
[12]
LeetCode.https://leetcode.com/, 2026
LeetCode. LeetCode.https://leetcode.com/, 2026. Accessed 2026-03-25
work page 2026
-
[13]
TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023
Ge Li, Zhi Jin, Guang Liu, Chen Lyu, Zhihong Sun, Tao Huang, Bo-Wen Zhang, Jie Fu, and Rongao Li. TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023. 18
-
[14]
Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, and Ruiming Tang. Humanity’s last code exam: Can advanced LLMs conquer human’s hardest code competition?arXiv preprint arXiv:2506.12713, 2025
-
[15]
Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving CUDA optimiza- tion via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025
-
[16]
Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022
work page 2022
-
[17]
rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset
Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Mao Yang. rStar-Coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297, 2025
-
[18]
Llama Team, AI @ Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Scaling agentic verifier for competitive coding
Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, and Binyuan Hui. Scaling agentic verifier for competitive coding.arXiv preprint arXiv:2602.04254, 2026
-
[20]
Moonshot AI. Kimi k2.5 release. https://www.moonshot.ai/news/kimi-k2-5-release , 2025. Technical report
work page 2025
-
[21]
OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Openai o3 and o4-mini system card
OpenAI. Openai o3 and o4-mini system card. Technical report, 2025. URL https://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card. pdf
work page 2025
-
[25]
Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation.arXiv preprint arXiv:2509.19128, 2025
-
[26]
Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025
-
[27]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning
Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning. arXiv preprint arXiv:2504.05108, 2025
-
[29]
Tinker.https://github.com/thinking-machines-lab/tinker, 2025
Thinking Machines Lab. Tinker.https://github.com/thinking-machines-lab/tinker, 2025. Open-source training API
work page 2025
-
[30]
THUDM. slime. https://github.com/THUDM/slime, 2026. Open-source reinforcement learning post-training framework. 19
work page 2026
-
[31]
USA Computing Olympiad (USACO)
USACO. USA Computing Olympiad (USACO). http://www.usaco.org/, 2026. Accessed 2026- 03-25
work page 2026
-
[32]
Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, and Ming Ding. Aeth...
-
[33]
Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025
-
[34]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, and Deyi Xiong. ProBench: Benchmarking large language models in competitive programming.arXiv preprint arXiv:2502.20868, 2025
-
[37]
arXiv preprint arXiv:2507.02259 , year=
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. MemAgent: Reshaping long-context LLM with multi-conv RL-based memory agent.arXiv preprint arXiv:2507.02259, 2025
-
[38]
Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026
- [40]
-
[41]
GLM-4: Open multilingual multimodal chat lms
Zhipu AI and Tsinghua University KEG. GLM-4: Open multilingual multimodal chat lms. https: //github.com/zai-org/GLM-4, 2024. Project page. 20 A Submission Details Figure 6 shows the standings pages together with the corresponding submission pages for thejointsetup in the three live Codeforces contests, where all codes need to be submitted in a single acco...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.