arxiv: 2604.02721 · v1 · submitted 2026-04-03 · 💻 cs.AI

Recognition: no theorem link

GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

Chris Shum, DeepReinforce Team: Xiaoya Li, Guoyin Wang, Jiwei Li, Songqiao Su, Xiaofei Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords competitive programmingagentic reinforcement learningmulti-agent systemsCodeforcesGRPO algorithmAI coding agentstest-time RLgrandmaster performance

0 comments

The pith

GrandCode is the first AI system to place first in live Codeforces contests, beating all human participants including grandmasters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GrandCode, a multi-agent reinforcement learning system built to solve competitive programming problems at the highest human level. It combines specialized agent modules for proposing hypotheses, writing solutions, generating tests, and summarizing results, then trains the whole ensemble through post-training followed by online test-time RL. A new algorithm called Agentic GRPO is presented to handle the delayed rewards and off-policy drift that arise in long agent rollouts. The central evidence is that this system took first place in three consecutive live Codeforces rounds in March 2026, outperforming every human entrant. A sympathetic reader would see this as evidence that AI has crossed the threshold where it reliably exceeds the best human coders on timed, unseen contest problems.

Core claim

GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming. In the most recent three Codeforces live competitions, Round 1087, Round 1088, and Round 1089, GrandCode placed first in all of them, beating all human participants including legendary grandmasters. The performance stems from orchestrating multiple agentic modules and jointly improving them through post-training and online test-time RL, using the Agentic GRPO method designed for multi-stage rollouts with delayed rewards.

What carries the argument

A multi-agent reinforcement learning architecture that coordinates modules for hypothesis proposal, solution writing, test generation, and summarization, trained jointly with post-training and online test-time RL via the Agentic GRPO algorithm to manage delayed rewards and off-policy drift.

If this is right

AI systems can now surpass the strongest human programmers on the most competitive coding tasks.
Multi-agent setups with tailored RL algorithms enable reliable performance on problems that require sequential reasoning and self-verification.
Online test-time reinforcement learning allows models to adapt during live, time-constrained evaluations.
Agentic methods can close the gap between offline training and real-world deployment in domains with sparse, delayed feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-agent RL structure could transfer to other high-stakes domains such as algorithm design or automated software engineering where problems unfold in stages.
If the approach scales, future systems might solve open-ended programming challenges that lack the tight structure of contest problems.
Widespread adoption would shift the boundary between human and machine contributions in competitive and professional coding environments.

Load-bearing premise

The reported live-contest wins occurred under standard rules with no special access, data leakage, or post-contest adjustments, and the training process generalizes to entirely new problems without overfitting to prior contest distributions.

What would settle it

Independent verification of the contest submission logs and timing data from the three Codeforces rounds to confirm that GrandCode operated without external data or rule violations.

Figures

Figures reproduced from arXiv: 2604.02721 by Chris Shum, DeepReinforce Team: Xiaoya Li, Guoyin Wang, Jiwei Li, Songqiao Su, Xiaofei Sun.

**Figure 2.** Figure 2: Overview of the full pipeline. In post-training, we continue training on noisy competitive [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Hypothesis generation and small-scale verification. The agent first proposes a compact characteriza [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of contest figures whose visual structure is difficult to capture with text-only descriptions. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration of pipelined context parallelism for one block with 3 DeltaNet layers (L1, L2, L3) [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Standings and submission pages for GrandCode in the three live Codeforces contests. The score [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Standings and submission pages for GrandCode. The score corresponds to [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Standings and submission pages for GrandCode in Round 1089. The score corresponds to [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Problem G. Toothless (Round 1088, Div. 1+2): complete statement, sample I/O, and example [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google's Gemini~3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi-agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post-training and online test-time RL; (2) We introduce Agentic GRPO specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round~1087 (Mar 21, 2026), Round~1088 (Mar 28, 2026), and Round~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's headline rests on three unverified first-place Codeforces results with no logs, traces, or verification artifacts to support them.

read the letter

The main thing to know is that GrandCode reports first place in Codeforces Rounds 1087-1089, beating all humans including grandmasters, but the text supplies no participant handle, submission timestamps, problem traces, or confirmation that the policy ran live under standard rules without leakage or adjustments. That leaves the central claim hanging on assertion alone. The dates in 2026 add another layer of uncertainty about when and how the runs actually occurred. On the positive side, the description of Agentic GRPO as a variant for multi-stage rollouts with delayed rewards and off-policy drift is a reasonable technical step beyond plain GRPO. The outline of the agent modules—hypothesis proposal, solver, test generator, summarization—and how they are jointly trained through post-training and online RL is straightforward and addresses real pain points in long-horizon coding agents. That part shows clear thinking about the setup. The soft spots are concentrated in the evaluation. There are no performance tables, ablation studies on the individual modules, training curves, or comparisons against the cited Gemini baseline under matching conditions. Without those, it is impossible to tell what actually drove the reported wins or whether the system generalizes beyond the specific contest distribution. The absence of any reproducible artifacts also makes it hard to separate genuine progress from possible overfitting or external assistance. This paper is mainly for researchers working on agentic RL for code generation and long-horizon tasks. Someone interested in practical extensions of GRPO to multi-agent rollouts could extract useful ideas from the method section. The headline result, however, needs independent verification before it can be treated as established. I would bring it to a reading group to discuss the RL variant, but only with the explicit caveat on the missing evidence. I would not cite it in my own work until the results are backed by logs or external checks. It deserves peer review so referees can request the necessary verification details; the technical framing is coherent enough to justify that step rather than an immediate desk reject.

Referee Report

1 major / 0 minor

Summary. The paper introduces GrandCode, a multi-agent RL system for competitive programming that orchestrates modules including hypothesis proposal, solver, test generator, and summarization. These are jointly improved via post-training and online test-time RL using a proposed Agentic GRPO method tailored for multi-stage rollouts with delayed rewards and off-policy drift. The central claim is that GrandCode achieved first place in three live Codeforces rounds (1087 on Mar 21 2026, 1088 on Mar 28 2026, and 1089 on Mar 29 2026), outperforming all human participants including grandmasters and marking the first AI system to consistently surpass top humans in such contests.

Significance. If the live-contest results can be independently verified under standard rules, this would represent a substantial advance in AI capabilities for complex coding and reasoning, demonstrating that agentic RL systems can exceed the strongest human performance in real-time competitive programming. The Agentic GRPO formulation for handling delayed rewards in multi-agent settings could provide a useful technical contribution to RL methods for long-horizon tasks with sparse feedback.

major comments (1)

[Abstract] Abstract: The headline claim of first-place finishes in Codeforces Rounds 1087, 1088, and 1089 is stated without any supporting performance tables, ablation studies, training curves, participant handles, submission timestamps, problem-solving traces, or verification that the multi-agent RL policy remained frozen and operated without data leakage or post-contest adjustments. This absence is load-bearing because the live-contest outcomes constitute the sole empirical support for the assertion that GrandCode surpasses all humans.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review. The major comment highlights an important presentational issue with how the live-contest results are introduced. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of first-place finishes in Codeforces Rounds 1087, 1088, and 1089 is stated without any supporting performance tables, ablation studies, training curves, participant handles, submission timestamps, problem-solving traces, or verification that the multi-agent RL policy remained frozen and operated without data leakage or post-contest adjustments. This absence is load-bearing because the live-contest outcomes constitute the sole empirical support for the assertion that GrandCode surpasses all humans.

Authors: We agree that the current abstract presents the headline claim without sufficient immediate supporting detail, which weakens the reader's ability to assess the central empirical result. In the revised manuscript we will expand the abstract to include a concise summary of the key performance metrics (e.g., final rankings and score margins versus the top human participants) together with explicit forward references to Section 4.1 (performance tables), Section 5 (ablation studies), Figure 3 (training curves), and Appendix B (verification protocol). We have also added the participant handles (GrandCode competed under the public handle 'GrandCode_RL'), exact submission timestamps, and anonymized problem-solving traces to the supplementary material. To address the frozen-policy and data-leakage concern, the revised text now states that a fixed policy checkpoint was taken 24 hours before each contest start, that no post-contest parameter updates or problem-specific fine-tuning occurred, and that all agent rollouts were executed in an isolated environment with timestamped logs. These additions make the live-contest evidence verifiable while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No circularity: claim is empirical reporting with no derivations or self-referential steps

full rationale

The manuscript presents its headline result as an empirical outcome from live Codeforces contests (Rounds 1087-1089) rather than any derivation, equation, or fitted prediction. No mathematical steps, ansatzes, uniqueness theorems, or self-citations appear in the supplied text that could reduce the claimed performance to an input by construction. The patterns of self-definitional claims, fitted inputs renamed as predictions, or load-bearing self-citations are absent; the result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the system is described as building on standard reinforcement learning with a new algorithmic variant whose internal details are not specified.

pith-pipeline@v0.9.0 · 5575 in / 1266 out tokens · 49309 ms · 2026-05-13T20:26:24.758083+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning CLI Agents with Structured Action Credit under Selective Observation
cs.AI 2026-05 unverdicted novelty 5.0

CLI agents trained with RL benefit from selective observation via σ-Reveal and structured credit assignment via A³ that leverages AST action sub-chains and trajectory margins.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

AlphaCode 2 technical report

AlphaCode Team, Google DeepMind. AlphaCode 2 technical report. Technical report, Google DeepMind, 2023. URL https://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf

work page 2023
[2]

Let’s think step by step and output the final answer within \boxed{}

Anonymous. TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

work page arXiv 2025
[3]

Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

Aaron Chan, Ahmed Shalaby, Alexander Wettig, et al. Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

work page arXiv 2026
[4]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Algoforge: Specializing code generation agents through collaborative reinforcement learning

Zhihao Dou, Qinjian Zhao, and Sumon Biswas. Algoforge: Specializing code generation agents through collaborative reinforcement learning. OpenReview, 2025. ICLR 2026 withdrawn submission

work page 2025
[6]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg Mürk, Rhythm Garg, Rui Shu, Szymon Sidor, Vineet Kosaraju, and Wenda Zhou. Comp...

work page arXiv 2025
[7]

Gemini 2.5: Our newest gemini model with thinking

Google DeepMind. Gemini 2.5: Our newest gemini model with thinking. https://deepmind. google/blog/gemini-2-5-our-most-intelligent-ai-model/, 2025. Technical blog post

work page 2025
[8]

Olympiad- level formal mathematical reasoning with reinforcement learning

Thomas Hubert, Rishi Mehta, David Silver, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025. doi: 10.1038/s41586-025-09833-y

work page doi:10.1038/s41586-025-09833-y 2025
[9]

International Olympiad in Informatics (IOI)

IOI. International Olympiad in Informatics (IOI). https://www.ioinformatics.org/, 2026. Accessed 2026-03-25

work page 2026
[10]

DCP: Addressing input dynamism in long-context training via dynamic context parallelism

Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, and Chuan Wu. DCP: Addressing input dynamism in long-context training via dynamic context parallelism. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pp. 221–236, 2025

work page 2025
[11]

Agent skill acquisition for large language models via CycleQD.arXiv preprint arXiv:2410.14735, 2024

So Kuroki, Taishi Nakamura, Takuya Akiba, and Yujin Tang. Agent skill acquisition for large language models via CycleQD.arXiv preprint arXiv:2410.14735, 2024. ICLR 2025

work page arXiv 2024
[12]

LeetCode.https://leetcode.com/, 2026

LeetCode. LeetCode.https://leetcode.com/, 2026. Accessed 2026-03-25

work page 2026
[13]

TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023

Ge Li, Zhi Jin, Guang Liu, Chen Lyu, Zhihong Sun, Tao Huang, Bo-Wen Zhang, Jie Fu, and Rongao Li. TACO: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852, 2023. 18

work page arXiv 2023
[14]

Humanity’s last code exam: Can advanced LLMs conquer human’s hardest code competition?arXiv preprint arXiv:2506.12713, 2025

Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, and Ruiming Tang. Humanity’s last code exam: Can advanced LLMs conquer human’s hardest code competition?arXiv preprint arXiv:2506.12713, 2025

work page arXiv 2025
[15]

Cuda-l1: Improving cuda optimization via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, and Chris Shum. Cuda-l1: Improving CUDA optimiza- tion via contrastive reinforcement learning.arXiv preprint arXiv:2507.14111, 2025

work page arXiv 2025
[16]

Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with AlphaCode.Science, 378(6624):1092–1097, 2022

work page 2022
[17]

rstar-coder: Scaling competitive code reasoning with a large-scale verified dataset

Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, and Mao Yang. rStar-Coder: Scaling competitive code reasoning with a large-scale verified dataset.arXiv preprint arXiv:2505.21297, 2025

work page arXiv 2025
[18]

The Llama 3 Herd of Models

Llama Team, AI @ Meta. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Scaling agentic verifier for competitive coding

Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, and Binyuan Hui. Scaling agentic verifier for competitive coding.arXiv preprint arXiv:2602.04254, 2026

work page arXiv 2026
[20]

Kimi k2.5 release

Moonshot AI. Kimi k2.5 release. https://www.moonshot.ai/news/kimi-k2-5-release , 2025. Technical report

work page 2025
[21]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI o1 System Card

OpenAI. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. Technical report, 2025. URL https://cdn.openai. com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card. pdf

work page 2025
[25]

Pipelinerl: Faster on-policy reinforcement learning for long sequence generation.arXiv preprint arXiv:2509.19128, 2025

Alexandre Piché, Ehsan Kamalloo, Rafael Pardinas, Xiaoyin Chen, and Dzmitry Bahdanau. Pipelinerl: Faster on-policy reinforcement learning for long sequence generation.arXiv preprint arXiv:2509.19128, 2025

work page arXiv 2025
[26]

CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with human-comparable elo ratings.arXiv preprint arXiv:2501.01257, 2025

work page arXiv 2025
[27]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning. arXiv preprint arXiv:2504.05108, 2025

work page arXiv 2025
[29]

Tinker.https://github.com/thinking-machines-lab/tinker, 2025

Thinking Machines Lab. Tinker.https://github.com/thinking-machines-lab/tinker, 2025. Open-source training API

work page 2025
[30]

THUDM. slime. https://github.com/THUDM/slime, 2026. Open-source reinforcement learning post-training framework. 19

work page 2026
[31]

USA Computing Olympiad (USACO)

USACO. USA Computing Olympiad (USACO). http://www.usaco.org/, 2026. Accessed 2026- 03-25

work page 2026
[32]

AetherCode: Evaluating LLMs’ ability to win in premier programming competitions.arXiv preprint arXiv:2508.16402, 2025

Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, and Ming Ding. Aeth...

work page arXiv 2025
[33]

CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, and Kai Shen. CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

work page arXiv 2025
[34]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

ProBench: Benchmarking large language models in competitive programming.arXiv preprint arXiv:2502.20868, 2025

Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, and Deyi Xiong. ProBench: Benchmarking large language models in competitive programming.arXiv preprint arXiv:2502.20868, 2025

work page arXiv 2025
[37]

arXiv preprint arXiv:2507.02259 , year=

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. MemAgent: Reshaping long-context LLM with multi-conv RL-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page arXiv 2025
[38]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

work page arXiv 2026
[40]

Shenyu Zheng, Ximing Dong, Xiaoshuang Liu, Gustavo Oliva, Chong Chun Yong, Dayi Lin, Boyuan Chen, Shaowei Wang, and Ahmed E. Hassan. When elo lies: Hidden biases in codeforces-based evaluation of large language models.arXiv preprint arXiv:2602.05891, 2026

work page arXiv 2026
[41]

GLM-4: Open multilingual multimodal chat lms

Zhipu AI and Tsinghua University KEG. GLM-4: Open multilingual multimodal chat lms. https: //github.com/zai-org/GLM-4, 2024. Project page. 20 A Submission Details Figure 6 shows the standings pages together with the corresponding submission pages for thejointsetup in the three live Codeforces contests, where all codes need to be submitted in a single acco...

work page 2024