Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

Jianfei Gao; Kai Chen; Kai Liu; Peijie Dong; Qipeng Guo; Shaoting Zhang; Xiaowen Chu; Xinchen Xie

arxiv: 2606.11634 · v1 · pith:624IDQGHnew · submitted 2026-06-10 · 💻 cs.AI

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

Kai Liu , Peijie Dong , Xinchen Xie , Jianfei Gao , Qipeng Guo , Xiaowen Chu , Shaoting Zhang , Kai Chen This is my paper

Pith reviewed 2026-06-27 10:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords sliding-window attentionreinforcement learningmath reasoningsupervised fine-tuningattention mechanismslarge language modelspolicy adaptation

0 comments

The pith

Reinforcement learning on self-generated trajectories narrows the accuracy gap between sliding-window and full self-attention models on math reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that converting a pretrained self-attention model to sliding-window attention via supervised fine-tuning leaves a noticeable performance shortfall on math reasoning benchmarks. The authors attribute part of this shortfall to a mismatch between the long-range dependencies present in standard SFT data and the local context limit imposed by sliding-window attention. They then apply on-policy reinforcement learning that optimizes trajectories generated directly under the sliding-window constraint, allowing the data distribution to adapt to the architecture. Experiments across mathematical reasoning benchmarks show that this second stage recovers much of the lost accuracy while retaining the linear complexity benefit. The central empirical claim is that reinforcement learning alters the viability assessment one would reach from conversion and supervised fine-tuning alone.

Core claim

After efficient conversion of a pretrained self-attention model to sliding-window attention through supervised fine-tuning, a performance gap remains on math reasoning tasks; subsequent on-policy reinforcement learning on self-generated trajectories under the sliding-window constraint substantially closes this gap and restores most of the accuracy while preserving linear-complexity inference.

What carries the argument

The two-stage SWARR recipe: supervised fine-tuning conversion followed by on-policy reinforcement learning policy adaptation on architecture-constrained trajectories.

If this is right

Sliding-window attention models become competitive for math reasoning without requiring pretraining of a new base model from scratch.
The linear complexity advantage of sliding-window attention is retained after the reinforcement learning stage.
On-policy reinforcement learning can adapt generated trajectories to better fit architectural constraints such as limited attention range.
The viability of sliding-window attention for math reasoning depends on the training regime rather than conversion alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptation pattern could be tested on other efficiency-oriented attention variants beyond sliding windows.
It implies that training distributions for constrained architectures should be co-optimized rather than taken from unconstrained models.
The approach may reduce the cost of exploring alternative attention mechanisms by starting from existing pretrained checkpoints.

Load-bearing premise

The accuracy gap after supervised fine-tuning arises in part from a data-architecture mismatch that reinforcement learning on self-generated trajectories can correct.

What would settle it

A controlled experiment in which sliding-window attention models continue to show a large accuracy deficit relative to self-attention models even after the same reinforcement learning stage on identical math reasoning benchmarks.

Figures

Figures reproduced from arXiv: 2606.11634 by Jianfei Gao, Kai Chen, Kai Liu, Peijie Dong, Qipeng Guo, Shaoting Zhang, Xiaowen Chu, Xinchen Xie.

**Figure 1.** Figure 1: Overview of the SWARR pipeline. Stage 1: Efficient conversion with architecture-agnostic SFT, avoiding costly pretraining. Because current SFT data are mainly collected for SA models, they may contain long-range dependencies that are difficult for SWA to model. Stage 2: RL policy adaptation under the SWA constraint, which may mitigate the resulting data-architecture mismatch. use efficient kernels such as … view at source ↗

**Figure 2.** Figure 2: RL training curves for SA and SWA models. Panel (left) reports accuracy as a function of RL steps, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency comparison of SA and SWA4k. (a) Throughput as a function of context length. (b) Maximum [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Case study comparing trajectories from SA-RL-1300 and SWA2k-RL-1500 on AIME24 with magic-number injection, matched by length distribution. A.1.2 Case Study of Generated Trajectories Here, we provide a qualitative case study of generated trajectories to further illustrate the differences between SA and SWA. Tracing long generations directly is difficult, so we introduce a simple visualization based on … view at source ↗

read the original abstract

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL after SFT conversion narrows the SWA gap on math reasoning, but the data-mismatch explanation lacks direct support.

read the letter

The main takeaway is that a two-stage process—SFT conversion to sliding-window attention followed by on-policy RL—can recover much of the accuracy that SWA loses relative to full self-attention on math reasoning tasks.

The paper describes SWARR as a practical adaptation recipe that starts from an existing pretrained SA model, applies SFT to switch to SWA without new pretraining, and then uses RL to adapt the policy on self-generated trajectories. The empirical observation that RL changes the viability conclusion one would reach from SFT alone is the central new piece. This keeps the linear-complexity benefit while showing improved benchmark results.

The work is straightforward about the motivation and the two-stage structure. It focuses on a real efficiency issue in long-context reasoning models and offers a concrete way to adapt without discarding pretrained weights.

The soft spot is the mechanism. The authors argue that SFT data contains long-range dependencies that SWA cannot handle well, and that RL trajectories adapt to the constraint. Yet the experiments compare only final accuracies; they do not report dependency-span statistics, attention distances, or parse depths between the SFT corpus and the RL data. The lift could therefore come from generic RL effects rather than architecture-specific adaptation. The abstract claims substantial narrowing but the provided text gives no effect sizes or controls, so the strength of the result is hard to judge without the full tables.

This is relevant for groups working on efficient inference for reasoning LLMs. A reader interested in attention variants or RL adaptation would get a usable recipe and a clear empirical signal. It deserves peer review to verify the numbers and see whether the mechanism claim can be tightened.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SWARR, a two-stage recipe for adapting sliding-window attention (SWA) models to mathematical reasoning: (1) efficient conversion of a pretrained self-attention (SA) model to SWA via supervised fine-tuning (SFT), and (2) on-policy reinforcement learning (RL) policy adaptation on self-generated trajectories. The central empirical claim is that SWA underperforms SA after SFT alone due to a data-architecture mismatch (long-range dependencies in SFT data), but RL narrows this gap substantially by adapting trajectories to better match the SWA constraint, recovering much of the lost accuracy while preserving linear-complexity benefits. The key contribution is the finding that RL changes the viability conclusion one would draw from conversion+SFT alone.

Significance. If the empirical result holds with mechanistic support, the work would show that on-policy RL can adapt model behavior to architectural constraints such as limited attention span, enabling efficient attention variants for reasoning without full pretraining. This offers a practical path for scaling long-context inference in math and agentic tasks. The recipe is concrete and the empirical framing is falsifiable in principle, though the manuscript supplies no quantitative results, baselines, or dependency measurements in the provided text.

major comments (2)

[Abstract and Experiments] Abstract and experimental results section: the central claim that RL narrows the SWA-SA gap specifically because on-policy trajectories adapt to reduce long-range dependencies (the explicit weakest assumption) is unsupported; only final accuracies are compared, with no reported statistics on dependency spans, attention rollout distances, or parse-tree depths between SFT corpus and RL trajectories. This leaves open that any lift is due to generic RL optimization benefits independent of the architecture mismatch hypothesis.
[Abstract] Abstract: the soundness of the empirical finding cannot be assessed because the text supplies no quantitative results, error bars, baseline comparisons, dataset details, or magnitude of gap recovery, contradicting the claim that RL 'substantially narrows the gap' and 'recovers much of the accuracy lost'.

minor comments (1)

The abstract and introduction would benefit from explicit citation of the specific math reasoning benchmarks and the exact performance deltas (with standard errors) that support the viability conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the empirical support for our claims. We address each major comment point by point below, with clarifications based on the full manuscript and commitments to revisions where the evidence can be improved without misrepresenting our results.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and experimental results section: the central claim that RL narrows the SWA-SA gap specifically because on-policy trajectories adapt to reduce long-range dependencies (the explicit weakest assumption) is unsupported; only final accuracies are compared, with no reported statistics on dependency spans, attention rollout distances, or parse-tree depths between SFT corpus and RL trajectories. This leaves open that any lift is due to generic RL optimization benefits independent of the architecture mismatch hypothesis.

Authors: We agree that direct measurements of dependency spans, attention distances, or parse-tree depths would provide stronger mechanistic evidence for the data-architecture mismatch hypothesis and would help rule out generic RL benefits. The manuscript presents the performance differential (SFT vs. RL under SWA, compared to SA) as support for the viability claim and the adaptation hypothesis, but does not include those specific statistics. In revision we will add such analysis (e.g., attention rollout or dependency length comparisons) where feasible using existing trajectories. revision: partial
Referee: [Abstract] Abstract: the soundness of the empirical finding cannot be assessed because the text supplies no quantitative results, error bars, baseline comparisons, dataset details, or magnitude of gap recovery, contradicting the claim that RL 'substantially narrows the gap' and 'recovers much of the accuracy lost'.

Authors: The full manuscript contains a dedicated Experiments section with quantitative results on math reasoning benchmarks (including accuracies, gap recovery magnitudes, baseline comparisons to SA and other methods, dataset details such as GSM8K and MATH, and error bars from multiple runs). The abstract summarizes these findings at a high level due to length constraints. We will revise the abstract to explicitly reference the specific quantitative improvements reported in the experiments. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no derivation chain or self-referential reduction

full rationale

The paper's central claim is an empirical observation from experiments: after SFT, SWA underperforms SA, but subsequent on-policy RL narrows the gap. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear in the provided text. The hypothesis about data-architecture mismatch is explicitly labeled as a hypothesis and is not used to derive results by construction; it is tested via accuracy benchmarks. No self-citations are invoked as load-bearing premises. The work is self-contained as a standard empirical recipe comparison and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, fitted parameters, or new postulated entities; the claim rests on an empirical hypothesis about data-architecture mismatch.

pith-pipeline@v0.9.1-grok · 5795 in / 1128 out tokens · 22571 ms · 2026-06-27T10:16:41.872257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 1 canonical work pages

[1]

Proceedings of the 29th symposium on operating systems principles , pages =

Efficient memory management for large language model serving with pagedattention , author =. Proceedings of the 29th symposium on operating systems principles , pages =
[2]

arXiv preprint arXiv:2508.15884 , year =

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author =. arXiv preprint arXiv:2508.15884 , year =

arXiv
[3]

5-math technical report: Toward mathematical expert model via self-improvement , author =

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author =. arXiv preprint arXiv:2409.12122 , year =

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2504.03624 , year =

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models , author =. arXiv preprint arXiv:2504.03624 , year =

arXiv
[5]

arXiv preprint arXiv:2507.21848 , year =

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity , author =. arXiv preprint arXiv:2507.21848 , year =

arXiv
[6]

arXiv preprint arXiv:2407.04620 , year =

Learning to (learn at test time): Rnns with expressive hidden states , author =. arXiv preprint arXiv:2407.04620 , year =

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2504.02546 , year =

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author =. arXiv preprint arXiv:2504.02546 , year =

arXiv
[8]

arXiv preprint arXiv:2103.03874 , year =

Measuring mathematical problem solving with the math dataset , author =. arXiv preprint arXiv:2103.03874 , year =

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2304.11277 , year =

Pytorch fsdp: experiences on scaling fully sharded data parallel , author =. arXiv preprint arXiv:2304.11277 , year =

Pith/arXiv arXiv
[10]

Nature , volume =

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author =. Nature , volume =. 2025 , publisher =

2025
[11]

arXiv preprint arXiv:2412.19437 , year =

Deepseek-v3 technical report , author =. arXiv preprint arXiv:2412.19437 , year =

Pith/arXiv arXiv
[12]

2023 , eprint =

Mistral 7B , author =. 2023 , eprint =

2023
[13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

s1: Simple test-time scaling , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

2025
[14]

arXiv preprint arXiv:2505.12346 , year =

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author =. arXiv preprint arXiv:2505.12346 , year =

arXiv
[15]

arXiv preprint arXiv:2503.14456 , year =

Rwkv-7" goose" with expressive dynamic state evolution , author =. arXiv preprint arXiv:2503.14456 , year =

arXiv
[16]

arXiv preprint arXiv:2404.05892 , year =

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence , author =. arXiv preprint arXiv:2404.05892 , year =

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2502.01456 , year =

Process reinforcement through implicit rewards , author =. arXiv preprint arXiv:2502.01456 , year =

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:1911.02150 , year =

Fast transformer decoding: One write-head is all you need , author =. arXiv preprint arXiv:1911.02150 , year =

Pith/arXiv arXiv 1911
[19]

arXiv preprint arXiv:2504.10449 , year =

M1: Towards scalable test-time compute with mamba reasoning models , author =. arXiv preprint arXiv:2504.10449 , year =

arXiv
[20]

arXiv preprint arXiv:2004.05150 , year =

Longformer: The long-document transformer , author =. arXiv preprint arXiv:2004.05150 , year =

Pith/arXiv arXiv 2004
[21]

Advances in neural information processing systems , volume =

Flashattention: Fast and memory-efficient exact attention with io-awareness , author =. Advances in neural information processing systems , volume =
[22]

arXiv preprint arXiv:2503.20783 , year =

Understanding r1-zero-like training: A critical perspective , author =. arXiv preprint arXiv:2503.20783 , year =

Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2402.19427 , year =

Griffin: Mixing gated linear recurrences with local attention for efficient language models , author =. arXiv preprint arXiv:2402.19427 , year =

Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2401.04658 , year =

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models , author =. arXiv preprint arXiv:2401.04658 , year =

arXiv
[25]

arXiv preprint arXiv:1707.06347 , year =

Proximal policy optimization algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2506.06395 , year =

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models , author =. arXiv preprint arXiv:2506.06395 , year =

arXiv
[27]

arXiv preprint arXiv:2507.20534 , year =

Kimi k2: Open agentic intelligence , author =. arXiv preprint arXiv:2507.20534 , year =

Pith/arXiv arXiv
[28]

Advances in neural information processing systems , volume =

Attention is all you need , author =. Advances in neural information processing systems , volume =
[29]

arXiv preprint arXiv:2009.14794 , year =

Rethinking attention with performers , author =. arXiv preprint arXiv:2009.14794 , year =

Pith/arXiv arXiv 2009
[30]

First conference on language modeling , year =

Mamba: Linear-time sequence modeling with selective state spaces , author =. First conference on language modeling , year =
[31]

arXiv preprint arXiv:2501.04519 , year =

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author =. arXiv preprint arXiv:2501.04519 , year =

Pith/arXiv arXiv
[32]

American Invitational Mathematics Examination 2025 , author =

2025
[33]

American Invitational Mathematics Examination - AIME 2024 , year =

MAA , title =. American Invitational Mathematics Examination - AIME 2024 , year =

2024
[34]

arXiv preprint arXiv:2307.08621 , year =

Retentive network: A successor to transformer for large language models , author =. arXiv preprint arXiv:2307.08621 , year =

Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2305.13245 , year =

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author =. arXiv preprint arXiv:2305.13245 , year =

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2503.18892 , year =

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author =. arXiv preprint arXiv:2503.18892 , year =

Pith/arXiv arXiv
[37]

arXiv preprint arXiv:2508.10925 , year =

gpt-oss-120b & gpt-oss-20b model card , author =. arXiv preprint arXiv:2508.10925 , year =

Pith/arXiv arXiv
[38]

International conference on machine learning , pages =

Transformers are rnns: Fast autoregressive transformers with linear attention , author =. International conference on machine learning , pages =. 2020 , organization =

2020
[39]

arXiv preprint arXiv:2407.14207 , year =

Longhorn: State space models are amortized online learners , author =. arXiv preprint arXiv:2407.14207 , year =

arXiv
[40]

arXiv preprint arXiv:2506.13284 , year =

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy , author =. arXiv preprint arXiv:2506.13284 , year =

arXiv
[41]

arXiv preprint arXiv:2508.14444 , year =

Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model , author =. arXiv preprint arXiv:2508.14444 , year =

Pith/arXiv arXiv
[42]

arXiv preprint arXiv:2504.16084 , year =

Ttrl: Test-time reinforcement learning , author =. arXiv preprint arXiv:2504.16084 , year =

Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2507.19353 , year =

Smooth reading: Bridging the gap of recurrent llm to self-attention llm on long-context tasks , author =. arXiv preprint arXiv:2507.19353 , year =

arXiv
[44]

arXiv preprint arXiv:2505.15431 , year =

Hunyuan-turbos: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought , author =. arXiv preprint arXiv:2505.15431 , year =

arXiv
[45]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

Orca: A distributed serving system for \ Transformer-Based \ generative models , author =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =
[46]

Advances in neural information processing systems , volume =

Training language models to follow instructions with human feedback , author =. Advances in neural information processing systems , volume =
[47]

arXiv preprint arXiv:2111.00396 , year =

Efficiently modeling long sequences with structured state spaces , author =. arXiv preprint arXiv:2111.00396 , year =

Pith/arXiv arXiv
[48]

GitHub repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

2024
[49]

Advances in Neural Information Processing Systems , volume =

Infllm: Training-free long-context extrapolation for llms with an efficient context memory , author =. Advances in Neural Information Processing Systems , volume =
[50]

Advances in Neural Information Processing Systems , volume =

Why think step by step? reasoning emerges from the locality of experience , author =. Advances in Neural Information Processing Systems , volume =
[51]

arXiv preprint arXiv:2506.02177 , year =

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts , author =. arXiv preprint arXiv:2506.02177 , year =

arXiv
[52]

arXiv preprint arXiv:2508.15763 , year =

Intern-s1: A scientific multimodal foundation model , author =. arXiv preprint arXiv:2508.15763 , year =

arXiv
[53]

arXiv preprint arXiv:2505.24298 , year =

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning , author =. arXiv preprint arXiv:2505.24298 , year =

Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2507.06457 , year =

A systematic analysis of hybrid linear attention , author =. arXiv preprint arXiv:2507.06457 , year =

Pith/arXiv arXiv
[55]

Advances in Neural Information Processing Systems , volume =

MetaLA: Unified optimal linear approximation to softmax attention map , author =. Advances in Neural Information Processing Systems , volume =
[56]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[57]

Advances in Neural Information Processing Systems , volume =

Hierarchically gated recurrent neural network for sequence modeling , author =. Advances in Neural Information Processing Systems , volume =
[58]

Xtuner: A toolkit for efficiently fine-tuning llm , author =
[59]

Sky-t1: Train your own o1 preview model within \ 450 , author =
[60]

arXiv preprint arXiv:2405.21060 , year =

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality , author =. arXiv preprint arXiv:2405.21060 , year =

Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2506.13585 , year =

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author =. arXiv preprint arXiv:2506.13585 , year =

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2503.24290 , year =

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author =. arXiv preprint arXiv:2503.24290 , year =

Pith/arXiv arXiv
[63]

Bespoke Labs , title =
[64]

arXiv preprint arXiv:2505.22425 , year =

Scaling Reasoning without Attention , author =. arXiv preprint arXiv:2505.22425 , year =

arXiv
[65]

Advances in neural information processing systems , volume =

Chain-of-thought prompting elicits reasoning in large language models , author =. Advances in neural information processing systems , volume =
[66]

arXiv preprint arXiv:2404.02078 , year =

Advancing llm reasoning generalists with preference trees , author =. arXiv preprint arXiv:2404.02078 , year =

arXiv
[67]

Proceedings of the Twentieth European Conference on Computer Systems , pages =

Hybridflow: A flexible and efficient rlhf framework , author =. Proceedings of the Twentieth European Conference on Computer Systems , pages =
[68]

arXiv preprint arXiv:2402.03300 , year =

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author =. arXiv preprint arXiv:2402.03300 , year =

Pith/arXiv arXiv
[69]

Advances in neural information processing systems , volume =

Parallelizing linear transformers with the delta rule over sequence length , author =. Advances in neural information processing systems , volume =
[70]

arXiv preprint arXiv:2412.06464 , year =

Gated delta networks: Improving mamba2 with delta rule , author =. arXiv preprint arXiv:2412.06464 , year =

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:2312.06635 , year =

Gated linear attention transformers with hardware-efficient training , author =. arXiv preprint arXiv:2312.06635 , year =

Pith/arXiv arXiv
[72]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Native sparse attention: Hardware-aligned and natively trainable sparse attention , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[73]

arXiv preprint arXiv:2502.13189 , year =

Moba: Mixture of block attention for long-context llms , author =. arXiv preprint arXiv:2502.13189 , year =

Pith/arXiv arXiv
[74]

arXiv preprint arXiv:2503.01496 , year =

Liger: Linearizing Large Language Models to Gated Recurrent Structures , author =. arXiv preprint arXiv:2503.01496 , year =

arXiv
[75]

arXiv preprint arXiv:2505.15692 , volume =

Thought-augmented policy optimization: Bridging external guidance and internal capabilities , author =. arXiv preprint arXiv:2505.15692 , volume =

Pith/arXiv arXiv
[76]

arXiv preprint arXiv:2404.07904 , year =

Hgrn2: Gated linear rnns with state expansion , author =. arXiv preprint arXiv:2404.07904 , year =

arXiv
[77]

arXiv preprint arXiv:2403.19887 , year =

Jamba: A hybrid transformer-mamba language model , author =. arXiv preprint arXiv:2403.19887 , year =

Pith/arXiv arXiv
[78]

arXiv preprint arXiv:2503.14476 , year =

Dapo: An open-source llm reinforcement learning system at scale , author =. arXiv preprint arXiv:2503.14476 , year =

Pith/arXiv arXiv
[79]

2025 , eprint =

Not All Correct Answers Are Equal: Why Your Distillation Source Matters , author =. 2025 , eprint =

2025
[80]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025

Showing first 80 references.

[1] [1]

Proceedings of the 29th symposium on operating systems principles , pages =

Efficient memory management for large language model serving with pagedattention , author =. Proceedings of the 29th symposium on operating systems principles , pages =

[2] [2]

arXiv preprint arXiv:2508.15884 , year =

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search , author =. arXiv preprint arXiv:2508.15884 , year =

arXiv

[3] [3]

5-math technical report: Toward mathematical expert model via self-improvement , author =

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author =. arXiv preprint arXiv:2409.12122 , year =

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2504.03624 , year =

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models , author =. arXiv preprint arXiv:2504.03624 , year =

arXiv

[5] [5]

arXiv preprint arXiv:2507.21848 , year =

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity , author =. arXiv preprint arXiv:2507.21848 , year =

arXiv

[6] [6]

arXiv preprint arXiv:2407.04620 , year =

Learning to (learn at test time): Rnns with expressive hidden states , author =. arXiv preprint arXiv:2407.04620 , year =

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2504.02546 , year =

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author =. arXiv preprint arXiv:2504.02546 , year =

arXiv

[8] [8]

arXiv preprint arXiv:2103.03874 , year =

Measuring mathematical problem solving with the math dataset , author =. arXiv preprint arXiv:2103.03874 , year =

Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2304.11277 , year =

Pytorch fsdp: experiences on scaling fully sharded data parallel , author =. arXiv preprint arXiv:2304.11277 , year =

Pith/arXiv arXiv

[10] [10]

Nature , volume =

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author =. Nature , volume =. 2025 , publisher =

2025

[11] [11]

arXiv preprint arXiv:2412.19437 , year =

Deepseek-v3 technical report , author =. arXiv preprint arXiv:2412.19437 , year =

Pith/arXiv arXiv

[12] [12]

2023 , eprint =

Mistral 7B , author =. 2023 , eprint =

2023

[13] [13]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

s1: Simple test-time scaling , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

2025

[14] [14]

arXiv preprint arXiv:2505.12346 , year =

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization , author =. arXiv preprint arXiv:2505.12346 , year =

arXiv

[15] [15]

arXiv preprint arXiv:2503.14456 , year =

Rwkv-7" goose" with expressive dynamic state evolution , author =. arXiv preprint arXiv:2503.14456 , year =

arXiv

[16] [16]

arXiv preprint arXiv:2404.05892 , year =

Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence , author =. arXiv preprint arXiv:2404.05892 , year =

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2502.01456 , year =

Process reinforcement through implicit rewards , author =. arXiv preprint arXiv:2502.01456 , year =

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:1911.02150 , year =

Fast transformer decoding: One write-head is all you need , author =. arXiv preprint arXiv:1911.02150 , year =

Pith/arXiv arXiv 1911

[19] [19]

arXiv preprint arXiv:2504.10449 , year =

M1: Towards scalable test-time compute with mamba reasoning models , author =. arXiv preprint arXiv:2504.10449 , year =

arXiv

[20] [20]

arXiv preprint arXiv:2004.05150 , year =

Longformer: The long-document transformer , author =. arXiv preprint arXiv:2004.05150 , year =

Pith/arXiv arXiv 2004

[21] [21]

Advances in neural information processing systems , volume =

Flashattention: Fast and memory-efficient exact attention with io-awareness , author =. Advances in neural information processing systems , volume =

[22] [22]

arXiv preprint arXiv:2503.20783 , year =

Understanding r1-zero-like training: A critical perspective , author =. arXiv preprint arXiv:2503.20783 , year =

Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2402.19427 , year =

Griffin: Mixing gated linear recurrences with local attention for efficient language models , author =. arXiv preprint arXiv:2402.19427 , year =

Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2401.04658 , year =

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models , author =. arXiv preprint arXiv:2401.04658 , year =

arXiv

[25] [25]

arXiv preprint arXiv:1707.06347 , year =

Proximal policy optimization algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2506.06395 , year =

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models , author =. arXiv preprint arXiv:2506.06395 , year =

arXiv

[27] [27]

arXiv preprint arXiv:2507.20534 , year =

Kimi k2: Open agentic intelligence , author =. arXiv preprint arXiv:2507.20534 , year =

Pith/arXiv arXiv

[28] [28]

Advances in neural information processing systems , volume =

Attention is all you need , author =. Advances in neural information processing systems , volume =

[29] [29]

arXiv preprint arXiv:2009.14794 , year =

Rethinking attention with performers , author =. arXiv preprint arXiv:2009.14794 , year =

Pith/arXiv arXiv 2009

[30] [30]

First conference on language modeling , year =

Mamba: Linear-time sequence modeling with selective state spaces , author =. First conference on language modeling , year =

[31] [31]

arXiv preprint arXiv:2501.04519 , year =

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking , author =. arXiv preprint arXiv:2501.04519 , year =

Pith/arXiv arXiv

[32] [32]

American Invitational Mathematics Examination 2025 , author =

2025

[33] [33]

American Invitational Mathematics Examination - AIME 2024 , year =

MAA , title =. American Invitational Mathematics Examination - AIME 2024 , year =

2024

[34] [34]

arXiv preprint arXiv:2307.08621 , year =

Retentive network: A successor to transformer for large language models , author =. arXiv preprint arXiv:2307.08621 , year =

Pith/arXiv arXiv

[35] [35]

arXiv preprint arXiv:2305.13245 , year =

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author =. arXiv preprint arXiv:2305.13245 , year =

Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2503.18892 , year =

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild , author =. arXiv preprint arXiv:2503.18892 , year =

Pith/arXiv arXiv

[37] [37]

arXiv preprint arXiv:2508.10925 , year =

gpt-oss-120b & gpt-oss-20b model card , author =. arXiv preprint arXiv:2508.10925 , year =

Pith/arXiv arXiv

[38] [38]

International conference on machine learning , pages =

Transformers are rnns: Fast autoregressive transformers with linear attention , author =. International conference on machine learning , pages =. 2020 , organization =

2020

[39] [39]

arXiv preprint arXiv:2407.14207 , year =

Longhorn: State space models are amortized online learners , author =. arXiv preprint arXiv:2407.14207 , year =

arXiv

[40] [40]

arXiv preprint arXiv:2506.13284 , year =

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy , author =. arXiv preprint arXiv:2506.13284 , year =

arXiv

[41] [41]

arXiv preprint arXiv:2508.14444 , year =

Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model , author =. arXiv preprint arXiv:2508.14444 , year =

Pith/arXiv arXiv

[42] [42]

arXiv preprint arXiv:2504.16084 , year =

Ttrl: Test-time reinforcement learning , author =. arXiv preprint arXiv:2504.16084 , year =

Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2507.19353 , year =

Smooth reading: Bridging the gap of recurrent llm to self-attention llm on long-context tasks , author =. arXiv preprint arXiv:2507.19353 , year =

arXiv

[44] [44]

arXiv preprint arXiv:2505.15431 , year =

Hunyuan-turbos: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought , author =. arXiv preprint arXiv:2505.15431 , year =

arXiv

[45] [45]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

Orca: A distributed serving system for \ Transformer-Based \ generative models , author =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages =

[46] [46]

Advances in neural information processing systems , volume =

Training language models to follow instructions with human feedback , author =. Advances in neural information processing systems , volume =

[47] [47]

arXiv preprint arXiv:2111.00396 , year =

Efficiently modeling long sequences with structured state spaces , author =. arXiv preprint arXiv:2111.00396 , year =

Pith/arXiv arXiv

[48] [48]

GitHub repository , howpublished =

Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu , title =. GitHub repository , howpublished =. 2024 , publisher =

2024

[49] [49]

Advances in Neural Information Processing Systems , volume =

Infllm: Training-free long-context extrapolation for llms with an efficient context memory , author =. Advances in Neural Information Processing Systems , volume =

[50] [50]

Advances in Neural Information Processing Systems , volume =

Why think step by step? reasoning emerges from the locality of experience , author =. Advances in Neural Information Processing Systems , volume =

[51] [51]

arXiv preprint arXiv:2506.02177 , year =

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts , author =. arXiv preprint arXiv:2506.02177 , year =

arXiv

[52] [52]

arXiv preprint arXiv:2508.15763 , year =

Intern-s1: A scientific multimodal foundation model , author =. arXiv preprint arXiv:2508.15763 , year =

arXiv

[53] [53]

arXiv preprint arXiv:2505.24298 , year =

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning , author =. arXiv preprint arXiv:2505.24298 , year =

Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2507.06457 , year =

A systematic analysis of hybrid linear attention , author =. arXiv preprint arXiv:2507.06457 , year =

Pith/arXiv arXiv

[55] [55]

Advances in Neural Information Processing Systems , volume =

MetaLA: Unified optimal linear approximation to softmax attention map , author =. Advances in Neural Information Processing Systems , volume =

[56] [56]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

[57] [57]

Advances in Neural Information Processing Systems , volume =

Hierarchically gated recurrent neural network for sequence modeling , author =. Advances in Neural Information Processing Systems , volume =

[58] [58]

Xtuner: A toolkit for efficiently fine-tuning llm , author =

[59] [59]

Sky-t1: Train your own o1 preview model within \ 450 , author =

[60] [60]

arXiv preprint arXiv:2405.21060 , year =

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality , author =. arXiv preprint arXiv:2405.21060 , year =

Pith/arXiv arXiv

[61] [61]

arXiv preprint arXiv:2506.13585 , year =

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author =. arXiv preprint arXiv:2506.13585 , year =

Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:2503.24290 , year =

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model , author =. arXiv preprint arXiv:2503.24290 , year =

Pith/arXiv arXiv

[63] [63]

Bespoke Labs , title =

[64] [64]

arXiv preprint arXiv:2505.22425 , year =

Scaling Reasoning without Attention , author =. arXiv preprint arXiv:2505.22425 , year =

arXiv

[65] [65]

Advances in neural information processing systems , volume =

Chain-of-thought prompting elicits reasoning in large language models , author =. Advances in neural information processing systems , volume =

[66] [66]

arXiv preprint arXiv:2404.02078 , year =

Advancing llm reasoning generalists with preference trees , author =. arXiv preprint arXiv:2404.02078 , year =

arXiv

[67] [67]

Proceedings of the Twentieth European Conference on Computer Systems , pages =

Hybridflow: A flexible and efficient rlhf framework , author =. Proceedings of the Twentieth European Conference on Computer Systems , pages =

[68] [68]

arXiv preprint arXiv:2402.03300 , year =

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author =. arXiv preprint arXiv:2402.03300 , year =

Pith/arXiv arXiv

[69] [69]

Advances in neural information processing systems , volume =

Parallelizing linear transformers with the delta rule over sequence length , author =. Advances in neural information processing systems , volume =

[70] [70]

arXiv preprint arXiv:2412.06464 , year =

Gated delta networks: Improving mamba2 with delta rule , author =. arXiv preprint arXiv:2412.06464 , year =

Pith/arXiv arXiv

[71] [71]

arXiv preprint arXiv:2312.06635 , year =

Gated linear attention transformers with hardware-efficient training , author =. arXiv preprint arXiv:2312.06635 , year =

Pith/arXiv arXiv

[72] [72]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Native sparse attention: Hardware-aligned and natively trainable sparse attention , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

[73] [73]

arXiv preprint arXiv:2502.13189 , year =

Moba: Mixture of block attention for long-context llms , author =. arXiv preprint arXiv:2502.13189 , year =

Pith/arXiv arXiv

[74] [74]

arXiv preprint arXiv:2503.01496 , year =

Liger: Linearizing Large Language Models to Gated Recurrent Structures , author =. arXiv preprint arXiv:2503.01496 , year =

arXiv

[75] [75]

arXiv preprint arXiv:2505.15692 , volume =

Thought-augmented policy optimization: Bridging external guidance and internal capabilities , author =. arXiv preprint arXiv:2505.15692 , volume =

Pith/arXiv arXiv

[76] [76]

arXiv preprint arXiv:2404.07904 , year =

Hgrn2: Gated linear rnns with state expansion , author =. arXiv preprint arXiv:2404.07904 , year =

arXiv

[77] [77]

arXiv preprint arXiv:2403.19887 , year =

Jamba: A hybrid transformer-mamba language model , author =. arXiv preprint arXiv:2403.19887 , year =

Pith/arXiv arXiv

[78] [78]

arXiv preprint arXiv:2503.14476 , year =

Dapo: An open-source llm reinforcement learning system at scale , author =. arXiv preprint arXiv:2503.14476 , year =

Pith/arXiv arXiv

[79] [79]

2025 , eprint =

Not All Correct Answers Are Equal: Why Your Distillation Source Matters , author =. 2025 , eprint =

2025

[80] [80]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025