hub

arXiv preprint arXiv:2409.12917 , year=

Training Language Models to Self-Correct via Reinforcement Learning , author= · 2024 · arXiv 2409.12917

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

cs.CL · 2024-12-30 · unverdicted · novelty 7.0

o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

cs.SE · 2026-05-02 · unverdicted · novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

cs.AI · 2026-04-29 · unverdicted · novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.

CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

cs.AI · 2025-07-28 · accept · novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

citing papers explorer

Showing 11 of 11 citing papers.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion cs.LG · 2026-05-05 · unverdicted · none · ref 56
Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 277
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability cs.LG · 2026-05-09 · unverdicted · none · ref 56
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories cs.AI · 2026-05-09 · unverdicted · none · ref 2
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation cs.CL · 2026-05-08 · unverdicted · none · ref 41
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE · 2026-05-02 · unverdicted · none · ref 17
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 18
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning cs.LG · 2026-04-19 · unverdicted · none · ref 7
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation cs.CL · 2026-04-28 · unverdicted · none · ref 17
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 35
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 174
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

arXiv preprint arXiv:2409.12917 , year=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer