Introduces state commitment learning and Counterfactual Erasure RL (CERL) to train models to commit only persistent state, reducing answer dependence on hidden thoughts across math, logic, QA, and tool-use tasks without accuracy loss.
arXiv preprint arXiv:2505.17813 , year =
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks by up to 19.2%.
Sophisticated prompting on Gemini 2.0 Flash achieves a 0.720 Concept Level Score on MedHopQA, outperforming baseline by 0.155 and matching Gemini 2.5 Flash performance.
citing papers explorer
-
State commitment learning: training language models to distinguish computation from memory
Introduces state commitment learning and Counterfactual Erasure RL (CERL) to train models to commit only persistent state, reducing answer dependence on hidden thoughts across math, logic, QA, and tool-use tasks without accuracy loss.
-
VSPO: Vector-Steered Policy Optimization for Behavioral Control
VSPO samples rollouts at varying steering intensities to improve behavioral control in LLMs while preserving task accuracy.