Invariant Gradient Alignment uses Logical Isomer Sets and a Continuous Gradient Conflict Mask to tighten OOD generalization bounds and boost empirical performance over ERM in reasoning distillation.
hub
General- reasoner: Advancing llm reasoning across all domains
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
C3RL is a new RL algorithm combining correctness, calibration, and reference accuracy rewards to improve LLM confidence calibration, enabling CAS to outperform majority voting with up to 12.33x lower inference cost.
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
citing papers explorer
-
Invariant Gradient Alignment for Robust Reasoning Distillation
Invariant Gradient Alignment uses Logical Isomer Sets and a Continuous Gradient Conflict Mask to tighten OOD generalization bounds and boost empirical performance over ERM in reasoning distillation.
-
ResMerge: Residual-based Spectral Merging of Large Language Models
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
-
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning benchmarks.
-
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
-
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
-
MoCo: A One-Stop Shop for Model Collaboration Research
MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling
C3RL is a new RL algorithm combining correctness, calibration, and reference accuracy rewards to improve LLM confidence calibration, enabling CAS to outperform majority voting with up to 12.33x lower inference cost.
-
Trust Region On-Policy Distillation
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
-
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
-
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
- MUR: Momentum Uncertainty guided Reasoning for Large Language Models