hub

arXiv preprint arXiv:2312.06585

Beyond human data: Scaling self-training for problem-solving with language models · 2022 · arXiv 2312.06585

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

Space Syntax-guided Post-training for Residential Floor Plan Generation

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

cs.CV · 2025-03-21 · conditional · novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

cs.LG · 2024-10-10 · unverdicted · novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

cs.AI · 2026-03-23 · unverdicted · novelty 5.0

SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

cs.IR · 2024-06-17 · unverdicted · novelty 5.0

ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.

PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues

cs.CL · 2026-04-20 · unverdicted · novelty 4.0

PRISMA augments self-training with direct preference optimization and an emotion-aware negotiation strategy chain-of-thought to produce more interpretable and effective negotiation dialogues on two new datasets.

Demystifying Long Chain-of-Thought Reasoning in LLMs

cs.CL · 2025-02-05 · unverdicted · novelty 4.0

Experiments show that long CoT reasoning in LLMs emerges with more training compute when reward shaping is used properly, and scaling verifiable rewards from noisy data helps especially on out-of-distribution tasks.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

cs.AI · 2025-03-31 · unverdicted · novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

cs.CV · 2026-05-20

$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

cs.LG · 2026-05-02