pith. machine review for the scientific record. sign in

On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

years

2026 8

verdicts

UNVERDICTED 8

representative citing papers

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

Teacher-Guided Policy Optimization for LLM Distillation

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

AIPO: : Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.

citing papers explorer

Showing 8 of 8 citing papers.