pith. sign in

hub

arXiv preprint arXiv:2602.02482 , year=

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

hub tools

citation-role summary

background 4

citation-polarity summary

years

2026 19

roles

background 4

polarities

background 4

representative citing papers

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

CORE distills contrasts between successful and unsuccessful reasoning traces into compact natural-language insights that enable faster model self-improvement on reasoning tasks with fewer rollouts than parametric or other non-parametric baselines.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

AXPO addresses the Thinking-Acting Gap in agentic RL training by targeted resampling of tool calls in all-wrong subgroups, delivering +1.8pp gains over GRPO on nine multimodal benchmarks with an 8B model beating a 32B baseline on Pass@4.

ECHO: Terminal Agents Learn World Models for Free

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

ECHO is a hybrid RL objective that trains agents to predict environment observation tokens from their actions, doubling GRPO pass@1 on TerminalBench-2.0 while improving dynamics prediction on held-out trajectories.

Reinforcing Human Behavior Simulation via Verbal Feedback

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.

Self-Improving 4D Perception via Self-Distillation

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight benchmarks.

RL with Learnable Textual Feedback: A Bilevel Approach

cs.LG · 2026-05-23 · unverdicted · novelty 5.0

Bi-NAC frames RL with textual feedback as a Stackelberg bilevel program and reports that 2B and 6B models trained this way outperform larger GRPO baselines on MATH-500 and GPQA.

Physics-Guided Policy Optimization with Self-Distillation

cs.LG · 2026-06-02 · unverdicted · novelty 4.0

PGPO modulates per-step trust in self-distilled updates via a mutual-information estimate derived from a viscous-fluid analogy, preserves SGD weak-approximation order, and reports gains of up to 4.5 points on Science-QA while avoiding late-training collapse.

citing papers explorer

Showing 19 of 19 citing papers.