pith. sign in

arxiv: 2603.16856 · v2 · pith:KM6I5WOYnew · submitted 2026-03-17 · 💻 cs.CL

Online Experiential Learning for Language Models

classification 💻 cs.CL
keywords experientialknowledgelearningmodellanguagemodelsonlinetrajectories
0
0 comments X
read the original abstract

The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning from Language Feedback via Variational Policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...

  2. Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

  3. Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting

    cs.LG 2026-05 unverdicted novelty 6.0

    TTRL-Guard mitigates the Correct-Answer Extinction Window in test-time RL via flip-rate-aware reward scaling, minority-preserving sampling, and risk-conditioned sparse updates, yielding best average pass@1 on Qwen mod...

  4. Evidence Over Plans: Online Trajectory Verification for Skill Distillation

    cs.AI 2026-05 unverdicted novelty 6.0

    PDI-guided distillation from environment-verified trajectories yields skills that surpass no-skill baselines and human-written skills across 86 tasks with far lower inference cost.

  5. ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage

    cs.LG 2026-05 unverdicted novelty 6.0

    ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.

  6. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...

  7. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents

    cs.LG 2026-04 unverdicted novelty 6.0

    Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.

  8. A Predictive Law for On-Policy Self-Distillation From World Feedback

    cs.LG 2026-05 unverdicted novelty 5.0

    A linear relationship between initial student-self-teacher performance gap and OPSD improvement provides a predictive law across contexts and model families.

  9. Echo: Learning from Experience Data via User-Driven Refinement

    cs.AI 2026-05 unverdicted novelty 5.0

    Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.

  10. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

  11. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

  12. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...

  13. A Brief Overview: On-Policy Self-Distillation In Large Language Models

    cs.HC 2026-05 unverdicted novelty 2.0

    This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.