arxiv: 2605.07505 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Yubin Wu , Zicheng Cai , Liping Ning , Hua Wang , Zhi Chen , Yaohua Tang , Hao Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords GUI agentsknowledge distillationreinforcement learninglightweight modelsvision-language agentson-policy distillationGRPOmulti-solution tasks

0 comments

The pith

A new SFT-free training paradigm using guided on-policy distillation and dual-level RL lets small 2B/3B GUI agents reach state-of-the-art among lightweight models while staying competitive with much larger ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a training approach for compact vision-language GUI agents that skips traditional supervised fine-tuning to avoid overfitting and policy rigidity. It combines Guided On-policy Distillation, which pulls in oracle reference trajectories through dynamic retrieval to cut hallucinations in multi-solution tasks, with a Multi-solution Dual-level GRPO framework that aligns high-level planning and low-level actions. An automated pipeline generates synthetic trajectories with rich annotations to support this training. Experiments show the resulting lightweight models outperform prior small agents and approach the performance of substantially bigger models across GUI benchmarks.

Core claim

By integrating generalized knowledge distillation with oracle trajectories and dynamic retrieval into GUI agents, together with a Multi-solution Dual-level GRPO framework that jointly optimizes macro subtask planning and micro execution matching, small-scale models overcome the limitations of imitation learning and achieve state-of-the-art results among lightweight agents while remaining competitive with larger-scale models.

What carries the argument

Guided On-policy Distillation with dynamic retrieval of oracle reference trajectories, combined with Multi-solution Dual-level GRPO for joint macro-micro alignment, which reduces hallucinations and improves exploration in long-horizon tasks.

If this is right

Small-scale models can now handle complex, long-horizon GUI interactions without the overfitting and forgetting typical of SFT.
Structured on-policy distillation and multi-solution exploration unlock performance limits that conventional imitation learning could not reach for 2B/3B agents.
The method supports scalable training through automated generation of trajectories with multiple valid solutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-dual-level RL pattern could transfer to other agent domains such as web navigation or mobile app control where multiple valid action sequences exist.
If the dynamic retrieval mechanism proves robust, it might reduce the need for ever-larger models in on-device automation by letting compact agents stay aligned with expert behavior over extended sessions.
A natural next test would be to measure energy use and latency on actual edge hardware to confirm the practical on-device advantage.

Load-bearing premise

The automated data generation pipeline produces high-quality, accurate multi-solution annotations and oracle reference trajectories that generalize beyond the synthetic data.

What would settle it

Run the trained 2B or 3B models on a fresh set of real-world GUI tasks not seen in the synthetic benchmarks and measure whether they maintain the reported competitiveness with larger models; a large performance drop would falsify the generalization benefit.

Figures

Figures reproduced from arXiv: 2605.07505 by Hao Chen, Hua Wang, Liping Ning, Yaohua Tang, Yubin Wu, Zhi Chen, Zicheng Cai.

**Figure 2.** Figure 2: The prompt of Long-horizon Planning Reward. [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: GUI Model System Prompt. Verify Model System Prompt: You are an AI assistant responsible for evaluating the correctness and relevance of a proposed next action in the context of a given task and system state. Your goal is to determine whether the provided next action is **correct**, **reasonable**, **executable**, and **task-relevant** based on the available information.{SYSTEM_REQUIREMENTS} --- The previo… view at source ↗

**Figure 4.** Figure 4: Verify Model System Prompt. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: System requirements Prompt. KEY_COMMAND_NOTES # Keyboard Command Guidelines ## Supported Function Keys The following function keys are supported: - Control keys: ctrl, alt, shift, super - Navigation keys: enter, esc, tab, up, down, left, right - Editing keys: backspace, delete - Page navigation: page_up, page_down, home, end ## Key Format Specifications - Use "+" to connect keys for simultaneous key combin… view at source ↗

**Figure 6.** Figure 6: Key command note Prompt. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Action and Solution Format And Ground-truth solutions. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiteGUI's SFT-free distillation plus dual-level GRPO for small GUI agents is a reasonable idea but its performance claims rest on unvalidated synthetic data.

read the letter

The paper's main contribution is a training setup for compact vision-language GUI agents that skips supervised fine-tuning. It combines guided on-policy distillation from oracle trajectories with dynamic retrieval to cut hallucinations, then adds a multi-solution dual-level GRPO that aligns high-level subtask planning with low-level action matching. They also describe an automated pipeline that generates the required trajectories and multi-solution annotations. This targets the usual problems small models face with imitation learning, such as overfitting and limited exploration on long-horizon tasks, and aims to make capable on-device agents more practical. The approach builds directly on existing RL and distillation components but applies them in this specific GUI setting with the dual-level structure, which looks like a useful adaptation for multi-solution scenarios. The abstract reports that the resulting 2B/3B models reach SOTA among lightweight agents and stay competitive with larger ones, with ablations supporting the individual pieces. That framing is clear and the motivation is grounded in real deployment constraints. The central weakness is the automated data pipeline. All benchmarks, ablations, and claims about reduced hallucinations and better exploration depend on the quality of the synthetic trajectories and annotations it produces. No validation details appear in the provided text—no human agreement rates, action-sequence error analysis, or checks against real GUI variability. If the pipeline introduces consistent biases or labels that do not generalize, the gains cannot be confidently attributed to the distillation or GRPO components. This is a load-bearing assumption rather than a minor gap. The citation pattern is standard and appropriate, with no obvious circularity in the derivations. The work shows clear engagement with the problem and prior methods. It is aimed at researchers working on efficient on-device agents and GUI automation. Readers interested in practical RL distillation for interactive tasks would get concrete ideas from the method sections. The paper deserves peer review so referees can examine the full experimental setup, dataset statistics, and any data validation that exists in the manuscript.

Referee Report

2 major / 3 minor

Summary. The paper proposes LiteGUI, an SFT-free training paradigm for lightweight (2B/3B) vision-language GUI agents. It introduces Guided On-policy Distillation that incorporates oracle reference trajectories and a dynamic retrieval mechanism to reduce hallucinations and address cognitive misalignment in multi-solution tasks, followed by a Multi-solution Dual-level GRPO framework that jointly optimizes macro-level subtask planning and micro-level execution. The authors construct an automated data generation pipeline to synthesize GUI trajectories with rich multi-solution annotations. Extensive experiments claim state-of-the-art performance among lightweight models while remaining competitive with substantially larger models across GUI benchmarks, with ablations attributing gains to the distillation and dual-level RL components.

Significance. If the central performance claims hold after addressing data validation, this work would meaningfully advance on-device GUI agents by showing how to mitigate SFT-induced overfitting and rigidity in small models through on-policy distillation and structured RL. The handling of multi-solution trajectories and long-horizon exploration via dual-level alignment could influence practical deployments in automated interaction, testing, and accessibility, provided the synthetic data pipeline generalizes reliably.

major comments (2)

[Section 3] Automated data generation pipeline (Section 3): No validation metrics are reported for the synthesized trajectories, such as human agreement rates on multi-solution annotations, error rates in generated action sequences, or coverage of real GUI interface variability. This is load-bearing because the Guided On-policy Distillation uses these oracle references and the GRPO framework depends on the multi-solution labels for rewards; without such evidence, reported gains in hallucination reduction and long-horizon performance cannot be confidently attributed to the training methods rather than artifacts of the data pipeline.
[Section 5] Experimental results (Section 5, Tables 1-3): The SOTA and competitiveness claims are presented without error bars, standard deviations across multiple runs, or statistical significance tests. GUI agent benchmarks often exhibit high variance due to interface differences and task stochasticity, so the absence of these details undermines the reliability of the cross-model comparisons and ablation conclusions.

minor comments (3)

[Abstract] The abstract refers to 'all benchmarks' without naming them; the introduction or experimental setup should explicitly list the evaluation suites (e.g., AndroidControl, WebArena) for immediate clarity.
[Section 4] Notation for the dual-level GRPO components (macro vs. micro rewards) could be more consistently defined when first introduced to aid readers in following the alignment objective.
[Figures] Figure captions for the overall architecture and GRPO framework would benefit from additional detail on data flow between distillation and RL stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects for strengthening the reliability of our claims on LiteGUI. We address each major comment below and commit to revisions that enhance the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Section 3] Automated data generation pipeline (Section 3): No validation metrics are reported for the synthesized trajectories, such as human agreement rates on multi-solution annotations, error rates in generated action sequences, or coverage of real GUI interface variability. This is load-bearing because the Guided On-policy Distillation uses these oracle references and the GRPO framework depends on the multi-solution labels for rewards; without such evidence, reported gains in hallucination reduction and long-horizon performance cannot be confidently attributed to the training methods rather than artifacts of the data pipeline.

Authors: We agree that explicit validation metrics for the automated data generation pipeline are important to substantiate the quality of the oracle references and multi-solution annotations, given their central role in Guided On-policy Distillation and the dual-level GRPO rewards. The original manuscript describes the pipeline's design for synthesizing trajectories with consistency checks against GUI interfaces but does not include quantitative validation such as human agreement rates or error statistics. In the revised version, we will add a dedicated subsection (or appendix) reporting error rates in generated action sequences evaluated on a held-out set of real-world GUI interfaces, coverage metrics for interface variability, and inter-annotator agreement rates on a sampled subset of multi-solution annotations. These additions will allow readers to better attribute performance gains to the proposed training methods. revision: yes
Referee: [Section 5] Experimental results (Section 5, Tables 1-3): The SOTA and competitiveness claims are presented without error bars, standard deviations across multiple runs, or statistical significance tests. GUI agent benchmarks often exhibit high variance due to interface differences and task stochasticity, so the absence of these details undermines the reliability of the cross-model comparisons and ablation conclusions.

Authors: We concur that the absence of error bars, standard deviations, and statistical tests limits the robustness assessment of the reported results, especially in stochastic GUI environments. The original submission presented single-run point estimates for the main benchmarks and ablations. In the revised manuscript, we will conduct additional evaluation runs using multiple random seeds (minimum of three) for the primary results in Tables 1-3, reporting standard deviations and error bars. We will also include appropriate statistical significance tests (such as paired t-tests) for key comparisons to support the SOTA claims among lightweight models and competitiveness with larger models. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents an empirical training paradigm (Guided On-policy Distillation + Multi-solution Dual-level GRPO) built on an automated synthetic data pipeline that supplies oracle trajectories and annotations. No mathematical derivations, equations, or first-principles claims are described that reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance results are reported from experiments and ablations rather than tautological predictions, and the method relies on standard RL components plus external-style oracles without load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the claims rest on unstated assumptions about data quality and the effectiveness of the proposed distillation and RL components.

pith-pipeline@v0.9.0 · 5563 in / 1077 out tokens · 20775 ms · 2026-05-11T02:13:47.516363+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel SFT-free training paradigm... Guided On-policy Distillation... Multi-solution Dual-level GRPO... automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rms(ŷt, A∗t) = max a∗∈A∗t ϕgui(ŷt, a∗) ... Rsub = fjudge(...) ... GRPO clipped objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

[1]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Cogagent: A visual language model for gui agents , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[2]

2024 , journal=

Aria-UI: Visual Grounding for GUI Instructions , author=. 2024 , journal=

work page 2024
[3]

Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Z...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc , author=. arXiv preprint arXiv:2502.14282 , year=

work page arXiv
[6]

Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Agent s2: A compositional generalist-specialist framework for computer use agents, 2025 , author=. URL https://arxiv. org/abs/2504.00906 , volume=

work page arXiv 2025
[7]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

ScreenSpot-Pro:

Kaixin Li and Meng Ziyang and Hongzhan Lin and Ziyang Luo and Yuchen Tian and Jing Ma and Zhiyong Huang and Tat-Seng Chua , booktitle=. ScreenSpot-Pro:. 2025 , url=

work page 2025
[9]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Agenttuning: Enabling generalized agent abilities for llms , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[10]

Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning , author=. arXiv preprint arXiv:2503.21620 , year=

work page arXiv
[11]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents , author=. arXiv preprint arXiv:2504.10458 , year=

work page internal anchor Pith review arXiv
[12]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review arXiv
[13]

Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao

Deskvision: Large scale desktop region captioning for advanced gui agents , author=. arXiv preprint arXiv:2503.11170 , year=

work page arXiv
[14]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents , author=. arXiv preprint arXiv:2410.23218 , year=

work page internal anchor Pith review arXiv
[15]

Navigating the Digital World as Humans Do: Universal Visual Grounding for

Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su , booktitle=. Navigating the Digital World as Humans Do: Universal Visual Grounding for. 2025 , url=

work page 2025
[16]

2024 , eprint=

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. 2024 , eprint=

work page 2024
[17]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning , author=. arXiv preprint arXiv:2505.12370 , year=

work page arXiv
[18]

2025 , eprint=

Seed1.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

GUI-G ^2 : Gaussian Reward Modeling for GUI Grounding , author=. 2025 , eprint=

work page 2025
[20]

arXiv preprint arXiv:2506.03143 , year=

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents , author=. arXiv preprint arXiv:2506.03143 , year=

work page arXiv
[21]

arXiv preprint arXiv:2504.14239 , year=

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners , author=. arXiv preprint arXiv:2504.14239 , year=

work page arXiv
[22]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents , author =. arXiv preprint arXiv:2505.15810 , year =

work page arXiv
[23]

2025 , howpublished =

ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding , author =. 2025 , howpublished =

work page 2025
[24]

2024 , eprint=

ShowUI: One Vision-Language-Action Model for GUI Visual Agent , author=. 2024 , eprint=

work page 2024
[25]

2025 , eprint=

OpenCUA: Open Foundations for Computer-Use Agents , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis , author=. 2025 , eprint=

work page 2025
[27]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. arXiv preprint arXiv:2501.12326 , year=

work page Pith review arXiv
[28]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

Self-Distillation Enables Continual Learning

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

work page internal anchor Pith review arXiv
[33]

S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Cheng, Kanzhi and Sun, Qiushi and Chu, Yougang and Xu, Fangzhi and YanTao, Li and Zhang, Jianbing and Wu, Zhiyong. S ee C lick: Harnessing GUI Grounding for Advanced Visual GUI Agents. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

work page 2024
[34]

2025 , eprint=

Soft Adaptive Policy Optimization , author=. 2025 , eprint=

work page 2025
[35]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[36]

ACM Transactions on Information Systems , volume=

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

work page 2025
[37]

Computer Science Review , volume=

Large language models hallucination: A comprehensive survey , author=. Computer Science Review , volume=. 2026 , publisher=

work page 2026
[38]

Retaining by doing: The role of on-policy data in mitigating forgetting, 2025 , year=

Chen, Howard and Razin, Noam and Narasimhan, Karthik and Chen, Danqi , journal=. Retaining by doing: The role of on-policy data in mitigating forgetting, 2025 , year=

work page 2025
[39]

Neural networks , volume=

A new learning paradigm: Learning using privileged information , author=. Neural networks , volume=. 2009 , publisher=

work page 2009
[40]

The Journal of Machine Learning Research , volume=

Learning using privileged information: similarity control and knowledge transfer , author=. The Journal of Machine Learning Research , volume=. 2015 , publisher=

work page 2015
[41]

Unifying distillation and privileged information.arXiv preprint arXiv:1511.03643, 2015

Unifying distillation and privileged information , author=. arXiv preprint arXiv:1511.03643 , year=

work page arXiv
[42]

Advances in Neural Information Processing Systems , volume=

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011
[44]

International Conference on Learning Representations , year =

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author =. International Conference on Learning Representations , year =

work page
[45]

2024 , howpublished =

SWIFT: Scalable lightWeight Infrastructure for Fine-Tuning , author =. 2024 , howpublished =

work page 2024
[46]

2024 , howpublished =

Megatron-LM and Megatron Core , author =. 2024 , howpublished =

work page 2024