{"total":12,"items":[{"citing_arxiv_id":"2605.17497","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Supervised On-Policy Distillation for Reasoning Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-17T15:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSOPD converts intra-group correct-wrong contrast into process supervision by distilling a teacher distribution from the shortest correct completion into prefixes of the longest wrong completion, improving GRPO on AIME and HMMT benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":68,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14350","ref_index":183,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-14T04:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10292","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LeapTS reformulates forecasting as adaptive multi-horizon scheduling via hierarchical control and NCDEs, delivering at least 7.4% better performance and 2.6-5.3x faster inference than Transformer baselines while adapting to non-stationary dynamics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bound ofLeapTSis no worse than the Direct and Recursive upper bounds: B⋆ LeapTS ≤min(B dir, Brec).(29) Proof. By definition, the target residual ishres =h t+P − ˆhcoarse t+P . For a specific scheduleℓ, the total prediction error can be expanded as: ∥ht+P −( ˆhcoarse t+P +α ˆhsched t+P )∥=∥h t+P − ˆhcoarse t+P −αh res +αh res −α ˆhsched t+P ∥(30) =∥(1−α)h res +α(h res − ˆhsched t+P )∥.(31) By the triangle inequality: ∥ht+P −( ˆhcoarse t+P +α ˆhsched t+P )∥ ≤(1−α)∥h res∥+α∥h res − ˆhsched t+P ∥.(32) From Proposition 1, the coarse prediction error is bounded by∥hres∥ ≤ϵ(P) . The second term, ∥hres − ˆhsched t+P ∥, represents the error of the scheduling branch when fitting this residual usingK recursive segments. Following the unrolling logic of Proposition 2 with varying segment lengths"},{"citing_arxiv_id":"2605.09183","ref_index":7,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift","primary_cat":"cs.LG","submitted_at":"2026-05-09T21:48:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SeqRejectron constructs a stopping rule with a small set of validator policies to achieve horizon-free sample complexity for selective imitation learning under arbitrary dynamics shifts.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Letπ 0 ∈arg min π∈ΠbdM(π)be an empirical disagreement minimizer. For every penalty parameterΛ>0and integerK≥1, there exists a distributionq ⋆ over validator sets Φ = (ϕ1, . . . , ϕK)∈Π K that simultaneously satisfies the following two properties: 1. The expected training disagreement is bounded by the empirical minimum: EΦ∼q⋆ \" KX k=1 bdM(ϕk) # ≤K· bdM(π0) + 1 Λ (7) 2. For every target policyπ∈Π, the expected late-stop risk scales exclusively with its excess empirical risk over the base policy: EΦ∼q⋆   1 n nX j=1 1 \u0002 τπ0,Φ(Tj)> τ π0,{π}(Tj) \u0003   ≤Λ \u0010 bdM(π)− bdM(π0) \u0011 + 1 K (8) 13 Similar to Section 3.2, Lemma 5.1 is a non-constructive result, although it may be realized by no-regret play. We package the output Lemma 5."},{"citing_arxiv_id":"2605.07637","ref_index":1,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding","primary_cat":"cs.AI","submitted_at":"2026-05-08T12:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"D={τ ˆπ u }n u=1 is then used to train the policy. The learning objective minimizes the negative log- likelihood of expert actions: θ⋆ = arg min θ E(ou,aˆπu)∼D \u0002 −logπ θ(aˆπ u |o u) \u0003 .(1) After training, actions are sampled asa u ∼π θ(ou). Method The overall communication and action prediction workflow is illustrated in Fig. 2. At each time stept∈[1, . . . , L]and for each agentu∈[1, . . . , U], the model receives a struc- tured observation ot u = [cost-to-go t u, i t u, n t u,1, . . . , nt u,k],(2) where cost-to-got u is an egocentric cost-to-go matrix,it u con- tains the agent's own features (relative positions of current and goal locations, greedy action, and previouskactions), and eachn t"},{"citing_arxiv_id":"2605.07505","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-08T09:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07401","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning myopic mixed-integer nonlinear model predictive control from expert demonstrations","primary_cat":"eess.SY","submitted_at":"2026-05-08T07:55:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A myopic MINMPC framework learns a value function offline via inverse optimization from expert data, allowing short horizons with near-optimal performance and strict integer feasibility online for hybrid systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06524","ref_index":44,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Process Matters more than Output for Distinguishing Humans from Machines","primary_cat":"cs.AI","submitted_at":"2026-05-07T16:30:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05172","ref_index":78,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-06T17:40:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03065","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:36:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00118","ref_index":154,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Gemma 2: Improving Open Language Models at a Practical Size","primary_cat":"cs.CL","submitted_at":"2024-07-31T19:13:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}