{"total":11,"items":[{"citing_arxiv_id":"2605.20277","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TIF-GRPO uses integral feedback on pseudo-temporal trajectories to regulate anatomy-aware rewards in RL for clinical faithfulness in volumetric CT analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18083","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\\Delta$ Integration into Upcycled MoE","primary_cat":"cs.CL","submitted_at":"2026-05-18T08:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14057","ref_index":156,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents","primary_cat":"cs.CL","submitted_at":"2026-05-13T19:29:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14054","ref_index":92,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T19:23:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11922","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning","primary_cat":"cs.SE","submitted_at":"2026-05-12T10:36:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11739","ref_index":38,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-12T08:19:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The gradient with respect to∆θis: g(∆θ) :=∇ ∆θLOPD =A∆θ−b.(35) Gradient Descent Dynamics and Closed-Form Solution.Consider gradient descent on ∆θ with fixed step sizeη >0: ∆θs+1 = ∆θs −ηg(∆θ s) = ∆θs −η(A∆θ s −b) = (I−ηA)∆θ s +ηb.(36) Starting from∆θ 0 = 0(initialization at the base model), we unroll the recursion: ∆θ1 =ηb,(37) ∆θ2 = (I−ηA)ηb+ηb=η[I+ (I−ηA)]b,(38) ∆θs =η s−1X j=0 (I−ηA) jb.(39) 30 This is a geometric series of matrices. Assume A is symmetric positive semidefinite (it is a Gram matrix of J ⊤ c F 1/2 c ). Choose η such that 0< η <2/λ max(A) to ensure convergence. Then I−ηA has spectral radius less than 1, and the series converges to: ∆θ∞ =η(I−(I−ηA)) −1b=A −1b,(40) where A−1 denotes the pseudo-inverse on the support of A."},{"citing_arxiv_id":"2605.09146","ref_index":68,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Thinking: Imagining in 360$^\\circ$ for Humanoid Visual Search","primary_cat":"cs.CV","submitted_at":"2026-05-09T20:10:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08766","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UserGPT Technical Report","primary_cat":"cs.IR","submitted_at":"2026-05-09T07:51:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00610","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors","primary_cat":"cs.LG","submitted_at":"2026-05-01T12:20:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20244","ref_index":10,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Hybrid Policy Distillation for LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-22T06:46:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19638","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-21T16:27:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}