{"total":13,"items":[{"citing_arxiv_id":"2607.00248","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity","primary_cat":"cs.AI","submitted_at":"2026-06-30T22:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30616","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent","primary_cat":"cs.CL","submitted_at":"2026-06-29T17:50:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 35B MoE agent model trained on 45K-token trajectories via three-stage SFT and domain-routed distillation achieves leading or competitive scores against 1T models on SEAL-0, IFBench, HiPhO, FrontierScience-Olympiad and MolBench-Bind.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.15079","ref_index":236,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale","primary_cat":"cs.CL","submitted_at":"2026-06-13T03:21:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00510","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:00:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27209","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments","primary_cat":"cs.AI","submitted_at":"2026-05-26T16:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27141","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions","primary_cat":"cs.AI","submitted_at":"2026-05-26T15:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20306","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:08:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WildRoadBench is a new dual-track benchmark on professionally annotated wild UAV road-damage images showing closed-source VLMs lead but leave over half the AP_50 metric on the table while agents lag and open-source models collapse on small targets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16909","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents","primary_cat":"cs.AI","submitted_at":"2026-05-16T09:49:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12070","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction","primary_cat":"cs.LG","submitted_at":"2026-05-12T12:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"1 Experimental Setup We evaluate Agentic RL tasks using two representative policy backbones: the denseQwen3-4B model and the MoE Qwen3-30B-A3B model. Training data are drawn from an Agentic RL corpus spanning multiple environments. Evaluation covers the retail, airline, and telecom domains of τ 2-Bench [3], together with the in-store and delivery splits of VitaBench [ 8]. For both benchmarks, we report task-level average success and pass metrics. To isolate the effect studied in this paper, we adopt an asynchronous RL setup with explicit control over the maximum version gap between rollout workers and the actor, which we cap at three. We eliminate additional staleness from minibatch reuse and multiple PPO epochs where possible, ensuring"},{"citing_arxiv_id":"2605.08766","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UserGPT Technical Report","primary_cat":"cs.IR","submitted_at":"2026-05-09T07:51:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07926","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-08T15:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp performance drops with increasing depth.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"evaluation of their capabilities becomes essential. Existing agent benchmarks evaluate important aspects of this capability. Tool-calling benchmarks such as BFCL [12] focus on individual API invocations, including tool selection, schema following, and argument generation. More interactive benchmarks, including Tau2-Bench [2], TripBench [15], VitaBench [4], SWE-bench [ 6], and Gaia2 [ 3], extend evaluation to multi-step tasks in domains such as travel planning, customer service, software engineering, mobile applications, and web workflows. However, many of these tasks are grounded in familiar domains with recurring solution templates: booking travel, applying customer-service policies, or editing code and running tests."},{"citing_arxiv_id":"2605.01347","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2026-05-02T09:41:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5 pair (27B+9B), with students from 1.7B to 14B. Training data is ToolACE [26] (∼16K step-split agentic instances) for agentic tasks and OpenThoughts3 [14] (30K problems) for code tasks. The benchmarks instantiate the agentic-vs-code split of our task-adaptive divergence principle (Sec. 3.3), namely multi-step agentic tool use (BFCL- v4 [36], τ 2-Bench [4], VitaBench [15]) and single-turn code generation (LiveCodeBench v6 [18], MBPP+ [3, 25]); we exclude math (already covered by recent OPD work [12, 22, 38]) and focus on the less-explored agentic and code regimes that expose the divergence-selection question motivating our theory. Against this setup we benchmark four reference paradigms: the undistilled instruct model"},{"citing_arxiv_id":"2604.27043","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CL-bench Life: Can Language Models Learn from Real-Life Context?","primary_cat":"cs.CL","submitted_at":"2026-04-29T17:44:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"),Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=flNZJ2eOet. [23] Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025. [24] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What's the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URL https://openreview.net/ forum?id=kIoBbc76Sy. [25] Haichuan Hu, Quanjun Zhang, Ye Shang, Guoqing Xie, Chunrong Fang, Zhenyu Chen, and"}],"limit":50,"offset":0}