{"total":15,"items":[{"citing_arxiv_id":"2606.25996","ref_index":107,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Autodata: An agentic data scientist to create high quality synthetic data","primary_cat":"cs.AI","submitted_at":"2026-06-24T16:08:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Autodata introduces an agentic method with meta-optimization to create higher-quality synthetic data, yielding performance gains over standard methods on CS, legal, and math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21337","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams","primary_cat":"cs.LG","submitted_at":"2026-06-19T11:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DataClaw0 introduces an agentic data-tailoring paradigm, a 9B model trained on a synthetically generated dataset, and a new benchmark, claiming improved downstream adaptation in video generation, VQA, and GUI navigation under limited data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11520","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories","primary_cat":"cs.CL","submitted_at":"2026-06-09T23:44:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11127","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation","primary_cat":"cs.CL","submitted_at":"2026-06-09T17:24:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09138","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-08T07:35:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Claw-R1 provides a Gateway Server and Data Pool to manage step-level agent interaction traces as structured data assets for agentic RL training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07710","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing","primary_cat":"cs.LG","submitted_at":"2026-06-05T13:23:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02908","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents","primary_cat":"cs.CL","submitted_at":"2026-06-01T21:25:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WRIT is a synthesis pipeline that generates write-read intensive trajectories along axes of write-decision count and per-decision evidence burden, enabling a 4B model to outperform GPT-5.1 on τ²-bench with reduced inference tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10832","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T16:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes image-bank harness and ODE closed-loop data generation to boost multimodal deep search agents, reporting average score gains from 24.9% to 39.0% on 8 benchmarks for 8B model and 30.6% to 41.5% for 30B.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multi-source synthesis of legend-based classification on the 1948 map plus the canonical 11-original Trusteeship set, with cross-checks for the 1950 Somaliland exclusion.Visual_Dependency (5.0).The depicted-count of trust territories on the September 1948 map is recoverable only by visual interpretation of shaded regions and labels. Text-only sources can give the original total (11) but not the map's depicted count.Shortcut_Leakage (3.0).The original-set total is one snippet away from any UN page, but the depicted count still requires careful map reading. The final percentage is not directly leaked.Verifiability (5.0).Single objective numeric target, rounded to whole percent.Capability_Requirement (5.0).Demands precise zoom and crop to read the legend and small territory labels,"},{"citing_arxiv_id":"2605.10999","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillGen: Verified Inference-Time Agent Skill Synthesis","primary_cat":"cs.LG","submitted_at":"2026-05-09T19:24:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkillGen synthesizes auditable skills from agent trajectories via contrastive induction on successes and failures, then verifies net performance impact by comparing outcomes with and without the skill on identical tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Decoding is deterministic with temperature 0; the default output budget is 4,096 tokens, increased to 16,384 tokens for skill generation. 16 C.4 SKILLGENHyperparameters Unless otherwise noted, all runs use the same benchmark-specific configuration template. The induction stage uses at most eight failure clusters and eight success clusters, with adaptive k-means clustering over k∈[2,8] and a target cluster size of 15. The contrastive module keeps up to 20 nearest failure-success pairs. The generation prompt receives up to six failure clusters, six success clusters, and eight contrastive observations; web search is disabled. The main experiments use a maximum refinement budget of eight rounds. For candidate verification, the verification gate evaluates uniformly sampled construction-time verification instances from the"},{"citing_arxiv_id":"2605.06638","ref_index":7,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14116","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-15T17:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 1 illustrates a sample task description, with further details provided in the Appendix A.1. 4.2. Comparison with Existing Benchmarks Table 2 presents a comparison between FT-Bench and relevant existing benchmarks. Current lines of work on research agent evaluation encompass a broad spectrum of tasks, including ML model optimization [14], Kaggle-style engineering challenges [2], AI R&D [32, 47], and scientific code implementation [4]. Despite their breadth, these benchmarks exhibit common limitations: they either evaluate agents on isolated sub-tasks within constrained environments or restrict the scope to traditional ML paradigms, thereby failing to capture the unique challenges of modern LLM training, such as instruction formatting, domain-specific evaluation,"},{"citing_arxiv_id":"2507.20534","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi K2: Open Agentic Intelligence","primary_cat":"cs.LG","submitted_at":"2025-07-28T05:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Empirically, this approach significantly enhances the model's token efficiency, encouraging concise yet effective solutions across all domains. PTX LossTo prevent the potential forgetting of valuable, high-quality data during joint RL training, we curate a dataset comprising hand-selected, high-quality samples and integrate it into the RL objective through an auxiliary PTX loss [55]. This strategy not only leverages the advantages of high-quality data, but also mitigates the risk of overfitting to the limited set of tasks explicitly present in the training regime. This augmentation substantially improves the model's generalization across a broader range of domains. Temperature DecayFor tasks such as creative writing and complex reasoning, we find that promoting exploration"},{"citing_arxiv_id":"2504.21318","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phi-4-reasoning Technical Report","primary_cat":"cs.AI","submitted_at":"2025-04-30T05:05:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19678","ref_index":146,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review","primary_cat":"cs.AI","submitted_at":"2025-04-28T11:08:22+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.","context_count":1,"top_context_role":"other","top_context_polarity":"background","context_text":"the biomedical domain, platforms like GeneAgent [141] and frameworks such as PRefLexOR [142] demonstrate enhanced reliability through self-verification and iterative refinement. Moreover, innovative solutions for research ideation, exem- plified by SurveyX [143] and Chain-of-Ideas [144], as well as specialized frameworks for synthetic data generation [145] and chemical reasoning [146], collectively underscore the significant strides made in leveraging autonomous AI agents for complex, real-world tasks. Table V presents an overview of AI Agent frameworks. A. AI Agent frameworks AI agent frameworks represent a transformative paradigm in developing intelligent systems, combining the power of large language models with modular tools and utilities to"},{"citing_arxiv_id":"2501.06322","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Agent Collaboration Mechanisms: A Survey of LLMs","primary_cat":"cs.AI","submitted_at":"2025-01-10T19:56:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and identifies challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"•More human behaviors and additionalLLMs needed to study to ensure the keyfindings. [36] AgentInstruct S&C•Generates diverse naturallanguage data with iterativecross-agent refinement, in-cluding cultural data •Ables to train more capable models fromgenerated data through tools usage, agenticcapabilities, etc. •Requires human to hand-construct gen-eration flows. [88] SocialMind S&C • Integrates verbal, non-verbal, and social cues togenerate in-situ suggestionsvia augmented realityglasses. •Designs and leverages a multi-modal, asmulti-tier collaborative agent system.•Requires advanced edge hardware tohandle complex systems. [144] CulturePark S&C•Prompts LLM-based agentswith various cultural back-grounds to simulate cross-cultural communication."}],"limit":50,"offset":0}