{"total":11,"items":[{"citing_arxiv_id":"2605.27820","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents","primary_cat":"cs.AI","submitted_at":"2026-05-27T01:28:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoBench is a new benchmark with 1,045 tasks and a simulated user environment showing that the best SOTA video-MLLM agents reach only 19.43% average accuracy on interactive multimodal tool-using tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19196","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?","primary_cat":"cs.CL","submitted_at":"2026-05-18T23:55:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17613","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VeriCache: Turning Lossy KV Cache into Lossless LLM Inference","primary_cat":"cs.AR","submitted_at":"2026-05-17T19:18:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16508","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Scaling Laws of Skills in LLM Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-15T18:05:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00334","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?","primary_cat":"cs.AI","submitted_at":"2026-05-01T01:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22937","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs","primary_cat":"cs.CL","submitted_at":"2026-04-24T18:22:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22821","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use","primary_cat":"cs.SD","submitted_at":"2026-04-17T16:41:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.24709","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards","primary_cat":"cs.LG","submitted_at":"2026-03-25T18:31:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A constrained-synthesis RL method with graduated rewards for atomic validity and orchestration consistency improves LLM turn accuracy on multi-step tool benchmarks and transfers to new API sets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.14703","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling","primary_cat":"cs.AI","submitted_at":"2025-10-16T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.07043","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"COMPASS: Benchmarking Constrained Optimization in LLM Agents","primary_cat":"cs.LG","submitted_at":"2025-10-08T14:09:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"COMPASS benchmark shows LLM agents reach 70-90% feasibility but only 20-60% optimality on constrained travel planning tasks, attributing the gap to insufficient search space exploration rather than tool use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19678","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review","primary_cat":"cs.AI","submitted_at":"2025-04-28T11:08:22+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"uate large language models (LLMs) across diverse and chal- lenging domains. For instance, ENIGMAEV AL [64] assesses complex multimodal puzzle-solving by requiring the synthesis of textual and visual clues, while ComplexFuncBench [66] challenges models with multi-step function-calling tasks that mirror real-world scenarios. Humanity's Last Exam (HLE) [67] further raises the bar by presenting expert-level aca- demic questions across a broad spectrum of subjects, thereby 6 TABLE II: Summary of LLM Benchmarks (Part 1) Benchmark / Dataset Year Evaluation Focus Key Features / Metrics Innovations/Techniques Observations ENIGMAEV AL [64] 2025 Multimodal Reasoning Contains 1,184 puzzles combining text and images; state-of-the-art"}],"limit":50,"offset":0}