SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
hub
arXiv preprint arXiv:2305.18323 , year=
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Complete cyclic subtask graphs offer a lens to measure when multi-agent revisitation aids recovery and exploration versus when it increases costs or is dominated by other bottlenecks in LLM agent workflows.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
citing papers explorer
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.