WebChallenger introduces PageMem and three architecture mechanisms to achieve competitive web navigation with open-weight LLMs on WebArena, VisualWebArena, Online-Mind2Web, and WorkArena without fine-tuning or site adapters.
hub
Inducing programmatic skills for agentic tasks
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
SelSkill applies dual-granularity preference learning to selective skill-or-skip decisions, improving task success by 10.9 points and execution precision by 29.1 points on ALFWorld with Qwen3-8B.
SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
Multimodal skills retaining visual figures improve CUA benchmark scores by 8.3 points over text-only equivalents generated from the same source content.
SkillMigrator reduces LLM-action counts by 8-10% on WebArena and Mind2Web by transferring web skills via layout-matched transferable interaction patterns.
Catalogs ten patterns and synthesizes a four-layer reference architecture for skill harnessing in LLM agents, evaluated via cross-instantiation on eight systems.
An iterative framework lets LLMs learn procedural assessment skills for rubric construction, improving automated scoring on all ten ASAP-SAS items and often exceeding expert rubrics while showing cross-item transfer.
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
CoM organizes memory fragments into evolving inference paths with adaptive truncation, delivering 7.5-10.4% accuracy gains on long-memory benchmarks at 2.7% token cost and 6% latency of complex alternatives.
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
DataCOPE uses verifier-guided contrastive distillation from agent trajectories to discover skills, yielding average gains of 9.71% on report-style and 32.30% on reasoning-style data analysis tasks across four model settings.
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
citing papers explorer
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.
-
Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents
CoM organizes memory fragments into evolving inference paths with adaptive truncation, delivering 7.5-10.4% accuracy gains on long-memory benchmarks at 2.7% token cost and 6% latency of complex alternatives.