Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
hub Canonical reference
Frontiers Comput
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 16polarities
background 16representative citing papers
Agentic Workflow Injection is a new injection vulnerability class in LLM-augmented GitHub Actions, with two patterns (P2A and P2S) detected via the TaintAWI tool yielding 496 confirmed exploitable instances across 13,392 workflows.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Infinite agentic loops are a distinct failure mode in LLM agents arising from unbounded feedback paths, and IAL-Scan detects them via framework-independent static analysis with 91.9% precision on 6,549 repositories.
AgentRivet applies commercial LLMs in an autonomous workflow to extract physics details from ATLAS and CMS papers and generate Rivet routines, achieving few syntax errors but occasional physics implementation issues on two test cases.
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
TianJi-Environ is a WRF-Chem-based multi-agent AI framework for autonomous validation of atmospheric chemistry mechanisms through executable experiments and evidence assessment.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
LLM agents progressively recruit deeper layers with stronger long-range dependencies and correction-dominant residual updates during sequential planning, showing a construction-refinement gap unlike static tasks.
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.
SkCC introduces a typed intermediate representation and compiler pipeline to make LLM agent skills portable across frameworks and enforce security constraints before deployment.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
-
Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
LLM agents progressively recruit deeper layers with stronger long-range dependencies and correction-dominant residual updates during sequential planning, showing a construction-refinement gap unlike static tasks.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
Causal state binding predicts action control in language agents
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
-
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
-
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
-
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
-
PACE: A Proxy for Agentic Capability Evaluation
PACE builds proxy benchmarks from non-agentic instances via relevance and global selection plus regression to predict agentic scores with MAE under 4%, Spearman correlation above 0.80, and 85% ranking accuracy at under 1% cost.
-
Socratic agents for autonomous scientific discovery in high-dimensional physical systems
AHOIS is a Socratic multi-agent AI that autonomously discovers and validates a random-interference encoding strategy for multimode fiber optics, achieving 76.97% MNIST and 83.17% Fashion-MNIST accuracy with 16x16 measurements of effective rank 56.9.
-
LLM-as-Code: Agentic Programming for Agent Harness
Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.
-
TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents
TravelEval is a new benchmark with a six-dimensional evaluation framework, realistic data sandbox, and simulation-based global assessment for LLM-powered travel planning agents.
-
GraphMind: From Operational Traces to Self-Evolving Workflow Automation
GraphMind builds and evolves action-centric workflow graphs from traces, navigates them via multi-agent LLM reasoning, and adapts via ATR, outperforming baselines on 93 incidents with 8x less context and 26% lower hallucination in production deployment.
-
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.
-
Atomic Task Graph: A Unified Framework for Agentic Planning and Execution
ATG maintains explicit DAGs of subtasks to enable dependency tracking, parallel execution, and localized repair in LLM agents, outperforming baselines on three benchmarks with 7B-8B models.
-
Linguistic Firewall: Geometry as Defense in Multi-Agent Systems Routing
ANTAP routes tasks by actively testing agent competencies, distilling results into behavioral operators in semantic space, and using non-textual projection to achieve near-zero attack success rate on description-based injections.
-
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
-
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
-
Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)
The paper defines a four-dimensional formal framework for agentic KG affordances and derives the Agentic Affordance Profile (AAP) as a semantic layer above VoID and DCAT for principled KG selection, composition, and failure diagnosis.
-
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems
Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents
JTPRO co-optimizes prompts and tool descriptions via reflection to raise overall success rate by 5-20% over baselines on multi-tool benchmarks.
-
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models
Evaluates multimodal foundation models as agents for power distribution defect detection across perception, reasoning, and tool usage using a custom benchmark.
-
Experiments in Agentic AI for Science
Two agentic AI systems, DeepTS/DeepCollector and DeepScribe, are built on a Local Body Remote Brain setup to automate dataset curation and physics lecture analysis for scientific workflows.