super hub Canonical reference

The Rise and Potential of Large Language Model Based Agents: A Survey

Boyang Hong, Wei He, Wenxiang Chen, Xin Guo, Yiwen Ding, Zhiheng Xi · 2023 · cs.AI · arXiv 2309.07864

Canonical reference. 93% of citing Pith papers cite this work as background.

112 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 112 citing papers more from Boyang Hong arXiv PDF

abstract

For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that can adapt to diverse scenarios. Due to the versatile capabilities they demonstrate, large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI), offering hope for building general AI agents. Many researchers have leveraged LLMs as the foundation to build AI agents and have achieved significant progress. In this paper, we perform a comprehensive survey on LLM-based agents. We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. Building upon this, we present a general framework for LLM-based agents, comprising three main components: brain, perception, and action, and the framework can be tailored for different applications. Subsequently, we explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation. Following this, we delve into agent societies, exploring the behavior and personality of LLM-based agents, the social phenomena that emerge from an agent society, and the insights they offer for human society. Finally, we discuss several key topics and open problems within the field. A repository for the related papers at https://github.com/WooooDyy/LLM-Agent-Paper-List.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 26 baseline 1 method 1

citation-polarity summary

background 26 baseline 1 use method 1

claims ledger

abstract For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are artificial entities that sense their environment, make decisions, and take actions. Many efforts have been made to develop intelligent agents, but they mainly focus on advancement in algorithms or training strategies to enhance specific capabilities or performance on particular tasks. Actually, what the community lacks is a general and powerful model to serve as a starting point for designing AI agents that

authors

Boyang Hong Wei He Wenxiang Chen Xin Guo Yiwen Ding Zhiheng Xi

co-cited works

representative citing papers

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Revisable by Design: A Theory of Streaming LLM Agent Execution

cs.LG · 2026-04-25 · unverdicted · novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

HECATE generates and validates ten complexity metrics (seven new) for LLM apps by treating prompts as behavioral specifications and filtering against maintenance activity from version history, showing prompt complexity as an independent factor.

Lightfall: An API-first, LLM-addressable control platform for synchrotron beamlines

physics.ins-det · 2026-06-04 · unverdicted · novelty 7.0

Lightfall provides an API-first, LLM-addressable control platform for synchrotron beamlines enabling natural language extensions by scientists through plugin skills.

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

AdaPlanBench introduces a multi-turn benchmark where LLM agents must adapt plans under progressively revealed dual constraints, with top models reaching only 67.75% accuracy.

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Introduces Sentinel Challenge benchmark and CoSaR framework for cooperative spatial reasoning and planning among 3-5 decentralized embodied agents across 14 city-scale scenes.

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

cs.CY · 2026-05-21 · unverdicted · novelty 7.0

Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

A multi-dimensional behavioral scoring system using LLM judges evaluates agentic stock predictors and feeds scores into closed-loop RL to improve one-day MAPE by 11.5% on held-out data.

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Dr.Sai: An agentic AI for real-world physics analysis at BESIII

hep-ex · 2026-04-24 · unverdicted · novelty 7.0

Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.

Feedback-Driven Execution for LLM-Based Binary Analysis

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

cs.CR · 2025-03-30 · unverdicted · novelty 7.0

MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.

Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting

cs.SE · 2026-06-28 · unverdicted · novelty 6.0

Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.

AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming

cs.SE · 2026-06-23 · unverdicted · novelty 6.0

AutoSpec evolves safety rules for LLM agents via ILP-guided counterexample synthesis, raising F1 to 0.98/0.93 on code and embodied domains with 94% false-positive reduction.

Uncertainty Decomposition for Clarification Seeking in LLM Agents

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

A prompt-based uncertainty decomposition separates action confidence from request uncertainty to enable clarification seeking in LLM agents, yielding F1 gains of 73% and 36% over baselines on two new underspecified benchmarks across five models.

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

q-fin.RM · 2026-06-16 · unverdicted · novelty 6.0

Proposes a POMDP framework for validating belief-state, forecast, and policy components of agentic AI and demonstrates it via a portfolio management case study.

LLM-as-Code: Agentic Programming for Agent Harness

cs.AI · 2026-06-14 · unverdicted · novelty 6.0

Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.

The Illusion of Multi-Agent Advantage

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

Automatically generated multi-agent systems underperform CoT-SC on benchmarks and a new diagnostic dataset, exposing architectural bloat that fails to deliver functional utility.

citing papers explorer

Showing 50 of 112 citing papers.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 26 · internal anchor
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
Revisable by Design: A Theory of Streaming LLM Agent Execution cs.LG · 2026-04-25 · unverdicted · none · ref 11 · internal anchor
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less completed work.
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents cs.CY · 2026-04-11 · accept · none · ref 54 · internal anchor
This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.
Rethinking Complexity Metrics for LLM-Integrated Applications: Beyond Source Code cs.AI · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
HECATE generates and validates ten complexity metrics (seven new) for LLM apps by treating prompts as behavioral specifications and filtering against maintenance activity from version history, showing prompt complexity as an independent factor.
Lightfall: An API-first, LLM-addressable control platform for synchrotron beamlines physics.ins-det · 2026-06-04 · unverdicted · none · ref 20 · internal anchor
Lightfall provides an API-first, LLM-addressable control platform for synchrotron beamlines enabling natural language extensions by scientists through plugin skills.
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints cs.CL · 2026-06-04 · unverdicted · none · ref 5 · internal anchor
AdaPlanBench introduces a multi-turn benchmark where LLM agents must adapt plans under progressively revealed dual constraints, with top models reaching only 67.75% accuracy.
Sentinel: Embodied Cooperative Spatial Reasoning and Planning cs.CV · 2026-05-25 · unverdicted · none · ref 60 · internal anchor
Introduces Sentinel Challenge benchmark and CoSaR framework for cooperative spatial reasoning and planning among 3-5 decentralized embodied agents across 14 city-scale scenes.
Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good cs.CY · 2026-05-21 · unverdicted · none · ref 139 · internal anchor
Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 49 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 46 · 2 links · internal anchor
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback cs.LG · 2026-05-07 · unverdicted · none · ref 2 · 3 links · internal anchor
A multi-dimensional behavioral scoring system using LLM judges evaluates agentic stock predictors and feeds scores into closed-loop RL to improve one-day MAPE by 11.5% on held-out data.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 35 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 23 · internal anchor
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 9 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Dr.Sai: An agentic AI for real-world physics analysis at BESIII hep-ex · 2026-04-24 · unverdicted · none · ref 8 · internal anchor
Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
Feedback-Driven Execution for LLM-Based Binary Analysis cs.CR · 2026-04-16 · unverdicted · none · ref 46 · internal anchor
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 55 · internal anchor
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions cs.CR · 2025-03-30 · unverdicted · none · ref 72 · internal anchor
MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting cs.SE · 2026-06-28 · unverdicted · none · ref 50 · internal anchor
Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming cs.SE · 2026-06-23 · unverdicted · none · ref 36 · internal anchor
AutoSpec evolves safety rules for LLM agents via ILP-guided counterexample synthesis, raising F1 to 0.98/0.93 on code and embodied domains with 94% false-positive reduction.
Uncertainty Decomposition for Clarification Seeking in LLM Agents cs.AI · 2026-06-17 · unverdicted · none · ref 7 · internal anchor
A prompt-based uncertainty decomposition separates action confidence from request uncertainty to enable clarification seeking in LLM agents, yielding F1 gains of 73% and 36% over baselines on two new underspecified benchmarks across five models.
Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation q-fin.RM · 2026-06-16 · unverdicted · none · ref 39 · internal anchor
Proposes a POMDP framework for validating belief-state, forecast, and policy components of agentic AI and demonstrates it via a portfolio management case study.
LLM-as-Code: Agentic Programming for Agent Harness cs.AI · 2026-06-14 · unverdicted · none · ref 22 · internal anchor
Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.
The Illusion of Multi-Agent Advantage cs.AI · 2026-06-11 · unverdicted · none · ref 38 · internal anchor
Automatically generated multi-agent systems underperform CoT-SC on benchmarks and a new diagnostic dataset, exposing architectural bloat that fails to deliver functional utility.
Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents cs.CL · 2026-06-10 · unverdicted · none · ref 16 · internal anchor
Autopilot enforces verifiable termination via a gated FSM scheduler and hard floor, proving that termination implies goal achievement under gate soundness, floor enforcement, and plan coverage, while cutting fabrication rates to 0.95% vs. 8-25% in baselines on 3150 paired cells including SWE-bench L
MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents cs.CL · 2026-06-09 · unverdicted · none · ref 10 · internal anchor
A shared polarity-flipping encoding subspace in LLM residual streams supports covert encoding and enables real-time detection of agentic data exfiltration via internal probes.
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification cs.AI · 2026-06-02 · unverdicted · none · ref 23 · internal anchor
The authors introduce a three-part ontology-based verification system for AI agents that generates regulatory and adversarial test scenarios and issues machine-verifiable trust certificates, with pilot results indicating improved coverage over baselines in four industries.
Strengthening Polymorphic Prompt Assembling: Dynamic Separator Generation Against Emerging Prompt Injection Attacks cs.CR · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Dynamic separator generation via domain-separated SHA-256 reduces attack success rate from 0.88 to 0.38 and eliminates leakage exposure in evaluations against 16 payloads on Llama and DeepSeek models.
ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense cs.CR · 2026-05-18 · unverdicted · none · ref 38 · internal anchor
ESLD extracts safety signals directly from the latent space of any guard model to enable faster and more accurate prompt-injection detection without retraining.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 170 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World cs.AI · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 43 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem cs.SE · 2026-05-08 · unverdicted · none · ref 60 · internal anchor
MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
SOD: Step-wise On-policy Distillation for Small Language Model Agents cs.CL · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
LoopTrap: Termination Poisoning Attacks on LLM Agents cs.CR · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Agentic Retrieval-Augmented Generation for Financial Document Question Answering cs.AI · 2026-05-06 · unverdicted · none · ref 34 · internal anchor
FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9.32 points.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 48 · 2 links · internal anchor
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 29 · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
LLM-Steered Power Allocation for Parallel QPSK-AWGN Channels cs.IT · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
LLM interprets natural-language policies to steer a projected-gradient power allocator in 8 parallel QPSK-AWGN channels, producing policy-dependent allocations and 60% lower mutual-information spread after abrupt channel reversals compared with the optimizer alone.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis cs.AI · 2026-04-17 · unverdicted · none · ref 32 · internal anchor
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models cs.AI · 2026-04-14 · unverdicted · none · ref 2 · internal anchor
Large language models display three universal scale-dependent regimes of behavior—stable, chaotic, and signal-dominated—driven by floating-point rounding errors that produce an avalanche effect in early layers.
Policy-Invisible Violations in LLM-Based Agents cs.AI · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
Generative AI Agent Empowered Power Allocation for HAP Propulsion and Communication Systems cs.NI · 2026-04-10 · unverdicted · none · ref 29 · internal anchor
A generative AI agent creates a realistic HAP propulsion power model including aerodynamic interference and enables a Q3E beamforming algorithm that improves QoS and energy efficiency.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw cs.CR · 2026-04-06 · conditional · none · ref 18 · internal anchor
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 51 · internal anchor
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Mobile GUI Agents under Real-world Threats: Are We There Yet? cs.CR · 2025-07-06 · conditional · none · ref 31 · internal anchor
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models cs.CL · 2025-05-12 · unverdicted · none · ref 6 · internal anchor
MulDimIF introduces a multi-dimensional constraint framework and generation pipeline that reveals sharp performance drops in LLMs as instruction complexity rises and shows targeted training gains from attention module updates.
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society cs.SI · 2025-02-12 · unverdicted · none · ref 106 · internal anchor
AgentSociety is a large-scale LLM agent-based social simulator validated on polarization, UBI, disasters, and sustainability issues with alignment to real experiments.
ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents cs.AI · 2024-12-18 · unverdicted · none · ref 26 · internal anchor
ChinaTravel is a benchmark with sandbox, compositional DSL, and 1154-human dataset for testing language agents on open-ended travel planning constraint satisfaction.
Retrieval-Augmented Generation for Natural Language Processing: A Survey cs.CL · 2024-07-18 · accept · none · ref 182 · internal anchor
The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.

The Rise and Potential of Large Language Model Based Agents: A Survey

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer