Canonical reference

Title resolution pending

· 2025 · arXiv 2505.08638

Canonical reference. 83% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 83% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 5 baseline 1

citation-polarity summary

background 5 baseline 1

representative citing papers

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

cs.SE · 2026-04-25 · unverdicted · novelty 8.0

CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

Holistic Evaluation and Failure Diagnosis of AI Agents

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

cs.AI · 2026-04-20 · unverdicted · novelty 7.0

AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

cs.SE · 2026-01-21 · unverdicted · novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

cs.AI · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

Insights Generator is a multi-agent system that generates evidence-backed natural-language insights characterizing systematic patterns across corpora of LLM agent execution traces.

PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

Process-Centric Analysis of Agentic Software Systems

cs.SE · 2025-12-02 · unverdicted · novelty 6.0

Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

cs.MA · 2026-04-19 · unverdicted · novelty 5.0

ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

A pragmatic approach to regulating AI agents

cs.CY · 2026-04-16 · unverdicted · novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

A Survey of Context Engineering for Large Language Models

cs.CL · 2025-07-17 · accept · novelty 4.0

The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13

citing papers explorer

Showing 14 of 14 citing papers.

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend cs.SE · 2026-04-25 · unverdicted · none · ref 38
CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents cs.CL · 2026-05-21 · unverdicted · none · ref 23
Agentic CLEAR automates multi-level evaluation of LLM agents, generating textual insights at system, trace, and node granularity that align with human annotations and predict task success.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems cs.AI · 2026-05-14 · unverdicted · none · ref 28 · 2 links
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Holistic Evaluation and Failure Diagnosis of AI Agents cs.AI · 2026-05-14 · unverdicted · none · ref 7
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation cs.AI · 2026-04-20 · unverdicted · none · ref 1
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling cs.SE · 2026-01-21 · unverdicted · none · ref 13
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents cs.AI · 2026-05-20 · unverdicted · none · ref 7 · 2 links
Insights Generator is a multi-agent system that generates evidence-backed natural-language insights characterizing systematic patterns across corpora of LLM agent execution traces.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 4
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 23
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 15
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 49
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
A pragmatic approach to regulating AI agents cs.CY · 2026-04-16 · unverdicted · none · ref 6
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
A Survey of Context Engineering for Large Language Models cs.CL · 2025-07-17 · accept · none · ref 220
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · unreviewed · ref 9

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer