Evaluation and benchmark- ing of llm agents: A survey

Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip · 2025

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.

Stateful Agent Backdoor

cs.CR · 2026-05-07 · unverdicted · novelty 7.0

A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.

Incisor: Ex Ante Cloud Instance Selection for HPC Jobs

cs.DC · 2026-04-27 · unverdicted · novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.

Green-Red Watermarking for Recommender Systems

cs.IR · 2026-04-26 · unverdicted · novelty 7.0

GREW uses a secret-key-driven green-red item partition and three ranking-integrated modules to embed verifiable watermarks in recommender systems that resist extraction attacks without data injection.

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.

CL-bench Life: Can Language Models Learn from Real-Life Context?

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

citing papers explorer

Showing 6 of 6 citing papers.

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents cs.CL · 2026-05-07 · unverdicted · none · ref 23
MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
Stateful Agent Backdoor cs.CR · 2026-05-07 · unverdicted · none · ref 23
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs cs.DC · 2026-04-27 · unverdicted · none · ref 40
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
Green-Red Watermarking for Recommender Systems cs.IR · 2026-04-26 · unverdicted · none · ref 41
GREW uses a secret-key-driven green-red item partition and three ranking-integrated modules to embed verifiable watermarks in recommender systems that resist extraction attacks without data injection.
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems cs.AI · 2026-05-07 · unverdicted · none · ref 13
ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.
CL-bench Life: Can Language Models Learn from Real-Life Context? cs.CL · 2026-04-29 · unverdicted · none · ref 44
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

Evaluation and benchmark- ing of llm agents: A survey

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer