hub

Measuring agents in production

· 2025 · cs.CY · arXiv 2512.04123

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

cs.AI · 2026-05-25 · unverdicted · novelty 6.0

Anchor generates consistent long-horizon agent tasks from parametric constraint programs, yielding ERP-Bench of 300 ERP tasks where frontier models reach optimal solutions in 17.4% of trials.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

cs.HC · 2026-04-20 · unverdicted · novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.

Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy

cs.OS · 2026-04-15 · unverdicted · novelty 6.0

YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.

Security Considerations for Multi-agent Systems

cs.CR · 2026-03-09 · unverdicted · novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

Echo: Learning from Experience Data via User-Driven Refinement

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

cs.AI · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.

A Theory-Guided LLM Pedagogical Agent for STEM+C Scaffolding Without Over-Reliance

cs.MA · 2026-05-28 · unverdicted · novelty 4.0

Copa is a theory-guided multimodal LLM agent that supports high school computational modeling through adaptive feedback, shown in a 33-dyad study to increase student confidence and conceptual verbalization without fostering dependence.

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

cs.SE · 2026-04-06 · unverdicted · novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

Riemann-Bench: A Benchmark for Moonshot Mathematics

cs.AI · 2026-04-08

citing papers explorer

Showing 1 of 1 citing paper after filters.

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots cs.HC · 2026-04-20 · unverdicted · none · ref 44 · internal anchor
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.

Measuring agents in production

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer