Measuring Agents in Production

Alessandro Basile; Alexander Xiong; Daniel Kang; Dawn Song; Emmanuele Lacavalla; Emma Shen; Huanzhi Mao; Ion Stoica; Jared Quincy Davis; Joseph E. Gonzalez

arxiv: 2512.04123 · v4 · pith:S2OTNWDEnew · submitted 2025-12-02 · 💻 cs.CY · cs.AI· cs.LG· cs.SE

Measuring Agents in Production

Melissa Z. Pan , Negar Arabzadeh , Riccardo Cogo , Yuxuan Zhu , Alexander Xiong , Lakshya A Agrawal , Huanzhi Mao , Emma Shen

show 17 more authors

Sid Pallerla Liana Patel Shu Liu Tianneng Shi Xiaoyuan Liu Jared Quincy Davis Emmanuele Lacavalla Alessandro Basile Shuyi Yang Paul Castro Daniel Kang Koushik Sen Dawn Song Joseph E. Gonzalez Ion Stoica Matei Zaharia Marquita Ellis

This is my paper

classification 💻 cs.CY cs.AIcs.LGcs.SE

keywords agentsproductionacrossbuilddevelopmenthumanmeasuringpractitioners

0 comments

read the original abstract

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 86 deployed systems practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and underexplored research avenues.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Do Evolutionary Coding Agents Evolve?
cs.NE 2026-05 unverdicted novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
cs.CL 2026-05 unverdicted novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
cs.AI 2026-05 unverdicted novelty 6.0

RAC adds a log-based safety net to AI agents via framework extensions, delivering 1.5-8X better latency and token use than LLM-based recovery on complex problems in τ-bench and REALM-Bench.
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
cs.HC 2026-04 unverdicted novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
Don't Let AI Agents YOLO Your Files: Shifting Information and Control to Filesystems for Agent Safety and Autonomy
cs.OS 2026-04 unverdicted novelty 6.0

YoloFS is an agent-native filesystem that stages mutations for review, provides snapshots for agent self-correction, and uses progressive permissions to reduce user interruptions while matching baseline task success.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Echo: Learning from Experience Data via User-Driven Refinement
cs.AI 2026-05 unverdicted novelty 5.0

Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
cs.AI 2026-05 unverdicted novelty 5.0

RAC is a log-based recovery paradigm implemented as an architectural extension to agent frameworks, achieving 1.5-8X better latency and token economy than LLM-based recovery on τ-bench and REALM-Bench.
Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift
cs.CV 2026-04 unverdicted novelty 5.0

Supervised fine-tuning with 0.1% labeled data outperforms all 60 tested prompt variants for CLIPSeg cloud segmentation on satellite imagery under domain shift.
Riemann-Bench: A Benchmark for Moonshot Mathematics
cs.AI 2026-04 conditional novelty 5.0

Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
cs.SE 2026-04 unverdicted novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.