How can we assess human-agent interactions? Case studies in software agent design

Aditya Bharat Soni; Ameet Talwalkar; Calvin Smith; Graham Neubig; Hoang H. Tran; Juan Michelini; Rohit Malhotra; Valerie Chen; Xingyao Wang; Xuhui Zhou

arxiv: 2510.09801 · v3 · pith:MU5C35BCnew · submitted 2025-10-10 · 💻 cs.AI

How can we assess human-agent interactions? Case studies in software agent design

Valerie Chen , Rohit Malhotra , Xingyao Wang , Juan Michelini , Xuhui Zhou , Aditya Bharat Soni , Hoang H. Tran , Calvin Smith

show 2 more authors

Ameet Talwalkar Graham Neubig

This is my paper

classification 💻 cs.AI

keywords agentpulsedesignsatisfactionsoftwareagentsdesignsevaluation

0 comments

read the original abstract

While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous assessment of human-agent interactions. First, we propose PULSE, a framework for more efficient human-centric evaluation of agent designs, which comprises collecting user feedback, training an ML model to predict user satisfaction, and computing results by combining human satisfaction ratings with model-generated pseudo-labels. Second, we deploy PULSE in software engineering -- one of the highest-impact, real-world domains for LLM agents -- via a large-scale web platform built around the open-source agent OpenHands. Across 15k users, we evaluate how three agent design decisions impact developer satisfaction rates. We also show how PULSE can lead to more robust conclusions about agent design, reducing confidence intervals by 40\% compared to a standard A/B test. Finally, we find substantial discrepancies between in-the-wild results with benchmark performance (e.g., the anti-correlation between claude-sonnet-4 and gpt-5), underscoring the limitations of benchmark-driven evaluation. Our framework PULSE provides guidance for future evaluations, and our findings identify opportunities for better software agent designs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Characterizing initial human-AI proof formalization workflows
cs.AI 2026-06 unverdicted novelty 5.0

A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.