arxiv: 2604.20779 · v1 · submitted 2026-04-22 · 💻 cs.AI · cs.CY· cs.SE

Recognition: unknown

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann , Vishakh Padmakumar , Xiang Li , John Yang , Diyi Yang , Sanmi Koyejo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:38 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.SE

keywords AI coding agentsempirical analysiscode generationdeveloper workflowssecurity vulnerabilitiesuser feedbackinteraction datasetsopen source repositories

0 comments

The pith

In real developer interactions, coding agents see only 44% of their code reach final commits and face user pushback in 44% of turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects and analyzes thousands of actual conversations between developers and AI coding agents drawn from public code repositories. It establishes that usage falls into two main patterns, one where agents handle nearly all the coding and another where humans do it all themselves. The data shows low rates of agent code being accepted as is, elevated security problems in their contributions, and frequent user interventions to correct or stop the agents. A sympathetic reader would care because this offers the first large-scale look at how these tools perform outside of lab settings, pointing to real limitations in current agent designs for practical software development.

Core claim

The analysis of the dataset reveals bimodal coding patterns in which agents author virtually all committed code in 41 percent of sessions while humans write all code in 23 percent. Just 44 percent of agent-produced code survives into user commits, agent-written code introduces more security vulnerabilities than human-authored code, and users push back against agent outputs through corrections, failure reports, and interruptions in 44 percent of all turns.

What carries the argument

The dataset of real coding agent sessions from public repositories, which includes complete interaction traces and accurate attribution of which code was written by the human versus the agent.

Load-bearing premise

The sessions obtained through automated collection from public repositories form a representative sample of real developer and agent interactions, with reliable attribution of code authorship.

What would settle it

Finding that a significant portion of the collected sessions have incorrect human-versus-agent code labels upon manual review, or observing substantially different survival rates in a controlled study of private developer workflows, would undermine the central claims.

Figures

Figures reproduced from arXiv: 2604.20779 by Diyi Yang, Joachim Baumann, John Yang, Sanmi Koyejo, Vishakh Padmakumar, Xiang Li.

**Figure 1.** Figure 1: We present SWE-chat, a continually growing dataset of real human-coding agent interactions collected from public GitHub repositories. Developers opt in via installing Entire.io, an open-source tool that automatically logs coding agent sessions and links them to code commits with line-level human vs. agent attribution. As of April 2026, SWE-chat contains 2.7M logged events from 200+ repositories, including … view at source ↗

**Figure 2.** Figure 2: Usage patterns and failure modes in SWE-chat. Using the SWE-chat dataset, we analyze how people use coding agents in the wild (left) and when and how they fail (right). Text colorings correspond to figure components. Results reflect the SWE-chat population of open-source developers using public repositories and opting into session logging. 1 Introduction AI coding agents have taken the world by storm. Enha… view at source ↗

**Figure 3.** Figure 3: Structure of a coding agent session in SWE-chat. Each session consists of alternating user prompts and agent responses with tool calls (file reads, edits, shell commands) and text output. The resulting SWE-chat dataset provides a comprehensive look into real-world humanagent collaboration, comprising almost 6,000 coding sessions across more than 200 repositories ( [PITH_FULL_IMAGE:figures/full_fig_p004… view at source ↗

**Figure 4.** Figure 4: SWE-chat user-agent interaction statistics. (a) Distribution of turns per session. (b) Distribution of agent tool calls per turn. (c) Top 15 file types touched by agent tool calls. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Vibe coding in the wild. % of agent-authored code, structured into three coding modes: human-only (0% agent-authored code), collaborative (0–99%), and vibe coding (≥99%). 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of LLM-annotated session success rating. The distribution is leftskewed, indicating that most sessions are rated as largely successful [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Cost efficiency per 100 committed lines of code. µ indicates means. In terms of time, collaborative sessions are the most efficient at a median of 4.8 minutes per 100 committed lines, while vibe coding (12.6 minutes) and humanonly sessions (8.6 minutes) are slower in comparison. The agent runtime metric, which excludes time spent waiting for user input, closely tracks session runtime across all modes. Ho… view at source ↗

**Figure 8.** Figure 8: Turn-level oversight in Claude Code sessions. Fraction of turns in which the agent [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Example of a low-success session (score: 10/100). The agent repeatedly modified [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Example of user correction pushback. The user points out that the agent overlooked an available API field, redirecting the approach without rejecting the overall goal. B.2.2 Rejection Pushback type: rejection Repository: armelhbobdad/bmad-module-skill-forge Æ Agent: Done. Issue #53 committed (766901b) and closed. Ready for the next one. User: I undid the commit. We need to find a better approach from scr… view at source ↗

**Figure 11.** Figure 11: Example of user rejection pushback. The user explicitly reverts the agent’s committed work and requests a completely different approach. B.2.3 Failure report Pushback type: failure_report Repository: obsessiondb/rudel Æ Agent: Now instead of adding padding to the element (which breaks the chart layout), it wraps the element in a temporary container div with 24px padding, captures that, then restores the D… view at source ↗

**Figure 12.** Figure 12: Example of user failure report pushback. The agent reports a successful fix, but the user observes the feature is still broken and reports it with a screenshot. B.3 Hard user interruptions Event: hard interruption Repository: pskoett/pskoett-ai-skills User: ok, now update the readme. also make sure the install information is in the new skill Æ Agent: I’ll verify the install steps first. ▷ Bash brew tap en… view at source ↗

**Figure 13.** Figure 13: Example of a hard user interruption. The user asked to update a README file, [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Example of agent-initiated clarification ( [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Example user prompts for each intent category. Notice that the last prompt lacks [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Example of the Expert Nitpicker persona. The user maintains a stable goal while issuing a series of precise, targeted corrections to the implementation. Each prompt refines how the agent executes, not what it builds. Persona: Vague Requester Repository: entireio/cli User: pull comments on PR and if real issues the fix Æ Agent: [fetches PR comments, applies 3 fixes across 2 workflow files. . . ] User: comm… view at source ↗

**Figure 17.** Figure 17: Example of the Vague Requester persona. The user provides broad, underspecified instructions and delegates all implementation decisions to the agent. Persona: Mind Changer Repository: entireio/cli User: let’s make the “entire start” command hidden Æ Agent: [edits start.go: sets command hidden, verifies build. . . ] User: thinking again: let’s completely remove the start command 20 [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 18.** Figure 18: Example of the Mind Changer persona. The user reverses the overall goal midsession — from hiding a CLI command to removing it entirely — changing what should be built, not just how. C Experimentation details C.1 Data processing pipeline Raw session log data from AI agents is stored on each repository’s entire/checkpoints/v1 branch, containing checkpoint and session metadata, user prompts, and full conver… view at source ↗

**Figure 19.** Figure 19: Distributions of human user intents (a), agent tool calls (b), repository domains [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Top 6 prompt languages [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Agent tool call trajectories. (a) Tool composition by sequential position within [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: Cumulative fraction of sessions originating from the [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗

**Figure 23.** Figure 23: Topic distribution of user prompts. Each bar represents one of the 20 clusters [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗

**Figure 24.** Figure 24: Distribution of user personas. D.4 Coding mode distribution over time [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗

**Figure 25.** Figure 25: 14-day rolling average of coding modes and agent-authored code. [PITH_FULL_IMAGE:figures/full_fig_p027_25.png] view at source ↗

**Figure 26.** Figure 26: Distribution of introduced vulnerabilities across Semgrep rule IDs (top 15 plus [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

**Figure 27.** Figure 27: Distribution of introduced vulnerabilities across CWE categories (top 15 plus [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗

**Figure 28.** Figure 28: Example of a Python vulnerability introduced by an agent in SWE-chat. The agent builds a shell command by interpolating a user-controlled string (target) into an f-string and then calls subprocess.run with shell=True (line 10). The inline comment shows the Semgrep annotation, which flags a CWE-78 OS-Command-Injection risk because an attacker who can influence target could inject arbitrary shell commands (… view at source ↗

**Figure 29.** Figure 29: Token, cognitive, and time efficiency per 100 committed lines of code (lower is [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗

**Figure 30.** Figure 30: Turn-level autonomy in Claude Code sessions. Agent turn duration in interactive [PITH_FULL_IMAGE:figures/full_fig_p030_30.png] view at source ↗

**Figure 31.** Figure 31: Agent stops for clarification, user interruptions, and soft user pushback over [PITH_FULL_IMAGE:figures/full_fig_p030_31.png] view at source ↗

**Figure 32.** Figure 32: Agent behavior by development activity (code writing vs. code reviewing). [PITH_FULL_IMAGE:figures/full_fig_p031_32.png] view at source ↗

**Figure 33.** Figure 33: Inter-annotator agreement confusion matrices. [PITH_FULL_IMAGE:figures/full_fig_p032_33.png] view at source ↗

**Figure 34.** Figure 34: LLM annotation agreement with gold labels from human expert annotations. [PITH_FULL_IMAGE:figures/full_fig_p034_34.png] view at source ↗

read the original abstract

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-chat supplies a new living dataset of real coding-agent sessions from public repos, but the headline stats on 44% survival and higher agent-code vulnerabilities rest on attribution that still needs explicit validation.

read the letter

The main takeaway is that this paper releases SWE-chat, a continually updated collection of 6000 complete sessions with 63k user prompts and 355k agent tool calls pulled automatically from open repositories. That scale and the live pipeline are genuinely new compared with prior benchmark-only work. The bimodal usage split—agents authoring nearly all committed code in 41% of sessions versus humans writing everything in 23%—is a concrete observation that synthetic tests have not captured. Reporting 44% pushback across turns and the survival rate also gives a first quantitative sense of friction in actual workflows. Those elements make the dataset worth having for anyone studying human-AI coding collaboration. The soft spot is exactly the one the stress-test flags. The survival and vulnerability comparisons require accurate per-line or per-commit labeling of agent versus human code. The abstract gives no numbers on how the pipeline handles interleaved edits, multi-user sessions, or refactorings, nor any audit or agreement metric. If that labeling is heuristic-only, the reported gaps could be partly artifactual. The full text should show the attribution logic and any manual checks; without them the comparative claims stay provisional. This paper is for researchers who want real usage traces rather than curated benchmarks, or for tool builders who need grounded failure modes. A reader focused on empirical agent evaluation will get immediate value from the released data even if the initial analysis is light. It deserves peer review because the collection method is novel and the questions matter, provided the methods section supplies the missing validation details on authorship.

Referee Report

3 major / 2 minor

Summary. The paper introduces SWE-chat, a large-scale 'living' dataset of 6,000 real-world coding agent sessions (63k+ user prompts, 355k+ tool calls) automatically collected from public open-source repositories. It reports an empirical characterization of usage patterns, including bimodal authorship (41% of sessions agent-dominated, 23% human-only), that only 44% of agent-produced code survives into user commits, that agent-written code introduces more security vulnerabilities than human code, and that users push back via corrections/failures/interruptions in 44% of turns. The dataset includes interaction traces with human-vs-agent code attribution.

Significance. If the collection pipeline and authorship attribution prove reliable, SWE-chat would be a significant contribution as the first large-scale observational dataset of AI coding agents in natural developer workflows. It moves the field beyond curated benchmarks by enabling evidence-based study of survival rates, failure modes, and pushback behaviors. The automated, ongoing collection pipeline and scale are notable strengths.

major comments (3)

[Dataset Construction] The central claims (44% agent-code survival rate, higher vulnerabilities in agent code, 44% user pushback) rest entirely on accurate per-line/per-commit human-vs-agent authorship attribution, yet the dataset construction section provides no quantitative validation of this labeling (e.g., no manual audit sample size, inter-annotator agreement, or error-rate estimates). Without such validation, the reported differences could be artifacts of heuristic failures when users interleave edits or refactor.
[Empirical Analysis / Vulnerability subsection] The vulnerability comparison claim lacks any description of the scanning methodology, tools employed, false-positive handling, or controls for confounding factors such as code complexity or project type. This is load-bearing for the assertion that 'agent-written code introduces more security vulnerabilities than code authored by humans.'
[Introduction and Dataset Collection] The representativeness of the 6,000 sessions as 'real users in the wild' is asserted but not supported by analysis of selection biases in the automated discovery pipeline from public repositories (e.g., project popularity, language distribution, or session-length filters). This affects generalizability of the bimodal patterns and 44% statistics.

minor comments (2)

[Abstract] The abstract states precise percentages (44%, 41%, 23%) without cross-references to the specific tables, figures, or sections containing the underlying counts and statistical details.
[Usage Patterns] Notation for 'vibe coding' and session classification criteria should be defined more explicitly when first introduced to avoid ambiguity in the bimodal usage analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset Construction] The central claims (44% agent-code survival rate, higher vulnerabilities in agent code, 44% user pushback) rest entirely on accurate per-line/per-commit human-vs-agent authorship attribution, yet the dataset construction section provides no quantitative validation of this labeling (e.g., no manual audit sample size, inter-annotator agreement, or error-rate estimates). Without such validation, the reported differences could be artifacts of heuristic failures when users interleave edits or refactor.

Authors: We agree that the absence of quantitative validation for the authorship attribution heuristic is a limitation in the current manuscript. The heuristic relies on git blame and commit history to attribute lines, but interleaved edits could introduce errors. In the revised version, we will add a new subsection under Dataset Construction reporting a manual audit of 200 randomly sampled sessions. Two independent annotators will label authorship per commit, and we will report inter-annotator agreement (Cohen's kappa), per-line error rates, and sensitivity analysis on sessions with high interleaving. This will directly support the reliability of the 44% survival rate and related statistics. revision: yes
Referee: [Empirical Analysis / Vulnerability subsection] The vulnerability comparison claim lacks any description of the scanning methodology, tools employed, false-positive handling, or controls for confounding factors such as code complexity or project type. This is load-bearing for the assertion that 'agent-written code introduces more security vulnerabilities than code authored by humans.'

Authors: We acknowledge that the vulnerability subsection is insufficiently detailed. The manuscript reports the comparative finding but omits the scanner, false-positive mitigation, and controls. In the revision, we will expand this subsection to specify the static analysis tool, the process for handling false positives (including manual review of a 10% sample), and controls such as regression adjustment for cyclomatic complexity, lines of code, and project-level fixed effects. We will also report the distribution of vulnerability categories to allow readers to assess the result. revision: yes
Referee: [Introduction and Dataset Collection] The representativeness of the 6,000 sessions as 'real users in the wild' is asserted but not supported by analysis of selection biases in the automated discovery pipeline from public repositories (e.g., project popularity, language distribution, or session-length filters). This affects generalizability of the bimodal patterns and 44% statistics.

Authors: The collection pipeline discovers sessions from public GitHub repositories via automated search, which necessarily favors projects with visible agent usage and certain languages. The manuscript emphasizes the scale and living nature of the dataset but does not quantify these biases. In the revised version, we will add descriptive statistics on language distribution, project popularity (e.g., star counts), and session-length filters, plus a dedicated limitations paragraph discussing implications for generalizability of the bimodal authorship and 44% figures. We maintain that the observational data from actual developer workflows remains a core contribution despite these selection effects. revision: partial

Circularity Check

0 steps flagged

Observational dataset collection and reporting with no derivations or fitted predictions

full rationale

The paper collects sessions from public repositories and reports direct empirical statistics (e.g., 44% agent-code survival, 44% user pushback, bimodal coding patterns). No equations, models, parameters, or predictions are defined or derived; all claims are descriptive observations from the raw interaction traces. Attribution of human vs. agent code is presented as part of the data pipeline without any self-referential fitting or reduction to prior results. This is a standard self-contained observational study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the validity of the automated collection pipeline and the accuracy of human-versus-agent code attribution; these are domain assumptions not supported by independent evidence in the abstract.

axioms (2)

domain assumption Sessions automatically discovered from public repositories form a representative and unbiased sample of real coding-agent usage.
The paper relies on this to generalize findings beyond the collected sessions.
domain assumption Code authorship between human and agent can be attributed accurately from commit histories and interaction traces.
This underpins all reported percentages on code survival and vulnerability differences.

pith-pipeline@v0.9.0 · 5557 in / 1381 out tokens · 41607 ms · 2026-05-09T23:38:48.007064+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization
cs.CR 2026-05 unverdicted novelty 6.0

SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with z...
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
cs.LG 2026-05 unverdicted novelty 6.0

SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
cs.SE 2026-05 unverdicted novelty 6.0

RECAP captures, replays, and analyzes AI-assisted programming sessions by linking prompts, edits, and developer actions in a single timeline.

Reference graph

Works this paper leans on

20 extracted references · 1 canonical work pages · cited by 4 Pith papers

[1]

Martin and Sean Barnum

URLhttps://arxiv.org/abs/2507.15003. Robert A. Martin and Sean Barnum. Common weakness enumeration (cwe) status update. Ada Lett., XXVIII(1):88–91, April 2008. ISSN 1094-3641. doi: 10.1145/1387830.1387835. URLhttps://doi.org/10.1145/1387830.1387835. Maxim Massenkoff, Eva Lyubich, Peter McCrory, Ruth Appel, and Ryan Heller. An- thropic economic index repor...

work page doi:10.1145/1387830.1387835 2008
[2]

This yields 20 clusters covering 57.4% of all prompts, with cluster sizes ranging from 152 to 4,329 (median: 256)

with min_cluster_size=150 and min_samples=5. This yields 20 clusters covering 57.4% of all prompts, with cluster sizes ranging from 152 to 4,329 (median: 256). The remaining 8,265 prompts (42.6%) are classified as noise, reflecting the diversity of coding session prompts that do not form tight semantic groups. For each cluster, we select the 100 prompts w...

2026
[3]

List the specific technologies, libraries, and project details mentioned. 25
[4]

Set all of those aside. What **general software engineering activity** unites these prompts? Think in terms of the development lifecycle: planning, prototyping, implementing features, debugging, testing, deploying, configuring infrastructure, designing APIs, refactoring, reviewing code, etc
[5]

""Run the project's build step and return stdout

What makes this activity more specific than just "coding"? Is it about a particular layer of the stack (frontend, backend, infra, data, auth)? A particular phase ( greenfield build vs. maintenance)? A particular style of work (exploratory vs. fixing vs. migrating)? Then produce a single cluster label (4-10 words). The label must: - Be understandable to an...

2000
[6]

First, two annotators iteratively refined the codebook until they agreed on all labels for 10 data points
[7]

We computed inter-annotator agreement metrics using the results from this stage

Second, the same two humans proceeded to independently annotate NIAA = 90 additional data points. We computed inter-annotator agreement metrics using the results from this stage. The results in Tables 6 and 7 show that agreement was moderate-high for all tasks. This includes a binary version of the prompt pushback tasks that collapses all classification c...
[8]

"domain"

Finally, the same two humans discussed all disagreements and decided on the most appropriate gold label for each data point. Together with the 10 data points from stage 1, this yielded Ngold = 100 gold labels for evaluating LLM annotation performance. We use Cohen’sκ to measure human-human and LLM-human agreement for multi-class annotation tasks, and addi...

1960
[9]

Expert Nitpicker - Deep domain knowledge; high and consistent standards - Gives precise, technically specific instructions or corrections - Notices subtle issues and requests exact adjustments - Goal remains stable throughout; corrections refine execution, not direction
[10]

Vague Requester - Broad strokes only; underspecified goals - Missing constraints; leaves many decisions to the model - Does not correct or redirect in detail
[11]

Mind Changer - Shifts the overall goal or requirements mid-session - Contradicts or reverses earlier instructions - Corrections change direction, not just execution details
[12]

Other" for sessions where no pattern is discernible. OUTPUT FORMAT (JSON only): Return exactly one JSON with the following keys:

Other (use only when the session is too brief to judge, or the user's behavior clearly does not match any of the above) Disambiguation: - Expert Nitpicker vs Mind Changer: an Expert Nitpicker corrects HOW the model executes a stable goal; a Mind Changer revises WHAT the goal is. Repeated precise corrections within the same goal = Expert Nitpicker. - Exper...
[13]

The full conversation transcript (user messages and agent responses)
[14]

A summary of tool calls made during the session (category counts, top tools)
[15]

## Evaluation procedure First, analyze the session along five dimensions

Commit information, if any (commit messages and diff summaries). ## Evaluation procedure First, analyze the session along five dimensions. For each, note concrete evidence from the transcript
[16]

For each, judge whether it was fully resolved, partially resolved, or unresolved

**Goal completion**: Did the agent fulfill what the user asked for? Identify every distinct user request or task. For each, judge whether it was fully resolved, partially resolved, or unresolved
[17]

An abrupt stop (user abandons mid-task, expresses frustration, or silently disengages after an unresolved error) is negative

**Final session state**: How did the session end? A natural conclusion (user confirms satisfaction, moves on to a new topic, or signs off) is positive. An abrupt stop (user abandons mid-task, expresses frustration, or silently disengages after an unresolved error) is negative. Weigh the ending heavily - a session that goes well for many turns but ends in ...
[18]

Positive signals include: appropriate use of research tools before acting, surfacing uncertainty and offering the user choices, and recovering from errors autonomously

**Agent efficiency**: Did the agent make steady progress, or did it spin? Negative signals include: the user repeating the same instruction or correction, the agent retrying a failed approach without changing strategy, and unnecessary tool calls that do not advance the task. Positive signals include: appropriate use of research tools before acting, surfac...
[19]

**Code and commit quality**: If code was produced, does it appear correct and complete based on the available evidence (test results, diff content, user reactions)? Were changes committed? Commits are a strong positive signal for task- oriented sessions but are not required for short advisory or exploratory sessions where no code change was the expected outcome
[20]

score": <integer 0-100>,

**User experience**: Did the user have to fight the agent, or did the interaction flow naturally? Look for signs of satisfaction (thanks, approval, moving to the next task) and dissatisfaction (re-explaining, correcting, expressing frustration) . 37 ## Scoring rubric After analyzing the five dimensions, assign a single integer score from 0 to 100. - **90-...