arxiv: 2605.07073 · v1 · submitted 2026-05-08 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

Yubin Kim , Chanwoo Park , Taehan Kim , Eugene Park , Samuel Schmidgall , Salman Rahman , Chunjong Park , Cynthia Breazeal

show 4 more authors

Xin Liu Hamid Palangi Hae Won Park Daniel McDuff

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent coordinationrole separationbenchmarkoperating system enforcementverifier approvalagent teamshuman-agent interactiontask decomposition

0 comments

The pith

Prompt-only agent teams match enforced-role teams on pass rates but violate role boundaries 3.6 times more often.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to check whether multi-agent systems actually divide labor or whether one agent simply performs multiple roles when separation is described only in prompts. It enforces three distinct roles through operating-system access controls so that the planner cannot edit files, the executor cannot see the full requirements, and the verifier cannot both edit and certify. Across hundreds of tasks the two setups produce statistically equivalent success rates, yet the unenforced version shows far more boundary violations and the verifier approves nearly half of the work that later fails objective tests. Removing the verifier entirely raises average partial scores, and teams improve results mainly on tasks where a lone agent already struggles. A parallel human study under identical rules finds that people paired with agents default to quick approvals rather than sustained coordination.

Core claim

TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles using operating-system controls so that no role can perform all three functions. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates on 931 instances, but prompt-only runs produce 3.6 times more cases of the verifier attempting to edit the executor's code. Verifiers approve 49 percent of submissions that fail the deterministic grader, an ablation without the verifier improves mean partial score, and team value is conditional on single-agent difficulty. A 40-session human study under the same separation reveals interaction styles

What carries the argument

TeamBench, a collection of 851 task templates that assigns Planner, Executor, and Verifier roles with operating-system-enforced limits on what each role can read, write, or certify.

Load-bearing premise

The chosen tasks genuinely require separate contributions from each role and that operating-system enforcement captures the coordination problems that appear in real multi-agent deployments.

What would settle it

Measure pass rates and violation counts on the identical 931 instances once with full operating-system role separation and once with all roles given complete access, then test whether the difference exceeds what random variation would produce.

Figures

Figures reproduced from arXiv: 2605.07073 by Chanwoo Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Eugene Park, Hae Won Park, Hamid Palangi, Salman Rahman, Samuel Schmidgall, Taehan Kim, Xin Liu, Yubin Kim.

**Figure 1.** Figure 1: TEAMBENCH evaluates agents under enforced role separation. The Planner reads the full requirements, the Executor edits the workspace, and the Verifier issues the final verdict. Sandboxes restrict which files each role can read or write. The example shows a role failure, where the Verifier accepts an implementation that violates a Planner constraint. We study this question by enforcing role separation with … view at source ↗

**Figure 2.** Figure 2: TEAMBENCH composition. (a) Source distribution by origin class. (b) Domain distribution across the 851 task templates. (c) Difficulty distribution over the 804 templates with grader-check counts. The remaining 47 templates are Unrated due to the grader-check count was not available. 4. Coordination behavior beyond pass rate. Team value depends on Solo capability, prompt-only roles hide more Verifier code-e… view at source ↗

**Figure 3.** Figure 3: TEAMBENCH leaderboard. Rows are sorted by max(Solo, Team) descending so the row order tracks each model’s best demonstrated capability. The bar shows both Solo (solid family color) and Team (hatched), and the right-side label gives that maximum percentage with the parenthesized delta to the other condition in green (+, team helps) or red (−, team hurts). The shorter of the two bars is drawn on top so both … view at source ↗

**Figure 4.** Figure 4: Cross-provider role-mixing on the 25-task subset. (a)-(c) report the marginal pass rate of each provider in each role, averaged over the 9 configurations holding that slot fixed. (d) plots all 27 configurations on the cost versus pass-rate plane (log x). The stair-step Pareto frontier runs from POEOVO (18.7%, $2.09) to PGEAVA (26.7%, $20.52). Marker color indicates the Executor family. Haiku-4.5 Gemini-3 F… view at source ↗

**Figure 5.** Figure 5: Verifier errors and conditional team value. (a) Verifier false-accept and false-reject rates on role-mixing runs. False-accept rates reach 60% for Haiku-4.5, 77% for Gemini-3 Flash, and 36% for GPT-5.4 Mini, while false-reject rates remain below 19%. (b) Mean team uplift over Solo by per-task Solo-score quintile on the 155-task reference ablation (n = 31 per quintile). Teams help on difficult tasks but hur… view at source ↗

**Figure 6.** Figure 6: Human study results. Each colored dot is one session, and the short black horizontal bar marks the per-mode median. (a) Duration per session by mode, in minutes on a linear scale, capped at 40 minutes (n = 10 Solo, n = 17 Hybrid, n = 11 Team within the cap; two Solo sessions were excluded since they exceeded the cap). (b) Per-role chat turns in Team mode, counting only messages a participant sent in the ch… view at source ↗

**Figure 7.** Figure 7: Human study platform (https://teambench.github.io/human-eval/). Participants enter a profile (a), pick a task from the stratified human-eval subset (b), select Team or Solo Mode with their role (c), and work in a sandboxed workspace with the same graders and role-based file access that agents use (d). The platform issues the same attestation file that closes the agent run. B.1 Pilot Coverage Coverage as o… view at source ↗

**Figure 8.** Figure 8: Per-category mean team uplift on the reference 5-condition ablation. Uplift is full-team [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeamBench shows OS-enforced role separation produces different agent behaviors than prompt-only setups even when pass rates match, but the results hinge on whether the 851 tasks actually force distinct role contributions.

read the letter

The core finding here is that prompt-only and sandbox teams end up with statistically similar success rates on the tasks, yet the prompt-only version triggers far more verifier attempts to edit executor code, and verifiers still greenlight nearly half the failing submissions. Removing the verifier also lifts partial scores in the ablation. A small human study adds that people paired with agents tend to rubber-stamp quickly while full human teams burn time chasing missing information across roles. That behavioral contrast is the paper's clearest contribution beyond raw pass rates. The benchmark itself is new in its use of operating-system controls to separate specification access, workspace edits, and certification across Planner, Executor, and Verifier, with 851 templates and a seeded set of 931 instances. The conditional team-value result, where teams help when solo agents are weak but hurt when they are already strong, is also worth noting. The work does a decent job of quantifying things that pass-rate metrics usually miss and includes a human baseline under the same constraints. The soft spots are mostly around task construction and reporting. The claim that enforced separation reveals real coordination dynamics only holds if most templates genuinely make each role's information and actions non-redundant; the abstract gives no breakdown of how the 851 templates were chosen or validated to ensure that. Without that, the 3.6-times edit difference and the false-approval rate could partly reflect tasks that one role can still complete from partial specs. The abstract also skips error bars, exact statistical tests, and details on instance seeding, which makes it hard to judge how stable the numbers are. Overall this is a benchmark paper aimed at people building or evaluating multi-agent systems who care about coordination beyond clever prompting. It deserves a serious referee because the setup is concrete and the behavioral observations are falsifiable, even if the authors will need to strengthen the task-validity argument and add more statistical transparency in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating multi-agent coordination under OS-enforced role separation across Planner (spec access), Executor (workspace editing), and Verifier (certification) roles. No single role can perform all three functions. Key empirical findings: prompt-only and sandbox-enforced teams achieve statistically indistinguishable pass rates, yet prompt-only runs exhibit 3.6 times more verifier attempts to edit executor code; verifiers approve 49% of submissions that fail the deterministic grader; removing the verifier improves mean partial scores; team value is conditional (beneficial when single agents struggle, detrimental when they perform well); and a 40-session human study reveals distinct interaction patterns (solo direct work, human-agent quick approvals, human teams coordinating missing info) that pass rates overlook.

Significance. If the central claims hold, the work is significant because it demonstrates that pass-rate metrics alone are insufficient for assessing agent teams and that enforced separation exposes coordination failures (over-editing, false approvals) invisible in prompt-only setups. The conditional team-value result and human study provide actionable insights for designing multi-agent systems. Credit is due for the benchmark scale (851 templates), the ablation study, and the human validation component, which together offer a falsifiable, reproducible framework for coordination evaluation beyond ad-hoc prompting.

major comments (3)

[Experimental results section] Experimental results on pass rates (abstract and main comparison): the claim of 'statistically indistinguishable pass rates' between prompt-only and sandbox conditions is load-bearing for the coordination-dynamics interpretation, yet the manuscript provides no details on the statistical test, p-values, error bars, or variance across the 931 instances, making it impossible to evaluate robustness as noted in the review.
[Benchmark design / task templates] Benchmark construction describing the 851 task templates: the design asserts that roles have non-overlapping information and action sets (no role can read full requirements, edit, and certify), but offers no quantitative check such as single-role success rates or information-overlap metrics. This is load-bearing because if many templates allow an Executor to succeed from partial spec plus workspace state, the 3.6× edit-attempt difference and 49% false-positive approvals could be artifacts of task construction rather than evidence of real coordination problems.
[Ablation study] Ablation removing the verifier and the 49% false-approval statistic: while improved mean partial score is reported, there is no breakdown by task type or analysis of whether noisy verifier approvals (49% of failing submissions) explain the ablation result, which is needed to distinguish coordination overhead from grader-verifier mismatch.

minor comments (2)

[Abstract and results tables] The abstract states concrete numbers (3.6 times, 49%) without accompanying sample sizes or confidence intervals; these should be added to the main text tables or figures for clarity.
[Introduction / benchmark overview] Notation for the three roles and the deterministic grader should be introduced with a dedicated early subsection or table to avoid ambiguity when discussing information flows.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and describe the revisions that will be incorporated to improve clarity and rigor.

read point-by-point responses

Referee: [Experimental results section] Experimental results on pass rates (abstract and main comparison): the claim of 'statistically indistinguishable pass rates' between prompt-only and sandbox conditions is load-bearing for the coordination-dynamics interpretation, yet the manuscript provides no details on the statistical test, p-values, error bars, or variance across the 931 instances, making it impossible to evaluate robustness as noted in the review.

Authors: We agree that the statistical details supporting the indistinguishability claim should have been reported. In the revised manuscript we will add a description of the test used to compare pass rates, the resulting p-value, error bars on the relevant figures, and measures of variance or confidence intervals computed across the 931 instances. These additions will be placed in the Experimental results section to allow readers to evaluate the robustness of the coordination-dynamics interpretation. revision: yes
Referee: [Benchmark design / task templates] Benchmark construction describing the 851 task templates: the design asserts that roles have non-overlapping information and action sets (no role can read full requirements, edit, and certify), but offers no quantitative check such as single-role success rates or information-overlap metrics. This is load-bearing because if many templates allow an Executor to succeed from partial spec plus workspace state, the 3.6× edit-attempt difference and 49% false-positive approvals could be artifacts of task construction rather than evidence of real coordination problems.

Authors: The OS-level enforcement guarantees that no single role can perform all three functions, but we acknowledge that empirical verification would strengthen the claim. In the revision we will report single-role success rates on the 851 templates and include quantitative information-overlap metrics (e.g., how much of the full specification can be inferred by the Executor from workspace state alone). These results will be added to the Benchmark design section to demonstrate that the observed coordination failures are not artifacts of task construction. revision: yes
Referee: [Ablation study] Ablation removing the verifier and the 49% false-approval statistic: while improved mean partial score is reported, there is no breakdown by task type or analysis of whether noisy verifier approvals (49% of failing submissions) explain the ablation result, which is needed to distinguish coordination overhead from grader-verifier mismatch.

Authors: We agree that a finer-grained analysis is needed to interpret the ablation result. In the revised manuscript we will provide a breakdown of the mean partial scores by task type and will quantify the contribution of the 49% false approvals to the observed improvement when the verifier is removed. This analysis will be added to the Ablation study section to separate coordination overhead from potential grader-verifier mismatches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct observational claims

full rationale

The paper is a benchmark study that reports pass rates, behavioral differences, and ablation results from running agents on 851 task templates under two conditions (prompt-only vs. OS-enforced role separation). All central claims are direct statistical comparisons of measured outcomes (e.g., indistinguishable pass rates, 3.6× edit attempts, 49% false approvals) rather than any derivation, fitted parameter, or self-citation chain. No equations, uniqueness theorems, ansatzes, or renamings appear; the work contains no load-bearing self-referential steps that reduce results to inputs by construction. The design assumptions about role separation are stated explicitly and tested via ablations, leaving the findings self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen tasks require genuine role separation and that the OS sandbox provides a faithful model of coordination constraints; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The benchmark tasks are constructed so that successful completion requires distinct, non-overlapping contributions from Planner, Executor, and Verifier roles.
This premise is required for the role-separation design to expose coordination failures rather than artificial constraints.

pith-pipeline@v0.9.0 · 5563 in / 1311 out tokens · 78994 ms · 2026-05-11T00:50:36.590800+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader
IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the Teamwork Necessity Index (TNI) as the fraction of the Solo versus Restricted gap recovered by the team

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 9 internal anchors

[1]

MetaGPT: Meta program- ming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta program- ming for a multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024

2024
[2]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. ChatDev: Communicative agents for software development. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[4]

CAMEL: Communicative agents for “mind” exploration of large language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[5]

Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823, 2024

work page arXiv 2024
[6]

Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, et al. Towards a science of scaling agent systems. arXiv preprint arXiv:2512.08296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

How we built our multi-agent research system

Anthropic. How we built our multi-agent research system. https://www.anthropic.com/ engineering/multi-agent-research-system , 2025. Anthropic Engineering Blog, June 2025

2025
[8]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei A. Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2025. arXiv:2503.13657

work page internal anchor Pith review arXiv 2025
[9]

arXiv:2507.02825 , year =

Yuxuan Zhu et al. Establishing best practices for building rigorous agentic benchmarks. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2025. arXiv:2507.02825

work page arXiv 2025
[10]

HiddenBench: Assessing collective reasoning in multi-agent LLMs via hidden profile tasks.arXiv preprint arXiv:2505.11556, 2025

Yuxuan Li, Aoi Naito, and Hirokazu Shirado. HiddenBench: Assessing collective reasoning in multi-agent LLMs via hidden profile tasks.arXiv preprint arXiv:2505.11556, 2025

work page internal anchor Pith review arXiv 2025
[11]

SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

2024
[12]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026. arXiv:2601.11868

work page internal anchor Pith review arXiv 2026
[13]

UCI machine learning repository, 2019

Dheeru Dua and Casey Graff. UCI machine learning repository, 2019. URLhttps://archive. ics.uci.edu/ml

2019
[14]

Objective communication patterns associated with team member effectiveness in real-world virtual teams.Human Factors, 66(5):1414–1430, 2024

Lisa O’Bryan, Tim Oxendahl, Xu Chen, Daniel McDuff, Santiago Segarra, Matthew Wetter- green, Margaret E Beier, and Ashutosh Sabharwal. Objective communication patterns associated with team member effectiveness in real-world virtual teams.Human Factors, 66(5):1414–1430, 2024

2024
[15]

Multiagentbench: Evaluating the collaboration and competition of llm agents,

Kunlun Zhu, Hongyi Du, Zhe Wang, et al. MultiAgentBench: Evaluating the collaboration and competition of LLM agents. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2025. arXiv:2503.01935

work page arXiv 2025
[16]

Steiner.Group Process and Productivity

Ivan D. Steiner.Group Process and Productivity. Academic Press, New York, 1972. 10

1972
[17]

Daniel M. Wegner. Transactive memory: A contemporary analysis of the group mind. In Brian Mullen and George R. Goethals, editors,Theories of Group Behavior, pages 185–208. Springer, New York, 1987

1987
[18]

Measuring transactive memory systems in the field: Scale development and validation.Journal of Applied Psychology, 88(4):587–604, 2003

Kyle Lewis. Measuring transactive memory systems in the field: Scale development and validation.Journal of Applied Psychology, 88(4):587–604, 2003

2003
[19]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander M ˛ adry. MLE-Bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024

work page Pith review arXiv 2024
[20]

Devbench: A comprehensive benchmark for software development

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Xin Cong, Xinyun He, et al. DevBench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604, 2024

work page arXiv 2024
[21]

GAIA: A benchmark for general AI assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations (ICLR), 2024

2024
[22]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review arXiv 2025
[23]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations (ICLR), 2024

2024
[24]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review arXiv 2024
[25]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Yueqi Li, Yufan Song, Frank F. Xu, Hao Tang, Mingchen Zhuge, Jiayi Pan, Yang Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI software d...

work page internal anchor Pith review arXiv 2024
[26]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contam- ination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InInternational Conference on Machine Learning (ICML), 2024

2024
[28]

Pooling of unshared information in group decision making: Biased information sampling during discussion.Journal of Personality and Social Psychology, 48(6):1467–1478, 1985

Garold Stasser and William Titus. Pooling of unshared information in group decision making: Biased information sampling during discussion.Journal of Personality and Social Psychology, 48(6):1467–1478, 1985

1985
[29]

Richard Hackman

J. Richard Hackman. The design of work teams. In Jay W. Lorsch, editor,Handbook of Organizational Behavior, pages 315–342. Prentice-Hall, Englewood Cliffs, NJ, 1987

1987
[30]

stronger X

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. 11 A Benchmark Construction A.1 Role permissions Table 2: Role permissions enforced by Docker bind mounts. No single role has simultaneous access to the full sp...

2021
[31]

Sessions failing either criterion are reported but excluded from outcome comparisons

Eligibility filter.A Hybrid session is analyzable iff overrode_grader≤2 on the 1 to 5 scale and wall-clock ≥5 minutes. Sessions failing either criterion are reported but excluded from outcome comparisons. Solo and Team sessions have no analogous eligibility filter beyond the existing attention check
[32]

Minimum cell.A per-task pass rate is reported only at ≥10 participant-sessions in the relevant cell, and an aggregate is reported only if≥8tasks satisfy the per-task minimum-cell rule
[33]

Pairing.Solo-vs-LLM-Solo and Team-vs-LLM-Team comparisons use McNemar’s exact test on (task id, seed) pairs, with Holm-Bonferroni adjustment across the three pairings (Solo, Hybrid, Team)
[34]

Significance tests are secondary

Effect-size focus.Headline numbers are paired differences in pass rate with Wilson 95% confi- dence intervals. Significance tests are secondary. We do not chasep < .05thresholds
[35]

Open-ended responsesare summarized by two annotators using the same role-collapse rubric as Section 3.3, and inter-annotator agreement is reported as Cohen’sκ
[36]

difficulty mix

Demographics, recruitment, and IRB.As specified in the public protocol document accompany- ing the platform release. Stop condition for collection. The target is ≥10 sessions per mode on each of the 20 stratified target tasks, with the eligibility filter applied. C Leaderboard C.1 Studies and counts C.2 Leaderboard construction The TEAMBENCHleaderboard su...