CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Chelsea Zou; Noah Goodman; Robert D. Hawkins; Selena She; Yiheng Yao

arxiv: 2605.09823 · v2 · pith:ASXOA3TJnew · submitted 2026-05-10 · 💻 cs.MA · cs.AI

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Chelsea Zou , Yiheng Yao , Selena She , Noah Goodman , Robert D. Hawkins This is my paper

Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords multi-agent coordinationprivacy preservationcalendar schedulingdecentralized systemsLLM agentsbenchmarkDCOP

0 comments

The pith

CalBench is a benchmark where agents with private calendars must coordinate meeting schedules without sharing data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CalBench as a controlled setting in which N agents each hold a private calendar of existing commitments and must communicate to schedule a stream of new meetings while minimizing total disruption. No agent can see another agent's calendar entries, yet the group must produce mutually consistent schedules; an oracle computes the optimal cost solution for each generated scenario. This allows exact measurement of how close the agents come to the optimum, how much they communicate, how evenly disruption costs are shared, and whether they leak sensitive private details tagged on calendar entries. A DCOP solver serves as a baseline under the same information limits. Readers would care because the setup isolates the coordination-privacy tension in a verifiable, decentralized way that single-agent substitutes cannot shortcut.

Core claim

CalBench generates decentralized scheduling instances with private calendars, incoming meetings, oracle-optimal solutions, and semantic privacy tags so that realized cost relative to the oracle, communication volume, fairness of cost distribution, and unnecessary private-information disclosure can all be quantified precisely; unlike many multi-agent tests, no single agent holds enough information to solve the problem alone.

What carries the argument

CalBench, a decentralized environment in which agents manage private calendars, receive a stream of meetings to schedule, and must negotiate consistent outcomes using only their own data plus an oracle that supplies the minimal total disruption cost.

Load-bearing premise

The specific mechanics of private-calendar scheduling with oracle optima and semantic privacy tags form a representative proxy for general coordination-privacy trade-offs across multi-agent LLM applications.

What would settle it

If agents using current LLMs in CalBench either routinely exceed the DCOP baseline costs by large margins or disclose task-irrelevant sensitive calendar details at high rates, the benchmark would fail to demonstrate useful coordination-privacy measurement.

Figures

Figures reproduced from arXiv: 2605.09823 by Chelsea Zou, Noah Goodman, Robert D. Hawkins, Selena She, Yiheng Yao.

**Figure 1.** Figure 1: Overview of the CalBench environment. (A) Environment setup: Each of N agents maintains a private calendar with T discrete time slots containing free times, errands with private semantic contexts, and scheduled meetings. Meetings must be placed at the same slot across all participants’ calendars. Calendars are initialized by reserving a hidden feasible solution and filling remaining slots to a target densi… view at source ↗

**Figure 2.** Figure 2: Privacy–efficiency plane: uniform VPS leakage (mean per game; §3.2) against excess cost [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Communication efficiency for model runs only. Each point is a single game’s messages [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Average direct messages by meeting index for uniform and varied tasks. The metric is [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Average direct messages by speaker position. We measure speaker-position effects because [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Mean number of prior meetings rescheduled per game. Values are averaged across [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CalBench gives a workable oracle-verified benchmark for decentralized scheduling with privacy tracking, but the claim that it proxies general coordination trade-offs rests on an untested assumption.

read the letter

CalBench defines an environment where agents each hold private calendars with existing commitments and must negotiate a stream of new meetings. An oracle supplies the global optimum so you can compute exact realized-to-optimal cost, and the setup runs a DCOP solver under the same information limits for comparison. Semantic sensitivity tags on entries let you measure whether agents leak task-irrelevant private details during talk. The design keeps every agent blind to others' full calendars, which forces genuine cross-boundary coordination and gives clean signals on communication volume and cost fairness.

Referee Report

2 major / 2 minor

Summary. The paper introduces CalBench, a controlled evaluation environment for studying coordination-privacy trade-offs in multi-agent LLMs via a decentralized calendar scheduling task. Each of N agents maintains a private calendar with pre-existing commitments and must negotiate to schedule M incoming meetings while minimizing disruption costs; an oracle provides the globally optimal schedule for exact performance measurement, a DCOP baseline enables comparison under identical private-information constraints, and semantic sensitivity tags on calendar entries allow quantification of privacy leakage during communication.

Significance. If the environment is implemented as described and its metrics prove robust, CalBench could fill a gap in multi-agent LLM evaluation by supplying a verifiable, inherently decentralized testbed where no agent sees others' private data yet global consistency is required. The oracle-based cost ratio, communication-volume tracking, fairness metric, and privacy-leakage measure together enable precise, reproducible quantification of trade-offs that are difficult to isolate in less structured domains.

major comments (2)

[Abstract] Abstract: the claim that CalBench 'provides a practical and verifiable setting for studying coordination protocols, communication efficiency, negotiation strategies, fairness, and privacy leakage in multi-agent systems' rests on the unargued premise that private-calendar scheduling with additive disruption costs and semantic tags is a representative proxy; no cross-domain comparison, sensitivity analysis, or justification is supplied to show that observed trade-offs would not be artifacts of the discrete-slot structure or the existence of a global optimum.
[Environment definition] Environment definition: the precise mechanics for generating scenarios, computing the oracle optimum, and enforcing the DCOP baseline under strictly private calendars are not specified in sufficient detail (e.g., no pseudocode, parameter ranges for N/M, or data-generation procedure), preventing independent assessment of whether the cost functions and privacy metrics are sound or reproducible.

minor comments (2)

[Abstract] The abstract introduces N and M without indicating typical values or ranges used for evaluation scenarios.
[Abstract] DCOP is used without an initial expansion on first appearance, although the acronym is standard in the field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that CalBench 'provides a practical and verifiable setting for studying coordination protocols, communication efficiency, negotiation strategies, fairness, and privacy leakage in multi-agent systems' rests on the unargued premise that private-calendar scheduling with additive disruption costs and semantic tags is a representative proxy; no cross-domain comparison, sensitivity analysis, or justification is supplied to show that observed trade-offs would not be artifacts of the discrete-slot structure or the existence of a global optimum.

Authors: We agree that the abstract claim would benefit from explicit justification. Calendar scheduling was selected as a canonical decentralized task that naturally separates private calendars from shared decisions while admitting an exact oracle optimum and additive costs; these properties enable the precise, reproducible metrics that are the benchmark's primary contribution. Nevertheless, the manuscript does not supply cross-domain comparisons or sensitivity analysis. In the revision we will add a short design-rationale subsection (new Section 2.1) that (i) motivates the choice by reference to real-world meeting coordination, (ii) explains why the discrete-slot and oracle structure are deliberate features rather than artifacts, and (iii) acknowledges that future work should test whether the same trade-off patterns appear in continuous or non-oracle domains. We will also tone down the abstract phrasing to reflect this added discussion. revision: yes
Referee: [Environment definition] Environment definition: the precise mechanics for generating scenarios, computing the oracle optimum, and enforcing the DCOP baseline under strictly private calendars are not specified in sufficient detail (e.g., no pseudocode, parameter ranges for N/M, or data-generation procedure), preventing independent assessment of whether the cost functions and privacy metrics are sound or reproducible.

Authors: We accept this criticism. The current text describes the high-level structure but omits the concrete generation procedure, oracle algorithm, and DCOP encoding. In the revised manuscript we will insert a new subsection (Section 3.2) containing: (a) pseudocode for scenario generation (including how private commitments and semantic sensitivity tags are sampled), (b) the exact integer-linear-program formulation used for the oracle optimum, (c) the DCOP variable/constraint encoding that respects private calendars, and (d) the ranges and default values for N, M, and other parameters. These additions will make the cost functions and privacy-leakage metric fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition with no derived predictions or self-referential reductions

full rationale

The paper introduces CalBench as a new evaluation environment for multi-agent coordination and privacy trade-offs. Its central claims describe the environment's design properties (decentralized private calendars, oracle optima, DCOP baseline, semantic sensitivity tags) and the quantities it enables measuring (realized-to-optimal cost, communication volume, privacy leakage). These follow directly from the stated construction rules without any derivation chain, equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The assumption that calendar scheduling forms a representative proxy is presented as a design choice rather than a derived result, so the contribution remains self-contained as an environment definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the paper defines an evaluation environment rather than deriving results from prior assumptions.

pith-pipeline@v0.9.0 · 5517 in / 998 out tokens · 56850 ms · 2026-05-12T01:59:53.883962+00:00 · methodology

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)