CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3
The pith
CalBench is a benchmark where agents with private calendars must coordinate meeting schedules without sharing data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CalBench generates decentralized scheduling instances with private calendars, incoming meetings, oracle-optimal solutions, and semantic privacy tags so that realized cost relative to the oracle, communication volume, fairness of cost distribution, and unnecessary private-information disclosure can all be quantified precisely; unlike many multi-agent tests, no single agent holds enough information to solve the problem alone.
What carries the argument
CalBench, a decentralized environment in which agents manage private calendars, receive a stream of meetings to schedule, and must negotiate consistent outcomes using only their own data plus an oracle that supplies the minimal total disruption cost.
Load-bearing premise
The specific mechanics of private-calendar scheduling with oracle optima and semantic privacy tags form a representative proxy for general coordination-privacy trade-offs across multi-agent LLM applications.
What would settle it
If agents using current LLMs in CalBench either routinely exceed the DCOP baseline costs by large margins or disclose task-irrelevant sensitive calendar details at high rates, the benchmark would fail to demonstrate useful coordination-privacy measurement.
Figures
read the original abstract
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CalBench, a controlled evaluation environment for studying coordination-privacy trade-offs in multi-agent LLMs via a decentralized calendar scheduling task. Each of N agents maintains a private calendar with pre-existing commitments and must negotiate to schedule M incoming meetings while minimizing disruption costs; an oracle provides the globally optimal schedule for exact performance measurement, a DCOP baseline enables comparison under identical private-information constraints, and semantic sensitivity tags on calendar entries allow quantification of privacy leakage during communication.
Significance. If the environment is implemented as described and its metrics prove robust, CalBench could fill a gap in multi-agent LLM evaluation by supplying a verifiable, inherently decentralized testbed where no agent sees others' private data yet global consistency is required. The oracle-based cost ratio, communication-volume tracking, fairness metric, and privacy-leakage measure together enable precise, reproducible quantification of trade-offs that are difficult to isolate in less structured domains.
major comments (2)
- [Abstract] Abstract: the claim that CalBench 'provides a practical and verifiable setting for studying coordination protocols, communication efficiency, negotiation strategies, fairness, and privacy leakage in multi-agent systems' rests on the unargued premise that private-calendar scheduling with additive disruption costs and semantic tags is a representative proxy; no cross-domain comparison, sensitivity analysis, or justification is supplied to show that observed trade-offs would not be artifacts of the discrete-slot structure or the existence of a global optimum.
- [Environment definition] Environment definition: the precise mechanics for generating scenarios, computing the oracle optimum, and enforcing the DCOP baseline under strictly private calendars are not specified in sufficient detail (e.g., no pseudocode, parameter ranges for N/M, or data-generation procedure), preventing independent assessment of whether the cost functions and privacy metrics are sound or reproducible.
minor comments (2)
- [Abstract] The abstract introduces N and M without indicating typical values or ranges used for evaluation scenarios.
- [Abstract] DCOP is used without an initial expansion on first appearance, although the acronym is standard in the field.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that CalBench 'provides a practical and verifiable setting for studying coordination protocols, communication efficiency, negotiation strategies, fairness, and privacy leakage in multi-agent systems' rests on the unargued premise that private-calendar scheduling with additive disruption costs and semantic tags is a representative proxy; no cross-domain comparison, sensitivity analysis, or justification is supplied to show that observed trade-offs would not be artifacts of the discrete-slot structure or the existence of a global optimum.
Authors: We agree that the abstract claim would benefit from explicit justification. Calendar scheduling was selected as a canonical decentralized task that naturally separates private calendars from shared decisions while admitting an exact oracle optimum and additive costs; these properties enable the precise, reproducible metrics that are the benchmark's primary contribution. Nevertheless, the manuscript does not supply cross-domain comparisons or sensitivity analysis. In the revision we will add a short design-rationale subsection (new Section 2.1) that (i) motivates the choice by reference to real-world meeting coordination, (ii) explains why the discrete-slot and oracle structure are deliberate features rather than artifacts, and (iii) acknowledges that future work should test whether the same trade-off patterns appear in continuous or non-oracle domains. We will also tone down the abstract phrasing to reflect this added discussion. revision: yes
-
Referee: [Environment definition] Environment definition: the precise mechanics for generating scenarios, computing the oracle optimum, and enforcing the DCOP baseline under strictly private calendars are not specified in sufficient detail (e.g., no pseudocode, parameter ranges for N/M, or data-generation procedure), preventing independent assessment of whether the cost functions and privacy metrics are sound or reproducible.
Authors: We accept this criticism. The current text describes the high-level structure but omits the concrete generation procedure, oracle algorithm, and DCOP encoding. In the revised manuscript we will insert a new subsection (Section 3.2) containing: (a) pseudocode for scenario generation (including how private commitments and semantic sensitivity tags are sampled), (b) the exact integer-linear-program formulation used for the oracle optimum, (c) the DCOP variable/constraint encoding that respects private calendars, and (d) the ranges and default values for N, M, and other parameters. These additions will make the cost functions and privacy-leakage metric fully reproducible. revision: yes
Circularity Check
No circularity: benchmark definition with no derived predictions or self-referential reductions
full rationale
The paper introduces CalBench as a new evaluation environment for multi-agent coordination and privacy trade-offs. Its central claims describe the environment's design properties (decentralized private calendars, oracle optima, DCOP baseline, semantic sensitivity tags) and the quantities it enables measuring (realized-to-optimal cost, communication volume, privacy leakage). These follow directly from the stated construction rules without any derivation chain, equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The assumption that calendar scheduling forms a representative proxy is presented as a design choice rather than a derived result, so the contribution remains self-contained as an environment definition.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.