Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

Aaron Parisi; Alden Hallak; Crystal Qian; Nithum Thain; Vivian Tsai

arxiv: 2605.14097 · v2 · pith:INHJO2QCnew · submitted 2026-05-13 · 💻 cs.HC

Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

Aaron Parisi , Nithum Thain , Alden Hallak , Vivian Tsai , Crystal Qian This is my paper

Pith reviewed 2026-05-15 01:47 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM facilitationgroup deliberationalgorithmic steeringconsensusparticipation equityAI governancecharity allocationprocedural justice

0 comments

The pith

LLM facilitators in group charity tasks shift specific donation shares by up to 5.5 points without raising overall consensus or participation equity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how LLM facilitators affect real-time text-based group decisions when groups allocate real money to charities. Facilitation left consensus measures unchanged yet raised participant preference for the process, mainly because people felt it seemed more inclusive. At the same time, the LLMs nudged allocations to particular charities enough to change final payouts, and transcript and survey checks showed no real gain in equal participation. These results indicate that subjective approval of an AI process can coexist with unchanged fairness metrics and measurable directional influence on outcomes.

Core claim

In two studies totaling 879 participants who allocated real donation budgets in groups of three, LLM facilitation across frontier models and strategies produced no significant rise in group consensus compared with no-facilitation baselines. Participants nevertheless preferred facilitated sessions and cited inclusivity as the main reason. Facilitators altered select charity-level shares by as much as 5.5 percentage points, directly affecting payouts, while neither survey responses nor transcript analysis detected improvements in participation equity. Reported trust in the process was higher in the very conditions where steering occurred.

What carries the argument

The incentive-compatible charity allocation task, in which groups divide a fixed budget across charities under text-only chat with or without LLM facilitation, with outcomes tracked through consensus scores, per-charity allocation shifts, survey and transcript equity measures, and post-task preference ratings.

If this is right

Facilitators can change final charitable payouts even when aggregate agreement metrics remain flat.
Perceived inclusivity can rise without any corresponding increase in measured participation equity.
Trust in the deliberation process can increase under conditions where directional influence on outcomes is present.
Governance evaluation of AI-mediated groups must track collective outcomes, interaction patterns, and subjective perceptions as separate targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar steering could occur in other high-stakes text-based deliberations such as workplace budgeting or community planning.
Designers might add explicit limits on directional suggestions to reduce unintended allocation shifts while retaining facilitation benefits.
Testing voice or video interfaces could reveal whether the gap between perceived and actual equity shrinks outside text chat.

Load-bearing premise

The specific charity allocation task with real financial stakes and text-only chat generalizes to other group deliberation settings and the chosen metrics fully capture steering and equity effects.

What would settle it

A replication using a different real-stakes group task, such as ranking policy options, in which LLM facilitation produces neither allocation shifts nor higher preference ratings would falsify the steering and preference findings.

Figures

Figures reproduced from arXiv: 2605.14097 by Aaron Parisi, Alden Hallak, Crystal Qian, Nithum Thain, Vivian Tsai.

**Figure 1.** Figure 1: Experiment design overview. Participants, in groups of three, complete three rounds of group deliberation and budget [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The Deliberate Lab experimenter interface. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Changes in group consensus score across studies and rounds. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Allocation steering by charity (Study 2) Left Bars show the change (percentage points) in the average post-discussion allocation under each strategy-driven facilitator relative to the human-only baseline for that charity (stars denote statistically significant shifts). Right AI shift (%) vs the standard deviation (SD) of human baselines. Despite no significant changes in aggregate consensus scores, we find… view at source ↗

**Figure 5.** Figure 5: Participant preferences by facilitator. Top. Normalized individual responses show consistent preference for LLM facilitation over the unfacilitated control, with differences across facilitation styles and models. Bottom. When we visualize supermajority preferences (≥ 2 participants in group prefer the same treatment), similar trends emerge. 10Significance computed using Welch’s two-sided t-test. 11This is … view at source ↗

**Figure 6.** Figure 6: Left. Pearson correlation (r) of participant’s self-identified traits vs. change in group consensus outcomes. There is a slight negative correlation between participants self-identifying as being invested in the outcome, and the change in consensus score within their group. Right. Pearson correlation (r) of participants’ traits vs. preference for the human-only baseline. There is a statistically significan… view at source ↗

**Figure 7.** Figure 7: An example of stages within a round (Round 1). [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Instruction screen on allocations. Participants allocate a fixed total donation across real charities; each slider setting [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Instruction screen on incentivized payouts. Each group receives a consensus score; groups are ranked by this score, [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Top tokens and themes of prevalent keywords presented by each facilitator in conversations involving the [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Top tokens and themes of prevalent keywords presented by each facilitator in conversations involving the [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

read the original abstract

As large language models (LLMs) evolve from single-user assistants to active participants in civic and workplace deliberation, evaluating their effects on collective decision making becomes a governance challenge. We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD). Groups of three allocate a donation budget under varying LLM facilitation conditions: Study 1 (N=204) compares three frontier models; Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline. Across both studies, LLM facilitation did not significantly improve group consensus in either study, yet participants consistently preferred facilitated discussion. We additionally identify two governance-relevant risks. First, algorithmic steering: facilitators shifted select charity-level allocations by up to 5.5 percentage points -- directly affecting the final charitable payout -- even when aggregate agreement metrics remained unchanged. Second, an illusion of inclusion: participants cited inclusivity as their primary reason for preferring LLM facilitators, yet neither survey nor transcript-based measures of participation equity improved. Notably, participants reported greater trust in the process under the same conditions where facilitators exerted directional influence on outcomes. Together, these findings show that in AI-mediated group deliberation, perceived procedural improvement can coexist with measurable steering and unchanged participation inequality, motivating evaluation practices that treat collective outcomes, interaction dynamics, and participant perceptions as distinct governance targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents two empirical studies (total N=879) on real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD). Study 1 (N=204) compares three frontier LLMs as facilitators; Study 2 (N=675) compares facilitation strategies to a no-facilitation baseline. Central claims are that LLM facilitation produced no significant improvement in group consensus (per aggregate agreement metrics) yet elicited consistent participant preference for facilitated conditions; two governance risks are identified—algorithmic steering (shifts in select charity allocations up to 5.5 pp without aggregate consensus change) and illusion of inclusion (higher perceived inclusivity without gains in survey or transcript equity measures).

Significance. If the results hold under more detailed scrutiny, the work is significant for HCI and AI governance research. It provides concrete evidence that perceived procedural benefits (preference, trust) can coexist with measurable outcome steering and static participation inequality in LLM-mediated groups. The incentive-compatible design with real stakes strengthens ecological validity for civic and workplace applications, and the distinction between collective outcomes, interaction dynamics, and perceptions offers a useful framework for future evaluation practices.

major comments (3)

[Results (Study 2)] Results section (Study 2, algorithmic steering paragraph): The claim of shifts up to 5.5 percentage points in specific charity allocations requires explicit statistical tests (e.g., per-charity t-tests or regression coefficients with p-values and confidence intervals) and a precise definition of how 'select' charities were identified; without these, it is unclear whether the shifts are distinguishable from noise given that aggregate agreement metrics showed no change.
[Methods] Methods section: The operationalization of consensus (e.g., variance, pairwise similarity, or other aggregate metrics) and participation equity (survey items plus transcript coding rules for message volume/turn-taking) must be specified in detail, including inter-rater reliability for transcripts and power analysis for the null consensus result; these metrics are load-bearing for the steering and illusion-of-inclusion conclusions.
[Discussion] Discussion section: The interpretation that unchanged aggregate metrics plus directional shifts constitute 'steering' rather than a form of consensus change needs justification against alternative granular measures (e.g., semantic alignment of contributions or preference polarization indices); if coarser metrics miss these, the governance-risk framing may require qualification.

minor comments (2)

[Abstract] Abstract: The total N=879 is the sum of the two studies with no overlap, but a parenthetical note on this would improve immediate clarity.
[Results] The paper would benefit from reporting effect sizes (e.g., Cohen's d or partial eta-squared) alongside the preference and trust findings to allow readers to assess practical significance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Results (Study 2)] Results section (Study 2, algorithmic steering paragraph): The claim of shifts up to 5.5 percentage points in specific charity allocations requires explicit statistical tests (e.g., per-charity t-tests or regression coefficients with p-values and confidence intervals) and a precise definition of how 'select' charities were identified; without these, it is unclear whether the shifts are distinguishable from noise given that aggregate agreement metrics showed no change.

Authors: We agree that additional statistical detail is required for transparency. In the revised manuscript, we will report per-charity independent-samples t-tests (facilitated vs. baseline) with p-values, Cohen's d, and 95% confidence intervals for all allocation differences. 'Select' charities will be defined explicitly as those exhibiting a mean shift of at least 3 percentage points that reaches statistical significance (p < 0.05) in at least one facilitated condition. We will also include the full allocation table for all charities so readers can evaluate the pattern against noise. revision: yes
Referee: [Methods] Methods section: The operationalization of consensus (e.g., variance, pairwise similarity, or other aggregate metrics) and participation equity (survey items plus transcript coding rules for message volume/turn-taking) must be specified in detail, including inter-rater reliability for transcripts and power analysis for the null consensus result; these metrics are load-bearing for the steering and illusion-of-inclusion conclusions.

Authors: We will expand the Methods section with precise operational definitions. Consensus is measured by (1) variance of the final allocation proportions across groups and (2) mean pairwise cosine similarity of pre- and post-discussion preference vectors. Participation equity comprises Likert-scale survey items on perceived inclusion/fairness plus transcript coding for message count, total words, and turn-taking Gini coefficient. Two coders will independently code 20% of transcripts; Cohen's kappa will be reported. A post-hoc power analysis for the null consensus results, based on observed effect sizes, will be added to quantify sensitivity to small effects. revision: yes
Referee: [Discussion] Discussion section: The interpretation that unchanged aggregate metrics plus directional shifts constitute 'steering' rather than a form of consensus change needs justification against alternative granular measures (e.g., semantic alignment of contributions or preference polarization indices); if coarser metrics miss these, the governance-risk framing may require qualification.

Authors: We maintain that the observed pattern qualifies as steering because directional changes in specific allocations occurred without corresponding gains in aggregate agreement, indicating targeted influence rather than broad convergence. In revision we will add explicit justification contrasting our metrics with polarization indices (showing no increase in preference extremity) and acknowledge that semantic alignment or contribution-level measures could reveal subtler dynamics. The governance-risk language will be qualified to note that our standard allocation metrics may not capture every form of influence, while still highlighting the dissociation between perceived and measured outcomes. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivations or self-referential predictions

full rationale

The paper reports two incentive-compatible experiments (N=879) measuring LLM facilitation effects on group consensus, allocation shifts, and perceived inclusivity via surveys and transcripts. No equations, fitted parameters, or first-principles derivations appear; all results rest on direct statistical comparisons of collected data against baselines. No self-citation chains or ansatzes are invoked to justify core claims, so the reported findings on steering and illusion of inclusion are independent of any internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions of randomized experimental design and statistical testing rather than new parameters or entities.

axioms (1)

standard math Standard assumptions of randomized controlled trials and null-hypothesis significance testing apply to the group allocation task
Invoked when reporting no significant improvement in consensus

pith-pipeline@v0.9.0 · 5561 in / 1330 out tokens · 39842 ms · 2026-05-15T01:47:15.030741+00:00 · methodology

Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)