arxiv: 2604.21794 · v1 · submitted 2026-04-23 · 💻 cs.AI · cs.CL· cs.MA

Recognition: unknown

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Haibo Jin, Haohan Wang, Heming Liu, Peng Kuang, Xiaopeng Yuan, Ye Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords multi-agent systemslatent communicationlarge language modelsend-to-end optimizationreasoning benchmarksDiffMASparameter-efficient training

0 comments

The pith

Multi-agent language models improve reasoning by learning to communicate through internal latent representations instead of text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffMAS, a framework for training multi-agent LLM systems where communication happens in the models' internal states rather than as text. It uses parameter-efficient supervised learning on trajectories of these latent communications to let agents optimize how they encode and decode shared information during interactions. This leads to better performance than single-agent setups or text-based communication on tasks like mathematical reasoning, scientific QA, code generation, and commonsense benchmarks. A sympathetic reader would care because it points to a way for AI agents to collaborate more effectively by evolving their own communication protocols without relying on fixed human-readable interfaces.

Core claim

DiffMAS treats latent communication as a learnable component by performing parameter-efficient supervised training over multi-agent latent trajectories. This enables the agents to jointly learn optimal ways to encode and interpret information across interactions, resulting in improved reasoning accuracy and decoding stability on benchmarks including 26.7% on AIME24 and 20.2% on GPQA-Diamond.

What carries the argument

DiffMAS, the training framework that performs parameter-efficient supervised training over multi-agent latent trajectories to jointly optimize how agents encode and interpret shared information.

If this is right

Consistent accuracy gains across mathematical reasoning, scientific QA, code generation, and commonsense benchmarks.
Superior performance to single-agent inference, text-based multi-agent systems, and prior latent communication methods.
Improved decoding stability during multi-agent interactions.
Specific results of 26.7 percent on AIME24 and 20.2 percent on GPQA-Diamond.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This latent approach might reduce communication overhead when scaling to teams of more than two agents.
It could allow agent groups to develop task-specific internal protocols without human-readable prompts.
Similar trajectory-based optimization might apply to non-text modalities such as vision or structured data.

Load-bearing premise

That supervised training on multi-agent latent trajectories can be performed in a parameter-efficient manner that jointly optimizes encoding and interpretation without instability or requiring unavailable task-specific data.

What would settle it

If applying DiffMAS produces no accuracy gains or reduced stability on held-out reasoning benchmarks compared to single-agent or text-based baselines, or if it requires full model updates instead of parameter-efficient training.

Figures

Figures reproduced from arXiv: 2604.21794 by Haibo Jin, Haohan Wang, Heming Liu, Peng Kuang, Xiaopeng Yuan, Ye Yu.

**Figure 2.** Figure 2: Self-consistency analysis on AIME2024 for Qwen3-4B (top) and Qwen3-8B (bottom). Overall Performance. As shown in Tables 1 and 2, DiffMAS consistently achieves the best performance across math- /science reasoning (AIME24/25, GPQADiamond), code generation (HumanEval+, MBPP+), and commonsense reasoning (OpenBookQA). The improvements are especially pronounced at smaller scales, where Qwen3-4B improves from 43… view at source ↗

**Figure 3.** Figure 3: Perplexity analysis on AIME2024 for Qwen3-4B, DiffMAS compared to LatentMAS. Density indicates the number of problems falling into each perplexity score category. Self-Consistency in Inference. Beyond aggregate accuracy, we analyze the self-consistency of multiagent reasoning on the AIME 2024 benchmark. We measure self-consistency by independently sampling each problem four times and recording the number … view at source ↗

**Figure 4.** Figure 4: Token-level predictive entropy (top-25) of the judger agent on AIME2024 under LatentMAS (top) and DiffMAS (bottom). Token-Level Entropy Dynamics and Stability of Differentiable Latent Communication. To study the stability of latent multi-agent communication, we analyze the token-level predictive entropy of the final agent during decoding. At each step, we compute the entropy of the top-25 token distribu… view at source ↗

read the original abstract

Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffMAS gives a workable training method for optimizing latent communication inside multi-agent LLM systems, with reported gains on reasoning benchmarks, but the approach hinges on access to suitable training trajectories whose construction is not yet clear.

read the letter

The core contribution here is DiffMAS, a framework that treats inter-agent communication as a trainable part of the system rather than a fixed text interface. It uses parameter-efficient supervised training on multi-agent latent trajectories so agents can learn better ways to encode and interpret internal states during joint reasoning. That is a direct move past most current orchestration work that only tunes roles or prompts.

Referee Report

2 major / 2 minor

Summary. The paper proposes DiffMAS, a training framework that treats latent communication in multi-agent LLM systems as a learnable component via parameter-efficient supervised training over multi-agent latent trajectories. This enables agents to jointly optimize how information is encoded and interpreted across interactions. Experiments across mathematical reasoning, scientific QA, code generation, and commonsense benchmarks report consistent gains over single-agent inference, text-based multi-agent systems, and prior latent communication methods, including 26.7% on AIME24 and 20.2% on GPQA-Diamond.

Significance. If the empirical results and training procedure hold, this would advance multi-agent LLM systems by shifting from fixed or text-based communication interfaces to jointly optimized latent protocols. The parameter-efficient supervised approach could improve scalability and reasoning stability on complex tasks, offering a practical path beyond role-orchestration-focused methods.

major comments (2)

[§3] §3 (Method): The procedure for collecting and labeling multi-agent latent trajectories for supervised training is not specified in sufficient detail. If trajectory generation depends on running base multi-agent systems on tasks with known answers or requires task-specific ground-truth labels, this creates potential circularity and data-scarcity issues that directly undermine the central claim of general, end-to-end optimization without pre-existing optimized systems.
[§4] §4 (Experiments): The reported benchmark improvements (e.g., 26.7% on AIME24, 20.2% on GPQA-Diamond) lack accompanying details on training procedure, data construction, baseline implementations, number of runs, variance, or statistical significance testing. These omissions make it impossible to verify reproducibility or assess whether the gains are robust, which is load-bearing for the claim of consistent outperformance.

minor comments (2)

[Abstract] Abstract: The phrase 'decoding stability' is invoked as a benefit but is neither defined nor quantified, leaving its meaning and measurement unclear.
[Throughout] Throughout: Notation for latent representations (e.g., key-value caches) would benefit from explicit equations or diagrams to clarify the communication mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback. We believe the suggested revisions will strengthen the paper by enhancing clarity on the method and experimental rigor. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§3] §3 (Method): The procedure for collecting and labeling multi-agent latent trajectories for supervised training is not specified in sufficient detail. If trajectory generation depends on running base multi-agent systems on tasks with known answers or requires task-specific ground-truth labels, this creates potential circularity and data-scarcity issues that directly undermine the central claim of general, end-to-end optimization without pre-existing optimized systems.

Authors: We thank the referee for highlighting this important point. Upon review, we acknowledge that the description in §3 could be more detailed to avoid ambiguity. In the revised manuscript, we will specify that the multi-agent latent trajectories are collected by performing standard forward passes with the base (unoptimized) LLMs on the training set of tasks. Labeling uses the available ground-truth answers for supervised loss computation on the reasoning outputs, while the latent communication parameters are trained to improve encoding and decoding of information between agents. This setup does not rely on pre-existing optimized systems, as the base models remain frozen and the training starts from random adapters. We will also discuss how this scales to tasks without labels by using consistency-based pseudo-labeling, mitigating data scarcity concerns. revision: yes
Referee: [§4] §4 (Experiments): The reported benchmark improvements (e.g., 26.7% on AIME24, 20.2% on GPQA-Diamond) lack accompanying details on training procedure, data construction, baseline implementations, number of runs, variance, or statistical significance testing. These omissions make it impossible to verify reproducibility or assess whether the gains are robust, which is load-bearing for the claim of consistent outperformance.

Authors: We agree that the experimental section lacks sufficient details for full reproducibility and robustness assessment. We will revise §4 and add an appendix to include: comprehensive training procedure details (e.g., learning rate, batch size, number of training steps); data construction process (sampling of trajectories from benchmark datasets); precise baseline implementations (model versions, hyperparameters); results averaged over multiple independent runs with reported variance; and p-values from statistical tests (e.g., Wilcoxon signed-rank test) to confirm significance of improvements. These additions will directly support the claims of consistent outperformance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical training method without derivation chain

full rationale

The paper describes DiffMAS as a parameter-efficient supervised training framework that operates over multi-agent latent trajectories to jointly optimize encoding and interpretation. No equations, first-principles derivations, or mathematical predictions are presented in the abstract or described claims. Performance gains are reported via empirical benchmarks (AIME24, GPQA-Diamond, etc.) rather than any reduction of a result to a fitted quantity defined by the method itself. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are invoked in the provided text. The work is self-contained as an empirical contribution and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes the existence of usable multi-agent latent trajectories for supervised training.

pith-pipeline@v0.9.0 · 5480 in / 1084 out tokens · 35432 ms · 2026-05-09T21:44:52.189814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 8 canonical work pages · 2 internal anchors

[1]

2025 , eprint=

Multi-agent Architecture Search via Agentic Supernet , author=. 2025 , eprint=

2025
[2]

2025 , eprint=

AFlow: Automating Agentic Workflow Generation , author=. 2025 , eprint=

2025
[3]

2024 , eprint=

AutoAgents: A Framework for Automatic Agent Generation , author=. 2024 , eprint=

2024
[4]

2025 , eprint=

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System , author=. 2025 , eprint=

2025
[5]

2025 , eprint=

EvolveSearch: An Iterative Self-Evolving Search Agent , author=. 2025 , eprint=

2025
[6]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023
[7]

2024 , eprint=

Language Agents as Optimizable Graphs , author=. 2024 , eprint=

2024
[8]

2025 , eprint=

MALT: Improving Reasoning with Multi-Agent LLM Training , author=. 2025 , eprint=

2025
[9]

2025 , eprint=

Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning , author=. 2025 , eprint=

2025
[10]

2023 , eprint=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. 2023 , eprint=

2023
[11]

2024 , eprint=

ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=

2024
[12]

2024 , eprint=

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework , author=. 2024 , eprint=

2024
[13]

2023 , eprint=

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society , author=. 2023 , eprint=

2023
[14]

2025 , eprint=

Latent Collaboration in Multi-Agent Systems , author=. 2025 , eprint=

2025
[15]

2025 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2025 , eprint=

2025
[16]

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory , url=

Liu, Yexiang and Li, Zekun and Fang, Zhi and Xu, Nan and He, Ran and Tan, Tieniu , year=. Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory , url=. doi:10.18653/v1/2025.acl-long.1356 , booktitle=

work page doi:10.18653/v1/2025.acl-long.1356 2025
[17]

2023 , eprint=

Implicit Chain of Thought Reasoning via Knowledge Distillation , author=. 2023 , eprint=

2023
[18]

2024 , eprint=

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking , author=. 2024 , eprint=

2024
[19]

2025 , eprint=

A Survey on Latent Reasoning , author=. 2025 , eprint=

2025
[20]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023
[21]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023
[22]

2025 , eprint=

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving , author=. 2025 , eprint=

2025
[23]

2022 , eprint=

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , author=. 2022 , eprint=

2022
[24]

2025 , eprint=

Thought Communication in Multiagent Collaboration , author=. 2025 , eprint=

2025
[25]

2016 , eprint=

Learning to Communicate with Deep Multi-Agent Reinforcement Learning , author=. 2016 , eprint=

2016
[26]

2025 , eprint=

Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. 2025 , eprint=

2025
[27]

2026 , eprint=

Enabling Agents to Communicate Entirely in Latent Space , author=. 2026 , eprint=

2026
[28]

2025 , eprint=

Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems , author=. 2025 , eprint=

2025
[29]

2024 , eprint=

Can In-context Learning Really Generalize to Out-of-distribution Tasks? , author=. 2024 , eprint=

2024
[30]

2023 , eprint=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. 2023 , eprint=

2023
[31]

2025 , eprint=

A Survey on the Optimization of Large Language Model-based Agents , author=. 2025 , eprint=

2025
[32]

The Twelfth International Conference on Learning Representations , year=

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. The Twelfth International Conference on Learning Representations , year=
[33]

arXiv preprint arXiv:2408.08435 , year=

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

work page arXiv
[34]

Scaling Large Language Model-based Multi-Agent Collaboration , author=
[35]

Thought communication in multiagent collaboration

Thought communication in multiagent collaboration , author=. arXiv preprint arXiv:2510.20733 , year=

work page arXiv
[36]

2025 , eprint=

Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say , author=. 2025 , eprint=

2025
[37]

Advances in Neural Information Processing Systems , volume=

Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making , author=. Advances in Neural Information Processing Systems , volume=
[38]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems , author=. arXiv preprint arXiv:2504.01990 , year=

work page arXiv
[39]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[40]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=
[42]

Wang, Kun and Zhang, Guibin and Ye, ManKit and Deng, Xinyu and Wang, Dongxia and Hu, Xiaobin and Guo, Jinyang and Liu, Yang and Guo, Yufei , journal=. MAS ^
[43]

2025 , url =

AIME 2025 Dataset , author =. 2025 , url =

2025
[44]

2024 , url =

AIME 2024 Dataset , author =. 2024 , url =

2024
[45]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
[46]

Advances in Neural Information Processing Systems , volume=

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in Neural Information Processing Systems , volume=
[47]

2018 , eprint=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. 2018 , eprint=

2018
[48]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

2026 , eprint=

Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. 2026 , eprint=

2026
[50]

2026 , eprint=

Ministral 3 , author=. 2026 , eprint=

2026
[51]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page doi:10.1038/s41586-025-09422-z
[52]

2019 , eprint=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , author=. 2019 , eprint=

2019
[53]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[54]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[55]

2023 , publisher =

OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =

2023
[56]

2025 , howpublished =

Gemini 3 Flash , author =. 2025 , howpublished =

2025
[57]

2016 , eprint=

Learning Multiagent Communication with Backpropagation , author=. 2016 , eprint=

2016
[58]

2017 , eprint=

Multi-Agent Cooperation and the Emergence of (Natural) Language , author=. 2017 , eprint=

2017
[59]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1
[60]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[61]

2018 , eprint=

Emergence of Grounded Compositional Language in Multi-Agent Populations , author=. 2018 , eprint=

2018
[62]

2017 , eprint=

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols , author=. 2017 , eprint=

2017
[63]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

2023
[64]

2014 , eprint=

Sequence to Sequence Learning with Neural Networks , author=. 2014 , eprint=

2014