arxiv: 2604.12213 · v1 · submitted 2026-04-14 · 💻 cs.AI · cs.MA· cs.SE

Recognition: unknown

Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Vasundra Srinivasan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.SE

keywords multimodal routingagent-to-agent networkscross-modal reasoningtask accuracyprotocol extensionvision tasksA2A protocol

0 comments

The pith

Modality-native routing in agent networks raises task accuracy by 20 points over text bottlenecks when downstream agents can use the preserved context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that keeping multimodal data like images and speech in their original forms when passing between AI agents leads to higher accuracy on tasks that need cross-modal understanding. It presents MMA2A, which routes these data types natively by looking at what each agent can handle, resulting in 52 percent task success versus 32 percent when everything is forced into text. This advantage only appears when the final agent can actually reason over the richer inputs; a simple keyword system gets the same results either way. Improvements are especially clear on visual tasks like spotting product defects, though it comes with almost double the processing time. The findings indicate that the way agents exchange information shapes what they can achieve together.

Core claim

Preserving multimodal signals across agent boundaries is necessary but not sufficient for accurate cross-modal reasoning. Modality-native routing via MMA2A improves task accuracy by 20 percentage points over text-bottleneck baselines on the CrossModal-CS benchmark, but only when the downstream reasoning agent can exploit the richer context. An ablation with keyword matching shows the gap disappears entirely, confirming that protocol-level native routing must pair with capable reasoning.

What carries the argument

MMA2A architecture layer that routes voice, image, and text parts in native modality by inspecting Agent Card capability declarations to preserve context for downstream reasoning.

If this is right

Task completion accuracy rises from 32% to 52%, with larger gains on vision-dependent tasks such as product defect reports.
The accuracy benefit requires capable LLM-based reasoning and vanishes when replaced by keyword matching.
Native multimodal processing adds a 1.8 times latency cost compared to text-only routing.
Routing becomes a first-order design variable in multi-agent systems because it determines the information available to reasoning agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native routing methods could extend to video or sensor data streams in agent networks for similar context preservation.
Protocol standards may need to include fallback conversions when native modality support is unavailable.
The accuracy-latency tradeoff suggests prioritizing native routing for tasks where visual or audio details are decisive.
Larger-scale tests with many agents could show whether routing overhead increases with network complexity.

Load-bearing premise

The downstream agent must be able to process and reason over native multimodal inputs rather than losing information through forced text conversion.

What would settle it

An independent run of the CrossModal-CS benchmark with the same LLM backend but a reasoning agent that cannot directly process native images or voice, which should eliminate the accuracy gap between MMA2A and the text-bottleneck baseline.

Figures

Figures reproduced from arXiv: 2604.12213 by Vasundra Srinivasan.

**Figure 2.** Figure 2: Information flow comparison. Text-BN (top) transcodes all non-text parts, losing prosodic and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Causal diagram of the two-layer requirement. Routing strategy determines input fidelity, which [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Information topologies induced by routing strategy. (a) Text-BN funnels all modalities through [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $\Delta$TCA: [8, 32] pp; McNemar's exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Native modality routing lifts multimodal agent accuracy by 20 points over text bottlenecks, but only when the downstream agent has capable reasoning; the ablation nails that condition.

read the letter

The paper's main result is that preserving native modalities when routing between agents improves task completion on CrossModal-CS from 32% to 52%, with the gain concentrated on vision-dependent subtasks. They build MMA2A as a thin layer on A2A that reads Agent Cards and routes voice, image, and text parts without forcing text conversion. The setup keeps the same LLM backend and the same 50 tasks, changing only the routing path, and they back the delta with a bootstrap CI and McNemar's test. The keyword-matching ablation is the useful part: it closes the gap completely, showing routing helps only if the receiving agent can actually use the richer input. That two-layer requirement is the clearest new angle here. The controlled comparison and the explicit trade-off note on 1.8x latency are straightforward and honest. The benchmark is small and the full task definitions sit in the paper rather than the abstract, so reproducibility will depend on how clearly those are written up. Generalization beyond the tested LLM and task set is not shown, which is normal at this stage but worth flagging. This is for people already working on agent protocols or multimodal multi-agent setups who want a concrete data point on routing choices. A reader who cares about information loss at agent boundaries will find the ablation useful. It is worth sending to peer review because the central claim rests on a clean variable isolation and the stats match the paired design; a referee can check the task details and ask for scaling data.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces MMA2A as an architecture layer atop the A2A protocol that routes voice, image, and text modalities natively by inspecting Agent Card capability declarations. On the CrossModal-CS 50-task benchmark with fixed LLM backend and tasks, it reports 52% task completion accuracy for MMA2A versus 32% for the text-bottleneck baseline (20 pp gain; 95% bootstrap CI [8, 32] pp; McNemar's exact p = 0.006). An ablation replacing LLM reasoning with keyword matching eliminates the gap (36% vs. 36%), showing the benefit requires capable downstream reasoning. Gains concentrate on vision-dependent tasks (+38.5 pp for product defect reports, +16.7 pp for visual troubleshooting) at a reported 1.8× latency cost.

Significance. If the result holds, the work provides concrete evidence that protocol-level routing decisions are first-order determinants of performance in multimodal multi-agent systems because they control the information available to downstream agents. The controlled comparison (identical tasks and backend, routing as sole variable) together with the keyword-matching ablation that closes the accuracy gap entirely supplies a clear two-layer requirement: native routing is beneficial only when paired with capable agent-level reasoning. This strengthens the case for treating modality preservation as a deliberate design variable rather than an afterthought in A2A networks.

minor comments (3)

The 1.8× latency cost is stated in the abstract without measurement details, hardware specification, or breakdown of overhead sources (e.g., multimodal encoding vs. transmission).
The CrossModal-CS benchmark is referenced but not described or cited; a brief task taxonomy or pointer to its definition would help readers evaluate the scope of the reported vision-dependent gains.
The structure and standardization status of 'Agent Card capability declarations' are not elaborated; clarifying whether they extend an existing schema would aid reproducibility and adoption.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work. The assessment correctly identifies the core contribution: modality-native routing improves task accuracy by 20 pp on the CrossModal-CS benchmark, but only when paired with capable LLM-based reasoning, as demonstrated by the keyword-matching ablation that eliminates the gap. We appreciate the emphasis on the controlled experimental design (identical tasks, backend, and routing as the sole variable) and the recognition that routing decisions are first-order determinants of performance in multimodal A2A systems. The recommendation for minor revision is noted; with no specific major comments provided, we will address any minor editorial or clarification points in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical result on controlled benchmark

full rationale

The paper introduces the MMA2A architecture as an extension to A2A protocols and evaluates it via direct empirical comparison on the CrossModal-CS benchmark. The central claim (20 pp accuracy gain) is measured under controlled conditions with identical tasks, backend, and only routing path as the variable, plus an explicit ablation (keyword matching) that eliminates the gap. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the result is a straightforward statistical comparison (bootstrap CI and McNemar's test) on a fixed 50-task set. This is a self-contained empirical finding without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper introduces a new protocol layer and relies on the domain assumption that capable LLM reasoning can use preserved multimodal context; no free parameters or invented physical entities are stated.

axioms (1)

domain assumption Downstream LLM-backed agents can exploit richer multimodal context when it is preserved by native routing
The accuracy benefit materializes only when this condition holds, as shown by the keyword-matching ablation that removes the gap.

invented entities (1)

MMA2A architecture layer no independent evidence
purpose: Inspects Agent Card declarations to route voice, image, and text parts in native modality
New proposed layer atop existing A2A protocol

pith-pipeline@v0.9.0 · 5575 in / 1427 out tokens · 85363 ms · 2026-05-10T15:45:57.060728+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Agent2Agent Protocol (A2A) Specification, v0.2.https://a2a-protocol.org/ latest/specification/, 2025

Google. Agent2Agent Protocol (A2A) Specification, v0.2.https://a2a-protocol.org/ latest/specification/, 2025

2025
[2]

Model Context Protocol (MCP).https://modelcontextprotocol.io/, 2024

Anthropic. Model Context Protocol (MCP).https://modelcontextprotocol.io/, 2024

2024
[3]

Agent Communication Protocol (ACP): RESTful multipart messaging for agent systems

IBM Research. Agent Communication Protocol (ACP): RESTful multipart messaging for agent systems. Technical report, 2025

2025
[4]

arXiv preprint arXiv:2505.02279 , year =

A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar. A survey of agent interoperability protocols: MCP, ACP, A2A, and ANP.arXiv preprint arXiv:2505.02279, 2025

work page arXiv 2025
[5]

C. C. Liao, D. Liao, and S. S. Gadiraju. AgentMaster: A multi-agent conversational frameworkusingA2AandMCPprotocolsformultimodalinformationretrievalandanalysis. InProc. EMNLP System Demonstrations, 2025

2025
[6]

The orchestration of multi-agent systems: Architectures, protocols, and enterprise adoption

A. Adimulam, R. Gupta, and S. Kumar. The orchestration of multi-agent systems: Archi- tectures, protocols, and enterprise adoption.arXiv preprint arXiv:2601.13671, 2026

work page arXiv 2026
[7]

Habiba and N

M. Habiba and N. I. Khan. Revisiting gossip protocols: A vision for emergent coordination in agentic multi-agent systems.arXiv preprint arXiv:2508.01531, 2025

work page arXiv 2025
[8]

GPT-4o system card

OpenAI. GPT-4o system card. Technical report, 2024

2024
[9]

Gemini: A Family of Highly Capable Multimodal Models

Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Li et al

X. Li et al. LLM Agent Communication Protocol (LACP) requires urgent standardization: A telecom-inspired protocol is necessary.arXiv preprint arXiv:2510.13821, 2025

work page arXiv 2025
[11]

MMBench: Is Your Multi-modal Model an All-around Player?

Y. Liu et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2024

work page internal anchor Pith review arXiv 2024
[12]

Yue et al

X. Yue et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.CVPR, 2024

2024
[13]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

J. Moura. CrewAI: Framework for orchestrating role-playing autonomous AI agents.https: //github.com/crewAIInc/crewAI, 2024. 14

2024