Recognition: unknown
Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension
Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3
The pith
Modality-native routing in agent networks raises task accuracy by 20 points over text bottlenecks when downstream agents can use the preserved context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Preserving multimodal signals across agent boundaries is necessary but not sufficient for accurate cross-modal reasoning. Modality-native routing via MMA2A improves task accuracy by 20 percentage points over text-bottleneck baselines on the CrossModal-CS benchmark, but only when the downstream reasoning agent can exploit the richer context. An ablation with keyword matching shows the gap disappears entirely, confirming that protocol-level native routing must pair with capable reasoning.
What carries the argument
MMA2A architecture layer that routes voice, image, and text parts in native modality by inspecting Agent Card capability declarations to preserve context for downstream reasoning.
If this is right
- Task completion accuracy rises from 32% to 52%, with larger gains on vision-dependent tasks such as product defect reports.
- The accuracy benefit requires capable LLM-based reasoning and vanishes when replaced by keyword matching.
- Native multimodal processing adds a 1.8 times latency cost compared to text-only routing.
- Routing becomes a first-order design variable in multi-agent systems because it determines the information available to reasoning agents.
Where Pith is reading between the lines
- Native routing methods could extend to video or sensor data streams in agent networks for similar context preservation.
- Protocol standards may need to include fallback conversions when native modality support is unavailable.
- The accuracy-latency tradeoff suggests prioritizing native routing for tasks where visual or audio details are decisive.
- Larger-scale tests with many agents could show whether routing overhead increases with network complexity.
Load-bearing premise
The downstream agent must be able to process and reason over native multimodal inputs rather than losing information through forced text conversion.
What would settle it
An independent run of the CrossModal-CS benchmark with the same LLM backend but a reasoning agent that cannot directly process native images or voice, which should eliminate the accuracy gap between MMA2A and the text-bottleneck baseline.
Figures
read the original abstract
Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $\Delta$TCA: [8, 32] pp; McNemar's exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MMA2A as an architecture layer atop the A2A protocol that routes voice, image, and text modalities natively by inspecting Agent Card capability declarations. On the CrossModal-CS 50-task benchmark with fixed LLM backend and tasks, it reports 52% task completion accuracy for MMA2A versus 32% for the text-bottleneck baseline (20 pp gain; 95% bootstrap CI [8, 32] pp; McNemar's exact p = 0.006). An ablation replacing LLM reasoning with keyword matching eliminates the gap (36% vs. 36%), showing the benefit requires capable downstream reasoning. Gains concentrate on vision-dependent tasks (+38.5 pp for product defect reports, +16.7 pp for visual troubleshooting) at a reported 1.8× latency cost.
Significance. If the result holds, the work provides concrete evidence that protocol-level routing decisions are first-order determinants of performance in multimodal multi-agent systems because they control the information available to downstream agents. The controlled comparison (identical tasks and backend, routing as sole variable) together with the keyword-matching ablation that closes the accuracy gap entirely supplies a clear two-layer requirement: native routing is beneficial only when paired with capable agent-level reasoning. This strengthens the case for treating modality preservation as a deliberate design variable rather than an afterthought in A2A networks.
minor comments (3)
- The 1.8× latency cost is stated in the abstract without measurement details, hardware specification, or breakdown of overhead sources (e.g., multimodal encoding vs. transmission).
- The CrossModal-CS benchmark is referenced but not described or cited; a brief task taxonomy or pointer to its definition would help readers evaluate the scope of the reported vision-dependent gains.
- The structure and standardization status of 'Agent Card capability declarations' are not elaborated; clarifying whether they extend an existing schema would aid reproducibility and adoption.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work. The assessment correctly identifies the core contribution: modality-native routing improves task accuracy by 20 pp on the CrossModal-CS benchmark, but only when paired with capable LLM-based reasoning, as demonstrated by the keyword-matching ablation that eliminates the gap. We appreciate the emphasis on the controlled experimental design (identical tasks, backend, and routing as the sole variable) and the recognition that routing decisions are first-order determinants of performance in multimodal A2A systems. The recommendation for minor revision is noted; with no specific major comments provided, we will address any minor editorial or clarification points in the revised manuscript.
Circularity Check
No significant circularity; empirical result on controlled benchmark
full rationale
The paper introduces the MMA2A architecture as an extension to A2A protocols and evaluates it via direct empirical comparison on the CrossModal-CS benchmark. The central claim (20 pp accuracy gain) is measured under controlled conditions with identical tasks, backend, and only routing path as the variable, plus an explicit ablation (keyword matching) that eliminates the gap. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the result is a straightforward statistical comparison (bootstrap CI and McNemar's test) on a fixed 50-task set. This is a self-contained empirical finding without reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Downstream LLM-backed agents can exploit richer multimodal context when it is preserved by native routing
invented entities (1)
-
MMA2A architecture layer
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
Reference graph
Works this paper leans on
-
[1]
Agent2Agent Protocol (A2A) Specification, v0.2.https://a2a-protocol.org/ latest/specification/, 2025
Google. Agent2Agent Protocol (A2A) Specification, v0.2.https://a2a-protocol.org/ latest/specification/, 2025
2025
-
[2]
Model Context Protocol (MCP).https://modelcontextprotocol.io/, 2024
Anthropic. Model Context Protocol (MCP).https://modelcontextprotocol.io/, 2024
2024
-
[3]
Agent Communication Protocol (ACP): RESTful multipart messaging for agent systems
IBM Research. Agent Communication Protocol (ACP): RESTful multipart messaging for agent systems. Technical report, 2025
2025
-
[4]
arXiv preprint arXiv:2505.02279 , year =
A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar. A survey of agent interoperability protocols: MCP, ACP, A2A, and ANP.arXiv preprint arXiv:2505.02279, 2025
-
[5]
C. C. Liao, D. Liao, and S. S. Gadiraju. AgentMaster: A multi-agent conversational frameworkusingA2AandMCPprotocolsformultimodalinformationretrievalandanalysis. InProc. EMNLP System Demonstrations, 2025
2025
-
[6]
The orchestration of multi-agent systems: Architectures, protocols, and enterprise adoption
A. Adimulam, R. Gupta, and S. Kumar. The orchestration of multi-agent systems: Archi- tectures, protocols, and enterprise adoption.arXiv preprint arXiv:2601.13671, 2026
-
[7]
M. Habiba and N. I. Khan. Revisiting gossip protocols: A vision for emergent coordination in agentic multi-agent systems.arXiv preprint arXiv:2508.01531, 2025
-
[8]
GPT-4o system card
OpenAI. GPT-4o system card. Technical report, 2024
2024
-
[9]
Gemini: A Family of Highly Capable Multimodal Models
Google DeepMind. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [10]
-
[11]
MMBench: Is Your Multi-modal Model an All-around Player?
Y. Liu et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2024
work page internal anchor Pith review arXiv 2024
-
[12]
Yue et al
X. Yue et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.CVPR, 2024
2024
-
[13]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Q. Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
J. Moura. CrewAI: Framework for orchestrating role-playing autonomous AI agents.https: //github.com/crewAIInc/crewAI, 2024. 14
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.