Latent Cache Flow: Model-to-Model Communication Without Text
Pith reviewed 2026-05-25 05:44 UTC · model grok-4.3
The pith
A compact adapter lets LLMs exchange KV cache summaries instead of text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent Cache Flow enables model-to-model communication without text by using a small adapter that jointly translates and compresses keys and values from the sharer model's KV cache into a summary of new information the receiver does not already possess, allowing the receiver to incorporate that information even when the two models maintain different contexts.
What carries the argument
The Latent Cache Flow adapter, which jointly translates and compresses KV cache entries to transmit summaries of new information.
If this is right
- A 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared-context settings.
- For different contexts, LCF is 23 percent more accurate than text-based communication.
- LCF communication runs 8.5 times faster than text-based methods.
- The adapter size is reduced to about 4 percent of that used in Cache-to-Cache approaches.
Where Pith is reading between the lines
- Networks of LLMs could exchange updates continuously without regenerating text at every step.
- The summary approach might extend to cases where models differ in size or architecture.
- Testing whether the same adapter works across entirely new model pairs would show how general the translation is.
Load-bearing premise
A learned summary of new information extracted from the sharer KV cache can be translated by a small adapter into a form the receiver model can usefully incorporate without requiring identical context or losing critical details.
What would settle it
An experiment in which the receiver model, after receiving an LCF summary, shows no improvement on questions that require the new information the sharer held.
Figures
read the original abstract
LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared-context settings; for different contexts, LCF is 23% more accurate and 8.5x faster than text-based communication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Latent Cache Flow (LCF) as an alternative to text-based or Cache-to-Cache (C2C) communication between LLMs. It jointly compresses and translates KV-cache entries via a small adapter (~13 MB, or 4% of C2C size) that transmits only a summary of new information, enabling use with non-identical contexts. Early experiments are reported to show the LCF adapter outperforming the 956 MB C2C adapter on accuracy in shared-context settings and delivering 23% higher accuracy plus 8.5x speedup versus text-based baselines when contexts differ.
Significance. If the performance claims are substantiated, LCF would offer a practical route to low-latency, low-loss inter-agent communication that scales to models with mismatched contexts, addressing a clear bottleneck in multi-LLM systems. The size reduction and context-robustness design choices are concrete engineering contributions that could be adopted independently of the specific accuracy numbers.
major comments (1)
- [Abstract] Abstract: the central performance claims (13 MB LCF more accurate than 956 MB C2C; 23% accuracy gain and 8.5x speedup vs. text for differing contexts) rest entirely on “early experiments” for which no methodology, datasets, model pairs, baseline implementations, number of trials, error bars, or statistical tests are supplied. Without these details the empirical support for the design’s advantages cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for their careful review and for identifying the lack of experimental detail. We agree that the abstract's reference to 'early experiments' requires supporting methodology to allow evaluation of the claims, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (13 MB LCF more accurate than 956 MB C2C; 23% accuracy gain and 8.5x speedup vs. text for differing contexts) rest entirely on “early experiments” for which no methodology, datasets, model pairs, baseline implementations, number of trials, error bars, or statistical tests are supplied. Without these details the empirical support for the design’s advantages cannot be evaluated.
Authors: We acknowledge that this comment is correct and that the current manuscript does not supply the requested details. The work is presented as preliminary, with the abstract summarizing early results. In the revised version we will add a dedicated Experiments section describing the full methodology, datasets and benchmarks, model pairs, baseline implementations (including how the 956 MB C2C adapter was reproduced), number of trials, error bars, and statistical tests. We will also update the abstract to reference this section. revision: yes
Circularity Check
No significant circularity identified
full rationale
The manuscript presents LCF as an empirical design for KV-cache communication, motivated by observations on joint translation/compression and summary transmission for differing contexts. No equations, derivations, fitted parameters, or self-citations appear in the provided text that would reduce any claimed result to its inputs by construction. Accuracy and latency numbers are reported as experimental outcomes rather than predictions forced by parameter fitting or definitional equivalence. The central feasibility claim is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle=
-
[2]
International Conference on Machine Learning , year=
Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. International Conference on Machine Learning , year=
-
[3]
and Burger, Doug and Wang, Chi , booktitle=
Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , booktitle=
-
[4]
Chen, Weize and Su, Yusheng and Zuo, Jingwei and Yang, Cheng and Yuan, Chenfei and Chan, Chi-Min and Yu, Heyang and Lu, Yaxi and Hung, Yi-Hsin and Qian, Chen and Qin, Yujia and Cong, Xin and Xie, Ruobing and Liu, Zhiyuan and Sun, Maosong and Zhou, Jie , booktitle=
-
[5]
Proceedings of Machine Learning and Systems , year=
Efficiently Scaling Transformer Inference , author=. Proceedings of Machine Learning and Systems , year=
-
[6]
International Conference on Learning Representations , year=
Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. International Conference on Learning Representations , year=
-
[7]
arXiv preprint arXiv:2405.04434 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Advances in Neural Information Processing Systems , year=
Transferring Linear Features Across Language Models With Model Stitching , author=. Advances in Neural Information Processing Systems , year=
-
[9]
and Salakhutdinov, Ruslan and Manning, Christopher D
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=. 2018 , publisher=
work page 2018
-
[10]
arXiv preprint arXiv:2601.06123 , year =
Latent Space Communication via K-V Cache Alignment , author =. arXiv preprint arXiv:2601.06123 , year =. 2601.06123 , archivePrefix =
-
[11]
OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.