pith. sign in

arxiv: 2605.25716 · v1 · pith:MIUJF44Bnew · submitted 2026-05-25 · 💻 cs.CR · cs.DC

An Efficient and Privacy-Preserving Architecture for Cross-Institutional Collaborative RAG

Pith reviewed 2026-06-29 21:50 UTC · model grok-4.3

classification 💻 cs.CR cs.DC
keywords federated RAGprivacy-preserving inferencedistributed attentionsecure LLM servingcross-institutional collaborationtransformer privacy
0
0 comments X

The pith

A scrambled attention protocol lets institutions run federated RAG together without exposing private data or retraining models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedRAG to overcome the barrier that transformer self-attention creates for distributed inference across separate institutional data stores. Its core mechanism scrambles features and permutes tokens so that attention computations can be delegated between nodes while plaintext remains hidden. The system claims this works without cryptographic primitives, specialized hardware, or model changes, and evaluations show utility loss stays below 0.1 percent while latency drops by as much as 62 times versus prior secure methods. A reader would care because the approach removes a key technical obstacle to collaborative, domain-specific knowledge bases that must obey strict privacy rules.

Core claim

The Scrambled Distributed Attention protocol uses numerically stable feature scrambling and token permutation to dynamically delegate attention computations across nodes, thereby decoupling execution from data localization without exposing plaintext and while resisting intermediate state inversion attacks.

What carries the argument

Scrambled Distributed Attention protocol: it scrambles and permutes token features so that cross-node attention can occur without plaintext exposure or data movement.

If this is right

  • Model utility remains within 0.1 percent of the non-private baseline.
  • End-to-end latency falls by up to 62 times relative to existing secure baselines.
  • Practical throughput for human-scale reading is maintained across institutions.
  • No model retraining or specialized hardware is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scrambling approach might apply to other transformer layers that require cross-device communication.
  • Performance in real multi-institution networks with variable network latency remains to be measured.
  • The method could be combined with differential privacy techniques for stronger guarantees.

Load-bearing premise

Numerically stable feature scrambling combined with token permutation prevents reconstruction of original tokens from any intermediate states passed between institutions.

What would settle it

An experiment in which an attacker recovers original input tokens or KV cache contents from the scrambled delegated states would disprove the privacy claim.

Figures

Figures reproduced from arXiv: 2605.25716 by Chenxin Mao, Fan Wu, Guihai Chen, Jie Wu, Shangyu Liu, Zhenzhe Zheng.

Figure 1
Figure 1. Figure 1: System architecture of the federated RAG system. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Forward pass of a decoder layer and the distributed [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Security evaluation of the scrambling mechanism [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end latency of RAG requests [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Decode throughput under different protected-layer counts and network settings. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Throughput vs. bandwidth. less, our throughput remains significantly higher than existing cryptographic solutions, which typically require several min￾utes to generate a single token [12]. The throughput for configurations employing two Compute Nodes is depicted in the second row of [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of intermediate activation quantization on [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) empowers LLMs with external knowledge, making cross-institutional domain-specific knowledge base integration a highly promising deployment paradigm. Despite this potential, strict privacy regulations create severe "data silos" that obstruct such collaboration. Building federated RAG systems requires distributed inference, but the Transformer's self-attention mechanism fundamentally conflicts with this by mandating cross-node access to distributed Key-Value caches. To address this challenge, we present FedRAG, a high-throughput, privacy-preserving federated RAG framework. At its core is a novel Scrambled Distributed Attention protocol that utilizes numerically stable feature scrambling and token permutation. By dynamically delegating scrambled computations to collaborating nodes, our system successfully decouples attention execution from data localization without exposing plaintext. Crucially, our approach requires no specialized hardware or model retraining, circumventing the prohibitive latency and communication overheads of cryptographic solutions while robustly defending against intermediate state inversion attacks. Extensive evaluations demonstrate our framework preserves negligible (<0.1\%) model utility degradation and achieves up to a 62$\times$ latency reduction over existing secure baselines, sustaining practical, human-reading throughput for cross-institutional knowledge synergy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes FedRAG, a federated RAG framework for cross-institutional collaboration under privacy constraints. Its core contribution is the Scrambled Distributed Attention protocol, which applies numerically stable feature scrambling and token permutation to delegate computations across nodes. This is claimed to decouple attention from data localization without exposing plaintext, require no specialized hardware or retraining, defend against intermediate state inversion attacks, and deliver <0.1% utility degradation with up to 62× latency reduction versus secure baselines.

Significance. If the privacy guarantees and performance claims hold, the work would address a practical barrier to collaborative RAG by avoiding the overhead of cryptographic primitives while sustaining usable throughput. The absence of model retraining or specialized hardware is a notable strength relative to many secure-inference baselines.

major comments (1)
  1. [Abstract / §3] Abstract and §3 (Scrambled Distributed Attention protocol): the central privacy claim—that numerically stable feature scrambling plus token permutation 'robustly defends against intermediate state inversion attacks' without exposing plaintext—is asserted without a formal security model, game-based definition, reduction to a hard problem, or any attack experiments (e.g., reconstruction or gradient-inversion attempts on scrambled KV caches). This is load-bearing for the entire privacy premise and the subsequent utility/latency results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the privacy analysis of FedRAG. We address the major comment below and commit to revisions that strengthen the presentation without altering the core technical claims or results.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (Scrambled Distributed Attention protocol): the central privacy claim—that numerically stable feature scrambling plus token permutation 'robustly defends against intermediate state inversion attacks' without exposing plaintext—is asserted without a formal security model, game-based definition, reduction to a hard problem, or any attack experiments (e.g., reconstruction or gradient-inversion attempts on scrambled KV caches). This is load-bearing for the entire privacy premise and the subsequent utility/latency results.

    Authors: We agree that the manuscript would be strengthened by a more explicit treatment of the privacy argument. The current version motivates the defense through the protocol design: feature scrambling is constructed to be numerically stable (preserving the exact attention computation) while rendering intermediate states non-invertible without the secret parameters, and token permutation breaks positional correspondence. These choices are intended to prevent plaintext exposure and thwart inversion without requiring cryptographic primitives. We acknowledge the absence of a formal game-based model or reduction in the submitted version. In revision we will expand §3 with an informal security argument detailing why the combination of scrambling and permutation renders reconstruction computationally impractical under standard assumptions on the adversary's knowledge, and we will add targeted experiments in the evaluation section that attempt reconstruction and gradient-inversion attacks on the scrambled KV caches. These additions will be empirical and explanatory rather than a full cryptographic proof, consistent with the paper's focus on practical systems performance. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical evaluation rather than self-referential definitions

full rationale

The provided abstract and context contain no equations, fitted parameters, or derivation steps that reduce performance or security claims to their own inputs by construction. The Scrambled Distributed Attention protocol is introduced as a novel mechanism whose privacy and utility properties are asserted via description and external evaluation, without self-definitional loops, renamed known results, or load-bearing self-citations that collapse the argument. Absence of mathematical derivations in the visible text precludes any of the enumerated circularity patterns. This is the expected outcome for a systems paper whose central contributions are architectural and empirical.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced protocol whose security and performance properties are asserted but not derived from prior literature in the provided abstract.

axioms (1)
  • domain assumption The Transformer's self-attention mechanism fundamentally conflicts with distributed inference by mandating cross-node access to distributed Key-Value caches.
    Stated directly in the abstract as the core technical obstacle.
invented entities (1)
  • Scrambled Distributed Attention protocol no independent evidence
    purpose: Decouples attention execution from data localization using feature scrambling and token permutation while defending against intermediate state inversion attacks.
    New protocol introduced to solve the stated privacy and performance problem.

pith-pipeline@v0.9.1-grok · 5748 in / 1214 out tokens · 28073 ms · 2026-06-29T21:50:47.709077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration

    cs.MA 2026-06 conditional novelty 7.0

    KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.

Reference graph

Works this paper leans on

62 extracted references · 16 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Self-rag: Learning to retrieve, gen- erate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, gen- erate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

  3. [3]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, An- drew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehen- sion dataset.arXiv preprint arXiv:1611.09268, 2016

  4. [4]

    Improving language mod- els by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language mod- els by retrieving from trillions of tokens. InInterna- tional conference on machine learning, pages 2206–

  5. [5]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Homomorphic encryption for arithmetic of ap- proximate numbers

    Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomorphic encryption for arithmetic of ap- proximate numbers. InInternational conference on the theory and application of cryptology and information security, pages 409–437. Springer, 2017

  8. [8]

    Thirty years of graph matching in pattern recognition.International journal of pattern recognition and artificial intelligence, 18(03):265–298, 2004

    Donatello Conte, Pasquale Foggia, Carlo Sansone, and Mario Vento. Thirty years of graph matching in pattern recognition.International journal of pattern recognition and artificial intelligence, 18(03):265–298, 2004. 13

  9. [9]

    Improved achiev- ability and converse bounds for erdos-rényi graph match- ing.ACM SIGMETRICS performance evaluation review, 44(1):63–72, 2016

    Daniel Cullina and Negar Kiyavash. Improved achiev- ability and converse bounds for erdos-rényi graph match- ing.ACM SIGMETRICS performance evaluation review, 44(1):63–72, 2016

  10. [10]

    Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

  11. [11]

    Depth gives a false sense of privacy:{LLM} internal states inversion

    Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, and Haojin Zhu. Depth gives a false sense of privacy:{LLM} internal states inversion. In34th USENIX Security Symposium (USENIX Security 25), pages 1629–1648, 2025

  12. [12]

    Puma: Secure in- ference of llama-7b in five minutes.Security and Safety, 4:2025014, 2025

    Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wen-Guang Chen, et al. Puma: Secure in- ference of llama-7b in five minutes.Security and Safety, 4:2025014, 2025

  13. [13]

    Dp-forward: Fine- tuning and inference on language models with differen- tial privacy in forward pass

    Minxin Du, Xiang Yue, Sherman SM Chow, Tianhao Wang, Chenyu Huang, and Huan Sun. Dp-forward: Fine- tuning and inference on language models with differen- tial privacy in forward pass. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 2665–2679, 2023

  14. [14]

    Unsplit: Data-oblivious model inversion, model stealing, and label inference attacks against split learning

    Ege Erdo˘gan, Alptekin Küpçü, and A Ercüment Çiçek. Unsplit: Data-oblivious model inversion, model stealing, and label inference attacks against split learning. In Proceedings of the 21st Workshop on Privacy in the Electronic Society, pages 115–124, 2022

  15. [15]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 55–65, 2019

  16. [16]

    Simcse: Simple contrastive learning of sentence embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6894– 6910, 2021

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  19. [19]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on ma- chine learning, pages 3929–3938. PMLR, 2020

  20. [20]

    Iron: Private infer- ence on transformers.Advances in neural information processing systems, 35:15718–15731, 2022

    Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. Iron: Private infer- ence on transformers.Advances in neural information processing systems, 35:15718–15731, 2022

  21. [21]

    Independent component analysis: algorithms and applications.Neural networks, 13(4-5):411–430, 2000

    Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications.Neural networks, 13(4-5):411–430, 2000

  22. [22]

    Leveraging pas- sage retrieval with generative models for open domain question answering

    Gautier Izacard and Edouard Grave. Leveraging pas- sage retrieval with generative models for open domain question answering. InProceedings of the 16th con- ference of the european chapter of the association for computational linguistics: main volume, pages 874–880, 2021

  23. [23]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704– 2713, 2018

  24. [24]

    Billion- scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion- scale similarity search with gpus.IEEE transactions on big data, 7(3):535–547, 2019

  25. [25]

    Generalization through Memorization: Nearest Neighbor Language Models.https://arxiv.org/abs/1911.00172, 2019

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019

  26. [26]

    Crypten: Secure multi-party computation meets machine learning.Advances in Neural Information Pro- cessing Systems, 34:4961–4973, 2021

    Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. Crypten: Secure multi-party computation meets machine learning.Advances in Neural Information Pro- cessing Systems, 34:4961–4973, 2021

  27. [27]

    PhD thesis, UC Berkeley, 2025

    Woosuk Kwon.vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, UC Berkeley, 2025

  28. [28]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 14

  29. [29]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  30. [30]

    Collaborative inference and learning between edge slms and cloud llms: A survey of algo- rithms, execution, and open challenges.arXiv preprint arXiv:2507.16731, 2025

    Senyao Li, Haozhao Wang, Wenchao Xu, Rui Zhang, Song Guo, Jingling Yuan, Xian Zhong, Tianwei Zhang, and Ruixuan Li. Collaborative inference and learning between edge slms and cloud llms: A survey of algo- rithms, execution, and open challenges.arXiv preprint arXiv:2507.16731, 2025

  31. [31]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subra- manian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexan- dre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  32. [32]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

  33. [33]

    Shadow in the cache: Unveiling and mitigating pri- vacy risks of kv-cache in llm inference.arXiv preprint arXiv:2508.09442, 2025

    Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, and Zhan Qin. Shadow in the cache: Unveiling and mitigating pri- vacy risks of kv-cache in llm inference.arXiv preprint arXiv:2508.09442, 2025

  34. [34]

    Ambigqa: Answering ambiguous open-domain questions

    Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5783–5797, 2020

  35. [35]

    Text embeddings reveal (al- most) as much as text

    John Morris, V olodymyr Kuleshov, Vitaly Shmatikov, and Alexander M Rush. Text embeddings reveal (al- most) as much as text. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12448–12460, 2023

  36. [36]

    Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information pro- cessing systems, 35:27730–27744, 2022

  37. [37]

    Public-key cryptosystems based on com- posite degree residuosity classes

    Pascal Paillier. Public-key cryptosystems based on com- posite degree residuosity classes. InInternational con- ference on the theory and applications of cryptographic techniques, pages 223–238. Springer, 1999

  38. [38]

    Unifying large language mod- els and knowledge graphs: A roadmap.IEEE Transac- tions on Knowledge and Data Engineering, 36(7):3580– 3599, 2024

    Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language mod- els and knowledge graphs: A roadmap.IEEE Transac- tions on Knowledge and Data Engineering, 36(7):3580– 3599, 2024

  39. [39]

    Bolt: Privacy-preserving, accu- rate and efficient inference for transformers

    Qi Pang, Jinhao Zhu, Helen Möllering, Wenting Zheng, and Thomas Schneider. Bolt: Privacy-preserving, accu- rate and efficient inference for transformers. In2024 IEEE Symposium on Security and Privacy (SP), pages 4753–4771. IEEE, 2024

  40. [40]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392, 2016

  41. [41]

    Trusted execution environment: What it is, and what it is not

    Mohamed Sabt, Mohammed Achemlal, and Abdelmad- jid Bouabdallah. Trusted execution environment: What it is, and what it is not. In2015 IEEE Trust- com/BigDataSE/Ispa, volume 1, pages 57–64. IEEE, 2015

  42. [42]

    Large language models encode clinical knowledge.Na- ture, 620(7972):172–180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah- davi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Na- ture, 620(7972):172–180, 2023

  43. [43]

    On the limitations of unsupervised bilingual dictionary in- duction

    Anders Søgaard, Sebastian Ruder, and Ivan Vuli´c. On the limitations of unsupervised bilingual dictionary in- duction. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 778–788, 2018

  44. [44]

    Hidden no more: Attacking and defending private third-party llm inference

    Rahul Krishna Thomas, Louai Zahran, Erica Choi, Ak- ilesh Potti, Micah Goldblum, and Arka Pal. Hidden no more: Attacking and defending private third-party llm inference. InForty-second International Conference on Machine Learning, 2025

  45. [45]

    Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539– 554, 2022

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539– 554, 2022

  46. [46]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  47. [47]

    Milvus: A purpose- built vector data management system

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. Milvus: A purpose- built vector data management system. InProceedings of the 2021 international conference on management of data, pages 2614–2627, 2021. 15

  48. [48]

    A survey of model architectures in information retrieval

    Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang, Jimmy Lin, and Vivek Srikumar. A survey of model architectures in information retrieval. arXiv preprint arXiv:2502.14822, 2025

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    Parallelizing linear transformers with the delta rule over sequence length.Advances in neu- ral information processing systems, 37:115491–115522, 2024

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in neu- ral information processing systems, 37:115491–115522, 2024

  51. [51]

    Hotpotqa: A dataset for diverse, ex- plainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. Hotpotqa: A dataset for diverse, ex- plainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

  52. [52]

    Pre- trained transformers for text ranking: Bert and beyond

    Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pre- trained transformers for text ranking: Bert and beyond. InProceedings of the 14th ACM International Confer- ence on web search and data mining, pages 1154–1156, 2021

  53. [53]

    Scx: Stateless kv-cache encoding for cloud-scale confidential trans- former serving

    Mu Yuan, Lan Zhang, Liekang Zeng, Siyang Jiang, Bu- fang Yang, Di Duan, and Guoliang Xing. Scx: Stateless kv-cache encoding for cloud-scale confidential trans- former serving. InProceedings of the ACM SIGCOMM 2025 Conference, pages 39–54, 2025

  54. [54]

    Pri- vacyrestore: Privacy-preserving inference in large lan- guage models via privacy removal and restoration

    Ziqian Zeng, Jianwei Wang, Junyao Yang, Zhengdong Lu, Haoran Li, Huiping Zhuang, and Cen Chen. Pri- vacyrestore: Privacy-preserving inference in large lan- guage models via privacy removal and restoration. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 10821–10855, 2025

  55. [55]

    mgte: Generalized long- context text representation and reranking models for multilingual text retrieval

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. mgte: Generalized long- context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412, 2024

  56. [56]

    Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, 51(4):1373–1418, 2025

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yu- long Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, 51(4):1373–1418, 2025

  57. [57]

    Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wen- tao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

  58. [58]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xi- aolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large lan- guage models.arXiv preprint arXiv:2303.18223, 1(2):1– 124, 2023

  59. [59]

    Permllm: Private inference of large lan- guage models within 3 seconds under wan.arXiv preprint arXiv:2405.18744, 2024

    Fei Zheng, Chaochao Chen, Zhongxuan Han, and Xi- aolin Zheng. Permllm: Private inference of large lan- guage models within 3 seconds under wan.arXiv preprint arXiv:2405.18744, 2024

  60. [60]

    A review on edge large language models: Design, execution, and applications

    Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuan- chao Shu, and Jiming Chen. A review on edge large language models: Design, execution, and applications. ACM Computing Surveys, 57(8):1–35, 2025

  61. [61]

    Qmsum: A new bench- mark for query-based multi-domain meeting summa- rization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new bench- mark for query-based multi-domain meeting summa- rization. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, ...

  62. [62]

    Training lan- guage models with memory augmentation

    Zexuan Zhong, Tao Lei, and Danqi Chen. Training lan- guage models with memory augmentation. InProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5657–5673, 2022. 16 A Numerical Stability Evaluation We evaluate the numerical stability of using a random dense matrix for feature scrambling. Table 4 presents the re...