LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Hanxi Fang; Haocheng Xia; Mihir Pamnani; Supawit Chockchowwat; Yongjoo Park

arxiv: 2606.04302 · v1 · pith:PPJOBWQFnew · submitted 2026-06-03 · 💻 cs.CL · cs.LG

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Haocheng Xia , Mihir Pamnani , Hanxi Fang , Supawit Chockchowwat , Yongjoo Park This is my paper

Pith reviewed 2026-06-28 07:02 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords KV cachingpositional encodingretrieval-augmented generationefficient LLM inferenceattention mechanismzero-copy reusedeferred computation

0 comments

The pith

LazyAttention defers positional encoding inside attention kernels to enable zero-copy KV cache reuse at arbitrary positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that conventional KV caches embed positional information, which prevents reuse across different positions in long-context tasks like RAG without either restricting reuse or copying memory. LazyAttention kernelizes deferred positional encoding so that positions are adjusted on-the-fly during attention computation. This lets one physical KV cache support multiple logical requests at any positions without materialization. The resulting system reports 1.37 times lower time-to-first-token and 1.40 times higher throughput than Block-Attention under skewed document loads while preserving output quality. Readers would care because KV materialization has been a direct barrier to scaling retrieval and in-context learning in production LLMs.

Core claim

We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements under skewed document distributions while maintaining comparable output quality.

What carries the argument

On-the-fly positional adjustment inside attention kernels that defers encoding until query time

If this is right

A single physical KV copy can serve multiple logical requests at arbitrary positions without duplication.
TTFT drops by 1.37 times under skewed document distributions compared with Block-Attention.
Inference throughput rises by 1.40 times while output quality stays comparable.
Separate kernels for prefilling and decoding become practical because position handling is no longer tied to cache storage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same deferred-encoding pattern could be applied to other transformer layers that currently bake positional or segment information into stored activations.
Multi-tenant serving systems could maintain far fewer distinct KV buffers when users request overlapping retrieved contexts at different offsets.
If the kernel overhead remains low, the technique may extend to streaming or continuously updated retrieval settings where document order changes frequently.

Load-bearing premise

On-the-fly positional adjustment inside attention kernels adds negligible compute and leaves attention accuracy and numerical stability unchanged.

What would settle it

Run the same attention operation on identical KV pairs once with standard pre-materialized positional encodings and once with LazyAttention's on-the-fly adjustment, then compare both wall-clock time and output distribution divergence.

Figures

Figures reproduced from arXiv: 2606.04302 by Hanxi Fang, Haocheng Xia, Mihir Pamnani, Supawit Chockchowwat, Yongjoo Park.

**Figure 1.** Figure 1: Comparison of positional encoding strategies within a Transform block. Existing methods, even the state-of-the-art Block-Attention for KV cache reuse, eagerly apply positions before caching, making KV caches position-dependent and nonshareable. LazyAttention applies positions lazily inside the fused attention kernel, enabling reuse and sharing without duplication. ory (HBM) remains a scarce resource for … view at source ↗

**Figure 2.** Figure 2: Illustration of Example 3.1. Example 3.1. Suppose a query Q encodes its positional index relative to the document length. Consider two documents, d1 and d2, with independently generated KV caches C1 and C2, both indexed from position 0. To reuse C2 for Q immediately following the processing of d1, we rotate Q backward by |d1| before the attention computation. This adjustment ensures positional consistency… view at source ↗

**Figure 3.** Figure 3: Comparison of Rotary Embedding (R) placement in tiled attention. (a) Naive deferred encoding: Rotates keys twice (resetting to zero, then shifting to target) before attention (A), incurring high computational overhead. (b) LazyAttention (prefilling): Rotates only the keys inside the attention kernel on the fly, allowing Keys/Values to be read as-is to avoid materialization overhead (K1/K2 denote the RoPE h… view at source ↗

**Figure 4.** Figure 4: TTFT vs request rate under different document distributions. We vary the request rate (req/s) and plot time-to-first-token (TTFT). (a) Uniform—few documents recur across requests (low reuse). (b) Skewed—a small set of documents is frequently reused (high reuse) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overhead analysis. (a) Breakdown of overhead across phases: ① document processing, ② query prefilling, and ③ decoding. Deferred rotation adds only 0.13% overhead in decoding. (b) Accumulated latency with and without deferred rotation, where the dominant gain comes from document processing rather than token generation [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LazyAttention kernelizes deferred positional encoding for zero-copy KV reuse across positions, but the speedups hinge on whether the on-the-fly adjustment stays cheap.

read the letter

The core idea is to move positional encoding inside the attention kernel so one physical KV tensor can serve multiple logical positions without materialization or prefix restrictions.

They improve on Block-Attention by avoiding the copy step when the same documents appear at different offsets. The reported numbers are 1.37x lower TTFT and 1.4x higher throughput under skewed distributions, with output quality staying comparable.

The approach is practical for RAG and ICL serving where cache reuse is the bottleneck. The mechanism itself is a clear step past the two options the abstract cites.

The main uncertainty is the cost of the on-the-fly adjustment. If the kernel rotations or re-indexing add non-trivial FLOPs or affect numerical stability, the net gain shrinks. The abstract gives no kernel sketch, no ablation on that overhead, and no error bars, so the numbers need verification against the actual implementation.

This is for people who run long-context inference at scale. A reader tuning serving systems would get concrete value from trying the kernel if the extra compute is low.

It deserves peer review because the problem is real and the proposed fix is specific, even though the evidence is still thin on the cost side. Send it and request the kernel details plus a direct comparison of adjustment overhead.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding (e.g., via on-the-fly RoPE-style rotations) to enable zero-copy, position-agnostic reuse of a single physical KV cache across multiple logical requests at arbitrary positions. This is claimed to resolve the materialization bottleneck in conventional KV caching for RAG and ICL. The system reports 1.37× lower TTFT and 1.40× higher inference throughput versus Block-Attention under skewed document distributions, with comparable output quality, using tailored kernels for prefilling and decoding.

Significance. If the on-the-fly adjustment proves to incur negligible extra compute while preserving attention scores and numerical stability, the result would meaningfully advance efficient long-context inference by increasing KV-cache reusability without additional memory materialization. The empirical speedups, if reproducible with full experimental details, would represent a practical improvement over existing prefix-restricted or materialization-heavy baselines.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the reported 1.37× TTFT and 1.40× throughput gains are presented without baseline implementation details, error bars, hardware specifications, or ablation isolating the cost of on-the-fly positional adjustment versus pre-materialized encoding. This information is load-bearing for verifying whether the claimed gains over Block-Attention are supported or whether hidden kernel overheads offset them.
[§3] §3 (LazyAttention Mechanism): the central claim that deferred positional encoding can be applied inside fused attention kernels at query time for arbitrary logical positions while adding only negligible FLOPs and preserving scores to within floating-point tolerance lacks a derivation, kernel pseudocode, or complexity analysis. Without this, it is impossible to confirm that the adjustment does not require per-token rotations or extra matrix multiplies that would shrink the reported efficiency gains.

minor comments (2)

[§3] Notation for the deferred encoding operator and its integration with existing attention kernels (e.g., FlashAttention) should be defined more explicitly to aid reproducibility.
[§3] The manuscript should include a clear statement of the exact RoPE (or equivalent) rotation formula used for on-the-fly adjustment and any assumptions about rotary embedding dimensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the reproducibility and clarity of the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported 1.37× TTFT and 1.40× throughput gains are presented without baseline implementation details, error bars, hardware specifications, or ablation isolating the cost of on-the-fly positional adjustment versus pre-materialized encoding. This information is load-bearing for verifying whether the claimed gains over Block-Attention are supported or whether hidden kernel overheads offset them.

Authors: We agree these details are required for verification. In the revised manuscript we will expand §4 with hardware specifications (NVIDIA H100 GPUs), error bars from five independent runs, full baseline implementation details for Block-Attention, and a dedicated ablation that isolates the incremental FLOPs and latency of the on-the-fly positional adjustment versus pre-materialized RoPE. This will confirm the reported speedups are not offset by hidden overhead. revision: yes
Referee: [§3] §3 (LazyAttention Mechanism): the central claim that deferred positional encoding can be applied inside fused attention kernels at query time for arbitrary logical positions while adding only negligible FLOPs and preserving scores to within floating-point tolerance lacks a derivation, kernel pseudocode, or complexity analysis. Without this, it is impossible to confirm that the adjustment does not require per-token rotations or extra matrix multiplies that would shrink the reported efficiency gains.

Authors: We will revise §3 to include a concise derivation demonstrating equivalence to standard RoPE, kernel pseudocode for the fused prefilling and decoding kernels, and a complexity analysis showing the adjustment requires only O(d) additional operations per token (d = head dimension) with no extra matrix multiplies. The analysis will also confirm numerical stability within floating-point tolerance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on kernel mechanism, not self-referential definitions or fitted predictions.

full rationale

The paper presents LazyAttention as an engineering innovation that kernelizes deferred positional encoding for zero-copy KV reuse. The abstract and description frame the TTFT/throughput gains as outcomes of on-the-fly adjustment inside attention kernels, without equations, parameter fits to data subsets, or self-citations that reduce the central claim to a tautology or renamed input. No load-bearing step equates a 'prediction' to its own construction; the derivation chain is self-contained as a proposed systems technique evaluated empirically against Block-Attention.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. No explicit free parameters, mathematical axioms, or independently evidenced invented entities are identifiable from the given text. The LazyAttention mechanism itself is the central new construct introduced by the paper.

invented entities (1)

LazyAttention mechanism no independent evidence
purpose: kernelized deferred positional encoding for position-agnostic KV reuse
The paper introduces this as the novel attention mechanism that enables the claimed zero-copy reuse.

pith-pipeline@v0.9.1-grok · 5737 in / 1363 out tokens · 27454 ms · 2026-06-28T07:02:47.660115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 29 canonical work pages · 5 internal anchors

[1]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Large-scale Evaluation of Notebook Checkpointing with AI Agents , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=
[2]

arXiv preprint arXiv:2507.05403 , year=

PBE Meets LLM: When Few Examples Aren't Few-Shot Enough , author=. arXiv preprint arXiv:2507.05403 , year=

arXiv
[3]

2025 , url=

Junhao Hu and Wenrui Huang and Weidong Wang and Haoyi Wang and tiancheng hu and zhang qin and Hao Feng and Xusheng Chen and Yizhou Shan and Tao Xie , booktitle=. 2025 , url=

2025
[4]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[5]

T urbo RAG : Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Lu, Songshuo and Wang, Hua and Rong, Yutian and Chen, Zhi and Tang, Yaohua. T urbo RAG : Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.334

work page doi:10.18653/v1/2025.emnlp-main.334 2025
[6]

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic

Wonbeom Lee and Jungi Lee and Junghwan Seo and Jaewoong Sim , editor =. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic. 18th. 2024 , url =

2024
[7]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =

Jingyang Yuan and Huazuo Gao and Damai Dai and Junyu Luo and Liang Zhao and Zhengyan Zhang and Zhenda Xie and Yuxing Wei and Lean Wang and Zhiping Xiao and Yuqing Wang and Chong Ruan and Ming Zhang and Wenfeng Liang and Wangding Zeng , editor =. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =. 2025 , url =

2025
[8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[9]

Advances in Neural Information Processing Systems , volume=

Minicache: Kv cache compression in depth dimension for large language models , author=. Advances in Neural Information Processing Systems , volume=
[10]

Model Tells You What to Discard: Adaptive

Suyu Ge and Yunan Zhang and Liyuan Liu and Minjia Zhang and Jiawei Han and Jianfeng Gao , booktitle=. Model Tells You What to Discard: Adaptive. 2024 , url=

2024
[11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv e-prints , keywords =. doi:10.48550/arXiv.2403.05530 , archivePrefix =. 2403.05530 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530
[12]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[13]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=

Orca: A distributed serving system for \ Transformer-Based \ generative models , author=. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=
[14]

Forty-first International Conference on Machine Learning,

Thomas Merth and Qichen Fu and Mohammad Rastegari and Mahyar Najibi , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024
[15]

Prompt Cache: Modular Attention Reuse for Low-Latency Inference , booktitle =

In Gim and Guojun Chen and Seung. Prompt Cache: Modular Attention Reuse for Low-Latency Inference , booktitle =. 2024 , url =

2024
[16]

Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2025, Rotterdam, March 30 - April 3, 2025 , publisher =

Jiayi Yao and Hanchen Li and Yuhan Liu and Siddhant Ray and Yihua Cheng and Qizheng Zhang and Kuntai Du and Shan Lu and Junchen Jiang , title =. Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2025, Rotterdam, March 30 - April 3, 2025 , publisher =. 2025 , timestamp =

2025
[17]

GPT-4 Technical Report

OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.08774 , eprinttype =. 2303.08774 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[18]

CoRR , volume =

Chao Jin and Zili Zhang and Xuanlin Jiang and Fangyue Liu and Xin Liu and Xuanzhe Liu and Xin Jin , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.12457 , eprinttype =. 2404.12457 , timestamp =

work page doi:10.48550/arxiv.2404.12457 2024
[19]

Manning , title =

Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[20]

Wikipedia (en) embedded with cohere.ai multilingual22-12 encoder , year =
[21]

The Thirteenth International Conference on Learning Representations , year=

Block-Attention for Efficient Prefilling , author=. The Thirteenth International Conference on Learning Representations , year=
[22]

2024 , note =

LangChain , howpublished =. 2024 , note =

2024
[23]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , booktitle =

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , booktitle =
[24]

23rd USENIX Conference on File and Storage Technologies (FAST 25) , year =

Ruoyu Qin and Zheming Li and Weiran He and Jialei Cui and Feng Ren and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu , title =. 23rd USENIX Conference on File and Storage Technologies (FAST 25) , year =
[25]

and Zhang, Hao and Stoica, Ion , booktitle =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023
[26]

Shapley Value Approximation Based on Complementary Contribution , year=

Sun, Qiheng and Zhang, Jiayao and Liu, Jinfei and Xiong, Li and Pei, Jian and Ren, Kui , journal=. Shapley Value Approximation Based on Complementary Contribution , year=
[27]

Thirty-seventh Conference on Neural Information Processing Systems , year=

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[28]

The Twelfth International Conference on Learning Representations , year=

Efficient Streaming Language Models with Attention Sinks , author=. The Twelfth International Conference on Learning Representations , year=
[29]

Yuhong Li and Yingbing Huang and Bowen Yang and Bharat Venkitesh and Acyr Locatelli and Hanchen Ye and Tianle Cai and Patrick Lewis and Deming Chen , booktitle=. Snap. 2024 , url=

2024
[30]

2024 , eprint=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2024 , eprint=

2024
[31]

2025 , eprint=

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference , author=. 2025 , eprint=

2025
[32]

Not All Heads Matter: A Head-Level

Yu Fu and Zefan Cai and Abedelkadir Asi and Wayne Xiong and Yue Dong and Wen Xiao , booktitle=. Not All Heads Matter: A Head-Level. 2025 , url=

2025
[33]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[34]

2024 , eprint=

The Llama 3 Herd of Models , author =. 2024 , eprint=

2024
[35]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[36]

2024 , howpublished =

Anthropic , title =. 2024 , howpublished =

2024
[37]

Zhang, Jiayao and Sun, Qiheng and Liu, Jinfei and Xiong, Li and Pei, Jian and Ren, Kui , title =. Proc. ACM Manag. Data , month = may, articleno =. 2023 , issue_date =. doi:10.1145/3588728 , abstract =

work page doi:10.1145/3588728 2023
[38]

LongBench:

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi. L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computation...

work page doi:10.18653/v1/2024.acl-long.172 2024
[39]

Contribution to the Theory of Games , volume=

A value for n-person games , author=. Contribution to the Theory of Games , volume=
[40]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for

Zichang Liu and Aditya Desai and Fangshuo Liao and Weitao Wang and Victor Xie and Zhaozhuo Xu and Anastasios Kyrillidis and Anshumali Shrivastava , booktitle=. Scissorhands: Exploiting the Persistence of Importance Hypothesis for. 2023 , url=

2023
[41]

RazorAttention: Efficient

Hanlin Tang and Yang Lin and Jing Lin and Qingsen Han and Shikuan Hong and Danning Ke and Yiwu Yao and Gongyi Wang , booktitle=. RazorAttention: Efficient. 2025 , url=

2025
[42]

DuoAttention: Efficient Long-Context

Guangxuan Xiao and Jiaming Tang and Jingwei Zuo and junxian guo and Shang Yang and Haotian Tang and Yao Fu and Song Han , booktitle=. DuoAttention: Efficient Long-Context. 2025 , url=

2025
[43]

The Thirteenth International Conference on Learning Representations , year=

Retrieval Head Mechanistically Explains Long-Context Factuality , author=. The Thirteenth International Conference on Learning Representations , year=
[44]

2019 , eprint=

Fast Transformer Decoding: One Write-Head is All You Need , author=. 2019 , eprint=

2019
[45]

FlashAttention: Fast and Memory-Efficient Exact Attention with

Tri Dao and Daniel Y Fu and Stefano Ermon and Atri Rudra and Christopher Re , booktitle=. FlashAttention: Fast and Memory-Efficient Exact Attention with. 2022 , url=

2022
[46]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=
[47]

A Unified Approach to Interpreting Model Predictions , url =

Lundberg, Scott M and Lee, Su-In , booktitle =. A Unified Approach to Interpreting Model Predictions , url =
[48]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Towards Efficient Data Valuation Based on the Shapley Value , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

2019
[49]

Journal of Machine Learning Research , year =

Rory Mitchell and Joshua Cooper and Eibe Frank and Geoffrey Holmes , title =. Journal of Machine Learning Research , year =
[50]

Papadimitriou , title =

Xiaotie Deng and Christos H. Papadimitriou , title =. Math. Oper. Res. , volume =. 1994 , url =. doi:10.1287/MOOR.19.2.257 , timestamp =

work page doi:10.1287/moor.19.2.257 1994
[51]

Benchmarking Retrieval-Augmented Generation for Medicine

Xiong, Guangzhi and Jin, Qiao and Lu, Zhiyong and Zhang, Aidong. Benchmarking Retrieval-Augmented Generation for Medicine. Findings of the Association for Computational Linguistics ACL 2024. 2024

2024
[52]

H y PA - RAG : A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications

Kalra, Rishi and Wu, Zekun and Gulley, Ayesha and Hilliard, Airlie and Guan, Xin and Koshiyama, Adriano and Treleaven, Philip Colin. H y PA - RAG : A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Dom...

work page doi:10.18653/v1/2024.customnlp4u-1.18 2024
[53]

Cell , volume=

Empowering biomedical discovery with AI agents , author=. Cell , volume=. 2024 , publisher=

2024
[54]

2024 USENIX Annual Technical Conference (USENIX ATC 24) , year =

Bin Gao and Zhuomin He and Puru Sharma and Qingxuan Kang and Djordje Jevdjic and Junbo Deng and Xingkun Yang and Zhou Yu and Pengfei Zuo , title =. 2024 USENIX Annual Technical Conference (USENIX ATC 24) , year =

2024
[55]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[56]

Retrieval Augmented Language Model Pre-Training , booktitle =

Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming. Retrieval Augmented Language Model Pre-Training , booktitle =. 2020 , url =

2020
[57]

Companion Proceedings of the

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks , author=. Companion Proceedings of the. 2025 , timestamp =

2025
[58]

Retrieval-based Language Models and Applications , booktitle =

Akari Asai and Sewon Min and Zexuan Zhong and Danqi Chen , editor =. Retrieval-based Language Models and Applications , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-TUTORIALS.6 , timestamp =

work page doi:10.18653/v1/2023.acl-tutorials.6 2023
[59]

Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and Kristina Toutanova and Llion Jones and Matthew Kelcey and Ming

Ori Ram and Yoav Levine and Itay Dalmedigos and Dor Muhlgay and Amnon Shashua and Kevin Leyton. In-Context Retrieval-Augmented Language Models , journal =. 2023 , url =. doi:10.1162/TACL\_A\_00605 , timestamp =

work page internal anchor Pith review doi:10.1162/tacl 2023
[60]

The Twelfth International Conference on Learning Representations,

Tri Dao , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[61]

2002 , url =

Hao Che and Ye Tung and Zhijun Wang , title =. 2002 , url =. doi:10.1109/JSAC.2002.801752 , timestamp =

work page doi:10.1109/jsac.2002.801752 2002
[62]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V. Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A. Smi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024
[63]

2022 , url=

NVIDIA Hopper Architecture In-Depth , author=. 2022 , url=

2022
[64]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

2020
[65]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[66]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025
[67]

Fanxu Meng and Pingzhi Tang and Zengwei Yao and Xing Sun and Muhan Zhang , booktitle=. Trans. 2025 , url=

2025
[68]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek. DeepSeek-V2:. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2405.04434 , eprinttype =. 2405.04434 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
[69]

CoRR , volume =

Ermo Hua and Che Jiang and Xingtai Lv and Kaiyan Zhang and Ning Ding and Youbang Sun and Biqing Qi and Yuchen Fan and Xuekai Zhu and Bowen Zhou , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.17739 , eprinttype =. 2412.17739 , timestamp =

work page doi:10.48550/arxiv.2412.17739 2024
[70]

doi: 10.1145/3442188.3445922

Emily M. Bender and Timnit Gebru and Angelina McMillan. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , booktitle =. 2021 , url =. doi:10.1145/3442188.3445922 , timestamp =

work page doi:10.1145/3442188.3445922 2021
[71]

Ethical and social risks of harm from Language Models , journal =

Laura Weidinger and John Mellor and Maribeth Rauh and Conor Griffin and Jonathan Uesato and Po. Ethical and social risks of harm from Language Models , journal =. 2021 , url =. 2112.04359 , timestamp =

Pith/arXiv arXiv 2021
[72]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

2020
[73]

♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish. M u S i Q ue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00475

work page doi:10.1162/tacl_a_00475 2022
[74]

CoRR , volume =

Bogdan Gliwa and Iwona Mochol and Maciej Biesek and Aleksander Wawer , title =. CoRR , volume =. 2019 , url =. 1911.12237 , timestamp =

arXiv 2019
[75]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Fabbri, Alexander and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1102

work page doi:10.18653/v1/p19-1102 2019
[76]

Weld and Luke Zettlemoyer , editor =

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , editor =. TriviaQA:. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,. 2017 , url =. doi:10.18653/V1/P17-1147 , timestamp =

work page doi:10.18653/v1/p17-1147 2017
[77]

The N arrative QA Reading Comprehension Challenge

Ko c isk \'y , Tom \'a s and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G \'a bor and Grefenstette, Edward. The N arrative QA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00023

work page doi:10.1162/tacl_a_00023 2018
[78]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018
[79]

Learning Dense Representations of Phrases at Scale , booktitle =

Jinhyuk Lee and Mujeen Sung and Jaewoo Kang and Danqi Chen , editor =. Learning Dense Representations of Phrases at Scale , booktitle =. 2021 , url =. doi:10.18653/V1/2021.ACL-LONG.518 , timestamp =

work page doi:10.18653/v1/2021.acl-long.518 2021
[80]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , month = apr, year =

Gautier Izacard and Edouard Grave , editor =. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EACL-MAIN.74 , timestamp =

work page doi:10.18653/v1/2021.eacl-main.74 2021

Showing first 80 references.

[1] [1]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Large-scale Evaluation of Notebook Checkpointing with AI Agents , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

[2] [2]

arXiv preprint arXiv:2507.05403 , year=

PBE Meets LLM: When Few Examples Aren't Few-Shot Enough , author=. arXiv preprint arXiv:2507.05403 , year=

arXiv

[3] [3]

2025 , url=

Junhao Hu and Wenrui Huang and Weidong Wang and Haoyi Wang and tiancheng hu and zhang qin and Hao Feng and Xusheng Chen and Yizhou Shan and Tao Xie , booktitle=. 2025 , url=

2025

[4] [4]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[5] [5]

T urbo RAG : Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Lu, Songshuo and Wang, Hua and Rong, Yutian and Chen, Zhi and Tang, Yaohua. T urbo RAG : Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.334

work page doi:10.18653/v1/2025.emnlp-main.334 2025

[6] [6]

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic

Wonbeom Lee and Jungi Lee and Junghwan Seo and Jaewoong Sim , editor =. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic. 18th. 2024 , url =

2024

[7] [7]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =

Jingyang Yuan and Huazuo Gao and Damai Dai and Junyu Luo and Liang Zhao and Zhengyan Zhang and Zhenda Xie and Yuxing Wei and Lean Wang and Zhiping Xiao and Yuqing Wang and Chong Ruan and Ming Zhang and Wenfeng Liang and Wangding Zeng , editor =. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention , booktitle =. 2025 , url =

2025

[8] [8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[9] [9]

Advances in Neural Information Processing Systems , volume=

Minicache: Kv cache compression in depth dimension for large language models , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

Model Tells You What to Discard: Adaptive

Suyu Ge and Yunan Zhang and Liyuan Liu and Minjia Zhang and Jiawei Han and Jianfeng Gao , booktitle=. Model Tells You What to Discard: Adaptive. 2024 , url=

2024

[11] [11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv e-prints , keywords =. doi:10.48550/arXiv.2403.05530 , archivePrefix =. 2403.05530 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.05530

[12] [12]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019

[13] [13]

16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=

Orca: A distributed serving system for \ Transformer-Based \ generative models , author=. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , pages=

[14] [14]

Forty-first International Conference on Machine Learning,

Thomas Merth and Qichen Fu and Mohammad Rastegari and Mahyar Najibi , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

2024

[15] [15]

Prompt Cache: Modular Attention Reuse for Low-Latency Inference , booktitle =

In Gim and Guojun Chen and Seung. Prompt Cache: Modular Attention Reuse for Low-Latency Inference , booktitle =. 2024 , url =

2024

[16] [16]

Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2025, Rotterdam, March 30 - April 3, 2025 , publisher =

Jiayi Yao and Hanchen Li and Yuhan Liu and Siddhant Ray and Yihua Cheng and Qizheng Zhang and Kuntai Du and Shan Lu and Junchen Jiang , title =. Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2025, Rotterdam, March 30 - April 3, 2025 , publisher =. 2025 , timestamp =

2025

[17] [17]

GPT-4 Technical Report

OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.08774 , eprinttype =. 2303.08774 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[18] [18]

CoRR , volume =

Chao Jin and Zili Zhang and Xuanlin Jiang and Fangyue Liu and Xin Liu and Xuanzhe Liu and Xin Jin , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.12457 , eprinttype =. 2404.12457 , timestamp =

work page doi:10.48550/arxiv.2404.12457 2024

[19] [19]

Manning , title =

Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[20] [20]

Wikipedia (en) embedded with cohere.ai multilingual22-12 encoder , year =

[21] [21]

The Thirteenth International Conference on Learning Representations , year=

Block-Attention for Efficient Prefilling , author=. The Thirteenth International Conference on Learning Representations , year=

[22] [22]

2024 , note =

LangChain , howpublished =. 2024 , note =

2024

[23] [23]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , booktitle =

Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , booktitle =

[24] [24]

23rd USENIX Conference on File and Storage Technologies (FAST 25) , year =

Ruoyu Qin and Zheming Li and Weiran He and Jialei Cui and Feng Ren and Mingxing Zhang and Yongwei Wu and Weimin Zheng and Xinran Xu , title =. 23rd USENIX Conference on File and Storage Technologies (FAST 25) , year =

[25] [25]

and Zhang, Hao and Stoica, Ion , booktitle =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023

[26] [26]

Shapley Value Approximation Based on Complementary Contribution , year=

Sun, Qiheng and Zhang, Jiayao and Liu, Jinfei and Xiong, Li and Pei, Jian and Ren, Kui , journal=. Shapley Value Approximation Based on Complementary Contribution , year=

[27] [27]

Thirty-seventh Conference on Neural Information Processing Systems , year=

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[28] [28]

The Twelfth International Conference on Learning Representations , year=

Efficient Streaming Language Models with Attention Sinks , author=. The Twelfth International Conference on Learning Representations , year=

[29] [29]

Yuhong Li and Yingbing Huang and Bowen Yang and Bharat Venkitesh and Acyr Locatelli and Hanchen Ye and Tianle Cai and Patrick Lewis and Deming Chen , booktitle=. Snap. 2024 , url=

2024

[30] [30]

2024 , eprint=

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling , author=. 2024 , eprint=

2024

[31] [31]

2025 , eprint=

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference , author=. 2025 , eprint=

2025

[32] [32]

Not All Heads Matter: A Head-Level

Yu Fu and Zefan Cai and Abedelkadir Asi and Wayne Xiong and Yue Dong and Wen Xiao , booktitle=. Not All Heads Matter: A Head-Level. 2025 , url=

2025

[33] [33]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024

[34] [34]

2024 , eprint=

The Llama 3 Herd of Models , author =. 2024 , eprint=

2024

[35] [35]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023

[36] [36]

2024 , howpublished =

Anthropic , title =. 2024 , howpublished =

2024

[37] [37]

Zhang, Jiayao and Sun, Qiheng and Liu, Jinfei and Xiong, Li and Pei, Jian and Ren, Kui , title =. Proc. ACM Manag. Data , month = may, articleno =. 2023 , issue_date =. doi:10.1145/3588728 , abstract =

work page doi:10.1145/3588728 2023

[38] [38]

LongBench:

Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi. L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computation...

work page doi:10.18653/v1/2024.acl-long.172 2024

[39] [39]

Contribution to the Theory of Games , volume=

A value for n-person games , author=. Contribution to the Theory of Games , volume=

[40] [40]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for

Zichang Liu and Aditya Desai and Fangshuo Liao and Weitao Wang and Victor Xie and Zhaozhuo Xu and Anastasios Kyrillidis and Anshumali Shrivastava , booktitle=. Scissorhands: Exploiting the Persistence of Importance Hypothesis for. 2023 , url=

2023

[41] [41]

RazorAttention: Efficient

Hanlin Tang and Yang Lin and Jing Lin and Qingsen Han and Shikuan Hong and Danning Ke and Yiwu Yao and Gongyi Wang , booktitle=. RazorAttention: Efficient. 2025 , url=

2025

[42] [42]

DuoAttention: Efficient Long-Context

Guangxuan Xiao and Jiaming Tang and Jingwei Zuo and junxian guo and Shang Yang and Haotian Tang and Yao Fu and Song Han , booktitle=. DuoAttention: Efficient Long-Context. 2025 , url=

2025

[43] [43]

The Thirteenth International Conference on Learning Representations , year=

Retrieval Head Mechanistically Explains Long-Context Factuality , author=. The Thirteenth International Conference on Learning Representations , year=

[44] [44]

2019 , eprint=

Fast Transformer Decoding: One Write-Head is All You Need , author=. 2019 , eprint=

2019

[45] [45]

FlashAttention: Fast and Memory-Efficient Exact Attention with

Tri Dao and Daniel Y Fu and Stefano Ermon and Atri Rudra and Christopher Re , booktitle=. FlashAttention: Fast and Memory-Efficient Exact Attention with. 2022 , url=

2022

[46] [46]

Proceedings of the 29th Symposium on Operating Systems Principles , pages=

Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

[47] [47]

A Unified Approach to Interpreting Model Predictions , url =

Lundberg, Scott M and Lee, Su-In , booktitle =. A Unified Approach to Interpreting Model Predictions , url =

[48] [48]

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =

Towards Efficient Data Valuation Based on the Shapley Value , author =. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics , pages =. 2019 , editor =

2019

[49] [49]

Journal of Machine Learning Research , year =

Rory Mitchell and Joshua Cooper and Eibe Frank and Geoffrey Holmes , title =. Journal of Machine Learning Research , year =

[50] [50]

Papadimitriou , title =

Xiaotie Deng and Christos H. Papadimitriou , title =. Math. Oper. Res. , volume =. 1994 , url =. doi:10.1287/MOOR.19.2.257 , timestamp =

work page doi:10.1287/moor.19.2.257 1994

[51] [51]

Benchmarking Retrieval-Augmented Generation for Medicine

Xiong, Guangzhi and Jin, Qiao and Lu, Zhiyong and Zhang, Aidong. Benchmarking Retrieval-Augmented Generation for Medicine. Findings of the Association for Computational Linguistics ACL 2024. 2024

2024

[52] [52]

H y PA - RAG : A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications

Kalra, Rishi and Wu, Zekun and Gulley, Ayesha and Hilliard, Airlie and Guan, Xin and Koshiyama, Adriano and Treleaven, Philip Colin. H y PA - RAG : A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Dom...

work page doi:10.18653/v1/2024.customnlp4u-1.18 2024

[53] [53]

Cell , volume=

Empowering biomedical discovery with AI agents , author=. Cell , volume=. 2024 , publisher=

2024

[54] [54]

2024 USENIX Annual Technical Conference (USENIX ATC 24) , year =

Bin Gao and Zhuomin He and Puru Sharma and Qingxuan Kang and Djordje Jevdjic and Junbo Deng and Xingkun Yang and Zhou Yu and Pengfei Zuo , title =. 2024 USENIX Annual Technical Conference (USENIX ATC 24) , year =

2024

[55] [55]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[56] [56]

Retrieval Augmented Language Model Pre-Training , booktitle =

Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming. Retrieval Augmented Language Model Pre-Training , booktitle =. 2020 , url =

2020

[57] [57]

Companion Proceedings of the

Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks , author=. Companion Proceedings of the. 2025 , timestamp =

2025

[58] [58]

Retrieval-based Language Models and Applications , booktitle =

Akari Asai and Sewon Min and Zexuan Zhong and Danqi Chen , editor =. Retrieval-based Language Models and Applications , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-TUTORIALS.6 , timestamp =

work page doi:10.18653/v1/2023.acl-tutorials.6 2023

[59] [59]

Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and Kristina Toutanova and Llion Jones and Matthew Kelcey and Ming

Ori Ram and Yoav Levine and Itay Dalmedigos and Dor Muhlgay and Amnon Shashua and Kevin Leyton. In-Context Retrieval-Augmented Language Models , journal =. 2023 , url =. doi:10.1162/TACL\_A\_00605 , timestamp =

work page internal anchor Pith review doi:10.1162/tacl 2023

[60] [60]

The Twelfth International Conference on Learning Representations,

Tri Dao , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[61] [61]

2002 , url =

Hao Che and Ye Tung and Zhijun Wang , title =. 2002 , url =. doi:10.1109/JSAC.2002.801752 , timestamp =

work page doi:10.1109/jsac.2002.801752 2002

[62] [62]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V. Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Chris Wilhelm and Luca Soldaini and Noah A. Smi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024

[63] [63]

2022 , url=

NVIDIA Hopper Architecture In-Depth , author=. 2022 , url=

2022

[64] [64]

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

2020

[65] [65]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[66] [66]

2025 , eprint=

DeepSeek-V3 Technical Report , author=. 2025 , eprint=

2025

[67] [67]

Fanxu Meng and Pingzhi Tang and Zengwei Yao and Xing Sun and Muhan Zhang , booktitle=. Trans. 2025 , url=

2025

[68] [68]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek. DeepSeek-V2:. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2405.04434 , eprinttype =. 2405.04434 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024

[69] [69]

CoRR , volume =

Ermo Hua and Che Jiang and Xingtai Lv and Kaiyan Zhang and Ning Ding and Youbang Sun and Biqing Qi and Yuchen Fan and Xuekai Zhu and Bowen Zhou , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2412.17739 , eprinttype =. 2412.17739 , timestamp =

work page doi:10.48550/arxiv.2412.17739 2024

[70] [70]

doi: 10.1145/3442188.3445922

Emily M. Bender and Timnit Gebru and Angelina McMillan. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , booktitle =. 2021 , url =. doi:10.1145/3442188.3445922 , timestamp =

work page doi:10.1145/3442188.3445922 2021

[71] [71]

Ethical and social risks of harm from Language Models , journal =

Laura Weidinger and John Mellor and Maribeth Rauh and Conor Griffin and Jonathan Uesato and Po. Ethical and social risks of harm from Language Models , journal =. 2021 , url =. 2112.04359 , timestamp =

Pith/arXiv arXiv 2021

[72] [72]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

2020

[73] [73]

♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition

Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish. M u S i Q ue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00475

work page doi:10.1162/tacl_a_00475 2022

[74] [74]

CoRR , volume =

Bogdan Gliwa and Iwona Mochol and Maciej Biesek and Aleksander Wawer , title =. CoRR , volume =. 2019 , url =. 1911.12237 , timestamp =

arXiv 2019

[75] [75]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Fabbri, Alexander and Li, Irene and She, Tianwei and Li, Suyi and Radev, Dragomir. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1102

work page doi:10.18653/v1/p19-1102 2019

[76] [76]

Weld and Luke Zettlemoyer , editor =

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , editor =. TriviaQA:. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,. 2017 , url =. doi:10.18653/V1/P17-1147 , timestamp =

work page doi:10.18653/v1/p17-1147 2017

[77] [77]

The N arrative QA Reading Comprehension Challenge

Ko c isk \'y , Tom \'a s and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G \'a bor and Grefenstette, Edward. The N arrative QA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00023

work page doi:10.1162/tacl_a_00023 2018

[78] [78]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018

[79] [79]

Learning Dense Representations of Phrases at Scale , booktitle =

Jinhyuk Lee and Mujeen Sung and Jaewoo Kang and Danqi Chen , editor =. Learning Dense Representations of Phrases at Scale , booktitle =. 2021 , url =. doi:10.18653/V1/2021.ACL-LONG.518 , timestamp =

work page doi:10.18653/v1/2021.acl-long.518 2021

[80] [80]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , month = apr, year =

Gautier Izacard and Edouard Grave , editor =. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EACL-MAIN.74 , timestamp =

work page doi:10.18653/v1/2021.eacl-main.74 2021