pith. machine review for the scientific record. sign in

arxiv: 2605.08580 · v1 · submitted 2026-05-09 · 💻 cs.MA · cs.AI

Recognition: no theorem link

Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

Ravi Netravali, Rui Pan, Yinwei Dai, Zhuofu Chen

Pith reviewed 2026-05-12 00:58 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords LLM agentscontext compactionasynchronous validationlong-horizon reasoningtrajectory groundingsummary validationagent accuracyparallel execution
0
0 comments X

The pith

Asynchronous compaction runs the summary generator in parallel with the agent on the original trajectory so a judge can check preservation of forward intent and facts independently of the summary itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents accumulate large trajectories that must be compacted into shorter summaries before execution can continue. Synchronous compaction places the summarizer on the critical path and leaves it unaware of what information the agent will need in future steps, so errors can propagate through coherent but incorrect behavior. Slipstream runs compaction asynchronously: the compactor and the agent both continue from the identical pre-compaction state, producing a candidate summary and a continued reasoning trace in parallel. A judge then compares the summary against that independent trace to verify that forward intent and necessary facts remain intact. The method raises task accuracy by up to 8.8 points and lowers end-to-end latency by up to 39.7 percent on SWE-bench Verified and BrowseComp workloads.

Core claim

Slipstream shows that running the compactor in parallel with continued agent execution on the original context creates a validation signal independent of the summary; a judge LLM can therefore inspect the agent's next steps to confirm that the candidate summary preserves both forward intent and the key facts and constraints the agent depends on.

What carries the argument

The trajectory-grounded judge that validates a candidate summary by comparing it to the agent's continued reasoning trace generated from the identical pre-compaction state.

If this is right

  • Task accuracy rises by up to 8.8 percentage points on long-horizon coding and web-browsing benchmarks.
  • End-to-end latency falls by up to 39.7 percent because compaction no longer blocks the agent's critical path.
  • Validation errors are caught before they can propagate through subsequent coherent but incorrect agent steps.
  • The same pre-compaction trajectory supplies both the summary and its independent validation signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the judge works reliably, the same parallel-validation pattern could be applied to other context-reduction methods such as retrieval or memory pruning.
  • Agent designs that surface explicit forward-intent statements might make the judge's check simpler and more robust.
  • Repeated application across multiple compaction steps could preserve coherence over horizons far longer than single-step validation allows.

Load-bearing premise

A judge LLM can reliably determine whether a candidate summary preserves the agent's forward intent and necessary facts solely by inspecting the agent's continued reasoning on the original trajectory, without access to ground-truth future outcomes.

What would settle it

An experiment that measures whether summaries accepted by the judge produce measurably higher final task success rates than summaries rejected by the judge or than randomly chosen summaries on the same long-horizon benchmarks.

Figures

Figures reproduced from arXiv: 2605.08580 by Ravi Netravali, Rui Pan, Yinwei Dai, Zhuofu Chen.

Figure 1
Figure 1. Figure 1: Synchronous compaction vs. Slipstream. (a) Synchronous compaction blocks agent execution and offers no visibility into what future actions require, leading to silent accuracy degradation. (b) Slipstream runs the compactor in parallel with continued agent execution on the original context, hiding compaction latency. The next-k agent actions provide a held-out validation signal covering intent and facts for … view at source ↗
Figure 2
Figure 2. Figure 2: Three illustrative compaction failures from real agent traces on browsing and coding workloads. Each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Challenges with synchronous compaction. (a) Compaction lies on the agent’s critical path, accounting [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Latency analysis across workloads, models, and compaction thresholds ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: [SWE-bench verified, Qwen3.5-9B] Isolating [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Latency breakdown across workloads, models, and compaction thresholds. Slipstream hides summarization overhead within the asynchronous execution window, achieving near-zero net latency overhead. B Detailed Prompts B.1 Trajectory-Grounded Judge Prompt The judge decides whether a candidate compacted state is sufficient to adopt by checking it against the next-k trajectory—the agent’s continued execution on t… view at source ↗
read the original abstract

To cope with the large contexts that long-horizon LLM agents produce, modern frameworks increasingly rely on compaction -- invoking an LLM to rewrite the accumulated trajectory into a shorter summary that the agent resumes from. Today, compaction runs synchronously on the critical path of agent execution but this can unpredictably degrade accuracy due to a structural validation gap: the compactor must condense context but is fundamentally unaware of precisely what information the agent will need later. Further, because post-compaction agent steps are conditioned on the new summary, targeted validation criteria do not exist and errors silently propagate through coherent but incorrect behavior. Our key insight is that asynchronous compaction efficiently addresses this gap: by running the compactor in parallel with continued agent execution on the original context, the candidate summary and the agent's next steps are generated independently from the same pre-compaction state, yielding a validation signal independent of the summary itself. We build Slipstream, a trajectory-grounded compaction system that uses a judge to validate the candidate summary against the agent's continued reasoning, checking that it preserves both the agent's forward intent and the key facts and constraints it depends on. Across long-horizon coding (SWE-bench Verified) and web-browsing (BrowseComp) workloads, Slipstream improves task accuracy by up to 8.8 percentage points while reducing end-to-end latency by up to 39.7%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Slipstream, a system for asynchronous compaction validation in long-horizon LLM agents. It identifies a structural validation gap in synchronous compaction where the compactor lacks knowledge of future agent needs, leading to silent accuracy degradation. The proposed solution runs the compactor in parallel with continued agent execution on the original pre-compaction context, then employs a judge LLM to validate candidate summaries by checking preservation of forward intent and key facts against the independently generated next-step trajectory. Empirical evaluation on SWE-bench Verified (coding) and BrowseComp (web-browsing) reports task accuracy gains of up to 8.8 percentage points and end-to-end latency reductions of up to 39.7%.

Significance. If the validation signal proves reliable and the reported gains hold under rigorous controls, this work could meaningfully advance practical context management for LLM agents by enabling compaction off the critical path without accuracy penalties. The asynchronous design provides a clean separation between summary generation and validation, and the concrete benchmark improvements constitute a practical contribution. The empirical focus on real workloads is a strength, though the absence of detailed protocol information and analysis of long-term dependency capture limits immediate impact assessment.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The reported accuracy gains of up to 8.8 percentage points and latency reductions of up to 39.7% are presented without specification of the exact baselines (e.g., synchronous compaction variants or alternative summarizers), number of trials, statistical significance tests, error bars, or precise measurement protocol for end-to-end latency and task success. This information is load-bearing for interpreting the magnitude and robustness of the central empirical claims.
  2. [Method] Method section (asynchronous validation description): The validation relies on a judge LLM inspecting the agent's continued reasoning on the original trajectory to confirm preservation of forward intent and facts. No ablation, correlation analysis, or long-horizon experiment is provided to test whether immediate next steps reliably surface all constraints that may only become relevant many steps later, leaving the weakest assumption unexamined and risking false acceptances on tasks where dependencies are delayed.
minor comments (2)
  1. [Method] The manuscript would benefit from a diagram or pseudocode illustrating the parallel execution flow of the compactor and agent to clarify the independence of the validation signal.
  2. [Preliminaries] Notation for the pre- and post-compaction states and the judge input format should be standardized and defined explicitly in the first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for acknowledging the potential practical contribution of our asynchronous compaction approach. We address each major comment below with clarifications and commitments to revisions that improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The reported accuracy gains of up to 8.8 percentage points and latency reductions of up to 39.7% are presented without specification of the exact baselines (e.g., synchronous compaction variants or alternative summarizers), number of trials, statistical significance tests, error bars, or precise measurement protocol for end-to-end latency and task success. This information is load-bearing for interpreting the magnitude and robustness of the central empirical claims.

    Authors: We agree that these details are essential for evaluating the empirical claims. The current manuscript presents high-level results without sufficient protocol information. In the revised version, we will expand the Experiments section with a dedicated 'Experimental Setup' subsection specifying: the exact baselines (synchronous compaction with standard LLM summarizers and alternative methods), number of trials and seeds, statistical tests (e.g., McNemar's test for accuracy differences and paired t-tests for latency) with p-values, error bars as standard error, and the precise end-to-end latency protocol (full task duration including compaction and validation overhead). We will also update the abstract to reference these controlled comparisons. revision: yes

  2. Referee: [Method] Method section (asynchronous validation description): The validation relies on a judge LLM inspecting the agent's continued reasoning on the original trajectory to confirm preservation of forward intent and facts. No ablation, correlation analysis, or long-horizon experiment is provided to test whether immediate next steps reliably surface all constraints that may only become relevant many steps later, leaving the weakest assumption unexamined and risking false acceptances on tasks where dependencies are delayed.

    Authors: We acknowledge the importance of examining whether next-step trajectories capture delayed dependencies. Our design prioritizes an independent validation signal from the pre-compaction state to prevent circularity, and the observed accuracy gains on long-horizon benchmarks provide supporting evidence. To directly address this, the revised manuscript will include a new analysis subsection with: an ablation varying validation trajectory length (1, 3, and 5 steps), a correlation analysis between judge scores and final task success, and an explicit discussion of limitations for very long-term dependencies along with potential mitigations such as periodic re-validation. These will be evaluated on the existing SWE-bench and BrowseComp workloads. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system design with independent validation signal

full rationale

The paper presents Slipstream as an empirical system for asynchronous compaction in LLM agents. The central insight—that running compaction in parallel with continued execution on the original context yields an independent validation signal—is a design choice, not a derivation from equations or self-referential definitions. No load-bearing steps reduce to fitted parameters, self-citations, or ansatzes by construction. The approach is evaluated on external benchmarks (SWE-bench Verified, BrowseComp) with reported accuracy and latency improvements. The LLM judge assumption is stated but does not create internal circularity in the claimed results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions about LLM summarization behavior and the reliability of judge-based validation; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The compactor is fundamentally unaware of future information needs when summarizing synchronously.
    Stated as the core structural validation gap.
  • domain assumption Parallel execution on the original context produces an independent validation signal usable by a judge.
    Central to the key insight and system design.

pith-pipeline@v0.9.0 · 5547 in / 1377 out tokens · 36197 ms · 2026-05-12T00:58:08.629657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 14 internal anchors

  1. [1]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  2. [2]

    2025 , howpublished =

    Seed-. 2025 , howpublished =

  3. [4]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , booktitle =. 2024 , eprint =

  4. [5]

    2023 , eprint =

    Mialon, Gr. 2023 , eprint =

  5. [6]

    and Keutzer, Kurt and Gholami, Amir , booktitle =

    Kim, Sehoon and Moon, Suhong and Tabrizi, Ryan and Lee, Nicholas and Mahoney, Michael W. and Keutzer, Kurt and Gholami, Amir , booktitle =. An. 2024 , eprint =

  6. [7]

    2026 , howpublished =

    Compaction , author =. 2026 , howpublished =

  7. [8]

    2025 , howpublished =

    How to Manage Long Context with Summarization , author =. 2025 , howpublished =

  8. [10]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , eprint =

  9. [12]

    2024 , month =

    Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding , author =. 2024 , month =

  10. [14]

    Lost in the Middle: How Language Models Use Long Contexts

    Lost in the Middle: How Language Models Use Long Contexts , author =. 2023 , eprint =. doi:10.48550/arXiv.2307.03172 , url =

  11. [15]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    On Context Utilization in Summarization with Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =. doi:10.18653/v1/2024.acl-long.153 , url =

  12. [16]

    2025 , month =

    Context Rot: How Increasing Input Tokens Impacts LLM Performance , author =. 2025 , month =

  13. [17]

    2024 , doi =

    Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , booktitle =. 2024 , doi =

  14. [18]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Zhang, Yang and Ginsburg, Boris , booktitle =. 2024 , eprint =. doi:10.48550/arXiv.2404.06654 , url =

  15. [19]

    doi:10.48550/arXiv.2407.11963 , url =

    Li, Mo and Zhang, Songyang and Zhang, Taolin and Duan, Haodong and Liu, Yunxin and Chen, Kai , year =. doi:10.48550/arXiv.2407.11963 , url =. 2407.11963 , archivePrefix =

  16. [20]

    Rossi, Seunghyun Yoon, and Hinrich Schütze

    Modarressi, Ali and Deilamsalehy, Hanieh and Dernoncourt, Franck and Bui, Trung and Rossi, Ryan A. and Yoon, Seunghyun and Sch. Forty-second International Conference on Machine Learning , year =. doi:10.48550/arXiv.2502.05167 , url =. 2502.05167 , archivePrefix =

  17. [22]

    , author=

    MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

  18. [25]

    arXiv preprint arXiv:2503.14499 , year=

    Measuring AI Ability to Complete Long Software Tasks , author =. Advances in Neural Information Processing Systems , year =. doi:10.48550/arXiv.2503.14499 , url =. 2503.14499 , archivePrefix =

  19. [26]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Scaling Long-Horizon LLM Agent via Context-Folding , author =. 2025 , eprint =. doi:10.48550/arXiv.2510.11967 , url =

  20. [27]

    and Wutschitz, Lukas and Chen, Yanzhi and Sim, Robert and Rajmohan, Saravan , booktitle =

    Kang, Minki and Chen, Wei-Ning and Han, Dongge and Inan, Huseyin A. and Wutschitz, Lukas and Chen, Yanzhi and Sim, Robert and Rajmohan, Saravan , booktitle =. 2026 , url =

  21. [28]

    2026 , eprint =

    Beyond Static Summarization: Proactive Memory Extraction for LLM Agents , author =. 2026 , eprint =. doi:10.48550/arXiv.2601.04463 , url =

  22. [29]

    2026 , howpublished =

    Automatic Context Compaction , author =. 2026 , howpublished =

  23. [30]

    2026 , howpublished =

    Memory & Context Management with Claude Sonnet 4.6 , author =. 2026 , howpublished =

  24. [31]

    2026 , howpublished =

    Context Engineering: Memory, Compaction, and Tool Clearing , author =. 2026 , howpublished =

  25. [32]

    2025 , month =

    Effective Context Engineering for AI Agents , author =. 2025 , month =

  26. [33]

    2026 , month =

    Context Management for Deep Agents , author =. 2026 , month =

  27. [34]

    2026 , howpublished =

    Sessions , author =. 2026 , howpublished =

  28. [35]

    2025 , month =

    Slash Commands, Summarization, and Improved Agent Terminal , author =. 2025 , month =

  29. [36]

    2026 , howpublished =

    Manage Costs Effectively , author =. 2026 , howpublished =

  30. [37]

    2025 , month =

    Introducing Codex , author =. 2025 , month =

  31. [38]

    2025 , howpublished =

    Context Engineering: Short-Term Memory Management with Sessions , author =. 2025 , howpublished =

  32. [39]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  33. [41]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  34. [43]

    Advances in Neural Information Processing Systems , volume=

    Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

  35. [46]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  36. [47]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  37. [50]

    Proceedings of EMNLP , year=

    Evaluating the Factual Consistency of Abstractive Text Summarization , author=. Proceedings of EMNLP , year=

  38. [51]

    Proceedings of NAACL-HLT , year=

    Evaluating Content Selection in Summarization: The Pyramid Method , author=. Proceedings of NAACL-HLT , year=

  39. [52]

    Proceedings of ACL , year=

    FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , author=. Proceedings of ACL , year=

  40. [53]

    Proceedings of EMNLP , year=

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment , author=. Proceedings of EMNLP , year=

  41. [54]

    Proceedings of NAACL , year=

    Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics , author=. Proceedings of NAACL , year=

  42. [55]

    Proceedings of EMNLP , year=

    QuestEval: Summarization Asks for Fact-based Evaluation , author=. Proceedings of EMNLP , year=

  43. [56]

    Proceedings of ACL , year=

    Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization , author=. Proceedings of ACL , year=

  44. [57]

    Proceedings of COLM , year=

    FABLES: Evaluating Faithfulness and Content Selection in Book-Length Summarization , author=. Proceedings of COLM , year=

  45. [58]

    Text Summarization Branches Out , year=

    ROUGE: A Package for Automatic Evaluation of Summaries , author=. Text Summarization Branches Out , year=

  46. [59]

    Automatic context compaction

    Anthropic . Automatic context compaction. https://platform.claude.com/cookbook/tool-use-automatic-context-compaction, 2026 a . Claude Cookbook. Accessed: 2026-04-30

  47. [60]

    Compaction

    Anthropic . Compaction. https://platform.claude.com/docs/en/build-with-claude/compaction, 2026 b . Claude API documentation (beta). Accessed: 2026-05-05

  48. [61]

    Context engineering: Memory, compaction, and tool clearing

    Anthropic . Context engineering: Memory, compaction, and tool clearing. https://platform.claude.com/cookbook/tool-use-context-engineering-context-engineering-tools, 2026 c . Claude Cookbook. Accessed: 2026-04-30

  49. [62]

    Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression

    Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, and Alexey Tumanov. Rocketkv: Accelerating long-context llm inference via two-stage kv cache compression. arXiv preprint arXiv:2502.14051, 2025

  50. [63]

    Seed- OSS open-source models

    ByteDance Seed Team . Seed- OSS open-source models. https://github.com/ByteDance-Seed/seed-oss, 2025. Accessed: 2026-05-05

  51. [64]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

  52. [65]

    Re-examining system-level correlations of automatic summarization evaluation metrics

    Daniel Deutsch, Rotem Dror, and Dan Roth. Re-examining system-level correlations of automatic summarization evaluation metrics. In Proceedings of NAACL, 2022

  53. [66]

    Huerta, and Hao Peng

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23281--23298, Suzhou, China, 2025. Association for C...

  54. [67]

    Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization

    Esin Durmus, He He, and Mona Diab. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of ACL, 2020

  55. [68]

    Cascade inference: Memory-efficient shared prefix batch decoding

    FlashInfer Team . Cascade inference: Memory-efficient shared prefix batch decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html, February 2024. FlashInfer blog post. Accessed: 2026-05-04

  56. [69]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025. URL https://trychroma.com/research/context-rot

  57. [70]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE -bench: Can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.06770

  58. [71]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An LLM compiler for parallel function calling. In Proceedings of the 41st International Conference on Machine Learning, 2024 a . URL https://arxiv.org/abs/2312.04511

  59. [72]

    Fables: Evaluating faithfulness and content selection in book-length summarization

    Yekyung Kim, Yapei Chang, Marzena Karpinska, Aparna Garimella, Varun Manjunatha, Kyle Lo, Tanya Goyal, and Mohit Iyyer. Fables: Evaluating faithfulness and content selection in book-length summarization. In Proceedings of COLM, 2024 b

  60. [73]

    Evaluating the factual consistency of abstractive text summarization

    Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. In Proceedings of EMNLP, 2020

  61. [74]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention . In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. URL https://arxiv.org/abs/2309.06180

  62. [75]

    How to manage long context with summarization

    LangChain . How to manage long context with summarization. https://langchain-ai.github.io/langmem/guides/summarization/, 2025. LangMem documentation. Accessed: 2026-05-05

  63. [76]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  64. [77]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37: 0 22947--22970, 2024

  65. [78]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 2004

  66. [79]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023 a . URL https://arxiv.org/abs/2307.03172

  67. [80]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of EMNLP, 2023 b

  68. [81]

    Scaling llm multi-turn rl with end-to-end summarization-based context management

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling llm multi-turn rl with end-to-end summarization-based context management. arXiv preprint arXiv:2510.06727, 2025

  69. [82]

    GAIA: a benchmark for General AI Assistants

    Gr \'e goire Mialon, Cl \'e mentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA : A benchmark for general AI assistants, 2023. URL https://arxiv.org/abs/2311.12983

  70. [83]

    Compaction

    Microsoft . Compaction. https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction, 2026. Microsoft Agent Framework documentation. Accessed: 2026-05-05

  71. [84]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  72. [85]

    Evaluating content selection in summarization: The pyramid method

    Ani Nenkova and Rebecca Passonneau. Evaluating content selection in summarization: The pyramid method. In Proceedings of NAACL-HLT, 2004

  73. [86]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir\_G Patil, Kevin Lin, Sarah Wooders, and Joseph\_E Gonzalez. Memgpt: towards llms as operating systems. 2023

  74. [87]

    Specrea- son: Fast and accurate inference-time compute via speculative reasoning

    Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, and Ravi Netravali. Specreason: Fast and accurate inference-time compute via speculative reasoning. arXiv preprint arXiv:2504.07891, 2025

  75. [88]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36: 0 68539--68551, 2023

  76. [89]

    Questeval: Summarization asks for fact-based evaluation

    Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. In Proceedings of EMNLP, 2021

  77. [90]

    Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

    Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding, 2025. URL https://arxiv.org/abs/2510.11967

  78. [91]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

  79. [92]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp : A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504.12516

  80. [93]

    Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, Timi Adeniran, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, and Aaron Zhao. Combating the memory walls: Optimization pathways for long-context agentic LLM inference, 2025 a . URL htt...

Showing first 80 references.