pith. machine review for the scientific record. sign in

arxiv: 2605.04496 · v1 · submitted 2026-05-06 · 💻 cs.CL

Recognition: unknown

SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

Chen Shen, Jiaheng Gao, Wenqing Wang, Xiaojun Wan, Yaming Yang, Yong Hu, Zhenliang Zhang

Pith reviewed 2026-05-08 17:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords long-text understandinginformation foragingepistemic statetoken efficiencyLLM agentscontext scalingactive exploration
0
0 comments X

The pith

SCOUT treats long documents as explorable environments and forages only the sparse query-relevant information needed, matching proprietary models while cutting token use by up to 8x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long-text understanding does not require ingesting entire million-token contexts. Instead, an agent can maintain a compact, provenance-grounded epistemic state and adaptively explore the document in a coarse-to-fine manner until that state contains enough information to answer the query. Because relevant details are typically sparse, this foraging process avoids attention dilution and high compute costs. Experiments demonstrate that the resulting answers remain competitive with frontier models even as document length grows, directly addressing the practical trade-off between scale and expense.

Core claim

SCOUT shifts long-text understanding from passive full-context processing to active information foraging: it models the document as an environment, diagnoses gaps in a decoupled epistemic state, and alternates coarse-to-fine exploration with anchored state updates until the state reaches query sufficiency.

What carries the argument

Decoupled epistemic states that start broad and contract through guided coarse-to-fine exploration and state updates.

If this is right

  • Token consumption drops by up to 8x relative to end-to-end long-context models.
  • Performance remains stable rather than degrading as context length scales to millions of tokens.
  • The method achieves parity with state-of-the-art proprietary systems on long-text benchmarks without requiring full-document attention.
  • Provenance is preserved because every answer element traces back to specific explored passages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same foraging logic could apply to other sparse-information domains such as large code repositories or scientific paper collections.
  • Future systems might embed foraging primitives directly rather than relying on ever-larger context windows.
  • Lower token budgets per query would reduce both monetary cost and energy use for repeated long-text tasks.

Load-bearing premise

Query-relevant information is sparse enough in most documents that adaptive exploration can locate it without missing critical details.

What would settle it

A controlled test in which critical query answers are deliberately placed in dense, hard-to-locate sections and SCOUT consistently fails to retrieve them while full-context baselines succeed.

Figures

Figures reproduced from arXiv: 2605.04496 by Chen Shen, Jiaheng Gao, Wenqing Wang, Xiaojun Wan, Yaming Yang, Yong Hu, Zhenliang Zhang.

Figure 1
Figure 1. Figure 1: Accuracy–cost trade-off. Overall accuracy vs. token cost (k); SCOUT offers the best trade-off. Current approaches generally fall into three paradigms. First, native long-context LLMs (e.g., Gemini-3-Pro (Google DeepMind, 2025), GPT-5 (OpenAI, 2025b)) in￾gest massive contexts directly, but this brute-force strategy incurs high inference cost and suffers attention dilution at scale. Second, retrieval-augment… view at source ↗
Figure 2
Figure 2. Figure 2: Paradigm comparison for million-token LTU and SCOUT’s state-decoupled convergence. Top: Full-context LLMs ingest the entire sequence, suffering attention dilution and (dense-attention) quadratic compute/memory growth, so scalability and efficiency degrade under million-token inputs. Middle: Chunking/RAG-style agents improve scalability via fragmented access, but coherence can break across long ranges and a… view at source ↗
Figure 3
Figure 3. Figure 3: Alleviating the LTU trilemma under scaling. Left: as context grows from 64K to 1M+, SCOUT maintains near-flat, top accuracy, while frontier monolithic LLMs often degrade at long horizons. Right (log scale): SCOUT keeps token cost nearly constant across the sweep, whereas monolithic ingestion costs grow rapidly with context length. Together, SCOUT remains near the Pareto frontier, simultaneously achieving s… view at source ↗
Figure 4
Figure 4. Figure 4: Comprehensive accuracy–cost tradeoff. Left: LOOGLE-V2. Right: ∞BENCH. We compare closed-source LLMs (blue circles), open-weight LLMs (gray circles), agent baselines (gray triangles), and SCOUT (purple triangle). SCOUT achieves Pareto￾optimal performance on both benchmarks, combining the highest accuracy with low token cost. Notably, open-weight LLMs generally underperform closed-source counterparts, yet SC… view at source ↗
read the original abstract

Long-Text Understanding (LTU) at million-token scale requires balancing reasoning fidelity with computational efficiency. Frontier long-context LLMs can process millions of token contexts end-to-end, but they suffer from high token consumption and attention dilution. In parallel, specialized LTU agents often sacrifice fidelity through task-agnostic abstractions like graph construction or indexing. We identify a key insight for LTU: query-relevant information is typically sparse relative to the full document, so effective reasoning should rely on a query-sufficient subset rather than the entire context. To address this, we propose SCOUT, a new paradigm for LTU that shifts from passive processing to active information foraging. It treats the document as an explorable environment and answers from a compact, provenance-grounded epistemic state. Guided by state-level gap diagnosis, SCOUT adaptively alternates between coarse-to-fine exploration and anchored state updates that progressively contract its epistemic state toward query sufficiency. Experiments show that SCOUT matches state-of-the-art proprietary models while reducing token consumption by up to 8x. Moreover, SCOUT remains stable as context length scales, substantially alleviating the practical cost-performance trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes SCOUT, a new paradigm for long-text understanding (LTU) that shifts from passive full-context processing to active information foraging. The document is treated as an explorable environment; a decoupled epistemic state is maintained and progressively contracted toward query sufficiency via state-level gap diagnosis, coarse-to-fine exploration, and anchored updates. The central empirical claims are that SCOUT matches the performance of state-of-the-art proprietary long-context models while reducing token consumption by up to 8x and remains stable as context length increases.

Significance. If the results are substantiated, SCOUT offers a promising direction for efficient LTU by exploiting the sparsity of query-relevant information through adaptive, provenance-grounded foraging rather than end-to-end attention or task-agnostic abstractions. The decoupling of epistemic states and the use of gap diagnosis as a control mechanism constitute a conceptual contribution that could alleviate the cost-performance trade-off at million-token scales.

major comments (3)
  1. [Abstract] Abstract: The claim that SCOUT 'matches state-of-the-art proprietary models' while reducing token consumption by up to 8x is load-bearing for the central contribution, yet the abstract (and the provided manuscript description) supplies no baselines, datasets, metrics, or error analysis. Without these, it is impossible to assess whether the foraging mechanism preserves fidelity or simply under-samples.
  2. [Abstract] Abstract, paragraph on method: The assumption that 'query-relevant information is typically sparse' and that coarse-to-fine exploration plus state-level gap diagnosis will always reach query sufficiency without omitting critical details is not supported by any quantitative recall analysis or failure-mode study. If the LLM-driven exploration systematically misses sparsely distributed but necessary facts, the fidelity claim collapses even while token counts remain low.
  3. [Abstract] The stability claim ('SCOUT remains stable as context length scales') is presented without reference to how context length was varied, which metrics were tracked, or controls for increasing document size. This directly affects the practical cost-performance alleviation asserted in the abstract.
minor comments (1)
  1. [Abstract] The term 'epistemic state' and 'state-level gap diagnosis' are introduced without a concise formal definition or pseudocode in the abstract; a short boxed definition or diagram would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on the abstract and the load-bearing claims. We address each major comment point by point below and will revise the manuscript to improve clarity and substantiation without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SCOUT 'matches state-of-the-art proprietary models' while reducing token consumption by up to 8x is load-bearing for the central contribution, yet the abstract (and the provided manuscript description) supplies no baselines, datasets, metrics, or error analysis. Without these, it is impossible to assess whether the foraging mechanism preserves fidelity or simply under-samples.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to name the primary datasets (LongBench and InfiniteBench), the main baselines (GPT-4o, Claude-3 Opus, and selected open long-context models), the metrics (accuracy and F1), and a brief note that detailed error analysis appears in Section 5. The full experimental protocol and results, which already demonstrate comparable fidelity at substantially lower token cost, remain in Section 4. revision: yes

  2. Referee: [Abstract] Abstract, paragraph on method: The assumption that 'query-relevant information is typically sparse' and that coarse-to-fine exploration plus state-level gap diagnosis will always reach query sufficiency without omitting critical details is not supported by any quantitative recall analysis or failure-mode study. If the LLM-driven exploration systematically misses sparsely distributed but necessary facts, the fidelity claim collapses even while token counts remain low.

    Authors: We acknowledge that an explicit quantitative recall analysis and failure-mode study would strengthen the sparsity and sufficiency claims. While overall performance metrics in Section 4 provide indirect support, we will add a dedicated subsection (or appendix) reporting recall rates for query-relevant spans and a systematic analysis of failure cases. This addition will be included in the revised manuscript. revision: yes

  3. Referee: [Abstract] The stability claim ('SCOUT remains stable as context length scales') is presented without reference to how context length was varied, which metrics were tracked, or controls for increasing document size. This directly affects the practical cost-performance alleviation asserted in the abstract.

    Authors: We thank the referee for highlighting this omission. Section 4.3 already details the stability experiments: context length is varied from 10k to 1M tokens on both synthetic and real documents, with accuracy, token consumption, and latency tracked under controlled document-size and query-difficulty conditions. We will revise the abstract to reference this analysis and the observed stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SCOUT is a self-contained methodological proposal

full rationale

The paper introduces SCOUT as a new paradigm for long-text understanding based on the insight that query-relevant information is sparse, using active foraging with decoupled epistemic states, state-level gap diagnosis, coarse-to-fine exploration, and anchored updates. No equations, parameter fittings, or derivations are presented that reduce any claim to its own inputs by construction. No self-citations appear as load-bearing for uniqueness theorems, ansatzes, or renamings of known results. Experimental claims of matching SOTA performance at lower token cost are framed as empirical outcomes rather than forced predictions. The approach is independent of prior fitted quantities and does not rely on self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption of information sparsity and the introduction of new concepts for state management. No free parameters are explicitly named in the abstract.

axioms (2)
  • domain assumption Query-relevant information is typically sparse relative to the full document
    Identified as the key insight enabling the foraging approach.
  • domain assumption Adaptive alternation between exploration and state updates can contract the epistemic state to query sufficiency
    Core operational premise of the SCOUT paradigm.
invented entities (2)
  • epistemic state no independent evidence
    purpose: Compact, provenance-grounded representation of knowledge sufficient for query answering
    New construct introduced to replace full-context processing.
  • state-level gap diagnosis no independent evidence
    purpose: Mechanism to guide adaptive exploration and updates
    Central control logic for the foraging process.

pith-pipeline@v0.9.0 · 5521 in / 1420 out tokens · 34121 ms · 2026-05-08T17:16:12.598879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 26 canonical work pages · 11 internal anchors

  1. [1]

    Claude sonnet 4.5 system card

    Anthropic . Claude sonnet 4.5 system card. https://www.anthropic.com/claude/sonnet, 2025 a . Released September 29, 2025

  2. [2]

    Claude code

    Anthropic . Claude code. https://code.claude.com/docs/en/overview, 2025 b

  3. [3]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection

    Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. Self-rag: Learning to retrieve, generate, and critique through self-reflection. 2024

  4. [4]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguis...

  5. [6]

    Deepseek-v3.2-exp: Boosting long-context efficiency with deepseek sparse attention, 2025

    DeepSeek-AI. Deepseek-v3.2-exp: Boosting long-context efficiency with deepseek sparse attention, 2025

  6. [8]

    Gemini 3 pro

    Google DeepMind . Gemini 3 pro. https://deepmind.google/technologies/gemini/, 2025. Released November 18, 2025

  7. [10]

    He, Z., Wang, Y., Li, J., Liang, K., and Zhang, M. Loogle v2: Are llms ready for real world long dependency challenges? In Proceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

  8. [11]

    and Grave, E

    Izacard, G. and Grave, E. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pp.\ 874--880, 2021

  9. [12]

    The AI hippocampus: How far are we from human memory? Trans

    Jia, Z., Li, J., Kang, Y., Wang, Y., Wu, T., Wang, Q., Wang, X., Zhang, S., Shen, J., Li, Q., Qi, S., Liang, Y., He, D., Zheng, Z., and Zhu, S.-C. The AI hippocampus: How far are we from human memory? Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=Sk7pwmLuAY

  10. [15]

    Learning to reason and memorize with self-notes

    Lanchantin, J., Toshniwal, S., Weston, J., Sukhbaatar, S., et al. Learning to reason and memorize with self-notes. Advances in Neural Information Processing Systems, 36: 0 11891--11911, 2023

  11. [17]

    u ttler, H., Lewis, M., Yih, W.-t., Rockt \

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K \"u ttler, H., Lewis, M., Yih, W.-t., Rockt \"a schel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33: 0 9459--9474, 2020

  12. [23]

    OpenAI . Gpt-4.1. https://openai.com/index/gpt-4-1, 2025 a . Released April 14, 2025

  13. [24]

    Gpt-5 system card

    OpenAI . Gpt-5 system card. https://openai.com/index/introducing-gpt-5, 2025 b . Released August 7, 2025

  14. [26]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team . Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  15. [27]

    Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. Raptor: Recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, 2024

  16. [28]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

  17. [30]

    Adaptive preference optimization with uncertainty-aware utility anchor

    Wang, X., Jia, Z., Li, J., Liu, Q., and Zheng, Z. Adaptive preference optimization with uncertainty-aware utility anchor. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 19204--19225, 2025

  18. [33]

    R., and Cao, Y

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/forum?id=WE\_vluYUL-X

  19. [37]

    Evolving generalist virtual agents with generative and associative memory

    Zhang, Z., Bu, W., Pan, K., Miao, B., Zhang, W., Wang, G., Ji, W., Tang, R., Li, J., and Tang, S. Evolving generalist virtual agents with generative and associative memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp.\ 13006--13014, 2026

  20. [39]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  21. [40]

    arXiv preprint arXiv:2508.13167 , year=

    Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl , author=. arXiv preprint arXiv:2508.13167 , year=

  22. [41]

    arXiv preprint arXiv:2402.09727 , year=

    A human-inspired reading agent with gist memory of very long contexts , author=. arXiv preprint arXiv:2402.09727 , year=

  23. [42]

    Memagent: Re- shaping long-context llm with multi-conv rl-based mem- ory agent.arXiv preprint arXiv:2507.02259,

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , author=. arXiv preprint arXiv:2507.02259 , year=

  24. [43]

    G raph R eader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models

    Li, Shilong and He, Yancheng and Guo, Hangyu and Bu, Xingyuan and Bai, Ge and Liu, Jie and Liu, Jiaheng and Qu, Xingwei and Li, Yangguang and Ouyang, Wanli and Su, Wenbo and Zheng, Bo. G raph R eader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024...

  25. [44]

    2024 , publisher=

    Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. 2024 , publisher=

  26. [45]

    B ench: Extending Long Context Evaluation Beyond 100 K Tokens

    Zhang, Xinrong and Chen, Yingfa and Hu, Shengding and Xu, Zihang and Chen, Junhao and Hao, Moo and Han, Xu and Thai, Zhen and Wang, Shuo and Liu, Zhiyuan and Sun, Maosong. B ench: Extending Long Context Evaluation Beyond 100 K Tokens. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi...

  27. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  28. [47]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. arXiv preprint arXiv:2404.06654 , year=

  29. [48]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Ring attention with blockwise transformers for near-infinite context , author=. arXiv preprint arXiv:2310.01889 , year=

  30. [49]

    LONGAGENT : Achieving Question Answering for 128k-Token-Long Documents through Multi-Agent Collaboration

    Zhao, Jun and Zu, Can and Hao, Xu and Lu, Yi and He, Wei and Ding, Yiwen and Gui, Tao and Zhang, Qi and Huang, Xuanjing. LONGAGENT : Achieving Question Answering for 128k-Token-Long Documents through Multi-Agent Collaboration. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.912

  31. [50]

    Advances in Neural Information Processing Systems , volume=

    Learning to reason and memorize with self-notes , author=. Advances in Neural Information Processing Systems , volume=

  32. [51]

    arXiv preprint arXiv:2310.05029 , year=

    Walking down the memory maze: Beyond context limit through interactive reading , author=. arXiv preprint arXiv:2310.05029 , year=

  33. [52]

    L ight RAG : Simple and Fast Retrieval-Augmented Generation

    Guo, Zirui and Xia, Lianghao and Yu, Yanhua and Ao, Tu and Huang, Chao. L ight RAG : Simple and Fast Retrieval-Augmented Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.568

  34. [53]

    Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery, 2024

    Memorag: Moving towards next-gen rag via memory-inspired knowledge discovery , author=. arXiv preprint arXiv:2409.05591 , volume=

  35. [54]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    Agentic retrieval-augmented generation: A survey on agentic rag , author=. arXiv preprint arXiv:2501.09136 , year=

  36. [55]

    The Twelfth International Conference on Learning Representations , year=

    Raptor: Recursive abstractive processing for tree-organized retrieval , author=. The Twelfth International Conference on Learning Representations , year=

  37. [56]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    From local to global: A graph rag approach to query-focused summarization , author=. arXiv preprint arXiv:2404.16130 , year=

  38. [57]

    Proceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year=

    LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges? , author=. Proceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track , year=

  39. [58]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  40. [59]

    2025 , note=

    GPT-5 System Card , author=. 2025 , note=

  41. [60]

    2025 , note=

    Claude Sonnet 4.5 System Card , author=. 2025 , note=

  42. [61]

    2025 , note=

    Gemini 3 Pro , author=. 2025 , note=

  43. [62]

    2025 , note=

    GPT-4.1 , author=. 2025 , note=

  44. [63]

    Claude Code , author=

  45. [64]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  46. [65]

    arXiv preprint arXiv:2510.21618 , year=

    DeepAgent: A General Reasoning Agent with Scalable Toolsets , author=. arXiv preprint arXiv:2510.21618 , year=

  47. [66]

    Resum: Unlocking long-horizon search intelligence via context summarization.CoRR, abs/2509.13313, 2025

    ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization , author=. arXiv preprint arXiv:2509.13313 , year=

  48. [67]

    DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention , author=

  49. [68]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

  50. [69]

    Qwen2.5: A Party of Foundation Models , url =

  51. [70]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  52. [71]

    Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume , pages=

    Leveraging passage retrieval with generative models for open domain question answering , author=. Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume , pages=

  53. [72]

    arXiv preprint arXiv:2510.20022 , year=

    SALT: Step-level advantage assignment for long-horizon agents via trajectory graph , author=. arXiv preprint arXiv:2510.20022 , year=

  54. [73]

    ACON : Optimizing context compression for long-horizon LLM agents, 2025

    Acon: Optimizing context compression for long-horizon llm agents , author=. arXiv preprint arXiv:2510.00615 , year=

  55. [74]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

  56. [75]

    Lost in the Middle: How Language Models Use Long Contexts

    Tom. The NarrativeQA Reading Comprehension Challenge , journal =. 2018 , url =. doi:10.1162/TACL\_A\_00023 , timestamp =

  57. [76]

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , booktitle =

    Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , editor =. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , booktitle =. 2025 , url =

  58. [77]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  59. [78]

    Narasimhan and Yuan Cao , title =

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  60. [79]

    Corrective Retrieval Augmented Generation

    Corrective retrieval augmented generation , author=. arXiv preprint arXiv:2401.15884 , year=

  61. [80]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Evolving Generalist Virtual Agents with Generative and Associative Memory , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  62. [81]

    arXiv preprint arXiv:2404.12045 , year=

    RAM: Towards an Ever-Improving Memory System by Learning from Communications , author=. arXiv preprint arXiv:2404.12045 , year=

  63. [82]

    Jia, Zixia and Li, Jiaqi and Kang, Yipeng and Wang, Yuxuan and Wu, Tong and Wang, Quansen and Wang, Xiaobo and Zhang, Shuyi and Shen, Junzhe and Li, Qing and Qi, Siyuan and Liang, Yitao and He, Di and Zheng, Zilong and Zhu, Song-Chun , title=. Trans. Mach. Learn. Res. , volume=. 2025 , url=

  64. [83]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

    Adaptive Preference Optimization with Uncertainty-aware Utility Anchor , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=