Recognition: no theorem link
A Survey of Context Engineering for Large Language Models
Pith reviewed 2026-05-13 20:54 UTC · model grok-4.3
The pith
Context engineering optimizes inputs for LLMs but exposes their weakness in producing sophisticated long-form outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes Context Engineering as a formal discipline that optimizes information payloads for LLMs. It decomposes the discipline into components of context retrieval and generation, processing, and management, then shows their architectural integration in retrieval-augmented generation, memory systems, tool-integrated reasoning, and multi-agent systems. Analysis of the literature reveals a fundamental asymmetry in model capabilities: current LLMs, when augmented by advanced context engineering, show strong proficiency at understanding complex contexts but exhibit pronounced limitations in generating equally sophisticated long-form outputs.
What carries the argument
Context Engineering as the systematic optimization of information payloads, organized through a taxonomy of retrieval/generation, processing, and management components integrated into RAG, memory, and multi-agent architectures, which surfaces the understanding-generation asymmetry.
If this is right
- Techniques in retrieval-augmented generation and memory systems can reliably improve model handling of complex inputs.
- Multi-agent and tool-integrated systems gain reliability when context management coordinates information across agents.
- Future model development should target the generation side of the asymmetry to enable longer, more coherent outputs.
- The proposed taxonomy supplies a shared structure for designing new context-aware applications.
- Addressing the gap would expand practical uses of LLMs in tasks requiring extended creative or analytical output.
Where Pith is reading between the lines
- Training objectives may need explicit weighting toward output generation quality rather than input comprehension alone.
- Direct benchmarks that score input understanding depth against output sophistication could quantify the asymmetry more precisely.
- The asymmetry may extend to multimodal settings, where models process rich inputs but struggle to generate detailed outputs.
- Hybrid workflows could route generation tasks to humans or specialized modules while models manage context.
Load-bearing premise
The claim that the asymmetry is the defining research priority assumes the authors' selection and interpretation of over 1400 papers accurately reflects the full state of the field without bias.
What would settle it
A controlled comparison showing that current LLMs, equipped with the best context engineering techniques, produce long-form outputs whose complexity and structure match the depth of their input understanding would falsify the asymmetry claim.
read the original abstract
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Context Engineering as a formal discipline for optimizing contextual information payloads for LLMs, extending beyond prompt engineering. It presents a taxonomy of foundational components (context retrieval and generation, processing, and management) and their integrations into architectures such as RAG, memory systems, tool-integrated reasoning, and multi-agent systems. Drawing on a synthesis of over 1400 papers, the central claim is that LLMs show strong proficiency in understanding complex contexts but pronounced limitations in generating sophisticated long-form outputs, making this asymmetry a defining priority for future research.
Significance. If the taxonomy is comprehensive and the asymmetry observation is representative of the literature, the survey supplies a unified technical roadmap that can orient both researchers and practitioners working on context-aware AI systems. The explicit identification of the understanding-generation gap, grounded in the reviewed body of work, offers a clear focal point for subsequent efforts to balance LLM capabilities.
major comments (2)
- [research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.
- [introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.
minor comments (3)
- [taxonomy section] The taxonomy diagram (if present) or its textual description would benefit from explicit labels or arrows clarifying the data-flow relationships among the three foundational components and the three system-level integrations.
- [throughout] Terminology consistency: 'context retrieval and generation' is sometimes written with a slash and sometimes as separate items; standardize phrasing across sections for readability.
- [references] A small number of citations appear to be repeated or listed without distinguishing primary from secondary sources; a brief note on citation selection criteria would help.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our survey. We agree that both points raised can strengthen the manuscript and will incorporate revisions to address them explicitly. The changes will enhance transparency without altering the core contributions or taxonomy.
read point-by-point responses
-
Referee: [research gap / concluding section] The section discussing the research gap and future priorities: the claim that current models exhibit 'pronounced limitations in generating equally sophisticated, long-form outputs' is presented as emerging directly from the literature synthesis, yet the manuscript does not aggregate or cite specific quantitative benchmarks (e.g., performance deltas on long-form generation tasks versus context-understanding tasks) that would make the asymmetry diagnosis more concrete and testable.
Authors: We appreciate this suggestion. The asymmetry observation synthesizes patterns across the reviewed works, where context-understanding benchmarks (e.g., long-context QA and retrieval) consistently show high performance while long-form generation tasks reveal coherence and consistency challenges. In revision, we will expand the concluding section with targeted citations to representative benchmarks, including performance deltas from papers on Needle-in-a-Haystack tests versus long-form writing or summarization evaluations. This will make the claim more concrete and testable while remaining within survey scope. revision: yes
-
Referee: [introduction / methodology overview] The description of the paper-selection process (implicit in the >1400-paper claim): without an explicit statement of search strategy, inclusion/exclusion criteria, or coverage across sub-areas (e.g., proportion of papers on generation versus retrieval), the representativeness of the synthesis—and therefore the robustness of the asymmetry diagnosis—remains difficult to evaluate.
Authors: We agree that an explicit methodology description will improve evaluability. We will add a dedicated 'Survey Methodology' subsection (or appendix) detailing the search strategy (keywords across arXiv, ACL, and NeurIPS from 2018–2024), inclusion/exclusion criteria (peer-reviewed empirical or technical papers on context techniques), and approximate coverage breakdowns by category (retrieval, processing, management, and integrated systems). This addition will directly support assessment of the synthesis. revision: yes
Circularity Check
No significant circularity
full rationale
This survey synthesizes over 1400 prior works into a taxonomy of context retrieval, processing, management, and system integrations such as RAG and multi-agent setups. The claimed asymmetry between strong context understanding and weaker long-form generation is presented as an observational conclusion from that literature review, with no equations, fitted parameters, formal derivations, or predictions that reduce to the paper's own inputs by construction. All load-bearing steps are descriptive summaries of external research; no self-citation chains or ansatzes are invoked to force the central gap diagnosis.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 24 Pith papers
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
-
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
-
Context Training with Active Information Seeking
Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
-
S^2tory: Story Spine Distillation for Movie Script Summarization
S^2tory uses narratological theory and a Narrative Expert Agent to identify plot nuclei in movie scripts for high-fidelity summarization at 3.5x compression, with strong zero-shot generalization to books.
-
CL-bench Life: Can Language Models Learn from Real-Life Context?
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
-
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
-
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Contexty: Capturing and Organizing In-situ Thoughts for Context-Aware AI Support
Contexty captures users' cognitive traces as editable snippets and organizes them to enable more effective, user-controlled context-aware AI collaboration during complex tasks.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs
A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
ExpressEdit: Fast Editing of Stylized Facial Expressions with Diffusion Models in Photoshop
ExpressEdit delivers fast, artifact-free stylized facial expression editing inside Photoshop via a diffusion model plugin and an accompanying expression database.
-
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
-
VIP-COP: Context Optimization for Tabular Foundation Models
VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
-
Towards Agentic Investigation of Security Alerts
An agentic LLM workflow with overview queries, query selection, evidence extraction, and verdict generation achieves significantly higher accuracy on security alert investigation than direct LLM use.
-
Human-Inspired Context-Selective Multimodal Memory for Social Robots
A new memory system for social robots selectively stores multimodal memories by emotional salience and novelty, achieving 0.506 Spearman correlation in selectivity and up to 13% better Recall@1 in multimodal retrieval.
-
CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning
CodaRAG improves RAG by using a CLS-inspired three-stage pipeline of knowledge consolidation, multi-dimensional associative navigation, and interference elimination, delivering 7-11% gains on GraphRAG-Bench for factua...
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
-
Context Collapse: Barriers to Adoption for Generative AI in Workplace Settings
Expert interviews demonstrate that context in generative AI workplace use collapses or rots over time, limiting tool effectiveness and revealing pitfalls in computational context approaches.
-
Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants
Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.
Reference graph
Works this paper leans on
-
[1]
https:// agent-network-protocol.com/specs/communication.html
Anp-agent communication meta-protocol specification(draft). https:// agent-network-protocol.com/specs/communication.html. [Online; accessed 17- July-2025]
work page 2025
-
[2]
S. A. Automating human evaluation of dialogue systems.North American Chapter of the Association for Computational Linguistics, 2022
work page 2022
-
[3]
Samir Abdaljalil, Hasan Kurban, Khalid A. Qaraqe, and E. Serpedin. Theorem-of-thought: A multi- agent framework for abductive, deductive, and inductive reasoning in language models. arXiv preprint, 2025
work page 2025
-
[4]
Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. Rankify: A comprehensive python toolkit for retrieval, re-ranking, and retrieval-augmented genera- tion, arXiv preprint arXiv:2502.02464, 2025. URLhttps://arxiv.org/abs/2502.02464v3
-
[5]
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matt Stallone, Rameswar Panda, Yara Rizk, G. Bhargav, M. Crouse, Chulaka Gunasekara, S. Ikbal, Sachin Joshi, Hima P. Karanam, Vineet Kumar, Asim Munawar, S. Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, P. Venkateswaran, Merve Unuvar, David Cox, S. Roukos, Luis A...
work page 2024
-
[6]
Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya
D. Acharya, Karthigeyan Kuppan, and Divya Bhaskaracharya. Agentic ai: Autonomous intelligence for complex goals—a comprehensive survey.IEEE Access, 2025
work page 2025
-
[7]
Tallyqa: Answering complex counting questions
Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[8]
Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024
Shantanu Acharya, Fei Jia, and Boris Ginsburg. Star attention: Efficient llm inference over long sequences, arXiv preprint arXiv:2411.17116, 2024. URLhttps://arxiv.org/abs/2411. 17116v3. 59
-
[9]
Emre Can Acikgoz, Jeremy Greer, Akul Datta, Ze Yang, William Zeng, Oussama Elachqar, Emmanouil Koukoumidis, Dilek Hakkani-Tur, and Gokhan Tur. Can a single model master both multi-turn conversations and tool use? coalm: A unified conversational agentic language model, arXiv preprint arXiv:2502.08820, 2025. URLhttps://arxiv.org/abs/2502.08820v3
-
[10]
Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani- Tur, and Gokhan Tur. A desideratum for conversational agents: Capabilities, challenges, and future directions, arXiv preprint arXiv:2504.16939, 2025. URLhttps://arxiv.org/abs/2504. 16939v1
-
[11]
Anum Afzal, Juraj Vladika, Gentrit Fazlija, Andrei Staradubets, and Florian Matthes. Towards opti- mizing a retrieval augmented generation using large language model on academic data.International Conference on Natural Language Processing and Information Retrieval, 2024
work page 2024
-
[12]
Ankush Agarwal, Sakharam Gawade, A. Azad, and P. Bhattacharyya. Kitlm: Domain-specific knowledge integration into language models for question answering.ICON, 2023
work page 2023
-
[13]
Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Large scale knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. arXiv preprint, 2020
work page 2020
-
[14]
Hegselmann, Hunter Lang, Yoon Kim, and D
Monica Agrawal, S. Hegselmann, Hunter Lang, Yoon Kim, and D. Sontag. Large language models are few-shot clinical information extractors.Conference on Empirical Methods in Natural Language Processing, 2022
work page 2022
-
[15]
Sharif, and Yaser Mohammadi Banadaki
Arash Ahmadi, S. Sharif, and Yaser Mohammadi Banadaki. Mcp bridge: A lightweight, llm-agnostic restful proxy for model context protocol servers, arXiv preprint arXiv:2504.08999, 2025. URL https://arxiv.org/abs/2504.08999v1
-
[16]
J. Ainslie, J. Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr’on, and Sumit K. Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[17]
Multi-agent system concepts theory and application phases
Adel Al-Jumaily. Multi-agent system concepts theory and application phases. arXiv preprint, 2006
work page 2006
-
[18]
Position interpolation improves alibi extrapolation
Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, and Joel Hestness. Position interpolation improves alibi extrapolation. arXiv preprint, 2023
work page 2023
-
[19]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, A. Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricar...
work page 2022
-
[20]
Stefano V. Albrecht and P. Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems.Artificial Intelligence, 2017
work page 2017
-
[21]
Understanding the Challenges and Opportunities of Generative AI Apps: An Empirical Study
Buthayna AlMulla, Maram Assi, and Safwat Hassan. Understanding the challenges and promises of developing generative ai apps: An empirical study, arXiv preprint arXiv:2506.16453, 2025. URL https://arxiv.org/abs/2506.16453v2. 60
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Reem S. Alsuhaibani, Christian D. Newman, M. J. Decker, Michael L. Collard, and Jonathan I. Maletic. On the naming of methods: A survey of professional developers.International Conference on Software Engineering, 2021
work page 2021
-
[23]
Francesco Alzetta, P. Giorgini, A. Najjar, M. Schumacher, and Davide Calvaresi. In-time explainability in multi-agent systems: Challenges, opportunities, and roadmap.EXTRAAMAS@AAMAS, 2020
work page 2020
-
[24]
Kenza Amara, Lukas Klein, Carsten T. Lüth, Paul F. Jäger, Hendrik Strobelt, and Mennatallah El- Assady. Why context matters in vqa and reasoning: Semantic interventions for vlm input modalities, arXiv preprint arXiv:2410.01690v1, 2024. URLhttps://arxiv.org/abs/2410.01690v1
-
[25]
Xavier Amatriain. Prompt design and engineering: Introduction and advanced methods, arXiv preprint arXiv:2401.14423, 2024. URLhttps://arxiv.org/abs/2401.14423v4
-
[26]
Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024
Zahra Aminiranjbar, Jianan Tang, Qiudan Wang, Shubha Pant, and Mahesh Viswanathan. Dawn: Designing distributed agents in a worldwide network, arXiv preprint arXiv:2410.22339, 2024. URL https://arxiv.org/abs/2410.22339v3
-
[27]
Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?International Conference on Learning Representations, 2024
work page 2024
-
[28]
Kaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Shuzheng Si, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, et al. Thread: A logic-based data organization paradigm for how-to question answering with retrieval augmented generation.arXiv preprint arXiv:2406.13372, 2024
-
[29]
Nissist: An incident mitigation copilot based on troubleshooting guides
Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, et al. Nissist: An incident mitigation copilot based on troubleshooting guides. In Proceedings of the 27th European Conference on Artificial Intelligence (ECAI 2024), pages 4471–4474, 2024
work page 2024
-
[30]
Ultraif: Advancing instruction following from the wild
Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, and Baobao Chang. Ultraif: Advancing instruction following from the wild. pages 7930–7957, 2025
work page 2025
-
[31]
Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, and Paul Hongsuck Seo. Lcirc: A recurrent compression approach for efficient long-form context and query dependent modeling in llms.North American Chapter of the Association for Computational Linguistics, 2025
work page 2025
-
[32]
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurélien Lucchi, and Thomas Hof- mann. Dynamic context pruning for efficient and interpretable autoregressive transformers.Neural Information Processing Systems, 2023
work page 2023
-
[33]
John R. Anderson, M. Matessa, and C. Lebiere. Act-r: A theory of higher level cognition and its relation to visual attention.Hum. Comput. Interact., 1997
work page 1997
-
[34]
Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022
Jacob Andreas. Language models as agent models.Conference on Empirical Methods in Natural Language Processing, 2022
work page 2022
-
[35]
Baldoni, and Leonardo Querzoni
Leonardo Aniello, R. Baldoni, and Leonardo Querzoni. Adaptive online scheduling in storm.Dis- tributed Event-Based Systems, 2013. 61
work page 2013
-
[36]
arXiv preprint arXiv:2407.04363 , year =
Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, M. Burtsev, and Evgeny Burnaev. Arigraph: Learning knowledge graph world models with episodic memory for llm agents, arXiv preprint arXiv:2407.04363, 2024. URLhttps://arxiv.org/abs/2407.04363v3
-
[37]
Introducing the model context protocol, November 2024
Anthropic. Introducing the model context protocol, November 2024. URL https://www. anthropic.com/news/model-context-protocol. [Online; accessed 17-July-2025]
work page 2024
-
[38]
RM Aratchige and Dr. Wmks Ilmini. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems, arXiv preprint arXiv:2504.01963, 2025. URL https://arxiv.org/abs/2504.01963v1
-
[39]
Leo Ardon, Daniel Furelos-Blanco, and A. Russo. Learning reward machines in cooperative multi- agent tasks.AAMAS Workshops, 2023
work page 2023
- [40]
-
[41]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.International Conference on Learning Representations, 2023
work page 2023
-
[42]
Hikaru Asano, Tadashi Kozuno, and Yukino Baba. Self iterative label refinement via robust unla- beled learning, arXiv preprint arXiv:2502.12565, 2025. URLhttps://arxiv.org/abs/2502. 12565v1
-
[43]
Ben Athiwaratkun, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Haifeng Qian, Hantian Ding, Qing Sun, Jun Wang, Jiacheng Guo, Liangfu Chen, Parminder Bhatia, Ramesh Nallapati, Sudipta Sengupta, and Bing Xiang. Bifurcated attention: Accelerating massively parallel decoding with shared prefixes in llms, arXiv preprint arXiv:2403.08845, 2024. URLhttps://arxi...
-
[44]
Avinash Ayalasomayajula, Rui Guo, Jingbo Zhou, Sujan Kumar Saha, and Farimah Farahmandi. Lasp: Llm assisted security property generation for soc verification.Workshop on Machine Learning for CAD, 2024
work page 2024
-
[45]
Aytes, Jinheon Baek, and Sung Ju Hwang
Simon A. Aytes, Jinheon Baek, and Sung Ju Hwang. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching. arXiv preprint, 2025
work page 2025
-
[46]
Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorgpour, A. Kazerouni, I. Rekik, and D. Merhof. Foundational models in medical imaging: A comprehensive survey and future vision, arXiv preprint arXiv:2310.18689, 2023. URLhttps://arxiv.org/abs/2310.18689v1
-
[47]
Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. Transformers for tabular data representation: A survey of models and applications.Transactions of the Association for Computational Linguistics, 2023
work page 2023
-
[48]
Chandrasekaran, Silviu Cucerzan, Allen Herring, and S
Jinheon Baek, N. Chandrasekaran, Silviu Cucerzan, Allen Herring, and S. Jauhar. Knowledge- augmented large language models for personalized contextual query suggestion.The Web Conference, 2023. 62
work page 2023
-
[49]
Tianyi Bai, Hao Liang, Binwang Wan, Ling Yang, Bozhou Li, Yifan Wang, Bin Cui, Conghui He, Binhang Yuan, and Wentao Zhang. A survey of multimodal large language model from a data-centric perspective, arXiv preprint arXiv:2405.16640v2, 2024. URLhttps://arxiv.org/abs/2405. 16640v2
-
[50]
Citrus: Chunked instruction-aware state eviction for long sequence modeling
Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, and Jackie Chi Kit Cheung. Citrus: Chunked instruction-aware state eviction for long sequence modeling. Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[51]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickaël Coustaty, Marccal Rusinol, O. R. Terrades, and Josep Llad’os. Globaldoc: A cross-modal vision-language framework for real-world document image retrieval and classification.IEEE Workshop/Winter Conference on Applications of Computer Vision, 2023
work page 2023
-
[53]
Purushothaman, and Renuka Sindhgatta
JayachanduBandlamudi,K.Mukherjee,PrernaAgarwal,SampathDechu,SiyuHuo,VatcheIsahagian, Vinod Muthusamy, N. Purushothaman, and Renuka Sindhgatta. Towards hybrid automation by bootstrappingconversationalinterfacesforitoperationtasks. AAAIConferenceonArtificialIntelligence , 2023
work page 2023
-
[54]
Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta
Jayachandu Bandlamudi, Kushal Mukherjee, Prerna Agarwal, Ritwik Chaudhuri, R. Pimplikar, Sampath Dechu, Alex Straley, Anbumunee Ponniah, and Renuka Sindhgatta. Building conversational artifacts to enable digital assistant for apis and rpas.AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[55]
Keqin Bao, Jizhi Zhang, Xinyu Lin, Yang Zhang, Wenjie Wang, and Fuli Feng. Large language models for recommendation: Past, present, and future.Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024
work page 2024
-
[56]
Sara Di Bartolomeo, Giorgio Severi, V. Schetinger, and Cody Dunne. Ask and you shall receive (a graph drawing): Testing chatgpt’s potential to apply graph layout algorithms.Eurographics Conference on Visualization, 2023
work page 2023
-
[57]
Saikat Barua. Exploring autonomous agents through the lens of large language models: A review, arXiv preprint arXiv:2404.04442, 2024. URLhttps://arxiv.org/abs/2404.04442v1
-
[58]
KinjalBasu, IbrahimAbdelaziz, KelseyBradford, M.Crouse, KiranKate, SadhanaKumaravel, Saurabh Goyal,AsimMunawar,YaraRizk,XinWang,LuisA.Lastras,andP.Kapanipathi. Nestful: Abenchmark for evaluating llms on nested sequences of api calls, arXiv preprint arXiv:2409.03797, 2024. URL https://arxiv.org/abs/2409.03797v3. 63
-
[59]
Natural language-oriented programming (nlop): Towards democratizing software creation
Amin Beheshti. Natural language-oriented programming (nlop): Towards democratizing software creation. 2024 IEEE International Conference on Software Services Engineering (SSE), 2024
work page 2024
- [60]
-
[61]
Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.International Conference on Learning Representations, 2024
work page 2024
-
[62]
Assaf Ben-Kish, Itamar Zimerman, M. J. Mirza, James R. Glass, Leonid Karlinsky, and Raja Giryes. Overflow prevention enhances long-context recurrent llms. arXiv preprint, 2025
work page 2025
-
[63]
M. Benna and Stefano Fusi. Complex synapses as efficient memory systems.BMC Neuroscience, 2015
work page 2015
-
[64]
M. Benna and Stefano Fusi. Computational principles of biological memory, arXiv preprint arXiv:1507.07580, 2015. URLhttps://arxiv.org/abs/1507.07580v1
-
[65]
Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh
Shelly Bensal, Umar Jamil, Christopher Bryant, M. Russak, Kiran Kamble, Dmytro Mozolevskyi, Muayad Ali, and Waseem Alshikh. Reflect, retry, reward: Self-improving llms via reinforcement learn- ing, arXiv preprint arXiv:2505.24726, 2025. URLhttps://arxiv.org/abs/2505.24726v1
-
[66]
Idoia Berges, J. Bermúdez, A. Goñi, and A. Illarramendi. Semantic web technology for agent communication protocols.Extended Semantic Web Conference, 2008
work page 2008
-
[67]
Gaurav Beri and Vaishnavi Srivastava. Advanced techniques in prompt engineering for large language models: A comprehensive study.2024 IEEE 4th International Conference on ICT in Business Industry & Government (ICTBIG), 2024
work page 2024
-
[68]
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew R. Gormley. Unlimiformer: Long-range transformers with unlimited length input.Neural Information Processing Systems, 2023
work page 2023
-
[69]
Maciej Besta, Nils Blach, Aleš Kubíček, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, H. Niewiadomski, P. Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models.AAAI Conference on Artificial Intelligence, 2023
work page 2023
-
[70]
Gregor Betz and Kyle Richardson. Judgment aggregation, discursive dilemma and reflective equilib- rium: Neural language models as self-improving doxastic agents.Frontiers in Artificial Intelligence, 2022
work page 2022
-
[71]
Bezalel, Eyal Orgad, and Amir Globerson
L. Bezalel, Eyal Orgad, and Amir Globerson. Teaching models to improve on tape.AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[72]
Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar
Umang Bhatt, Sanyam Kapoor, Mihir Upadhyay, Ilia Sucholutsky, Francesco Quinzan, Katherine M. Collins, Adrian Weller, Andrew Gordon Wilson, and Muhammad Bilal Zafar. When should we orchestrate multiple agents?, arXiv preprint arXiv:2503.13577, 2025. URLhttps://arxiv.org/ abs/2503.13577v1. 64
-
[73]
Context-dpo: Aligning language models for context- faithfulness
Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context- faithfulness. ACL 2025, 2024
work page 2025
-
[74]
Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024
Baolong Bi, Shenghua Liu, Lingrui Mei, Yiwei Wang, Pengliang Ji, and Xueqi Cheng. Decoding by contrasting knowledge: Enhancing llms’ confidence on edited facts.ACL 2025, 2024
work page 2025
-
[75]
Lpnl: Scalable link prediction with large language models.ACL 2024, 2024
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, and Xueqi Cheng. Lpnl: Scalable link prediction with large language models.ACL 2024, 2024
work page 2024
-
[76]
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Junfeng Fang, and Xueqi Cheng. Struedit: Structured outputs enable the fast and accurate knowledge editing for large language models. 2024
work page 2024
-
[77]
Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Hongcheng Gao, Yilong Xu, and Xueqi Cheng. Adaptive token biaser: Knowledge editing via biasing key entities.EMNLP 2024, 2024
work page 2024
-
[78]
Refinex: Learning to refine pre-training data at scale from expert-guided programs
Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, and Xueqi Cheng. Refinex: Learning to refine pre-training data at scale from expert-guided programs. 2025
work page 2025
-
[79]
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, Junfeng Fang, Hongcheng Gao, Shiyu Ni, and Xueqi Cheng. Is factuality enhancement a free lunch for llms? better factuality can lead to worse context-faithfulness. ICLR 2025, 2025
work page 2025
-
[80]
Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, and Xueqi Cheng. Parameters vs. context: Fine-grained control of knowledge reliance in language models. 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.