MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
hub Canonical reference
Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474
Canonical reference. 75% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow model and showing gains on Swiss-roll, ImageNet-256 and Oxford Flowers-102.
GazeMind encodes gaze data for LLM reasoning to deliver interpretable, personalized cognitive load predictions that generalize across tasks without fine-tuning and outperform baselines by over 20% on a new 152-person dataset.
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on industry data.
SpecHop accelerates multi-hop LLM tool use via continuous multi-threaded speculation with asynchronous verification, approaching oracle latency gains and reducing latency up to 40% on retrieval tasks.
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
LERA is a retrieve-then-generate auction system that refines ad candidate ranking with LLM logits and applies a threshold-aware critical-value payment rule to maintain truthfulness in chatbot ad insertion.
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
HoReN is a parameter-preserving editor that wraps an MLP with a Hopfield codebook memory and scales to 50K sequential edits on ZsRE while maintaining performance above 0.93.
Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production deployment.
citing papers explorer
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench is a new benchmark exposing that LLM agent memory systems fail on group conversation properties like speaker-grounded tracking and audience-adapted responses, with top systems at 46% accuracy.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
Latent Abstraction for Retrieval-Augmented Generation
LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA benchmarks with better efficiency.
-
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
-
Evaluating the Search Agent in a Parallel World
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
-
Context Memorization for Efficient Long Context Generation
Attention-state memory externalizes long prefixes into a lightweight lookup table of precomputed attention states, yielding higher accuracy than standard in-context learning at fixed memory budgets and lower latency than full attention.
-
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
-
Conservative Flows: A New Paradigm of Generative Models
Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow model and showing gains on Swiss-roll, ImageNet-256 and Oxford Flowers-102.
-
GazeMind: A Gaze-Guided LLM Agent for Personalized Cognitive Load Assessment
GazeMind encodes gaze data for LLM reasoning to deliver interpretable, personalized cognitive load predictions that generalize across tasks without fine-tuning and outperform baselines by over 20% on a new 152-person dataset.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
-
CL-bench Life: Can Language Models Learn from Real-Life Context?
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
-
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on industry data.
-
SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents
SpecHop accelerates multi-hop LLM tool use via continuous multi-threaded speculation with asynchronous verification, approaching oracle latency gains and reducing latency up to 40% on retrieval tasks.
-
Beyond Scaling: Agents Are Heading to the Edge
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
-
LERA: LLM-Enhanced RAG for Ad Auction in Generative Chatbots
LERA is a retrieve-then-generate auction system that refines ad candidate ranking with LLM logits and applies a threshold-aware critical-value payment rule to maintain truthfulness in chatbot ad insertion.
-
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
-
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
-
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.
-
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
-
HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing
HoReN is a parameter-preserving editor that wraps an MLP with a Hopfield codebook memory and scales to 50K sequential edits on ZsRE while maintaining performance above 0.93.
-
Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Bian Que is an agentic framework using a unified operational paradigm, flexible Skill Arrangement, and self-evolving mechanism to automate O&M tasks, achieving 75% alert reduction and over 50% MTTR cut in production deployment.
-
A Simple Plug-in for Improving Eviction-Based KV Cache Compression
VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.
-
Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On
Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
-
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems
A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.
- Retrieval-Augmented Linguistic Calibration