Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no performance loss.
Title resolution pending
18 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 18roles
background 3polarities
background 3representative citing papers
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
APWA is a distributed multi-agent architecture that decomposes parallelizable agentic workflows into non-interfering subproblems for scalable execution on heterogeneous resources.
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
CANTANTE uses contrastive rollouts to attribute system rewards to individual agents, enabling better prompt optimization than prior methods on programming, math, and QA benchmarks.
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
LLM agents struggle to detect and act on implicit memory conflicts, with top models scoring 55.2% on the new STALE benchmark of 400 scenarios; CUPMem prototype strengthens state-aware revision.
Event-Causal RAG segments videos into events represented as SES graphs, merges them into a causal knowledge graph, and uses bidirectional retrieval to supply relevant event chains to a video foundation model for improved long-video question answering.
SSRP separates planning from execution in LLM agents to overcome the Attention Latch, delivering 715X resilience gains over ReAct baselines on MultiWOZ tasks.
Argus generates GPU kernels achieving 99-104% of hand-optimized throughput on key LLM kernels by enforcing compile-time data-flow invariants via a tag-based DSL and an in-context RL planner.
Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improves small VLMs.
CUE-R uses REMOVE, REPLACE, and DUPLICATE interventions on individual evidence items to quantify their per-item utility in RAG along correctness, grounding faithfulness, and confidence axes.
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
AgenticPosesRanker ranks docking poses using six deterministic physical tools and LLM reasoning, achieving 50% best-pose accuracy that matches the Smina baseline on a balanced 10-system, 162-pose benchmark.
Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.
citing papers explorer
-
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
Framing LLM agent loops as a Context Gathering Decision Process POMDP yields a predicate-based belief state that boosts multi-hop reasoning up to 11.4% and an exhaustion gate that cuts token use up to 39% with no performance loss.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Screening Is Enough
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
-
APWA: A Distributed Architecture for Parallelizable Agentic Workflows
APWA is a distributed multi-agent architecture that decomposes parallelizable agentic workflows into non-interfering subproblems for scalable execution on heterogeneous resources.
-
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
-
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
CANTANTE uses contrastive rollouts to attribute system rewards to individual agents, enabling better prompt optimization than prior methods on programming, math, and QA benchmarks.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
-
STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
LLM agents struggle to detect and act on implicit memory conflicts, with top models scoring 55.2% on the new STALE benchmark of 400 scenarios; CUPMem prototype strengthens state-aware revision.
-
Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Event-Causal RAG segments videos into events represented as SES graphs, merges them into a causal knowledge graph, and uses bidirectional retrieval to supply relevant event chains to a video foundation model for improved long-video question answering.
-
Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols
SSRP separates planning from execution in LLM agents to overcome the Attention Latch, delivering 715X resilience gains over ReAct baselines on MultiWOZ tasks.
-
ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
Argus generates GPU kernels achieving 99-104% of hand-optimized throughput on key LLM kernels by enforcing compile-time data-flow invariants via a tag-based DSL and an in-context RL planner.
-
Gym-Anything: Turn any Software into an Agent Environment
Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improves small VLMs.
-
CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation
CUE-R uses REMOVE, REPLACE, and DUPLICATE interventions on individual evidence items to quantify their per-item utility in RAG along correctness, grounding faithfulness, and confidence axes.
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and SearchQA while showing some skills remain externally useful.
-
AgenticPosesRanker: An Agentic AI Framework for Physically Grounded Ranking of Protein-Ligand Docking Poses
AgenticPosesRanker ranks docking poses using six deterministic physical tools and LLM reasoning, achieving 50% best-pose accuracy that matches the Smina baseline on a balanced 10-system, 162-pose benchmark.
-
Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction
Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.