Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
hub
Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
31 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
Rosetta Memory trains two profile-conditioned operators with a minimum-gain sampling curriculum and performance-gap reward to enable memory transfer between LLMs, showing gains on multi-hop QA benchmarks and robustness to unseen models.
CatDT deploys a self-evolving multi-agent system with UniMech and reinforcement learning to build digital twins of heterogeneous catalysts, matching experimental rates within 0.5-2x on seven benchmarks and identifying competitive non-precious candidates for propane dehydrogenation.
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent self-diagnosed bugs and maintained cross-channel context.
MemoPilot trains memory updates for LLM agents via multi-turn GRPO on RPS and poker, achieving top Elo scores and outperforming baselines including DeepSeek-V3.2.
Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.
FluxMem evolves memory as a heterogeneous graph via three refinement stages and reports consistent state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
EvoMemBench evaluates 15 memory methods for LLM agents and finds long-context baselines competitive with no single memory approach working consistently across settings.
SeqMem-Eval reveals that high final accuracy in sequential LLM memory tasks often coexists with substantial forgetting and negative transfer, exposing stability-adaptability trade-offs hidden by standard aggregate metrics.
Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.
Skill-R1 applies bi-level group-relative policy optimization to evolve skills recurrently from verified outcomes, yielding gains over baselines on multi-step tasks.
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
EvolveR enables LLM agents to self-evolve via a closed loop of distilling interaction trajectories into strategic principles offline and retrieving them to guide online decisions with policy reinforcement, yielding better results on multi-hop QA benchmarks.
Janus is a method-agnostic plug-in that uses a Memory Momentum Trigger and compact hybrid evaluation to selectively accept LLM memory updates, yielding +2.7 to +4.6 accuracy gains over base updaters on six datasets.
CBR integration into R&D-Agent with Gemma 4 31B yields directionally higher accuracy and lower variance than baseline on one of two Kaggle competitions.
MoLEM achieves a 10.40% average accuracy improvement in continual learning tasks across math, science, and code by using dynamic latent memory experts with a frozen base model and stage-specific autoencoders for routing.
citing papers explorer
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).