LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Mieb: Massive image embedding benchmark.arXiv preprint arXiv:2504.10471
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
representative citing papers
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.