MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
Nebula: Do we evaluate vision-language-action agents correctly?,
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
MANGO uses Generator, Assessor, and Judge agents to create reusable atomic tasks and fine-grained oracles from natural language, evaluated on LIBERO_10 and RoboCasa benchmarks for comparable failure detection with better localization.
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
citing papers explorer
-
MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation
MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
-
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models
MANGO uses Generator, Assessor, and Judge agents to create reusable atomic tasks and fine-grained oracles from natural language, evaluated on LIBERO_10 and RoboCasa benchmarks for comparable failure detection with better localization.
-
FATE-VLA:Failue-aware test generation for vision-language-action models
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie