Nebula: Do we evaluate vision-language-action agents correctly?,

· 2025 · arXiv 2510.16263

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.

MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models

cs.SE · 2026-06-23 · unverdicted · novelty 6.0

MANGO uses Generator, Assessor, and Judge agents to create reusable atomic tasks and fine-grained oracles from natural language, evaluated on LIBERO_10 and RoboCasa benchmarks for comparable failure detection with better localization.

FATE-VLA:Failue-aware test generation for vision-language-action models

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.

VISOR: A Vision-Language Model-based Test Oracle for Testing Robots

cs.SE · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie

citing papers explorer

Showing 4 of 4 citing papers.

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation cs.CL · 2026-06-29 · unverdicted · none · ref 12
MemDelta shows agent memory evaluations are confounded by LLM family and embedding model, with RAG often matching full context and self-memory underperforming basic retrieval.
MANGO: Automated Multi-Agent Test Oracle Generation for Vision-Language-Action Models cs.SE · 2026-06-23 · unverdicted · none · ref 24
MANGO uses Generator, Assessor, and Judge agents to create reusable atomic tasks and fine-grained oracles from natural language, evaluated on LIBERO_10 and RoboCasa benchmarks for comparable failure detection with better localization.
FATE-VLA:Failue-aware test generation for vision-language-action models cs.RO · 2026-06-01 · unverdicted · none · ref 16
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots cs.SE · 2026-05-11 · unverdicted · none · ref 45 · 2 links
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie

Nebula: Do we evaluate vision-language-action agents correctly?,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer