EPIC-Bench is a new fine-grained benchmark that shows leading VLMs struggle with multi-target counting, part-whole relations, and affordance detection in real-world embodied visual grounding tasks.
hub Baseline reference
Bridgedata v2: A dataset for robot learning at scale
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
FineVLA unifies robot datasets into 47k fine-grained trajectories, adds a VLM annotator and benchmark, and shows that mixing fine-grained and goal-level instructions improves steerable control without hurting task success.
Instrumented objects boost diffusion policy success in robotic hanger insertion by 14-25 percentage points over vision-only baselines, and augmenting datasets with instrumented expert rollouts lets a vision-only student match the instrumented expert.
GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.
Survey organizing world models for robotic manipulation into representation families, a functional taxonomy, and infrastructure roles across pretraining, post-training, and inference, while reviewing 34 datasets and evaluation protocols.
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
JoyAI-Sim provides bidirectional Robot-Simulation-Human pathways for aligned model evaluation and data generation in robotics using the JoySim simulator as an evaluation layer and physical consistency filter.
citing papers explorer
-
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.