citation dossier

//arxiv.org/abs/2409.16283

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani · 2024 · arXiv 2409.16283

18Pith papers citing it

18reference links

cs.ROtop field · 14 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 18 reviewed papers. Its strongest current cluster is cs.RO (14 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

cs.RO · 2026-04-09 · unverdicted · novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

cs.RO · 2026-04-12 · unverdicted · novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

cs.RO · 2026-04-03 · conditional · novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

cs.CV · 2026-03-17 · unverdicted · novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

World Action Models are Zero-shot Policies

cs.RO · 2026-02-17 · unverdicted · novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

cs.RO · 2025-03-18 · unverdicted · novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

cs.CV · 2024-12-19 · unverdicted · novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

cs.RO · 2024-11-29 · unverdicted · novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.

StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

cs.RO · 2026-04-20 · unverdicted · novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04 · accept · novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

Motus: A Unified Latent Action World Model

cs.CV · 2025-12-15 · unverdicted · novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

citing papers explorer

Showing 18 of 18 citing papers.

Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 91
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 6
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning cs.RO · 2026-04-09 · unverdicted · none · ref 4
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 3
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation cs.CV · 2026-05-11 · unverdicted · none · ref 4
SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 3
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training cs.RO · 2026-04-23 · unverdicted · none · ref 5
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations cs.RO · 2026-04-12 · unverdicted · none · ref 4
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model cs.RO · 2026-04-03 · conditional · none · ref 28
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 24
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 8
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 7
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations cs.CV · 2024-12-19 · unverdicted · none · ref 80
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation cs.RO · 2024-11-29 · unverdicted · none · ref 5
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement cs.RO · 2026-04-20 · unverdicted · none · ref 4
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data cs.RO · 2026-04-04 · accept · none · ref 5
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Motus: A Unified Latent Action World Model cs.CV · 2025-12-15 · unverdicted · none · ref 4
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 74
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

//arxiv.org/abs/2409.16283

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer