hub Canonical reference

Any-point Trajectory Modeling for Policy Learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao · 2023 · cs.RO · arXiv 2401.00025

Canonical reference. 100% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Learning from demonstration is a powerful method for teaching robots new skills, and having more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Across over 130 language-conditioned tasks we evaluated in both simulation and the real world, ATM outperforms strong video pre-training baselines by 80% on average. Furthermore, we show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 9

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

cs.RO · 2025-05-19 · unverdicted · novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

EgoInfinity is a modular pipeline that lifts in-the-wild RGB videos into agent-agnostic 4D hand-object data with interaction-aware refinement and retargets motions to diverse robot morphologies for video-to-action learning.

Unified Motion-Action Modeling for Heterogeneous Robot Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

cs.CV · 2026-06-11 · unverdicted · novelty 6.0

MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.

GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

cs.RO · 2026-06-08 · unverdicted · novelty 6.0

GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.

Revisiting Articulated Parts Perception in Robot Manipulation

cs.RO · 2026-06-06 · unverdicted · novelty 6.0

Proposes GPS representation for articulated parts, uses VR to annotate 41K frames across 234 objects, trains an RGB-D model, and achieves 73% success in heuristic manipulation policies on 9 objects.

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.

LACE: Latent Visual Representation for Cross-Embodiment Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.

BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances

cs.RO · 2026-04-25 · unverdicted · novelty 6.0

BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.

GazeVLA: Learning Human Intention for Robotic Manipulation

cs.RO · 2026-04-24 · unverdicted · novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

cs.RO · 2025-10-01 · unverdicted · novelty 6.0

AFFORD2ACT distills a minimal set of affordance-guided 2D keypoints from text and a single image to train a 38-dimensional gated transformer policy that achieves 82% success on unseen objects and scenes.

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

cs.CV · 2025-07-06 · unverdicted · novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.

citing papers explorer

Showing 36 of 36 citing papers after filters.

Point Tracking Improves World Action Models cs.RO · 2026-05-22 · unverdicted · none · ref 47 · internal anchor
JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation cs.RO · 2026-05-21 · unverdicted · none · ref 46 · internal anchor
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.
Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 97 · internal anchor
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities cs.LG · 2026-04-16 · unverdicted · none · ref 70 · internal anchor
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models cs.RO · 2025-05-19 · unverdicted · none · ref 63 · internal anchor
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots cs.RO · 2026-06-26 · unverdicted · none · ref 55 · internal anchor
A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.
EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning cs.RO · 2026-06-16 · unverdicted · none · ref 7 · internal anchor
EgoInfinity is a modular pipeline that lifts in-the-wild RGB videos into agent-agnostic 4D hand-object data with interaction-aware refinement and retargets motions to diverse robot morphologies for video-to-action learning.
Unified Motion-Action Modeling for Heterogeneous Robot Learning cs.RO · 2026-06-15 · unverdicted · none · ref 13 · internal anchor
UMA treats object motion and robot actions as co-evolving variables under a masked generative objective with hindsight relabeling and contrastive disentanglement to support multi-task pretraining and deployment across heterogeneous robot data.
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models cs.CV · 2026-06-11 · unverdicted · none · ref 51 · internal anchor
MaskWAM unifies mask prompting and prediction in world-action models via Mixture of Transformers to improve robotic policy generalization on language-ambiguous tasks.
GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation cs.RO · 2026-06-08 · unverdicted · none · ref 30 · internal anchor
GHOST improves generalization in robot manipulation via hierarchical factorization into 3D sub-goal prediction from RGB-D views and a goal-conditioned low-level controller, enabling human video integration without action retargeting.
Revisiting Articulated Parts Perception in Robot Manipulation cs.RO · 2026-06-06 · unverdicted · none · ref 40 · internal anchor
Proposes GPS representation for articulated parts, uses VR to annotate 41K frames across 234 objects, trains an RGB-D model, and achieves 73% success in heuristic manipulation policies on 9 objects.
Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations cs.RO · 2026-05-21 · unverdicted · none · ref 32 · internal anchor
A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances cs.RO · 2026-04-25 · unverdicted · none · ref 3 · internal anchor
BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.
GazeVLA: Learning Human Intention for Robotic Manipulation cs.RO · 2026-04-24 · unverdicted · none · ref 65 · internal anchor
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 47 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation cs.RO · 2026-01-11 · unverdicted · none · ref 117 · internal anchor
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views cs.CV · 2025-11-17 · unverdicted · none · ref 50 · internal anchor
Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation cs.RO · 2025-10-01 · unverdicted · none · ref 8 · internal anchor
AFFORD2ACT distills a minimal set of affordance-guided 2D keypoints from text and a single image to train a 38-dimensional gated transformer policy that achieves 82% success on unseen objects and scenes.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge cs.CV · 2025-07-06 · unverdicted · none · ref 48 · internal anchor
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.
FLARE: Robot Learning with Implicit World Modeling cs.RO · 2025-05-21 · unverdicted · none · ref 51 · internal anchor
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets cs.RO · 2025-04-03 · unverdicted · none · ref 44 · internal anchor
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generalizable policies than imitation learning alone.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models cs.CV · 2025-03-27 · unverdicted · none · ref 64 · internal anchor
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
Physically Interpretable World Models via Weakly Supervised Representation Learning cs.LG · 2024-12-17 · unverdicted · none · ref 60 · internal anchor
PIWM aligns latent states in image-based world models with physical variables and constrains their dynamics to known equations via weak distribution supervision, yielding accurate long-horizon predictions and parameter recovery on Cart Pole, Lunar Lander, and Donkey Car.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning cs.RO · 2024-11-07 · unverdicted · none · ref 55 · internal anchor
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation cs.RO · 2024-10-08 · unverdicted · none · ref 57 · internal anchor
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation cs.RO · 2024-09-24 · unverdicted · none · ref 47 · internal anchor
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model cs.CV · 2026-06-25 · unverdicted · none · ref 21 · internal anchor
DexAC-WM improves FID, FVD, and PCK in high-DoF action-conditioned video prediction via structured action modeling and semantic grounding on EgoDex and EgoVerse.
OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation cs.RO · 2026-05-25 · unverdicted · none · ref 52 · internal anchor
OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos cs.RO · 2026-05-24 · unverdicted · none · ref 27 · internal anchor
HumanEgo reports 92.5% average success on four real robot tasks using only 15-30 minutes of human video per task and zero robot data, with zero-shot transfer to new robots and cameras.
PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-19 · unverdicted · none · ref 28 · internal anchor
PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.
FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy cs.RO · 2026-05-15 · unverdicted · none · ref 10 · 2 links · internal anchor
FocalPolicy introduces frequency-optimized chunking and locally anchored flow matching with a foresight composite objective to reduce inter-chunk discontinuities in visuomotor policies.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
GR-3 Technical Report cs.RO · 2025-07-21 · unverdicted · none · ref 72 · internal anchor
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

Any-point Trajectory Modeling for Policy Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer