TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
hub Mixed citations
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Mixed citation behavior. Most common role is background (64%).
abstract
Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
See2Act couples action denoising with viewpoint refinement in a diffusion-based imitation learning policy trained on keyframe-anchored camera poses, recovering informative views under occlusion and improving RLBench performance by up to 34% with zero-shot sim-to-real transfer.
FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
A hardware-free dual-camera capture framework with ChArUco spatial unification and receding-horizon state alignment enables decoupled SE(3) manipulation and SE(2) base trajectories for diffusion policies, yielding 83.8% average success on four long-horizon household tasks.
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet-v2, S3DIS, and nuScenes.
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.
ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.
EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
citing papers explorer
-
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies
TAKO demonstrates real-time adversarial takeover of robotic diffusion policies via reusable universal patches on visual inputs, achieving 100% success in steering attacker-chosen trajectories across multiple tasks, encoders, and diffusion methods.
-
DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand
DexCompose achieves 77.4% average success on 16 composite dexterous tasks by using role-aware residual composition with explicit finger ownership to combine pretrained policies without destructive interference.
-
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
-
Learning to See While Learning to Act: Diffusion Models for Active Perception in Robot Imitation
See2Act couples action denoising with viewpoint refinement in a diffusion-based imitation learning policy trained on keyframe-anchored camera poses, recovering informative views under occlusion and improving RLBench performance by up to 34% with zero-shot sim-to-real transfer.
-
Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation
FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.
-
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
-
Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation
A hardware-free dual-camera capture framework with ChArUco spatial unification and receding-horizon state alignment enables decoupled SE(3) manipulation and SE(2) base trajectories for diffusion policies, yielding 83.8% average success on four long-horizon household tasks.
-
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.
-
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
-
SkiP: When to Skip and When to Refine for Efficient Robot Manipulation
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
-
DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
Information Filtering via Variational Regularization for Robot Manipulation
Variational Regularization imposes an adaptive information bottleneck on noisy intermediate features in DP3-UNet and DP3-DiT policies, consistently raising task success rates on RoboTwin2.0, Adroit, and MetaWorld while achieving new state-of-the-art results.
-
TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance
TouchGuide improves contact-rich robot manipulation by steering diffusion or flow-matching visuomotor policies with tactile feasibility scores from a contrastively trained Contact Physical Model.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
Chronos: A Physics-Informed Full-History Framework for Non-Markovian Long-Horizon Manipulation
Chronos elevates full observation history to the policy's latent state via selective SSM tokens and a Schrödinger-inspired acceleration bridge, achieving large gains on memory-dependent robot tasks with fewer parameters.
-
ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation
ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.
-
Geometric Action Model for Robot Policy Learning
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
-
EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment
EmbodiSteer steers embodiment-agnostic Cartesian diffusion policies into joint space with Jacobian-based collision guidance after each denoising step for zero-shot cross-embodiment deployment.
-
Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization
HOWTransfer recovers 3D hand motion from video, localizes contact intervals via hand-object cues, generates multi-modal grasp hypotheses, and edits trajectories to produce diverse robot-executable motions achieving 86% success.
-
Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video
Video2Sim2Real turns a single human video into a deployable robot manipulation skill by reconstructing a digital twin, anchoring motions to object-centric simulator configurations, and bridging sim-to-real gaps with imitation learning and residual RL.
-
MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model
MotionVLA converts short past video windows into compact trajectory-field tokens to supply motion-consistent evidence for vision-language-action robot policies, improving long-horizon manipulation.
-
Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry
IDP generates one-step robot actions by adaptively weighting a scalar potential objective using conditional expert geometry derived from local variations of observation-similar expert actions, combined with expert-proximal terminal evaluation.
-
Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.
-
PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking
PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.
-
Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal
FGO guides diffusion policy generation via expanding spectral bands on sub-frequency manifolds to improve action smoothness on 15 robotic manipulation tasks.
-
HITL-D: Human In The Loop Diffusion Assisted Shared Control
HITL-D combines diffusion policies with human input for shared robotic control, reducing required joystick axes and improving speed and workload in manipulation tasks per a 12-participant study.
-
HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds
HCLM presents a hierarchical architecture that uses an SE(3)-invariant diffusion policy for coordination and a hybrid whole-body controller with MPC and admittance control for safe closed-chain loco-manipulation on dual quadrupeds.
-
Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data
A simulation-grounded state policy using 3D particle dynamics outperforms an egocentric vision policy by 30.8% in L1 error on unseen rope configurations for bimanual manipulation from limited human data.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.
-
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
-
ShapeGen: Robotic Data Generation for Category-Level Manipulation
ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
-
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
-
One Hand to Rule Them All: Canonical Representations for Unified Dexterous Manipulation
A unified parameter space and canonical URDF enable cross-embodiment dexterous grasping policies with 81.9% zero-shot success on unseen hands like the 3-finger LEAP Hand.
-
Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation
SCFields fuses semantics and contact data in a sim-to-real pipeline to enable category-level generalization for tactile tool manipulation with diffusion policies.
-
TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning
ESPADA uses semantic segmentation from VLMs and LLMs plus DTW to downsample non-critical segments in demonstrations, delivering about 2x faster robot execution in behavior cloning while maintaining task success rates.
-
IGen: Scalable Data Generation for Robot Learning from Open-World Images
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
-
LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.
-
AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
AFFORD2ACT distills a minimal set of affordance-guided 2D keypoints from text and a single image to train a 38-dimensional gated transformer policy that achieves 82% success on unseen objects and scenes.
-
COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy
COMPASS is a manipulation-aware active sensing framework that raises simulated manipulation success rates by 24.25% over information-gain-only baselines in a new four-level confined-space benchmark.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.