Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
arXiv preprint arXiv:2108.12617 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
Creates MegaKPT dataset and GKDT promptable transformer model for general keypoint detection across diverse objects with reported high accuracy on 22 test sets.
SOCO is a new benchmark for semantic object correspondence that provides taxonomy, annotations, and language labels to evaluate part-level understanding in vision and multimodal foundation models.
AnyAct generates editable human reenactments from character videos via conditional motion generation from transferable sparse local 2D articulated cues, with designs for human-only supervision and global-local decoupling.
citing papers explorer
-
Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors
Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.
-
Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks
The paper constructs the VIBE benchmark and evaluates six visual in-context learning models on 14 datasets, 12 tasks, and 106 combinations under a unified one-shot protocol, revealing limitations and failure modes.
-
GKDT: General Keypoint Detection Transformer
Creates MegaKPT dataset and GKDT promptable transformer model for general keypoint detection across diverse objects with reported high accuracy on 22 test sets.
-
SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
SOCO is a new benchmark for semantic object correspondence that provides taxonomy, annotations, and language labels to evaluate part-level understanding in vision and multimodal foundation models.
-
AnyAct: Towards Human Reenactment of Character Motion From Video
AnyAct generates editable human reenactments from character videos via conditional motion generation from transferable sparse local 2D articulated cues, with designs for human-only supervision and global-local decoupling.