Recognition: 3 theorem links
· Lean Theorem3D-VLA: A 3D Vision-Language-Action Generative World Model
Pith reviewed 2026-05-13 18:13 UTC · model grok-4.3
The pith
3D-VLA connects 3D perception to robot actions by embedding a generative world model inside a language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
3D-VLA is built on a 3D-based large language model with interaction tokens to engage the environment, and embodied diffusion models aligned to it for predicting goal images and point clouds. This creates a generative world model that links 3D perception, reasoning, and action, trained on a curated 3D embodied instruction dataset from existing robotics data. Experiments show significant improvements in reasoning, multimodal generation, and planning capabilities in embodied environments.
What carries the argument
A 3D large language model augmented with interaction tokens and aligned embodied diffusion models that generate future goal images and point clouds.
Load-bearing premise
3D information extracted from existing robotics datasets is diverse enough to train a model that generalizes to new environments.
What would settle it
Testing the trained model on a held-out robotics task or physical robot never seen during dataset curation and measuring whether planning success rates exceed those of standard 2D vision-language-action baselines.
read the original abstract
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces 3D-VLA, a generative world model for embodied AI that integrates 3D perception, reasoning, and action via a 3D-based LLM augmented with interaction tokens and aligned diffusion models for goal image and point-cloud prediction. A large-scale 3D embodied instruction dataset is curated by extracting 3D information from existing robotics corpora, and experiments on held-in splits are reported to show gains in reasoning, multimodal generation, and planning.
Significance. If the empirical claims are substantiated with quantitative metrics and generalization tests, the work could meaningfully advance embodied foundation models by shifting from direct perception-to-action mappings toward explicit generative world models that support planning via imagined 3D futures. The dataset curation effort is a constructive contribution to the community.
major comments (2)
- [§4] §4 (Experiments): The central claim of 'significant improvements' in reasoning, generation, and planning is supported only by held-in dataset results; no quantitative metrics, baselines, ablation studies, or error analysis are supplied, leaving the magnitude and sources of any gains impossible to assess.
- [§4.3] §4.3 (Evaluation): No out-of-distribution, held-out, or cross-robotology tests are described. Because the dataset is extracted from the same robotics sources used for training, observed gains may reflect interpolation within the training support rather than the claimed advantages of the 3D world model for real-world planning under distributional shift.
minor comments (2)
- [Abstract] Abstract: The phrase 'significantly improves' is used without any numerical results or baseline comparisons.
- [§3.2] §3.2: The mechanism by which interaction tokens interface with the embodied environment would benefit from a concrete example or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We agree that the experimental evaluation requires more rigorous quantitative support and generalization analysis to substantiate the claims. We have revised the manuscript to address these points and provide point-by-point responses below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central claim of 'significant improvements' in reasoning, generation, and planning is supported only by held-in dataset results; no quantitative metrics, baselines, ablation studies, or error analysis are supplied, leaving the magnitude and sources of any gains impossible to assess.
Authors: We acknowledge that the original submission relied primarily on held-in results and qualitative examples. In the revised manuscript, §4 has been expanded with quantitative metrics (task success rates for planning, accuracy for reasoning, and perceptual quality scores for generation), direct comparisons to baselines including 2D VLA models and non-generative variants, ablation studies on the 3D LLM backbone, interaction tokens, and diffusion alignment modules, and an error analysis subsection that categorizes failure modes and links them to specific model components. revision: yes
-
Referee: [§4.3] §4.3 (Evaluation): No out-of-distribution, held-out, or cross-robotology tests are described. Because the dataset is extracted from the same robotics sources used for training, observed gains may reflect interpolation within the training support rather than the claimed advantages of the 3D world model for real-world planning under distributional shift.
Authors: We agree that held-in results alone cannot fully rule out interpolation effects. The revised evaluation now includes a held-out split consisting of novel instruction-object combinations excluded from training but drawn from the same source corpora; 3D-VLA shows consistent gains over baselines on this split, supporting the value of the generative 3D world model. Full cross-robotology testing (different hardware platforms) is not feasible within the current revision due to the absence of aligned multi-robot 3D data and would require new collection efforts; we explicitly discuss this limitation and outline it as future work. revision: partial
Circularity Check
No significant circularity: empirical training on external data with no self-referential derivation
full rationale
The paper describes an empirical pipeline: curating a 3D embodied instruction dataset by extracting information from existing robotics corpora, training a 3D-based LLM augmented with interaction tokens, training and aligning embodied diffusion models for goal image/point-cloud prediction, and reporting performance on held-in dataset splits. No equations, uniqueness theorems, or ansatzes are presented that reduce a claimed prediction or result to a quantity defined inside the paper itself. The central claims rest on observed improvements in reasoning/generation/planning metrics rather than any fitted parameter being renamed as a prediction or any self-citation chain substituting for independent justification. This is a standard empirical ML construction whose validity is assessed by external benchmarks, not by internal definitional closure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A 3D-based LLM can be extended with generation abilities by aligning embodied diffusion models for goal image and point cloud prediction.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclearwe propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclearTo train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclearwe train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds
Forward citations
Cited by 27 Pith papers
-
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 0 23716--23736, 2022
work page 2022
-
[2]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Bhat, S. F., Birkl, R., Wofk, D., Wonka, P., and M \"u ller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023
Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., and Levine, S. Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023
work page 2023
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Brooks, T., Holynski, A., and Efros, A. A. Instructpix2pix: Learning to follow image editing instructions, 2023
work page 2023
-
[7]
Playfusion: Skill acquisition via diffusion from language-annotated play
Chen, L., Bahl, S., and Pathak, D. Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pp.\ 2012--2029. PMLR, 2023 a
work page 2012
-
[8]
Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., and Chen, T. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning, 2023 b
work page 2023
-
[9]
X., Savva, M., Halber, M., Funkhouser, T., and Nießner, M
Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., and Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes, 2017
work page 2017
-
[10]
M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pp.\ 720--736, 2018
work page 2018
-
[11]
Dass, S., Yapeter, J., Zhang, J., Zhang, J., Pertsch, K., Nikolaidis, S., and Lim, J. J. Clvr jaco play dataset, 2023. URL https://github.com/clvrai/clvr_jaco_play_dataset
work page 2023
-
[12]
Objaverse: A universe of annotated 3d objects, 2022
Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., and Farhadi, A. Objaverse: A universe of annotated 3d objects, 2022
work page 2022
-
[13]
Dreamllm: Synergistic multimodal com- prehension and creation
Dong, R., Han, C., Peng, Y., Qi, Z., Ge, Z., Yang, J., Zhao, L., Sun, J., Zhou, H., Wei, H., Kong, X., Zhang, X., Ma, K., and Yi, L. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023
-
[14]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P. Palm-e: An embodied multimodal language model, 2023 b
work page 2023
-
[16]
Structure and content-guided video synthesis with diffusion models, 2023
Esser, P., Chiu, J., Atighehchian, P., Granskog, J., and Germanidis, A. Structure and content-guided video synthesis with diffusion models, 2023
work page 2023
-
[17]
Rh20t: A robotic dataset for learning diverse skills in one-shot
Fang, H.-S., Fang, H., Tang, Z., Liu, J., Wang, J., Zhu, H., and Lu, C. Rh20t: A robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595, 2023
-
[18]
Finetuning offline world models in the real world,
Feng, Y., Hansen, N., Xiong, Z., Rajagopalan, C., and Wang, X. Finetuning offline world models in the real world. arXiv preprint arXiv:2310.16029, 2023
-
[19]
Guo, Z., Zhang, R., Zhu, X., Tang, Y., Ma, X., Han, J., Chen, K., Gao, P., Li, X., Li, H., and Heng, P.-A. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following, 2023
work page 2023
-
[20]
3d- llm: Injecting the 3d world into large language models,
Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., and Gan, C. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023
-
[21]
Multiply: A multisensory object-centric embodied large language model in 3d world
Hong, Y., Zheng, Z., Chen, P., Wang, Y., Li, J., and Gan, C. Multiply: A multisensory object-centric embodied large language model in 3d world. arXiv preprint arXiv:2401.08577, 2024
-
[22]
Honnibal, M. and Montani, I. spaCy 2 : Natural language understanding with B loom embeddings, convolutional neural networks and incremental parsing. To appear, 2017
work page 2017
-
[23]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Chat-3d v2: Bridging 3d scene and large language models with object identifiers, 2023 a
Huang, H., Wang, Z., Huang, R., Liu, L., Cheng, X., Zhao, Y., Jin, T., and Zhao, Z. Chat-3d v2: Bridging 3d scene and large language models with object identifiers, 2023 a
work page 2023
-
[25]
An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023
Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.-C., Jia, B., and Huang, S. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023 b
-
[26]
Language is not all you need: Aligning perception with language models
Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023 c
-
[27]
James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020
work page 2020
-
[28]
Bc-z: Zero-shot task generalization with robotic imitation learning
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., and Finn, C. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp.\ 991--1002. PMLR, 2022
work page 2022
-
[29]
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.\ 12888--12900. PMLR, 2022
work page 2022
-
[31]
Li, J., Chen, D., Hong, Y., Chen, Z., Chen, P., Shen, Y., and Gan, C. Covlm: Composing visual entities and relationships in large language models via communicative decoding. arXiv preprint arXiv:2311.03354, 2023 a
-
[32]
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
3dmit: 3d multi-modal instruction tuning for scene understanding, 2024
Li, Z., Zhang, C., Wang, X., Ren, R., Xu, Y., Ma, R., and Liu, X. 3dmit: 3d multi-modal instruction tuning for scene understanding, 2024
work page 2024
-
[34]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction
Liu, Y., Liu, Y., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., and Yi, L. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21013--21022, 2022
work page 2022
-
[36]
UNIFIED - IO : A unified model for vision, language, and multi-modal tasks
Lu, J., Clark, C., Zellers, R., Mottaghi, R., and Kembhavi, A. UNIFIED - IO : A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=E01k9048soZ
work page 2023
-
[37]
Language conditioned imitation learning over unstructured data.arXiv preprint arXiv:2005.07648,
Lynch, C. and Sermanet, P. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020
-
[38]
Interactive language: Talking to robots in real time
Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., and Florence, P. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023
work page 2023
-
[39]
Mandlekar, A., Booher, J., Spero, M., Tung, A., Gupta, A., Zhu, Y., Garg, A., Savarese, S., and Fei-Fei, L. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1048--1055. IEEE, 2019
work page 2019
-
[40]
Marr, D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information . The MIT Press, 07 2010. ISBN 9780262514620. doi:10.7551/mitpress/9780262514620.001.0001. URL https://doi.org/10.7551/mitpress/9780262514620.001.0001
-
[41]
Mees, O., Hermann, L., Rosete-Beas, E., and Burgard, W. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7 0 (3): 0 7327--7334, 2022
work page 2022
-
[42]
Grounding language with visual affordances over unstructured data
Mees, O., Borja-Diaz, J., and Burgard, W. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023
work page 2023
-
[43]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., and Chen, M. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022
work page internal anchor Pith review arXiv 2022
-
[44]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Padalkar, A., Pooley, A., Jain, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Singh, A., Brohan, A., et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
The effects of contextual scenes on the identification of objects
Palmer, S. The effects of contextual scenes on the identification of objects. Memory & Cognition, 3: 0 519--526, 01 1975
work page 1975
-
[46]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Seeing and Visualizing: It's Not What You Think
Pylyshyn, Z. Seeing and Visualizing: It's Not What You Think. 01 2003. ISBN 9780262316316. doi:10.7551/mitpress/6137.001.0001
-
[48]
Gpt4point: A unified framework for point-language understanding and generation, 2023
Qi, Z., Fang, Y., Sun, Z., Wu, X., Wu, T., Wang, J., Lin, D., and Zhao, H. Gpt4point: A unified framework for point-language understanding and generation, 2023
work page 2023
-
[49]
Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A. X., Savva, M., Zhao, Y., and Batra, D. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai, 2021
work page 2021
-
[50]
Grounded sam: Assembling open-world models for diverse visual tasks, 2024
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., and Zhang, L. Grounded sam: Assembling open-world models for diverse visual tasks, 2024
work page 2024
-
[51]
High-resolution image synthesis with latent diffusion models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022
work page 2022
-
[52]
Playing with food: Learning food item representations through interactive exploration
Sawhney, A., Lee, S., Zhang, K., Veloso, M., and Kroemer, O. Playing with food: Learning food item representations through interactive exploration. In Experimental Robotics: The 17th International Symposium, pp.\ 309--322. Springer, 2021
work page 2021
-
[53]
Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N. J., Florence, P., Han, W., Baruch, R., Lu, Y., Mirchandani, S., Xu, P., Sanketi, P., Hausman, K., Shafran, I., Ichter, B., and Cao, Y. Robovqa: Multimodal long-horizon reasoning for robotics. In arXiv preprint arXiv:2311.00899, 2023
- [54]
-
[55]
MUTEX : Learning unified policies from multimodal task specifications
Shah, R., Mart \' n-Mart \' n, R., and Zhu, Y. MUTEX : Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=PwqiqaaEzJ
work page 2023
-
[56]
Lancon-learn: Learning with language to enable generalization in multi-task manipulation
Silva, A., Moorman, N., Silva, W., Zaidi, Z., Gopalan, N., and Gombolay, M. Lancon-learn: Learning with language to enable generalization in multi-task manipulation. IEEE Robotics and Automation Letters, 7 0 (2): 0 1635--1642, 2021
work page 2021
-
[57]
Teed, Z. and Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16, pp.\ 402--419. Springer, 2020
work page 2020
-
[58]
Walke, H. R., Black, K., Zhao, T. Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A. W., Myers, V., Kim, M. J., Du, M., et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp.\ 1723--1736. PMLR, 2023
work page 2023
-
[59]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Wu, S., Fei, H., Qu, L., Ji, W., and Chua, T.-S. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023
-
[60]
Pointllm: Empowering large language models to understand point clouds, 2023
Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., and Lin, D. Pointllm: Empowering large language models to understand point clouds, 2023
work page 2023
-
[61]
Uni3d: Exploring unified 3d representation at scale, 2023
Zhou, J., Wang, J., Ma, B., Liu, Y.-S., Huang, T., and Wang, X. Uni3d: Exploring unified 3d representation at scale, 2023
work page 2023
-
[62]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.