{"work":{"id":"12319725-bc7d-4c32-a229-ad270a7460bc","openalex_id":null,"doi":null,"arxiv_id":"2410.07864","raw_key":null,"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","authors":null,"authors_text":"Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang","year":2024,"venue":"cs.RO","abstract":"Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.","external_url":"https://arxiv.org/abs/2410.07864","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T04:32:56.642190+00:00","pith_arxiv_id":"2410.07864","created_at":"2026-05-09T06:35:38.626519+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","render_title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation"},"hub":{"state":{"work_id":"12319725-bc7d-4c32-a229-ad270a7460bc","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":100,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2024-05-23T01:43:54+00:00","last_pith_cited_at":"2026-05-21T16:14:19+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-27T10:05:50.848444+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":37},{"context_role":"baseline","n":4}],"polarity_counts":[{"context_polarity":"background","n":36},{"context_polarity":"baseline","n":4},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","claims":[{"claim_text":"Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"camera configurations, providing valuable empirical references for designing future 3D policies. 2 Related Work 2.1 2D Policy Learning Early visuomotor policy learning primarily relied on 2D visual inputs, focusing on imitation and end-to-end action prediction from RGB observations. Diffusion-based approaches such as Dif- fusion Policy [7], RDT-1B [27], and CF-SDP [46] demonstrate stable behavior cloning through generative modeling in pixel space, while transformer-based architectures like ACT [","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"RDT-1B [113] SigLIP T5-XXL DiT Xattn DDPM (Aggregated Datasets) Real(ALOHA dual-arm): wash, pour, fold, etc. RT-2 [1]♢ViT-4B, ViT-22B PaLI-X, PaLM-E Symbol- tuning Concat BC (disc), Co- fine-tuning Fractal, VQASim: Language-Table;Real: RT-1 evaluation tasks RT-H [114]♢(Model design follows RT-2) BC (disc) Diverse+KitchenReal: Diverse+Kitchen eval tasks RT-X [115]♢(Models from RT-1 and RT-2) BC (disc) [SC: OXE]Real: BridgeV2, RT-1 evaluation tasks, etc. OpenVLA [35]♢DINOv2, SigLIP Prismatic-7B Sy","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In addition, parallel efforts have sought to extend the scaling laws [19, 20] observed in vision and language domains to the embodied setting, collecting large-scale embodied datasets and training generalist agents end-to-end on top of vision-language foundation models [21, 22, 23]. These diverse approaches have led to a rapid proliferation of VLA models in robotic manipulation [24, 25], navigation [26, 27], and autonomous driving [28, 29, 30], demonstrating promising capabilities in multitask l","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The International Journal of Robotics Research , page 02783649241273668, 2023. 3, 5, 7, 8, 9 [91] Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy. arXiv preprint arXiv:2503.19757, 2025. [92] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model f","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Early large-scale systems such as RT-2 [1] and OpenVLA [2] suggest that scaling multimodal backbones can translate into broader task coverage in robotics. At the same time, the field has been actively experimenting with different design choices-from diffusion/flow-based action heads that improve continuous control fidelity (e.g., Octo [3], pi0 [4], RDT [5]), to richer multimodal structures and training signals (e.g., GR-1/GR-2 [6], [7], RoboDreamer [8], and RL-augmented variants [9], [10]). Thes","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"trast, foundation models trained on million-scale, multi-robot corpora exhibit strong zero-shot transfer: RT-1 [6] unifies vision, language, and action in a single transformer for real-time kitchen manipulation; RT-2 [5] jointly finetunes large vision-language models on web and robot data to support semantic planning and object reasoning; diffusion-based RDT-1B [28] andπ[3] learn diverse bimanual dynamics from over a million episodes. Vision-language-action systems such as OpenVLA [20] and CogAC","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (37 contexts).","role_counts":[{"n":37,"context_role":"background"},{"n":4,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-25T04:35:34.470475+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"3d89bbd2-c3f9-4c54-970e-738fdb94468e","orcid":null,"display_name":"Songming Liu"},{"id":"6111ebe8-e393-4fef-894f-0b863ecdc8cd","orcid":null,"display_name":"Lingxuan Wu"},{"id":"0489688f-1759-436f-883c-b8e59741c085","orcid":null,"display_name":"Bangguo Li"},{"id":"660d91b3-d236-41b2-97c4-445842319412","orcid":null,"display_name":"Hengkai Tan"},{"id":"aab6b048-3866-4755-b74e-b63f38edd388","orcid":null,"display_name":"Huayu Chen"},{"id":"971a29c3-ad4a-43df-9241-f85ac3fc14c0","orcid":null,"display_name":"Zhengyi Wang"}]},"error":null,"updated_at":"2026-05-25T04:35:36.351398+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:51:45.049033+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":36},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":35},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":22},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":21},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":20},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":20},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":15},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":14},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":14},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":14},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":14},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":13},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation","work_id":"e92c2c13-4330-45fe-8231-34a6002626bd","shared_citers":11},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":10},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":10},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":10},{"title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","work_id":"3d63039f-41b0-4a31-af31-6fc10f5c1b1b","shared_citers":9},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":9},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":8},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":8},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":8}],"time_series":[{"n":2,"year":2024},{"n":3,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T15:01:36.541927+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:51:41.452992+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","claims":[{"claim_text":"Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"camera configurations, providing valuable empirical references for designing future 3D policies. 2 Related Work 2.1 2D Policy Learning Early visuomotor policy learning primarily relied on 2D visual inputs, focusing on imitation and end-to-end action prediction from RGB observations. Diffusion-based approaches such as Dif- fusion Policy [7], RDT-1B [27], and CF-SDP [46] demonstrate stable behavior cloning through generative modeling in pixel space, while transformer-based architectures like ACT [","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"RDT-1B [113] SigLIP T5-XXL DiT Xattn DDPM (Aggregated Datasets) Real(ALOHA dual-arm): wash, pour, fold, etc. RT-2 [1]♢ViT-4B, ViT-22B PaLI-X, PaLM-E Symbol- tuning Concat BC (disc), Co- fine-tuning Fractal, VQASim: Language-Table;Real: RT-1 evaluation tasks RT-H [114]♢(Model design follows RT-2) BC (disc) Diverse+KitchenReal: Diverse+Kitchen eval tasks RT-X [115]♢(Models from RT-1 and RT-2) BC (disc) [SC: OXE]Real: BridgeV2, RT-1 evaluation tasks, etc. OpenVLA [35]♢DINOv2, SigLIP Prismatic-7B Sy","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In addition, parallel efforts have sought to extend the scaling laws [19, 20] observed in vision and language domains to the embodied setting, collecting large-scale embodied datasets and training generalist agents end-to-end on top of vision-language foundation models [21, 22, 23]. These diverse approaches have led to a rapid proliferation of VLA models in robotic manipulation [24, 25], navigation [26, 27], and autonomous driving [28, 29, 30], demonstrating promising capabilities in multitask l","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The International Journal of Robotics Research , page 02783649241273668, 2023. 3, 5, 7, 8, 9 [91] Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy. arXiv preprint arXiv:2503.19757, 2025. [92] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model f","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Early large-scale systems such as RT-2 [1] and OpenVLA [2] suggest that scaling multimodal backbones can translate into broader task coverage in robotics. At the same time, the field has been actively experimenting with different design choices-from diffusion/flow-based action heads that improve continuous control fidelity (e.g., Octo [3], pi0 [4], RDT [5]), to richer multimodal structures and training signals (e.g., GR-1/GR-2 [6], [7], RoboDreamer [8], and RL-augmented variants [9], [10]). Thes","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"trast, foundation models trained on million-scale, multi-robot corpora exhibit strong zero-shot transfer: RT-1 [6] unifies vision, language, and action in a single transformer for real-time kitchen manipulation; RT-2 [5] jointly finetunes large vision-language models on web and robot data to support semantic planning and object reasoning; diffusion-based RDT-1B [28] andπ[3] learn diverse bimanual dynamics from over a million episodes. Vision-language-action systems such as OpenVLA [20] and CogAC","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (37 contexts).","role_counts":[{"n":37,"context_role":"background"},{"n":4,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-25T04:35:34.461793+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","claims":[{"claim_text":"Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T15:01:43.041892+00:00"}},"summary":{"title":"RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation","claims":[{"claim_text":"Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":36},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":35},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":22},{"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","shared_citers":21},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":20},{"title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","work_id":"04f46bb3-4346-47e8-bf09-c75d91f96e87","shared_citers":20},{"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","shared_citers":15},{"title":"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware","work_id":"6fe159e0-fa73-481a-88d4-4719c15140be","shared_citers":14},{"title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","work_id":"9b985126-4a2f-4bdf-b014-2a7524ec634e","shared_citers":14},{"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","shared_citers":14},{"title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","work_id":"0c5e9314-5fa7-4613-ad12-605a71d561d2","shared_citers":14},{"title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","work_id":"4b158d3e-3dff-4412-85cd-baa879465a5e","shared_citers":13},{"title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","work_id":"843ab5eb-2815-4db8-b3bc-890b23fa5ffa","shared_citers":12},{"title":"Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation","work_id":"e92c2c13-4330-45fe-8231-34a6002626bd","shared_citers":11},{"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","shared_citers":10},{"title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","work_id":"83a8f966-6cfa-4f21-81f3-87440aae238f","shared_citers":10},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":10},{"title":"Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning","work_id":"3d63039f-41b0-4a31-af31-6fc10f5c1b1b","shared_citers":9},{"title":"X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model","work_id":"13faca8d-e96d-4e6c-a441-9f2683d11934","shared_citers":9},{"title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","work_id":"aebf924c-e761-437e-9cee-f1ccc2e427bd","shared_citers":8},{"title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","work_id":"f797e9ec-510f-43a7-8a0c-18009ce332e5","shared_citers":8},{"title":"Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets","work_id":"59e728c0-b6ca-4759-a8f4-02b981f2220f","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":8},{"title":"Gemini Robotics: Bringing AI into the Physical World","work_id":"f7c5ce10-8364-4fbe-964f-2802b81c3a98","shared_citers":8}],"time_series":[{"n":2,"year":2024},{"n":3,"year":2025},{"n":42,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"0489688f-1759-436f-883c-b8e59741c085","orcid":null,"display_name":"Bangguo Li","source":"manual","import_confidence":0.72},{"id":"660d91b3-d236-41b2-97c4-445842319412","orcid":null,"display_name":"Hengkai Tan","source":"manual","import_confidence":0.72},{"id":"aab6b048-3866-4755-b74e-b63f38edd388","orcid":null,"display_name":"Huayu Chen","source":"manual","import_confidence":0.72},{"id":"6111ebe8-e393-4fef-894f-0b863ecdc8cd","orcid":null,"display_name":"Lingxuan Wu","source":"manual","import_confidence":0.72},{"id":"3d89bbd2-c3f9-4c54-970e-738fdb94468e","orcid":null,"display_name":"Songming Liu","source":"manual","import_confidence":0.72},{"id":"971a29c3-ad4a-43df-9241-f85ac3fc14c0","orcid":null,"display_name":"Zhengyi Wang","source":"manual","import_confidence":0.72}]}}