{"paper":{"title":"RT-H: Action Hierarchies Using Language","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.","cross_cats":["cs.AI"],"primary_cat":"cs.RO","authors_text":"Debidatta Dwibedi, Dorsa Sadigh, Jonathan Tompson, Pierre Sermanet, Quon Vuong, Suneel Belkhale, Ted Xiao, Tianli Ding, Yevgen Chebotar","submitted_at":"2024-03-04T08:16:11Z","abstract_excerpt":"Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., \"pick coke can\" and \"pick an apple\") in multi-task datasets. However, as tasks become more semantically diverse (e.g., \"pick coke can\" and \"pour cup\"), sharing data between tasks becomes harder, so learning to map high-level tasks to actio"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That fine-grained language motion phrases capture shared low-level structure across semantically diverse tasks sufficiently well that predicting them improves downstream action prediction and enables effective language-based correction.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e4cd9ca7d9f39dc9c49fc7a13fe4adc5f791a2bd56de0da0e2cc8adbec4df1ee"},"source":{"id":"2403.01823","kind":"arxiv","version":2},"verdict":{"id":"8b740000-7a5e-4ec6-9195-f24f2cebc662","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T06:49:12.759689Z","strongest_claim":"Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages.","one_line_summary":"RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That fine-grained language motion phrases capture shared low-level structure across semantically diverse tasks sufficiently well that predicting them improves downstream action prediction and enables effective language-based correction.","pith_extraction_headline":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections."},"references":{"count":63,"sample":[{"doi":"","year":2023,"title":"Do as i can, not as i say: Grounding language in robotic affordances","work_id":"162aa552-ee2f-4cc9-b1b6-f951beeed62a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3568162.3578623","year":2023,"title":"“No, to the Right","work_id":"ab15835c-e9db-4f47-8325-d798c6f35c30","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Correcting robot plans with natural language feedback","work_id":"bff9f09b-3261-4eca-be1a-447a04fcbb45","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"URL https://api.semanticscholar.org/CorpusID: 248085271","work_id":"6d0c3ebd-641a-4481-872f-00df32ae5ec0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","ref_index":5,"cited_arxiv_id":"2307.15818","is_internal_anchor":true}],"resolved_work":63,"snapshot_sha256":"aee9348c11220fac061336eef4a2cd7afb74f430ef6ede64bb6495a16cf2f4c5","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"37f1e93178923e324fa86c6fca24ba794483d79cbc5af82b0566d27c73a78a7c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}