{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:FYN4RFFBVFSWUUUGMZ3R5ZSEP2","short_pith_number":"pith:FYN4RFFB","schema_version":"1.0","canonical_sha256":"2e1bc894a1a9656a528666771ee6447eb469f837ee05bea1f488a7363f760f38","source":{"kind":"arxiv","id":"2403.01823","version":2},"attestation_state":"computed","paper":{"title":"RT-H: Action Hierarchies Using Language","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.","cross_cats":["cs.AI"],"primary_cat":"cs.RO","authors_text":"Debidatta Dwibedi, Dorsa Sadigh, Jonathan Tompson, Pierre Sermanet, Quon Vuong, Suneel Belkhale, Ted Xiao, Tianli Ding, Yevgen Chebotar","submitted_at":"2024-03-04T08:16:11Z","abstract_excerpt":"Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., \"pick coke can\" and \"pick an apple\") in multi-task datasets. However, as tasks become more semantically diverse (e.g., \"pick coke can\" and \"pour cup\"), sharing data between tasks becomes harder, so learning to map high-level tasks to actio"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2403.01823","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2024-03-04T08:16:11Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"6ce889699add44e9d8826eff1f6ba9e286a1dd915a3843ad177b00c773c46637","abstract_canon_sha256":"92b344e6172175696b4c494f461c8d89070242449cfbc5c528c263a6933d5a7a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.779553Z","signature_b64":"c5vtWYFCcHCub3drw/fdCwVuDuouMd+HOlG9feJXBrxB1sUWyDArYg/S82K5VP8UguOHIMfu/HX4SUmdPuNyCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"2e1bc894a1a9656a528666771ee6447eb469f837ee05bea1f488a7363f760f38","last_reissued_at":"2026-05-17T23:38:14.778993Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.778993Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"RT-H: Action Hierarchies Using Language","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.","cross_cats":["cs.AI"],"primary_cat":"cs.RO","authors_text":"Debidatta Dwibedi, Dorsa Sadigh, Jonathan Tompson, Pierre Sermanet, Quon Vuong, Suneel Belkhale, Ted Xiao, Tianli Ding, Yevgen Chebotar","submitted_at":"2024-03-04T08:16:11Z","abstract_excerpt":"Language provides a way to break down complex concepts into digestible pieces. Recent works in robot imitation learning use language-conditioned policies that predict actions given visual observations and the high-level task specified in language. These methods leverage the structure of natural language to share data between semantically similar tasks (e.g., \"pick coke can\" and \"pick an apple\") in multi-task datasets. However, as tasks become more semantically diverse (e.g., \"pick coke can\" and \"pour cup\"), sharing data between tasks becomes harder, so learning to map high-level tasks to actio"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That fine-grained language motion phrases capture shared low-level structure across semantically diverse tasks sufficiently well that predicting them improves downstream action prediction and enables effective language-based correction.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e4cd9ca7d9f39dc9c49fc7a13fe4adc5f791a2bd56de0da0e2cc8adbec4df1ee"},"source":{"id":"2403.01823","kind":"arxiv","version":2},"verdict":{"id":"8b740000-7a5e-4ec6-9195-f24f2cebc662","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T06:49:12.759689Z","strongest_claim":"Our method RT-H builds an action hierarchy using language motions: it first learns to predict language motions, and conditioned on this and the high-level task, it predicts actions, using visual context at all stages.","one_line_summary":"RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That fine-grained language motion phrases capture shared low-level structure across semantically diverse tasks sufficiently well that predicting them improves downstream action prediction and enables effective language-based correction.","pith_extraction_headline":"Predicting fine-grained language descriptions of motions first helps robot policies share structure across diverse tasks and accept language corrections."},"references":{"count":63,"sample":[{"doi":"","year":2023,"title":"Do as i can, not as i say: Grounding language in robotic affordances","work_id":"162aa552-ee2f-4cc9-b1b6-f951beeed62a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3568162.3578623","year":2023,"title":"“No, to the Right","work_id":"ab15835c-e9db-4f47-8325-d798c6f35c30","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Correcting robot plans with natural language feedback","work_id":"bff9f09b-3261-4eca-be1a-447a04fcbb45","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"URL https://api.semanticscholar.org/CorpusID: 248085271","work_id":"6d0c3ebd-641a-4481-872f-00df32ae5ec0","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","ref_index":5,"cited_arxiv_id":"2307.15818","is_internal_anchor":true}],"resolved_work":63,"snapshot_sha256":"aee9348c11220fac061336eef4a2cd7afb74f430ef6ede64bb6495a16cf2f4c5","internal_anchors":6},"formal_canon":{"evidence_count":2,"snapshot_sha256":"37f1e93178923e324fa86c6fca24ba794483d79cbc5af82b0566d27c73a78a7c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2403.01823","created_at":"2026-05-17T23:38:14.779098+00:00"},{"alias_kind":"arxiv_version","alias_value":"2403.01823v2","created_at":"2026-05-17T23:38:14.779098+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2403.01823","created_at":"2026-05-17T23:38:14.779098+00:00"},{"alias_kind":"pith_short_12","alias_value":"FYN4RFFBVFSW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"FYN4RFFBVFSWUUUG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"FYN4RFFB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":21,"internal_anchor_count":21,"sample":[{"citing_arxiv_id":"2510.12710","citing_title":"Reflection-Based Task Adaptation for Self-Improving VLA","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01925","citing_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2511.18960","citing_title":"AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2511.18085","citing_title":"Continually Evolving Skill Knowledge in Vision Language Action Model","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04447","citing_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2601.07060","citing_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2503.15558","citing_title":"Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08392","citing_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2502.19417","citing_title":"Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2412.13877","citing_title":"RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15620","citing_title":"Towards Generalizable Robotic Manipulation in Dynamic Environments","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2502.05855","citing_title":"DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13119","citing_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13632","citing_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12167","citing_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14125","citing_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2501.09747","citing_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09059","citing_title":"Learning Vision-Language-Action World Models for Autonomous Driving","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14125","citing_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18463","citing_title":"Using large language models for embodied planning introduces systematic safety risks","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15938","citing_title":"VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation","ref_index":1,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2","json":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2.json","graph_json":"https://pith.science/api/pith-number/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/graph.json","events_json":"https://pith.science/api/pith-number/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/events.json","paper":"https://pith.science/paper/FYN4RFFB"},"agent_actions":{"view_html":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2","download_json":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2.json","view_paper":"https://pith.science/paper/FYN4RFFB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2403.01823&json=true","fetch_graph":"https://pith.science/api/pith-number/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/graph.json","fetch_events":"https://pith.science/api/pith-number/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/action/storage_attestation","attest_author":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/action/author_attestation","sign_citation":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/action/citation_signature","submit_replication":"https://pith.science/pith/FYN4RFFBVFSWUUUGMZ3R5ZSEP2/action/replication_record"}},"created_at":"2026-05-17T23:38:14.779098+00:00","updated_at":"2026-05-17T23:38:14.779098+00:00"}