{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:Y5LX4DQD7K3VH6DQIYSCJTQ6X6","short_pith_number":"pith:Y5LX4DQD","schema_version":"1.0","canonical_sha256":"c7577e0e03fab753f870462424ce1ebfaa0f2f65a6419301d0bd265c96a85351","source":{"kind":"arxiv","id":"2508.07917","version":4},"attestation_state":"computed","paper":{"title":"MolmoAct: Action Reasoning Models that can Reason in Space","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MolmoAct encodes robot observations into depth-aware tokens, editable trajectory traces, and low-level actions through a three-stage pipeline.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Ali Farhadi, Angelica Wu, Bohan Fang, Boyang Li, Dieter Fox, Eli VanderBilt, Haoquan Fang, Jason Lee, Jiafei Duan, Jieyu Zhang, Karen Farley, Ranjay Krishna, Rose Hendrix, Sangho Lee, Shuo Liu, Wilbert Pumacay, Winson Han, Yi Ru Wang, Yuquan Deng","submitted_at":"2025-08-11T12:32:45Z","abstract_excerpt":"Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and ste"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2508.07917","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2025-08-11T12:32:45Z","cross_cats_sorted":[],"title_canon_sha256":"754cffeb0bf838ee1ab60cfe42a62b9da773132274b7d5fcfe9e000645a4e9ce","abstract_canon_sha256":"56ef2d440f74d2c02bb85414d3dcda76a36e92fba696eda6936c2531f9103018"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:19.843621Z","signature_b64":"29RM79zPmua0M2b6qXjq7f9g9Y+Z1GsxC1SkyyF0gKB8exDBnVIXwtV1phVylcIi249T8dOiBf0ZuKd0Ki1ECg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c7577e0e03fab753f870462424ce1ebfaa0f2f65a6419301d0bd265c96a85351","last_reissued_at":"2026-05-17T23:39:19.842934Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:19.842934Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MolmoAct: Action Reasoning Models that can Reason in Space","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MolmoAct encodes robot observations into depth-aware tokens, editable trajectory traces, and low-level actions through a three-stage pipeline.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Ali Farhadi, Angelica Wu, Bohan Fang, Boyang Li, Dieter Fox, Eli VanderBilt, Haoquan Fang, Jason Lee, Jiafei Duan, Jieyu Zhang, Karen Farley, Ranjay Krishna, Rose Hendrix, Sangho Lee, Shuo Liu, Wilbert Pumacay, Winson Han, Yi Ru Wang, Yuquan Deng","submitted_at":"2025-08-11T12:32:45Z","abstract_excerpt":"Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and ste"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MolmoAct encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the structured three-stage pipeline of depth-aware perception, editable trajectory planning, and low-level control produces meaningfully better adaptability, generalization, and semantic grounding than direct perception-to-action mapping models.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MolmoAct is a 7B robotic foundation model using a three-stage pipeline of depth-aware perception, editable spatial trajectory planning, and low-level action prediction that reports state-of-the-art results on simulation and real-world tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MolmoAct encodes robot observations into depth-aware tokens, editable trajectory traces, and low-level actions through a three-stage pipeline.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d5fe9e754e1eb39f2ac446a481735cc2ed7e791db71ae1d80b1a14d6c765e210"},"source":{"id":"2508.07917","kind":"arxiv","version":4},"verdict":{"id":"b04ffe46-9140-4bc8-aaab-353ebc2bb1c3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T23:30:25.974841Z","strongest_claim":"MolmoAct encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior.","one_line_summary":"MolmoAct is a 7B robotic foundation model using a three-stage pipeline of depth-aware perception, editable spatial trajectory planning, and low-level action prediction that reports state-of-the-art results on simulation and real-world tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the structured three-stage pipeline of depth-aware perception, editable trajectory planning, and low-level control produces meaningfully better adaptability, generalization, and semantic grounding than direct perception-to-action mapping models.","pith_extraction_headline":"MolmoAct encodes robot observations into depth-aware tokens, editable trajectory traces, and low-level actions through a three-stage pipeline."},"references":{"count":13,"sample":[{"doi":"","year":2024,"title":"2.ViT Image Encoder:encodes each crop independently into per-patch features","work_id":"4885fa0b-f534-43cb-a557-f87c08dbd9b4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Layer selection and concatenation:features from the third-to-last (OpenAI CLIP) or fourth-to-last (SigLIP2) and the tenth-from-last ViT layers are concatenated for each patch; this slightly outperform","work_id":"4751badc-4869-4165-a10a-1fddb020dc23","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Attention pooling in2 × 2windows:within each2 × 2patch window, a multi-headed attention layer pools the four patches to a single vector, using the mean of the patches as the query. This pooling reduce","work_id":"161a021d-833b-48b7-866e-0e984bd76332","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Language Description:Put the bowl into the sink","work_id":"b05eb385-d8dc-43bb-b3f8-0d6381ee9ae2","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Language Description:Wipe the table","work_id":"f6e79548-9c86-43d1-8bca-5ca67023720a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":13,"snapshot_sha256":"7219498b8edb83f4b6cd94425d2df8d08c89622758c15f0e12a3976699aa6cdb","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"db6eb7c44b5a85d87e5cbd3d9451bec9b2b9aa51f00ad49c6af7eea1be7c3ea2"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2508.07917","created_at":"2026-05-17T23:39:19.843037+00:00"},{"alias_kind":"arxiv_version","alias_value":"2508.07917v4","created_at":"2026-05-17T23:39:19.843037+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2508.07917","created_at":"2026-05-17T23:39:19.843037+00:00"},{"alias_kind":"pith_short_12","alias_value":"Y5LX4DQD7K3V","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"Y5LX4DQD7K3VH6DQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"Y5LX4DQD","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":45,"internal_anchor_count":45,"sample":[{"citing_arxiv_id":"2606.07107","citing_title":"Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2503.03480","citing_title":"SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21976","citing_title":"TacO: Benchmarking Tactile Sensors for Object Manipulation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2602.10503","citing_title":"Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08167","citing_title":"Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2602.19710","citing_title":"Universal Pose Pretraining for Generalizable Vision-Language-Action Policies","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18532","citing_title":"VLANeXt: Recipes for Building Strong VLA Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21414","citing_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2508.13998","citing_title":"Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2510.19268","citing_title":"Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2511.14148","citing_title":"AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17411","citing_title":"SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2511.16857","citing_title":"BOP-ASK: Object-Interaction Reasoning for Vision-Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2510.03827","citing_title":"LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10941","citing_title":"Mull-Tokens: Modality-Agnostic Latent Thinking","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04476","citing_title":"Vision-aligned Latent Reasoning for Multi-modal Large Language Model","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10611","citing_title":"Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2602.11236","citing_title":"ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13193","citing_title":"Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00110","citing_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2603.02115","citing_title":"Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15956","citing_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.26320","citing_title":"DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13778","citing_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13632","citing_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","ref_index":20,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6","json":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6.json","graph_json":"https://pith.science/api/pith-number/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/graph.json","events_json":"https://pith.science/api/pith-number/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/events.json","paper":"https://pith.science/paper/Y5LX4DQD"},"agent_actions":{"view_html":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6","download_json":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6.json","view_paper":"https://pith.science/paper/Y5LX4DQD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2508.07917&json=true","fetch_graph":"https://pith.science/api/pith-number/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/graph.json","fetch_events":"https://pith.science/api/pith-number/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/action/timestamp_anchor","attest_storage":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/action/storage_attestation","attest_author":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/action/author_attestation","sign_citation":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/action/citation_signature","submit_replication":"https://pith.science/pith/Y5LX4DQD7K3VH6DQIYSCJTQ6X6/action/replication_record"}},"created_at":"2026-05-17T23:39:19.843037+00:00","updated_at":"2026-05-17T23:39:19.843037+00:00"}