MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

Huan Li; Muyao Peng; Xi Wang; Yan Wang; Yinchen Tian; You Yang

arxiv: 2606.10899 · v1 · pith:VG3Q6OV3new · submitted 2026-06-09 · 💻 cs.RO

MV-Actor: Aligning Multi-View Semantics and Spatial Awareness for Bimanual Manipulation

Yinchen Tian , Huan Li , Muyao Peng , Xi Wang , Yan Wang , You Yang This is my paper

Pith reviewed 2026-06-27 12:50 UTC · model grok-4.3

classification 💻 cs.RO

keywords multi-view perceptionbimanual manipulationsemantic-spatial representationrobotic manipulationdepth repairmulti-view interactionspatial awareness

0 comments

The pith

MV-Actor builds a unified semantic-spatial representation from multiple camera views to improve bimanual robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MV-Actor as a multi-view perception framework that addresses how existing bimanual policies either encode each camera view separately or combine their features only at a shallow level. This approach limits how well semantics are shared across viewpoints and how reliably spatial information is recovered. MV-Actor adds three components that interact to produce one combined representation: multi-view semantic interaction shares perception across cameras, semantic-spatial token interaction grounds those semantics using features from a feed-forward reconstruction model, and guided metric depth repair cleans up noisy sensor depth to supply stable metric references. Experiments on the PerAct2 benchmark show an 87.8 percent average success rate in simulation, with further gains over RGB and RGB-D baselines in real-world settings that include frequent viewpoint shifts and consumer-grade depth noise.

Core claim

MV-Actor establishes a unified semantic-spatial representation for bimanual manipulation by first performing Multi-view Semantic Interaction to share semantic perception across views, then applying Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features for reliable spatial awareness, and finally using a Guided Metric Depth Repair module to refine degraded sensor depth under consumer-grade noise.

What carries the argument

The three modules of Multi-view Semantic Interaction, Semantic-Spatial Token Interaction, and Guided Metric Depth Repair that together construct the unified semantic-spatial representation.

If this is right

Bimanual policies gain higher task success rates once semantic information is actively shared across all available camera views.
Spatial awareness remains usable even when depth sensors are unstable or consumer-grade, provided the repair module supplies metric anchors.
Real-world deployment becomes more robust to camera motion because the framework explicitly handles viewpoint changes through the interaction modules.
The same interaction pattern can be applied to other multi-camera manipulation setups that currently suffer from isolated view processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to tasks with more than two arms if the semantic interaction module is scaled to additional views without redesign.
Replacing the reconstruction model inside the token interaction step with a different spatial prior might further reduce reliance on any single depth sensor.
Testing the framework on longer-horizon tasks would reveal whether the unified representation maintains consistency over extended sequences.

Load-bearing premise

The three modules will combine to produce a unified semantic-spatial representation that overcomes the limits of independent view encoding and shallow fusion.

What would settle it

A controlled test on the PerAct2 benchmark in which MV-Actor without one or more of the three modules shows no improvement over independent-view or shallow-fusion baselines under frequent viewpoint changes and noisy depth.

Figures

Figures reproduced from arXiv: 2606.10899 by Huan Li, Muyao Peng, Xi Wang, Yan Wang, Yinchen Tian, You Yang.

**Figure 2.** Figure 2: MV-Actor overview. Given multi-view RGB-D observations, MV-Actor first repairs degraded sensor depth by combining RGB texture and a Pi3X feed-forward depth prior, producing metric depth that can be back-projected into the robot frame. CLIP features provide semantic scene tokens, while Pi3X [16] provides feed-forward spatial tokens. Multi-view Semantic Interaction exchanges semantic evidence across physical… view at source ↗

**Figure 3.** Figure 3: Multi-view and semantic-spatial token interaction. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: ARK-Lift real-robot setup and tasks. The platform uses three calibrated RGB-D cameras. The task examples cover cooperative lifting, stick insertion, and contact-based handover. edges captured by different cameras; in straighten_rope, it enables accurate endpoint reasoning along the full rope configuration observed from multiple angles. On long-horizon tasks such as sweep_dust, lift_tray, handover_easy, t… view at source ↗

**Figure 5.** Figure 5: Multi-view Semantic Interaction visualization. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic-Spatial Token Interaction visualization. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Qualitative depth repair. Each example shows RGB, degraded sensor depth, feed-forward depth (FFD), ground-truth depth, FFD-only repair, RGB-only repair, and RGB+FFD repair. Joint guidance better balances hole filling, boundary recovery, and global structure consistency. view observations by sharing semantic perception across views and grounding it with reliable spatial awareness from feed-forward reconstr… view at source ↗

**Figure 7.** Figure 7: Typical depth failures. Public depth datasets [45–47] and depth artifact studies [48] show holes and boundary artifacts caused by transparent or reflective surfaces, low texture, and robotarm occlusion [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: Examples of real-robot evaluation stages, ordered consistently with [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Robotic manipulation has been widely applied in industrial scenarios. Compared with single-arm manipulation, bimanual manipulation is equipped with multiple cameras to capture information from different viewpoints. However, existing multi-view policies encode each view independently or fuse view features shallowly, resulting in limited sharing semantic perception and unreliable spatial awareness. In this paper, we propose \textbf{MV-Actor}, a multi-view perception framework that builds a unified semantic-spatial representation for bimanual manipulation. First, MV-Actor performs Multi-view Semantic Interaction to share semantic perception across views. Then it uses Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features and acquire reliable spatial awareness. Finally, a Guided Metric Depth Repair module refines degraded sensor depth to provide more reliable metric anchors under consumer-grade depth noise. In simulation experiments conducted on the PerAct2 bimanual benchmark, MV-Actor achieves a state-of-the-art average success rate of 87.8\%. In real-world evaluations with more frequent viewpoint changes and unstable consumer-grade depth, MV-Actor outperforms both RGB and RGB-D baselines, further demonstrating the benefit of sharing semantic perception and reliable spatial awareness for bimanual manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MV-Actor adds three modules for semantic sharing and spatial grounding in multi-view bimanual policies, with reported 87.8% on PerAct2 and real-world gains, but the abstract leaves implementation and ablations opaque.

read the letter

MV-Actor proposes a framework with three modules to build a unified semantic-spatial representation for bimanual manipulation using multiple camera views. The main points are Multi-view Semantic Interaction to share semantics, Semantic-Spatial Token Interaction to ground with reconstruction features, and Guided Metric Depth Repair for better metric depth under noise.

This addresses a real issue in the field where bimanual policies don't fully leverage multiple views for both semantics and reliable spatial info. The paper does well by testing on the PerAct2 benchmark in simulation, hitting 87.8% average success, and showing better performance in real-world tests with viewpoint changes and unstable depth compared to RGB and RGB-D baselines. That gives some evidence that the approach helps in practical scenarios.

Where it is softer is the absence of detailed methodology, specific baselines, ablation studies, or error analysis in the provided abstract. This makes it tough to see exactly how the modules work or confirm that they overcome the limitations as claimed. The central assumption about creating a unified representation from these interactions is plausible but rests on the empirical outcomes, which can't be fully assessed yet.

There are no signs of internal contradictions or circular reasoning in the high-level argument. The claims are framed as empirical results from the experiments.

This paper would be of interest to people working on multi-view robotic policies or bimanual manipulation systems. A reader who wants to see comparisons on standard benchmarks and real setups with consumer sensors could find it useful. Given the specific problem it targets and the reported results, it deserves a serious referee to examine the full implementation and verify the contributions.

I recommend engaging with it through peer review to get a closer look at the methods and data.

Referee Report

1 major / 2 minor

Summary. The paper proposes MV-Actor, a multi-view perception framework for bimanual robotic manipulation. It introduces three modules to build a unified semantic-spatial representation: Multi-view Semantic Interaction to share semantic perception across views, Semantic-Spatial Token Interaction to ground visual semantics with feed-forward reconstruction model features for reliable spatial awareness, and Guided Metric Depth Repair to refine degraded sensor depth under consumer-grade noise. The framework is evaluated on the PerAct2 bimanual benchmark in simulation, achieving a state-of-the-art average success rate of 87.8%, and in real-world experiments with frequent viewpoint changes and unstable depth, where it outperforms both RGB and RGB-D baselines.

Significance. If the empirical results hold after detailed validation, this work would advance multi-view policies for bimanual manipulation by showing that aligned semantic-spatial representations can overcome limitations of independent view encoding and shallow fusion, with particular relevance to robustness under viewpoint variation and noisy depth sensors. The reported performance gains in both simulation and real-world settings with consumer hardware indicate potential practical impact.

major comments (1)

[Abstract] Abstract: The abstract states the 87.8% average success rate and outperformance over baselines but supplies no methodology details, baselines, ablation results, or error analysis, making it impossible to determine whether the data supports the central claim that the three modules produce a unified representation sufficient to overcome the stated limitations.

minor comments (2)

The manuscript would benefit from a diagram or pseudocode clarifying the data flow and interactions among the three proposed modules.
Ensure consistent use of terminology (e.g., 'feed-forward reconstruction model') and provide a reference or brief description for any external models used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of MV-Actor. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states the 87.8% average success rate and outperformance over baselines but supplies no methodology details, baselines, ablation results, or error analysis, making it impossible to determine whether the data supports the central claim that the three modules produce a unified representation sufficient to overcome the stated limitations.

Authors: We acknowledge that the abstract is concise by design. It does provide a high-level overview of the methodology via the three named modules (Multi-view Semantic Interaction, Semantic-Spatial Token Interaction, Guided Metric Depth Repair) and explicitly states outperformance over RGB and RGB-D baselines. Full methodology details appear in Section 3, experimental setup and baselines in Section 4.1 (with results in Tables 1–2), module ablations in Section 4.3, and error/failure analysis in Section 4.4. These sections supply the evidence that the unified semantic-spatial representation addresses the limitations of independent view encoding and shallow fusion. To further address the concern, we will revise the abstract to include a brief clause on the evaluation protocol and ablation findings, subject to length constraints. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes an empirical multi-view framework (MV-Actor) consisting of three described modules and validates performance via simulation benchmarks (PerAct2, 87.8% success) and real-world trials. No equations, derivations, fitted parameters, or predictions appear in the provided text. Claims rest on experimental outcomes rather than any self-referential definitions, imported uniqueness theorems, or ansatzes smuggled via self-citation. The argument is self-contained against external benchmarks with no load-bearing reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.1-grok · 5752 in / 1315 out tokens · 27028 ms · 2026-06-27T12:50:23.138736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 2 linked inside Pith

[1]

Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

2019
[2]

Diffusion policy: Visuomotor policy learn- ing via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shu- ran Song. Diffusion policy: Visuomotor policy learn- ing via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[3]

RT-1: Robotics Transformer for real- world control at scale.Robotics: Science and Systems XIX, pages 1–22, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for real- world control at scale.Robotics: Science and Systems XIX, pages 1–22, 2023

2023
[4]

Learning fine-grained bimanual manip- ulation with low-cost hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manip- ulation with low-cost hardware. InICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, pages 1–22, 2023

2023
[5]

Mobile ALOHA: Learning bimanual mobile manipulation us- ing low-cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation us- ing low-cost whole-body teleoperation. InProceedings 9 of the Conference on Robot Learning (CoRL), pages 4066–4083. PMLR, 2024

2024
[6]

RDT-1B: a diffusion foundation model for bi- manual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a diffusion foundation model for bi- manual manipulation. InProceedings of the Interna- tional Conference on Learning Representations (ICLR), pages 29982–30009, 2025

2025
[7]

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A Vision-Language-Action flow model for general robot control.arXiv preprint arXiv:2410.24164, pages 1–17, 2024

Pith/arXiv arXiv 2024
[8]

BFA: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10(9):8930–8937, 2025

Zihan Lan, Weixin Mao, Haosheng Li, Le Wang, Tian- cai Wang, Haoqiang Fan, and Osamu Yoshie. BFA: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10(9):8930–8937, 2025

2025
[9]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 785–799. PMLR, 2023

2023
[10]

Act3D: 3D feature field transformers for multi-task robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3D: 3D feature field transformers for multi-task robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL), pages 3949–3965. PMLR, 2023

2023
[11]

3D Diffuser Actor: Policy diffusion with 3D scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3D Diffuser Actor: Policy diffusion with 3D scene representations. InProceedings of the Con- ference on Robot Learning (CoRL), pages 1949–1974. PMLR, 2024

1949
[12]

3D FlowMatch Actor: Unified 3D policy for single- and dual-arm manipulation.arXiv preprint arXiv:2508.11002, pages 1–23, 2025

Nikolaos Gkanatsios, Jiahe Xu, Matthew Bronars, Ar- salan Mousavian, Tsung-Wei Ke, and Katerina Fragki- adaki. 3D FlowMatch Actor: Unified 3D policy for single- and dual-arm manipulation.arXiv preprint arXiv:2508.11002, pages 1–23, 2025

arXiv 2025
[13]

Zheng Li, Pei Qu, Yufei Jia, Shihui Zhou, Haizhou Ge, Jiahang Cao, Jinni Zhou, Guyue Zhou, and Jun Ma. ManiVID-3D: Generalizable view-invariant reinforce- ment learning for robotic manipulation via disentangled 3D representations.IEEE Robotics and Automation Let- ters, 11(4):4235–4242, 2026

2026
[14]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 20697–20709, 2024

2024
[15]

MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, and Yan Wang. MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

arXiv 2026
[16]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, pages 1–17, 2025

Pith/arXiv arXiv 2025
[17]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. InPro- ceedings of the International Conference on Machine Learning (ICML), pages 30613–30632. PMLR, 2023

2023
[18]

Learning view-invariant world models for visual robotic manipulation

Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, and Yang Yu. Learning view-invariant world models for visual robotic manipulation. InProceedings of the International Conference on Learning Represen- tations (ICLR), pages 1–24, 2025

2025
[19]

3D-MVP: 3D multiview pretraining for manipulation

Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3D-MVP: 3D multiview pretraining for manipulation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22530–22539, 2025

2025
[20]

RVT: Robotic view transformer for 3D object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3D object manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 694–710. PMLR, 2023

2023
[21]

RVT-2: Learning precise ma- nipulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise ma- nipulation from few demonstrations. InRSS Workshop: Data Generation for Robotics, pages 1–11, 2024

2024
[22]

PEAfowl: Perception-enhanced multi-view Vision-Language-Action for bimanual ma- nipulation.arXiv preprint arXiv:2601.17885, pages 1– 11, 2026

Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, and Xun Cao. PEAfowl: Perception-enhanced multi-view Vision-Language-Action for bimanual ma- nipulation.arXiv preprint arXiv:2601.17885, pages 1– 11, 2026

arXiv 2026
[23]

G 2VLM: Geome- try grounded vision language model with unified 3D reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, pages 1–14, 2025

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Li- han Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2VLM: Geome- try grounded vision language model with unified 3D reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, pages 1–14, 2025

arXiv 2025
[24]

Dy- nam3D: Dynamic layered 3D tokens empower VLM for vision-and-language navigation.Advances in Neural In- formation Processing Systems (NeurIPS), 38:153522– 153544, 2026

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dy- nam3D: Dynamic layered 3D tokens empower VLM for vision-and-language navigation.Advances in Neural In- formation Processing Systems (NeurIPS), 38:153522– 153544, 2026. 10

2026
[25]

Evo-0: Vision-Language-Action model with implicit spatial un- derstanding.arXiv preprint arXiv:2507.00416, pages 1–9, 2025

Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision-Language-Action model with implicit spatial un- derstanding.arXiv preprint arXiv:2507.00416, pages 1–9, 2025

arXiv 2025
[26]

PerAct2: Benchmarking and learning for robotic bimanual manipulation tasks

Markus Grotz, Mohit Shridhar, Yu-Wei Chao, Tamim Asfour, and Dieter Fox. PerAct2: Benchmarking and learning for robotic bimanual manipulation tasks. In CoRL Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, pages 1–18, 2024

2024
[27]

V oxAct-B: V oxel-based acting and stabilizing policy for bimanual manipulation

I-Chun Arthur Liu, Sicheng He, Daniel Seita, and Gau- rav S Sukhatme. V oxAct-B: V oxel-based acting and stabilizing policy for bimanual manipulation. InPro- ceedings of the Conference on Robot Learning (CoRL), pages 4354–4370. PMLR, 2025

2025
[28]

InterACT: Inter-dependency aware action chunking with hierarchical attention trans- formers for bimanual manipulation

Andrew Choong-Won Lee, Ian Chuang, Ling-Yuan Chen, and Iman Soltani. InterACT: Inter-dependency aware action chunking with hierarchical attention trans- formers for bimanual manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 1730–
[29]

Denois- ing diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840– 6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840– 6851, 2020

2020
[30]

De- noising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising Diffusion Implicit Models. InProceedings of the International Conference on Learning Representa- tions (ICLR), pages 1–20, 2021

2021
[31]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 4195–4205, 2023

2023
[32]

RT-2: Vision-Language- Action models transfer web knowledge to robotic con- trol

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-Language- Action models transfer web knowledge to robotic con- trol. InProceedings of the Conference on Robot Learn- ing (CoRL), pages 2165–2183. PMLR, 2023

2023
[33]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InICRA Workshop on Vision-Language Models for Navigation and Manip- ulation, pages 1–17, 2024

2024
[34]

OpenVLA: An open-source Vision-Language-Action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. OpenVLA: An open-source Vision-Language-Action model. InProceedings of the Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

2025
[35]

InProceedings of the Conference on Robot Learning (CoRL), pages 17–40

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a Vision-Language-Action model with open- world generalization. InProceedings of the Conference on Robot Learning (CoRL), pages 17–40. PMLR, 2025

2025
[36]

Flow Matching for gen- erative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matthew Le. Flow Matching for gen- erative modeling. InProceedings of the International Conference on Learning Representations (ICLR), pages 1–28, 2023

2023
[37]

Autoregressive ac- tion sequence learning for robotic manipulation.IEEE Robotics and Automation Letters, 10(5):4898–4905, 2025

Xinyu Zhang, Yuhan Liu, Haonan Chang, Liam Schramm, and Abdeslam Boularias. Autoregressive ac- tion sequence learning for robotic manipulation.IEEE Robotics and Automation Letters, 10(5):4898–4905, 2025

2025
[38]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–
[39]

U-Net: Convolutional networks for biomedical im- age segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical im- age segmentation. InProceedings of the Interna- tional Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–
[40]

On the continuity of rotation representations in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 5745–5753, 2019

2019
[41]

3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3D representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3D representations. InRSS Workshop on Dexterous Manip- ulation: Design, Perception and Control, pages 1–17, 2024

2024
[42]

Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manip- ulation

Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, and Liqiang Nie. Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manip- ulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17394–17404, 2025

2025
[43]

Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation.arXiv preprint arXiv:2504.17784, pages 1–20, 2025

Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiang- miao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation.arXiv preprint arXiv:2504.17784, pages 1–20, 2025. 11

arXiv 2025
[44]

Anybimanual: Transferring unimanual policy for general bimanual ma- nipulation

Guanxing Lu, Tengbo Yu, Haoyuan Deng, Season Si Chen, Yansong Tang, and Ziwei Wang. Anybimanual: Transferring unimanual policy for general bimanual ma- nipulation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 13662–13672, 2025

2025
[45]

Booster: a benchmark for depth from images of specular and transparent surfaces.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(1):85–102, 2023

Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Booster: a benchmark for depth from images of specular and transparent surfaces.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(1):85–102, 2023

2023
[46]

Indoor segmentation and support inference from RGBD images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. InProceedings of the European Conference on Computer Vision (ECCV), pages 746–
[47]

A taxonomy and evaluation of dense two-frame stereo correspon- dence algorithms.International Journal of Computer Vision, 47(1):7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon- dence algorithms.International Journal of Computer Vision, 47(1):7–42, 2002

2002
[48]

Depth map artefacts reduction: A review.IET Image Processing, 14(12):2630–2644, 2020

Mostafa Mahmoud Ibrahim, Qiong Liu, Rizwan Khan, Jingyu Yang, Ehsan Adeli, and You Yang. Depth map artefacts reduction: A review.IET Image Processing, 14(12):2630–2644, 2020

2020
[49]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025

2025
[50]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

2016
[51]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 12 SUPPLEMENTARY MA TERIAL A. Module-Level Implementation Details Semantic tokens and anchorsFrozen...

2017

[1] [1]

Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

Aude Billard and Danica Kragic. Trends and challenges in robot manipulation.Science, 364(6446):eaat8414, 2019

2019

[2] [2]

Diffusion policy: Visuomotor policy learn- ing via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shu- ran Song. Diffusion policy: Visuomotor policy learn- ing via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[3] [3]

RT-1: Robotics Transformer for real- world control at scale.Robotics: Science and Systems XIX, pages 1–22, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for real- world control at scale.Robotics: Science and Systems XIX, pages 1–22, 2023

2023

[4] [4]

Learning fine-grained bimanual manip- ulation with low-cost hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manip- ulation with low-cost hardware. InICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, pages 1–22, 2023

2023

[5] [5]

Mobile ALOHA: Learning bimanual mobile manipulation us- ing low-cost whole-body teleoperation

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation us- ing low-cost whole-body teleoperation. InProceedings 9 of the Conference on Robot Learning (CoRL), pages 4066–4083. PMLR, 2024

2024

[6] [6]

RDT-1B: a diffusion foundation model for bi- manual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: a diffusion foundation model for bi- manual manipulation. InProceedings of the Interna- tional Conference on Learning Representations (ICLR), pages 29982–30009, 2025

2025

[7] [7]

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A Vision-Language-Action flow model for general robot control.arXiv preprint arXiv:2410.24164, pages 1–17, 2024

Pith/arXiv arXiv 2024

[8] [8]

BFA: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10(9):8930–8937, 2025

Zihan Lan, Weixin Mao, Haosheng Li, Le Wang, Tian- cai Wang, Haoqiang Fan, and Osamu Yoshie. BFA: Best-feature-aware fusion for multi-view fine-grained manipulation.IEEE Robotics and Automation Letters, 10(9):8930–8937, 2025

2025

[9] [9]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 785–799. PMLR, 2023

2023

[10] [10]

Act3D: 3D feature field transformers for multi-task robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3D: 3D feature field transformers for multi-task robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL), pages 3949–3965. PMLR, 2023

2023

[11] [11]

3D Diffuser Actor: Policy diffusion with 3D scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3D Diffuser Actor: Policy diffusion with 3D scene representations. InProceedings of the Con- ference on Robot Learning (CoRL), pages 1949–1974. PMLR, 2024

1949

[12] [12]

3D FlowMatch Actor: Unified 3D policy for single- and dual-arm manipulation.arXiv preprint arXiv:2508.11002, pages 1–23, 2025

Nikolaos Gkanatsios, Jiahe Xu, Matthew Bronars, Ar- salan Mousavian, Tsung-Wei Ke, and Katerina Fragki- adaki. 3D FlowMatch Actor: Unified 3D policy for single- and dual-arm manipulation.arXiv preprint arXiv:2508.11002, pages 1–23, 2025

arXiv 2025

[13] [13]

Zheng Li, Pei Qu, Yufei Jia, Shihui Zhou, Haizhou Ge, Jiahang Cao, Jinni Zhou, Guyue Zhou, and Jun Ma. ManiVID-3D: Generalizable view-invariant reinforce- ment learning for robotic manipulation via disentangled 3D representations.IEEE Robotics and Automation Let- ters, 11(4):4235–4242, 2026

2026

[14] [14]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 20697–20709, 2024

2024

[15] [15]

MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, and Yan Wang. MeMix: Writing less, remem- bering more for streaming 3D reconstruction.arXiv preprint arXiv:2603.15330, pages 1–19, 2026

arXiv 2026

[16] [16]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.π 3: Permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, pages 1–17, 2025

Pith/arXiv arXiv 2025

[17] [17]

Multi-view masked world models for visual robotic manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, and Pieter Abbeel. Multi-view masked world models for visual robotic manipulation. InPro- ceedings of the International Conference on Machine Learning (ICML), pages 30613–30632. PMLR, 2023

2023

[18] [18]

Learning view-invariant world models for visual robotic manipulation

Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, and Yang Yu. Learning view-invariant world models for visual robotic manipulation. InProceedings of the International Conference on Learning Represen- tations (ICLR), pages 1–24, 2025

2025

[19] [19]

3D-MVP: 3D multiview pretraining for manipulation

Shengyi Qian, Kaichun Mo, Valts Blukis, David F Fouhey, Dieter Fox, and Ankit Goyal. 3D-MVP: 3D multiview pretraining for manipulation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22530–22539, 2025

2025

[20] [20]

RVT: Robotic view transformer for 3D object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3D object manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 694–710. PMLR, 2023

2023

[21] [21]

RVT-2: Learning precise ma- nipulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. RVT-2: Learning precise ma- nipulation from few demonstrations. InRSS Workshop: Data Generation for Robotics, pages 1–11, 2024

2024

[22] [22]

PEAfowl: Perception-enhanced multi-view Vision-Language-Action for bimanual ma- nipulation.arXiv preprint arXiv:2601.17885, pages 1– 11, 2026

Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, and Xun Cao. PEAfowl: Perception-enhanced multi-view Vision-Language-Action for bimanual ma- nipulation.arXiv preprint arXiv:2601.17885, pages 1– 11, 2026

arXiv 2026

[23] [23]

G 2VLM: Geome- try grounded vision language model with unified 3D reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, pages 1–14, 2025

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Li- han Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G 2VLM: Geome- try grounded vision language model with unified 3D reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, pages 1–14, 2025

arXiv 2025

[24] [24]

Dy- nam3D: Dynamic layered 3D tokens empower VLM for vision-and-language navigation.Advances in Neural In- formation Processing Systems (NeurIPS), 38:153522– 153544, 2026

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dy- nam3D: Dynamic layered 3D tokens empower VLM for vision-and-language navigation.Advances in Neural In- formation Processing Systems (NeurIPS), 38:153522– 153544, 2026. 10

2026

[25] [25]

Evo-0: Vision-Language-Action model with implicit spatial un- derstanding.arXiv preprint arXiv:2507.00416, pages 1–9, 2025

Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision-Language-Action model with implicit spatial un- derstanding.arXiv preprint arXiv:2507.00416, pages 1–9, 2025

arXiv 2025

[26] [26]

PerAct2: Benchmarking and learning for robotic bimanual manipulation tasks

Markus Grotz, Mohit Shridhar, Yu-Wei Chao, Tamim Asfour, and Dieter Fox. PerAct2: Benchmarking and learning for robotic bimanual manipulation tasks. In CoRL Workshop on Whole-body Control and Bimanual Manipulation: Applications in Humanoids and Beyond, pages 1–18, 2024

2024

[27] [27]

V oxAct-B: V oxel-based acting and stabilizing policy for bimanual manipulation

I-Chun Arthur Liu, Sicheng He, Daniel Seita, and Gau- rav S Sukhatme. V oxAct-B: V oxel-based acting and stabilizing policy for bimanual manipulation. InPro- ceedings of the Conference on Robot Learning (CoRL), pages 4354–4370. PMLR, 2025

2025

[28] [28]

InterACT: Inter-dependency aware action chunking with hierarchical attention trans- formers for bimanual manipulation

Andrew Choong-Won Lee, Ian Chuang, Ling-Yuan Chen, and Iman Soltani. InterACT: Inter-dependency aware action chunking with hierarchical attention trans- formers for bimanual manipulation. InProceedings of the Conference on Robot Learning (CoRL), pages 1730–

[29] [29]

Denois- ing diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840– 6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denois- ing diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840– 6851, 2020

2020

[30] [30]

De- noising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising Diffusion Implicit Models. InProceedings of the International Conference on Learning Representa- tions (ICLR), pages 1–20, 2021

2021

[31] [31]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 4195–4205, 2023

2023

[32] [32]

RT-2: Vision-Language- Action models transfer web knowledge to robotic con- trol

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. RT-2: Vision-Language- Action models transfer web knowledge to robotic con- trol. InProceedings of the Conference on Robot Learn- ing (CoRL), pages 2165–2183. PMLR, 2023

2023

[33] [33]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InICRA Workshop on Vision-Language Models for Navigation and Manip- ulation, pages 1–17, 2024

2024

[34] [34]

OpenVLA: An open-source Vision-Language-Action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. OpenVLA: An open-source Vision-Language-Action model. InProceedings of the Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

2025

[35] [35]

InProceedings of the Conference on Robot Learning (CoRL), pages 17–40

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a Vision-Language-Action model with open- world generalization. InProceedings of the Conference on Robot Learning (CoRL), pages 17–40. PMLR, 2025

2025

[36] [36]

Flow Matching for gen- erative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matthew Le. Flow Matching for gen- erative modeling. InProceedings of the International Conference on Learning Representations (ICLR), pages 1–28, 2023

2023

[37] [37]

Autoregressive ac- tion sequence learning for robotic manipulation.IEEE Robotics and Automation Letters, 10(5):4898–4905, 2025

Xinyu Zhang, Yuhan Liu, Haonan Chang, Liam Schramm, and Abdeslam Boularias. Autoregressive ac- tion sequence learning for robotic manipulation.IEEE Robotics and Automation Letters, 10(5):4898–4905, 2025

2025

[38] [38]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–

[39] [39]

U-Net: Convolutional networks for biomedical im- age segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical im- age segmentation. InProceedings of the Interna- tional Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–

[40] [40]

On the continuity of rotation representations in neural networks

Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 5745–5753, 2019

2019

[41] [41]

3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3D representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3D Diffusion Policy: Generalizable visuomotor policy learning via simple 3D representations. InRSS Workshop on Dexterous Manip- ulation: Design, Perception and Control, pages 1–17, 2024

2024

[42] [42]

Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manip- ulation

Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, and Liqiang Nie. Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manip- ulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17394–17404, 2025

2025

[43] [43]

Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation.arXiv preprint arXiv:2504.17784, pages 1–20, 2025

Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, and Jiang- miao Pang. Gripper keypose and object pointflow as interfaces for bimanual robotic manipulation.arXiv preprint arXiv:2504.17784, pages 1–20, 2025. 11

arXiv 2025

[44] [44]

Anybimanual: Transferring unimanual policy for general bimanual ma- nipulation

Guanxing Lu, Tengbo Yu, Haoyuan Deng, Season Si Chen, Yansong Tang, and Ziwei Wang. Anybimanual: Transferring unimanual policy for general bimanual ma- nipulation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 13662–13672, 2025

2025

[45] [45]

Booster: a benchmark for depth from images of specular and transparent surfaces.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(1):85–102, 2023

Pierluigi Zama Ramirez, Alex Costanzino, Fabio Tosi, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. Booster: a benchmark for depth from images of specular and transparent surfaces.IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(1):85–102, 2023

2023

[46] [46]

Indoor segmentation and support inference from RGBD images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. InProceedings of the European Conference on Computer Vision (ECCV), pages 746–

[47] [47]

A taxonomy and evaluation of dense two-frame stereo correspon- dence algorithms.International Journal of Computer Vision, 47(1):7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon- dence algorithms.International Journal of Computer Vision, 47(1):7–42, 2002

2002

[48] [48]

Depth map artefacts reduction: A review.IET Image Processing, 14(12):2630–2644, 2020

Mostafa Mahmoud Ibrahim, Qiong Liu, Rizwan Khan, Jingyu Yang, Ehsan Adeli, and You Yang. Depth map artefacts reduction: A review.IET Image Processing, 14(12):2630–2644, 2020

2020

[49] [49]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025

2025

[50] [50]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016

2016

[51] [51]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017. 12 SUPPLEMENTARY MA TERIAL A. Module-Level Implementation Details Semantic tokens and anchorsFrozen...

2017