pith. sign in

arxiv: 2606.20048 · v1 · pith:HVUBGJDNnew · submitted 2026-06-18 · 💻 cs.RO

MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs

Pith reviewed 2026-06-26 17:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords visuomotor learningbehavior cloningdata augmentationreflection symmetryrobot manipulationdemonstration learningequivariant policies
0
0 comments X

The pith

MirrorDuo generates a mirrored demonstration for every original one, doubling effective data for reflection-symmetric robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MirrorDuo, a reflection-based method that creates a mirrored counterpart for each demonstration consisting of image, proprioception, and 6-DoF end-effector action tuples. This produces a collect-one-get-one-free effect that can be used either as data augmentation inside existing behavior cloning or diffusion policy pipelines or as a structural prior inside reflection-equivariant networks. When demonstrations are spread evenly across both sides of the workspace, the added mirrored data yields higher performance under a fixed collection budget. When all demonstrations are collected on one side only, the same mechanism supports direct transfer to the mirrored workspace using zero or five target-side demonstrations.

Core claim

MirrorDuo formulates visuomotor learning on mirrored demonstration pairs so that the policy respects reflection consistency, thereby improving sample efficiency and enabling cross-side transfer whenever the workspace admits a clean reflection symmetry.

What carries the argument

The reflection mapping that converts an original image-proprioception-6-DoF-action tuple into a valid mirrored counterpart lying on the same task manifold.

If this is right

  • Performance improves significantly under the same data budget when demonstrations are evenly distributed across both sides of the workspace.
  • Skill transfer to the mirrored workspace is possible with as few as zero or five demonstrations collected in the target arrangement.
  • MirrorDuo can be inserted as data augmentation into standard behavior cloning or diffusion policy training.
  • The same pairs can be used as a structural prior inside reflection-equivariant policy networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same symmetry exploitation could be tested on tasks whose dominant symmetry is rotational rather than reflectional.
  • Combining MirrorDuo with other geometric augmentations might further lower the number of real-world demonstrations needed in partially symmetric environments.
  • Measuring the drop in performance when the reflection assumption is deliberately violated would quantify the method's robustness boundary.

Load-bearing premise

The workspace and task admit a clean reflection symmetry such that mirroring images, proprioception, and actions produces valid, collision-free demonstrations.

What would settle it

Observing that mirrored actions produce collisions or task failures on a physical robot whose workspace lacks exact reflection symmetry would falsify the central assumption.

Figures

Figures reproduced from arXiv: 2606.20048 by Danica Kragic, Florian T. Pokorny, Giovanni Luca Marchetti, Ruiyu Wang, Zheyu Zhuang.

Figure 1
Figure 1. Figure 1: Illustration of MirrorDuo (M). Mirroring a source demo to synthesis paired demo in the mirrored arrangement. Behaviour Cloning (BC) from visual demonstrations holds promise for scalable skill acquisition in real￾world environments. Still, it is constrained by the cost of collecting diverse data, particularly in set￾tings with spatial variation of target objects or asym￾metric scene layouts [1, 2], see [PI… view at source ↗
Figure 2
Figure 2. Figure 2: Simulation Setups. Each image shows the environment, averaged over several initial conditions. (a) Close-view with demonstrations confined to one half of the workspace. (b) Wide￾view with the camera moved back. (c) Intermediate-view with demos distributed across the tabletop. Square D0 Coffee D2 Original Mirror Original Mirror MirrorDiffusion (Delta) ✚M 92±1 0±0 69±1 0±0 M 90±0 54±3 67±0 26±4 M, O 92±0 64±… view at source ↗
Figure 3
Figure 3. Figure 3: Visual asymmetry from the robot. In the close view, asym￾metry appears near the wrist and grip￾per, while in the wide view it extends to the elbow and shoulder. mirrored setups introduce background asymmetry ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wide-view success rate (%), against number of additional opposite-side demos. Number of Demos from Mirrored Setup Added to 200 Demos 0.2 0.0 5 10 40 5 10 40 0.8 0.6 0.4 Success Rate Mirrored Setup with White-to-Wood Mirrored Setup with Wood-to-White Mirrored Setup for Evaluation White-to-Wood Wood-to-White Diff. + Diff. + MirrorDiff. + MirrorAug Rand. Overlay Pretrained MirrorDiff. + [PITH_FULL_IMAGE:figu… view at source ↗
Figure 5
Figure 5. Figure 5: Setup and success rate (%) for asymmetric backgrounds, in the mirrored arrangements, against the number of additional opposite-side demos. Evaluation of Square D0 under mirrored setups with local visual domain shifts: white-to-wood and wood-to-white table textures. in Sec. 3, although its denoising function is per-step equivariant, reflection symmetry is broken over the whole denoising trajectory due to in… view at source ↗
Figure 6
Figure 6. Figure 6: Illstrations of Real Task Setups. (a) Task Distribution: Each image overlay with three task arrangements. (b) Example of start and goal configuration. In-domain # M-Demos 0 5 MirrorDiff. + O 76.7 0.0 73.3 DiffPo. +M, O, P 86.7 20.0 83.3 DiffPo. + O, P 83.3 0.0 3.3 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of two plausible robot configurations and eye-in-hand views that share near [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of Reflection Equivariant Diffusion (MirrorDiffusion) Network Architecture. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of Conflicting Visual Cues and Trajectories introduced by mirrored demon￾strations in the Square D2 task. Each mirrored demonstration features an approximately co-located square nut relative to its original counterpart (e.g., Fig.(b, c) and Fig.(d, a)), yet exhibits a distinct eye-in-hand view. This discrepancy suggests that while the mirrored and original demonstrations share a similar initia… view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of Conflicting Visual Cues and Trajectories introduced by mirrored demonstrations in the Three Piece Assembly D1 task. Each mirrored demonstration features an approximately co-located T-shaped piece relative to its original counterpart (e.g., (b, c) and (d, a)), yet exhibits a distinct eye-in-hand view, one oriented toward the workspace, the other facing out￾ward. This discrepancy suggests th… view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of Mirroring with an Off-Centered Camera. (a) Original image from an off-centered third-person camera. (b) Roll-centered image with the end-effector aligned to the mirroring axis. (c) Mirrored version of the roll-centered image used by MirrorDuo. (d) Roll-centered image from the actual mirrored setup. To transfer the mirrored skill to the initial configuration based on the given demonstration… view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of Off-centered camera view for Stack Three D1 Following previous setups, we evaluate MirrorDiffusion, Diffusion + MirrorAug, and the Diffusion baseline using re-rendered demonstrations under off-centered cameras. In the one-sided case, we assess performance on mirrored arrangements with 0, 5, and 10 additional demonstrations. For the two-sided case, we directly evaluate in-domain performance… view at source ↗
read the original abstract

Image-based behaviour cloning leverages demonstrations captured from ubiquitous RGB cameras. However, it remains constrained by the cost of collecting diverse demos, especially for generalizing across workspace variations. We propose MirrorDuo, a reflection-based formulation that operates on image, proprioception, and full 6-DoF end-effector action tuples, generating a mirrored counterpart for each original demonstration, effectively achieving "collect one, get one for free". It can be applied as a data augmentation strategy for existing learning pipelines, such as standard behaviour cloning or diffusion policy, or as a structural prior for reflection-equivariant policy networks. By leveraging the overlap between the original and mirrored domains, MirrorDuo achieves significantly improved performance under the same data budget when demonstrations are evenly distributed across both sides of the workspace. When demonstrations are confined to one side, MirrorDuo enables efficient skill transfer to the mirrored workspace with as few as zero or five demos in the target arrangement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MirrorDuo, a reflection-based method that generates a mirrored counterpart for each original demonstration consisting of image, proprioception, and 6-DoF end-effector action tuples. It can be used either as data augmentation for standard behavior cloning or diffusion policies, or as a structural prior to enforce reflection-equivariance in policy networks. The central claims are that this yields significantly better performance under fixed data budgets when demonstrations are distributed across workspace sides, and enables skill transfer to the mirrored workspace with as few as zero or five target-side demonstrations.

Significance. If the reflection symmetry holds without introducing invalid trajectories, the method provides a simple, parameter-free way to effectively double demonstration data for symmetric tasks, directly addressing the high cost of collecting diverse visuomotor demonstrations. The dual use as augmentation and equivariant prior is a practical strength.

major comments (2)
  1. [§3.2] §3.2 (Mirroring Formulation): The procedure for mirroring 6-DoF end-effector poses and actions is presented as producing valid demonstrations on the same task manifold, but no derivation or explicit transformation rules are given for rotations under reflection, nor is there analysis of when this mapping preserves collision-free paths or kinematic feasibility.
  2. [§5] §5 (Experiments): Results claim performance gains and transfer with 0-5 target demos, but the evaluation does not include controlled tests on environments with partial symmetry violations (e.g., asymmetric fixtures or obstacles), which directly tests the load-bearing assumption that mirrored trajectories remain on-manifold.
minor comments (2)
  1. The abstract states performance improvements but the main text should include explicit baseline comparisons and error bars in all reported tables for the distributed-data and transfer settings.
  2. Notation for the reflection operator on images vs. actions could be unified for clarity in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Mirroring Formulation): The procedure for mirroring 6-DoF end-effector poses and actions is presented as producing valid demonstrations on the same task manifold, but no derivation or explicit transformation rules are given for rotations under reflection, nor is there analysis of when this mapping preserves collision-free paths or kinematic feasibility.

    Authors: We agree that the current presentation in §3.2 would benefit from greater formality. In the revised manuscript we will insert an explicit derivation of the reflection operator on SE(3), specifying the action on translation (sign flip on the appropriate axis) and on rotation (conjugation by the reflection matrix, or equivalently negating the appropriate quaternion components while preserving the rotation sense). We will also add a short paragraph discussing the kinematic and collision-free conditions under which the mirrored trajectory remains on-manifold, namely that the original demonstration must itself be collision-free and that the workspace symmetry (no asymmetric fixtures) is respected. revision: yes

  2. Referee: [§5] §5 (Experiments): Results claim performance gains and transfer with 0-5 target demos, but the evaluation does not include controlled tests on environments with partial symmetry violations (e.g., asymmetric fixtures or obstacles), which directly tests the load-bearing assumption that mirrored trajectories remain on-manifold.

    Authors: The referee correctly notes that our experiments assume full reflection symmetry. We will add a dedicated limitations paragraph in §5 (and a corresponding sentence in the conclusion) that explicitly states the method’s reliance on workspace symmetry and describes the expected degradation when asymmetric obstacles or fixtures are present. Because constructing and collecting data for controlled partial-symmetry environments would require an entirely new experimental campaign outside the scope of the present study, we do not plan to add such experiments; the added discussion will instead clarify the boundary conditions under which MirrorDuo is guaranteed to produce valid demonstrations. revision: partial

Circularity Check

0 steps flagged

No circularity: method is a data-augmentation prior with empirical validation

full rationale

The paper introduces MirrorDuo as a reflection-based augmentation that generates mirrored (image, proprioception, 6-DoF action) tuples from original demonstrations. No derivation chain, fitted parameters, or predictions are presented that reduce to the inputs by construction. Performance improvements are reported via experiments under fixed data budgets; the symmetry assumption is stated explicitly as a precondition rather than derived. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. This is a standard engineering contribution whose central claims remain independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes workspace reflection symmetry and that mirrored actions remain valid.

pith-pipeline@v0.9.1-grok · 5705 in / 1127 out tokens · 24587 ms · 2026-06-26T17:27:25.760703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    J. Gao, A. Xie, T. Xiao, C. Finn, and D. Sadigh. Efficient data collection for robotic manipula- tion via compositional generalization. InProceedings of Robotics: Science and Systems (RSS), 2024

  2. [2]

    Z. Xue, S. Deng, Z. Chen, Y . Wang, Z. Yuan, and H. Xu. Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

  3. [3]

    Eisner, Y

    B. Eisner, Y . Yang, T. Davchev, M. Vecerik, J. Scholz, and D. Held. Deep se (3)-equivariant geometric reasoning for precise placement tasks. InThe Twelfth International Conference on Learning Representations, 2024

  4. [4]

    Ryu, H.-i

    H. Ryu, H.-i. Lee, J.-H. Lee, and J. Choi. Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning. InThe Eleventh International Conference on Learning Representations, 2023

  5. [5]

    J. Yang, Z. Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. In8th Annual Conference on Robot Learning, 2024

  6. [6]

    D. Wang, R. Walters, and R. Platt.SO(2)-equivariant reinforcement learning. InInternational Conference on Learning Representations, 2022

  7. [7]

    D. Wang, S. Hart, D. Surovik, T. Kelestemur, H. Huang, H. Zhao, M. Yeatman, J. Wang, R. Walters, and R. Platt. Equivariant diffusion policy. In8th Annual Conference on Robot Learning, 2024

  8. [8]

    M. Jia, D. Wang, G. Su, D. Klee, X. Zhu, R. Walters, and R. Platt. Seil: Simulation-augmented equivariant imitation learning. In2023 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 1845–1851. IEEE, 2023

  9. [9]

    D. Wang, J. Y . Park, N. Sortur, L. L. Wong, R. Walters, and R. Platt. The surprising effective- ness of equivariant models in domains with latent symmetry. InInternational Conference on Learning Representations. International Conference on Learning Representations, 2023

  10. [10]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. InConference on Advances in Neural Information Processing Systems (NeurIPS), 1988

  11. [11]

    Rahmatizadeh, P

    R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task manip- ulation for inexpensive robots using end-to-end learning from demonstration.International Conference on Robotics and Automation (ICRA), 2018

  12. [12]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

  13. [13]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mor- datch, and J. Tompson. Implicit behavioral cloning.Conference on Robot Learning (CoRL), 2021

  14. [14]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. 11

  15. [15]

    Consistency policy: Accelerated visuomotor policies via consistency distillation,

    A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

  16. [16]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.Robotics: Science and Systems, 2024

  17. [17]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

  18. [18]

    Hoque, A

    R. Hoque, A. Mandlekar, C. Garrett, K. Goldberg, and D. Fox. Intervengen: Interventional data generation for robust and data-efficient robot imitation learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 2840–2846. IEEE, 2024

  19. [19]

    Skillmimicgen: Automated demonstration gener- ation for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

    C. Garrett, A. Mandlekar, B. Wen, and D. Fox. Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

  20. [20]

    Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185, 2024

  21. [21]

    Hoogeboom, V

    E. Hoogeboom, V . G. Satorras, C. Vignac, and M. Welling. Equivariant diffusion for molecule generation in 3d. InInternational conference on machine learning, pages 8867–8887. PMLR, 2022

  22. [22]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. 2019 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2018

  23. [23]

    G. Cesa, L. Lang, and M. Weiler. A program to build e (n)-equivariant steerable cnns. In International conference on learning representations, 2022

  24. [24]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Conference on Neural Information Processing Systems (NeurIPS), 2020

  25. [25]

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

  26. [26]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InPro- ceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  27. [27]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar- chical image database. InConference on Computer Vision and Pattern Recognition (CVPR), 2009

  28. [28]

    Burns, Z

    K. Burns, Z. Witzel, J. I. Hamid, T. Yu, C. Finn, and K. Hausman. What makes pre-trained visual representations successful for robust manipulation? In8th Annual Conference on Robot Learning, 2024

  29. [29]

    Hansen and X

    N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611– 13617. IEEE, 2021

  30. [30]

    Zhuang, R

    Z. Zhuang, R. Wang, N. Ingelhag, V . Kyrki, and D. Kragic. Enhancing visual domain robust- ness in behaviour cloning via saliency-guided augmentation. In8th Annual Conference on Robot Learning, 2024. 12

  31. [31]

    bm9A7SfqLlZigWYdavpxjAUKIig=

    M. C. Welle, N. Ingelhag, M. Lippi, M. Wozniak, A. Gasparri, and D. Kragic. Quest2ros: An app to facilitate teleoperating robots. In7th International Workshop on Virtual, Augmented, and Mixed-Reality for Human-Robot Interactions, 2024. 13 AFormulation Derivations Eye-in-hand Local-frame Reparameterization For an eye-in-hand camera setup, let the current c...