pith. sign in

arxiv: 2606.27163 · v1 · pith:D4NNGM3Ynew · submitted 2026-06-25 · 💻 cs.RO · cs.AI· cs.LG

Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline)

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords bimanual garment foldingvision-language-action policyreinforcement learningself value estimationadvantage estimationsim-to-real transferfailure detectioncompetition robotics
0
0 comments X

The pith

A vision-language-action policy can act as its own value function by predicting success, progress, and future quantities to improve bimanual garment folding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for bimanual garment folding that starts with a vision-language-action policy and adds the ability for the same network to output estimates of success, progress, and task-relevant future quantities. These estimates are then used inside a reinforcement-learning loop to compute advantages, detect failures during execution, and choose among candidate actions. The author combines this self-value approach with distributed training pipelines, inference-time sampling for hyperparameters, and targeted sim-to-real steps such as camera alignment and human-in-the-loop data collection. If the approach holds, robot policies become more self-contained, needing fewer separate networks while still supporting effective learning and deployment in manipulation tasks.

Core claim

The author establishes that the identical network used to predict actions can also predict success, progress, and a few task-relevant future quantities, and that these predictions are sufficient to drive advantage estimation, live failure detection, and candidate selection inside an RL loop for the folding task.

What carries the argument

The policy network serving as its own value function, where its outputs for success and progress replace separate value networks in advantage estimation and failure detection.

If this is right

  • Advantage estimation for policy improvement can be performed directly from the policy outputs without a separate critic.
  • Live failure detection becomes possible by monitoring the policy's own success predictions during rollout.
  • Action candidates can be ranked and selected using the policy's predictions of future task quantities.
  • Inference-time hyperparameter tuning via Thompson sampling can be applied on top of the self-value estimates.
  • Sim-to-real transfer is supported by combining camera alignment, heavy augmentation, and DAgger-style human-in-the-loop collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-prediction pattern could be tested on other long-horizon manipulation tasks that currently rely on separate value heads.
  • Removing the need for an explicit value network might lower memory and compute costs when scaling vision-language-action models to new robots.
  • The asynchronous training and hub-based rollout pipeline could be reused in other distributed robot learning setups even without the folding-specific components.

Load-bearing premise

The policy network's own predictions of success, progress, and future quantities are accurate and unbiased enough to support advantage estimation, failure detection, and candidate selection.

What would settle it

Replace the network's predicted success and progress values with random or constant numbers during both training and execution and measure whether the resulting policy still achieves top ranks on the same folding tasks.

Figures

Figures reproduced from arXiv: 2606.27163 by Ilia Larchenko.

Figure 1
Figure 1. Figure 1: The four garment types, seen from the overhead camera in the original simulation behavior-cloning dataset released by the organizers. The challenge ran in NVIDIA Isaac Lab (built on Isaac Sim) on the organizers’ released environment, assets, and success checker [1]. 1.2 Key challenges The task combines several difficulties, and most of the system exists to address one of them: 1. Cloth is deformable and ha… view at source ↗
Figure 2
Figure 2. Figure 2: System overview: the asynchronous training / rollout / DAgger flywheel. The three components share state only through the HuggingFace Hub. — is built around log-probability policy-gradient updates and doesn’t transfer cleanly to flow-matching VLAs. Though there are attempts to adapt PPO-like logic to flow-matching VLAs, some with promising results. Another problem is that valid actions occupy a tiny manifo… view at source ↗
Figure 3
Figure 3. Figure 3: AWR + RECAP toy illustration. (inverse sampling (IS) probability, normalized per episode length), and all auxiliary losses are weighted by wi while the action loss ignores it. The similar machinery is applied to BC and DAgger sources: there the per-frame priority is failure-rate-proportional, P ∝ e 3(1−SRgarment) , so garments the policy still struggles with are over-sampled, and the IS weights again debia… view at source ↗
Figure 4
Figure 4. Figure 4: A mid-episode frame from a rollout debug video. The three camera views combined. The overlay shows the per-frame success probability (S), advantage (A), reward (R), completion (C), and time-to-completion (T), with their traces over the episode. 3.2 Rollout strategies I used multiple strategies to collect rollouts that differ in the garment-sampling logic, augmentation aggressiveness, and starting states ( … view at source ↗
Figure 5
Figure 5. Figure 5: Policy architecture. A frozen SigLIP encoder feeds the Gemma-2B prefix (images, current query, state, garment type, advantage, FAST tokens, FAST query); the Gemma-300M flow-matching action expert reads the cleaned prefix and emits a 30-step action chunk. The prediction heads (orange) read from the prefix; the FM query at the tail of the action expert carries the action-conditional future predictions [PITH… view at source ↗
Figure 6
Figure 6. Figure 6: Attention mask. The garment type is not provided as an input during eval￾uation, so I implemented a very basic System-2-like logic (similar to the stage-prediction logic in the BEHAVIOR-1K solution): • During training the ground-truth garment type is provided as the input. • During evaluation the model predicts the garment type (via the auxiliary prediction head, Section 5) at the very beginning of the epi… view at source ↗
Figure 7
Figure 7. Figure 7: Dense reward on a successful long-sleeve-top episode: overhead frames at five milestones (top) and the cumulative reward (bottom). The first 0.5 is allocated gradually as the primary proximity distance closes — the gradual first checkpoint — and reaching full success brings the total to 1.0. setup. The same αs that attenuates the noisy P(success) term also attenuates the Rcum t term, which is what resolves… view at source ↗
Figure 8
Figure 8. Figure 8: AWR sampling weight as a function of advantage, P ∝ e clip(A,−2,2) . 6.8 Intuitive summary The final formula is over-engineered and could probably be simplified without losing much. The short version — an ac￾tion’s advantage is high when: • it makes objective task progress, measured by the same keypoint distances that define success; • the predicted probability of eventual success increases; • conditioned … view at source ↗
Figure 9
Figure 9. Figure 9: Thompson-sampling posteriors for short pants. Rank Team Long top Short top Long pants Short pants Overall 1 ilya (this work) 74.5% 70.0% 80.5% 93.5% 79.63% 2 Shubham @ Vorwerk 73.0% 62.5% 71.5% 87.0% 73.50% 3 Dum-E 76.5% 62.0% 75.5% 79.5% 73.38% 4 SCUT-Unlimited 65.5% 66.0% 70.0% 91.0% 73.13% 5 GraspYesAI 73.5% 61.0% 69.0% 79.0% 70.63% 6 sZs 70.5% 64.0% 68.5% 75.5% 69.63% 7 ClothFolder50k 77.0% 56.0% 58.5%… view at source ↗
Figure 10
Figure 10. Figure 10: Real-robot episodes per garment type, split by source (organizer BC, home teleop, home DAgger). 9.7 Aligning the environments The alignment half of §9.2 was mostly simple engineering: get the cameras, calibration, units, and motion to match the target as closely as I could. The single most useful tool here was a camera-overlay alignment utility ( [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The four garment types in the organizer’s real-robot BC dataset, seen from the overhead camera. These are the physical garments the real policy folds — the same four types shown in their original simulation form in §1. 9.10 DAgger collection and weighting Recovery data is the one thing the organizer BC set cannot provide — it only ever shows clean, successful folds. To get off-distribution recovery behavi… view at source ↗
Figure 12
Figure 12. Figure 12: The three cameras (top, left wrist, right wrist) for each training bucket: the organizer BC data, my own teleop/DAgger recordings, and the sim success-replays [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The three live robot cameras overlaid on the matching organizer-BC frame after driving the arms to the recorded joint state [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The same idea across the sim/real boundary: an Isaac Sim render of the same action sequence overlaid on the real cameras, used to tune the sim’s camera offsets against the real rig. 9.11 The units bug (a cautionary tale) I lost a lot of time to a subtle data bug. Between two LeR￾obot 0.4.x releases the default state representation for the SO-101 follower changed — the five arm joints switched from a norma… view at source ↗
Figure 15
Figure 15. Figure 15: The same real camera frame under independent draws of the augmentation stack — color, gain, gamma, blur, noise, crop/rotate/zoom, and cutout. Every training frame is perturbed this way so the model cannot rely on any fixed appearance [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Per-source speed factors [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Per-frame sampling weight across an episode with two human interventions and a return to autonomous control. Human￾correction frames get the highest weight, autonomous frames far from an intervention sit at the lower weight, and the 5-second windows just before each takeover ramp to zero. The two conventions are related by a per-joint rescale, and for most of the joints the full range of motion is around … view at source ↗
read the original abstract

I describe my solution to the LeHome Challenge 2026, an ICRA 2026 competition on bimanual garment folding. The system placed 1st of 62 teams in the online (simulation) round and 2nd in the real-world final. It improves a vision-language-action (VLA) policy with a reinforcement-learning loop. The policy is its own value function: the same network that predicts actions also predicts success, progress, and a few task-relevant future quantities, and those predictions drive advantage estimation, live failure detection, and candidate selection. The work mostly recombines existing RL ideas with engineering and optimization contributions that can be used together as one recipe or individually: AWR + RECAP combined for flow-matching VLA; an asynchronous distributed training / rollout pipeline through HuggingFace Hub; inference-time hyperparameters optimization via Thompson sampling; a sim-to-real recipe with camera-alignment tooling, heavy augmentation and DAgger-like HIL data collection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper describes a 1st-place (online, 62 teams) and 2nd-place (offline) solution to the LeHome Challenge 2026 on bimanual garment folding. It augments a vision-language-action (VLA) policy with an RL loop in which the same network predicts actions together with success, progress, and task-relevant future quantities; these predictions are used for advantage estimation (via AWR+RECAP on a flow-matching VLA), live failure detection, and candidate selection. The work emphasizes engineering contributions—an asynchronous distributed training/rollout pipeline via Hugging Face Hub, Thompson sampling for inference-time hyperparameters, and a sim-to-real recipe with camera alignment, heavy augmentation, and DAgger-style HIL collection—that can be used individually or together.

Significance. If the self-value mechanism is shown to be reliable, the manuscript supplies a practical, modular recipe for applying RL to VLA policies in contact-rich robotics without separate value networks. The competition rankings provide external empirical support for the integrated system, and the listed engineering components (async pipeline, sim-to-real augmentations) are independently reusable. The absence of component-level validation, however, prevents clear attribution of performance gains to the self-predicted quantities versus the surrounding infrastructure.

major comments (2)
  1. [Abstract] Abstract: the claim that the policy network's own auxiliary heads for success, progress, and future quantities 'drive' advantage estimation, live failure detection, and candidate selection is load-bearing, yet the manuscript supplies no calibration metrics, bias analysis, or comparison against ground-truth outcomes for these heads.
  2. [Abstract] Abstract: no ablations, error bars, or controlled comparisons are reported that isolate the contribution of the self-value mechanism from the other listed engineering elements (asynchronous pipeline, Thompson sampling, sim-to-real augmentations).
minor comments (1)
  1. The abstract and title emphasize competition placement; a short methods overview paragraph would help readers locate the technical novelty before the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our competition entry. We address the two major comments below, focusing on validation of the self-predicted quantities and isolation of contributions. The work is primarily an engineering recipe validated through competition rankings rather than controlled lab experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the policy network's own auxiliary heads for success, progress, and future quantities 'drive' advantage estimation, live failure detection, and candidate selection is load-bearing, yet the manuscript supplies no calibration metrics, bias analysis, or comparison against ground-truth outcomes for these heads.

    Authors: We acknowledge that the manuscript lacks explicit calibration metrics, bias analysis, or direct ground-truth comparisons for the auxiliary heads. This stems from the paper's focus on the integrated competition system rather than isolated head validation. In revision we will add available post-hoc analysis from rollout logs (e.g., success prediction accuracy against observed outcomes and simple bias checks) where data permits, while noting that full controlled ground-truth collection was outside the competition scope. revision: partial

  2. Referee: [Abstract] Abstract: no ablations, error bars, or controlled comparisons are reported that isolate the contribution of the self-value mechanism from the other listed engineering elements (asynchronous pipeline, Thompson sampling, sim-to-real augmentations).

    Authors: We agree that component-level ablations would aid attribution. However, the manuscript reports an integrated system whose performance was externally validated by competition rankings (1st/62 online, 2nd offline). Conducting new controlled ablations isolating the self-value heads from the pipeline, Thompson sampling, and sim-to-real recipe is not feasible within the post-competition timeline and resource constraints. We will revise the text to more clearly state that the engineering elements are modular and reusable independently, and to frame the competition results as system-level evidence rather than per-component proof. revision: no

Circularity Check

0 steps flagged

No circularity; external competition ranking validates performance independently of internal predictions.

full rationale

The paper describes an applied engineering system (VLA policy with auxiliary heads for success/progress/future quantities driving AWR-style advantage, failure detection, and selection) that recombines existing RL methods plus engineering contributions. Its central claim is competition placement (1st/62 online, 2nd offline), an external benchmark independent of any fitted parameters or self-referential equations. No derivation chain, equations, or load-bearing self-citations are present that reduce to the paper's own inputs by construction. The auxiliary predictions are used as described but their accuracy is not claimed via internal math; success is measured externally. This matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the approach relies on standard RL components from prior literature with no new axioms or invented entities; one free parameter set is the inference-time hyperparameters tuned by Thompson sampling.

free parameters (1)
  • inference-time hyperparameters
    Optimized via Thompson sampling during deployment to select model behavior settings.

pith-pipeline@v0.9.1-grok · 5708 in / 1290 out tokens · 46060 ms · 2026-06-26T04:56:52.489793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 4 canonical work pages

  1. [1]

    1st Simulation- Driven Competition on Deformable Object Manipula- tion

    R. Wu, Y . Wang, Z. Li, Y . Chen, A. Longhini, et al. (LeHome Challenge Organizers).LeHome Chal- lenge 2026 — Challenge on Garment Manipulation Skill Learning in Household Scenarios(“1st Simulation- Driven Competition on Deformable Object Manipula- tion”; ICRA 2026 competition, Vienna, 1–5 June 2026). Website: https://lehome-challenge.com/ ; code: https:/...

  2. [2]

    Cadene, S

    R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zoui- tine, et al.LeRobot: State-of-the-art Machine Learning for Real-World Robotics in PyTorch.2024. https: //github.com/huggingface/lerobot

  3. [3]

    SO-ARM100 / SO-ARM101: Standard Open Arms.2024

    The Robot Studio, in collaboration with Hugging Face. SO-ARM100 / SO-ARM101: Standard Open Arms.2024. https://github.com/TheRobotStudio/SO -ARM100

  4. [4]

    S. Gao, M. Pagnucco, T. Bednarz, Y . Song.NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simu- lation for Robotics.arXiv:2606.03551, 2026. https: //arxiv.org/abs/2606.03551

  5. [5]

    X. B. Peng, A. Kumar, G. Zhang, S. Levine.Advantage- Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning.arXiv:1910.00177, 2019. ht tps://arxiv.org/abs/1910.00177

  6. [6]

    π∗ 0.6: a VLA That Learns From Experience.arXiv:2511.14759, 2025

    Physical Intelligence et al. π∗ 0.6: a VLA That Learns From Experience.arXiv:2511.14759, 2025. https: //arxiv.org/abs/2511.14759

  7. [7]

    Larchenko, G

    I. Larchenko, G. Zarin, A. Karnatak.Task Adaptation of Vision-Language-Action Model: 1st Place Solution for the 2025 BEHAVIOR Challenge.arXiv:2512.06951, 2025.https://arxiv.org/abs/2512.06951

  8. [8]

    Black et al

    K. Black et al. (Physical Intelligence). π0.5: a Vision- Language-Action Model with Open-World Generaliza- tion.arXiv:2504.16054, 2025. https://arxiv.or g/abs/2504.16054

  9. [9]

    A. Nair, A. Gupta, M. Dalal, S. Levine.AWAC: Ac- celerating Online Reinforcement Learning with Offline Datasets.arXiv:2006.09359, 2020. https://arxiv. org/abs/2006.09359

  10. [10]

    S. Ross, G. Gordon, D. Bagnell.A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning (DAgger).AISTATS 2011, PMLR 15, pp. 627– 635.https://arxiv.org/abs/1011.0686

  11. [11]

    J. Luo, C. Xu, J. Wu, S. Levine.Precise and Dexter- ous Robotic Manipulation via Human-in-the-Loop Rein- forcement Learning (HIL-SERL).Science Robotics, vol. 10, eads5033, 2025. DOI:10.1126/scirobotics.ads5033. arXiv:2410.21845. https://www.science.org/ doi/10.1126/scirobotics.ads5033

  12. [12]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov.Proximal Policy Optimization Algorithms. arXiv:1707.06347, 2017. https://arxiv.org/ abs/1707.06347

  13. [13]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, et al. DeepSeekMath: Pushing the Limits of Mathematical Rea- soning in Open Language Models.arXiv:2402.03300, 2024.https://arxiv.org/abs/2402.03300

  14. [14]

    Y . Sun, J. Shen, Y . Wang, T. Chen, Z. Wang, M. Zhou, H. Zhang.Improving Data Efficiency for LLM Reinforce- ment Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay.arXiv:2506.05316, 2025.https://arxiv.org/abs/2506.05316

  15. [15]

    X. Zhai, B. Mustafa, A. Kolesnikov, L. Beyer.Sigmoid Loss for Language Image Pre-Training.ICCV 2023. arXiv:2303.15343. https://arxiv.org/abs/23 03.15343 24

  16. [16]

    arXiv:2403.08295, 2024

    Gemma Team, Google DeepMind.Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295, 2024. https://arxiv.org/ abs/2403.08295

  17. [17]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, M. Le.Flow Matching for Generative Modeling.ICLR

  18. [18]

    https://arxiv.org/ab s/2210.02747

    arXiv:2210.02747. https://arxiv.org/ab s/2210.02747

  19. [19]

    Pertsch et al.FAST: Efficient Action Tokenization for Vision-Language-Action Models.arXiv:2501.09747, 2025.https://arxiv.org/abs/2501.09747

    K. Pertsch et al.FAST: Efficient Action Tokenization for Vision-Language-Action Models.arXiv:2501.09747, 2025.https://arxiv.org/abs/2501.09747

  20. [20]

    Zhai.Exclusive Self Attention.arXiv:2603.09078, 2026.https://arxiv.org/abs/2603.09078

    S. Zhai.Exclusive Self Attention.arXiv:2603.09078, 2026.https://arxiv.org/abs/2603.09078

  21. [21]

    arXiv:2507.05331, 2025

    TRI LBM Team.A Careful Examination of Large Be- havior Models for Multitask Dexterous Manipulation. arXiv:2507.05331, 2025. https://arxiv.org/ab s/2507.05331

  22. [22]

    Schulman, P

    J. Schulman, P. Moritz, S. Levine, M. I. Jordan, P. Abbeel.High-Dimensional Continuous Control Us- ing Generalized Advantage Estimation.ICLR 2016. arXiv:1506.02438. https://arxiv.org/abs/ 1506.02438

  23. [23]

    A. Deng, Y . Xu, R. Kohavi, T. Walker.Improving the Sensitivity of Online Controlled Experiments by Utiliz- ing Pre-Experiment Data.WSDM 2013, pp. 123–132. DOI:10.1145/2433396.2433413. https://exp-pla tform.com/Documents/2013-02-CUPED-I mprovingSensitivityOfControlledExper iments.pdf

  24. [24]

    Black, M

    K. Black, M. Y . Galliker, S. Levine (Physical Intelli- gence).Real-Time Execution of Action Chunking Flow Policies.arXiv:2506.07339, 2025. https://arxiv. org/abs/2506.07339

  25. [25]

    J. Ho, T. Salimans.Classifier-Free Diffusion Guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications; arXiv:2207.12598, 2022. https://arxiv.org/abs/2207.12598

  26. [26]

    W. R. Thompson.On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.Biometrika, vol. 25, no. 3–4, pp. 285– 294, 1933. DOI:10.1093/biomet/25.3-4.285. https: //doi.org/10.1093/biomet/25.3-4.285

  27. [27]

    Russo, B

    D. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen. A Tutorial on Thompson Sampling.Foundations and Trends in Machine Learning, vol. 11, no. 1, pp. 1–96,

  28. [28]

    https://arxiv

    DOI:10.1561/2200000070. https://arxiv. org/abs/1707.02038 25