pith. sign in

arxiv: 1907.00408 · v1 · pith:IOKQ2ET7new · submitted 2019-06-30 · 💻 cs.RO · cs.CV

GarmNet: Improving Global with Local Perception for Robotic Laundry Folding

Pith reviewed 2026-05-25 12:23 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords garment localizationlandmark detectionrobotic foldingend-to-end deep learningmulti-task perceptionCloPeMa datasetclothing manipulation
0
0 comments X

The pith

GarmNet performs garment localization and landmark detection together in one network, cutting localization error by 24.7 percent on the CloPeMa dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GarmNet as an end-to-end model that handles both global garment localization for category recognition and local landmark detection for grasping in a single network. Prior approaches treated these tasks separately, which limited how well robots could perceive varied garment states. Training and testing on 3,330 images from the CloPeMa Garment dataset shows that adding the landmark task improves localization accuracy. The result is presented as a scalable, efficient perception solution for robotic laundry folding.

Core claim

GarmNet simultaneously localizes the garment as a whole and detects landmarks for grasping. Localization supplies global information to recognize garment category, while landmark detection supports grasping actions. When landmark detection is included, garment localization error drops by 24.7 percent compared with localization alone.

What carries the argument

GarmNet, an end-to-end deep learning model that jointly outputs garment localization and landmark detections.

If this is right

  • Robots obtain both category recognition and grasping cues from one forward pass, reducing separate processing steps.
  • The combined representation supports handling a wider range of crumpled garment configurations than single-task models.
  • Memory and compute stay low enough for deployment on robotic platforms that must run multiple domestic tasks.
  • The same joint-perception pattern can be applied to other garment types in the dataset without redesigning separate networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the joint-training benefit generalizes, similar multi-task networks could reduce error in other robotic perception problems that combine global scene understanding with local action points.
  • Real-robot folding trials would be needed to check whether the dataset error reduction produces higher end-to-end success rates under variable lighting and fabric stretch.
  • The approach leaves open whether adding more auxiliary tasks, such as grasp quality prediction, would yield further localization gains.

Load-bearing premise

The reported error reduction is produced by the joint training of localization and landmark detection rather than by differences in model size, training details, or data handling.

What would settle it

Training two models of identical capacity on the identical CloPeMa split, one with only localization and one with both tasks, and finding no meaningful difference in localization error.

Figures

Figures reproduced from arXiv: 1907.00408 by Daniel Fernandes Gomes, Luis F. Teixeira, Shan Luo.

Figure 1
Figure 1. Figure 1: GarmNet macro view, UML [3] Components Diagram. The architecture is broken into three blocks (components): Feature extractor, Landmark Detector and Garment Localizer; that output: intermediate features at two depths, landmarks classes+localizations and garment class+localization. Feature extractor We implement the feature extraction module with a Fully Convolutional Neural Network (FCNN), a 50-layer ResNet… view at source ↗
Figure 2
Figure 2. Figure 2: Landmark detector component, UML[3] representation. After one interme￾diate branch, two separate branches output 18 × 18 landmark proposals (classification and location). This block, implemented with convolutional layers, can be interpreted as small fully connected network sliding over the feature extractor output. Garment localizer To perform the localization of the piece of clothing present in the image,… view at source ↗
Figure 3
Figure 3. Figure 3: Garment localizer component, UML[3] representation. Similar to the land￾mark detector 2, yet fully connected layers are used. The Intermediate layer outputs a 512-d, the classifier a 9-d (one hot encoded classes) and the regressor the 3-d (x,y, with and height) vectors. 4 Experiments Our implementation was performed using Keras4 framework with the Tensor￾Flow5 back-end. All experiments were carried out on … view at source ↗
Figure 4
Figure 4. Figure 4: Representative cases of the result of applying the spacial constraint loss. At the top row, predictions with composed loss, at the middle, without, and the bottom, the ground truth [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GarmNet-B (introduced in 4.5) representation using UML[3]. The output emitted by the classifier block from the Landmark Detector branch is concatenated with the Feature Extractor output before being fed into the intermediate layer [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Developing autonomous assistants to help with domestic tasks is a vital topic in robotics research. Among these tasks, garment folding is one of them that is still far from being achieved mainly due to the large number of possible configurations that a crumpled piece of clothing may exhibit. Research has been done on either estimating the pose of the garment as a whole or detecting the landmarks for grasping separately. However, such works constrain the capability of the robots to perceive the states of the garment by limiting the representations for one single task. In this paper, we propose a novel end-to-end deep learning model named GarmNet that is able to simultaneously localize the garment and detect landmarks for grasping. The localization of the garment represents the global information for recognising the category of the garment, whereas the detection of landmarks can facilitate subsequent grasping actions. We train and evaluate our proposed GarmNet model using the CloPeMa Garment dataset that contains 3,330 images of different garment types in different poses. The experiments show that the inclusion of landmark detection (GarmNet-B) can largely improve the garment localization, with an error rate of 24.7% lower. Solutions as ours are important for robotics applications, as these offer scalable to many classes, memory and processing efficient solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes GarmNet, an end-to-end CNN for simultaneous garment localization (global category recognition) and landmark detection (for grasping) in robotic laundry folding. It evaluates the model on the CloPeMa Garment dataset (3,330 images) and reports that adding landmark detection (GarmNet-B) reduces garment localization error by 24.7% compared to localization-only training.

Significance. If the reported improvement is attributable to joint training rather than capacity differences and generalizes beyond the fixed dataset, the multi-task formulation could provide a scalable, efficient perception module for domestic robotics tasks involving deformable objects. The work addresses a practical gap between separate global-pose and local-landmark pipelines.

major comments (3)
  1. [Abstract / Experiments] Abstract and experiments section: The central claim of a 24.7% localization error reduction for GarmNet-B is presented without any description of the baseline architecture (GarmNet-A), parameter counts, backbone depth, training schedule, data augmentation, or loss-weighting scheme. Without an explicit statement that the only difference is the added landmark head and multi-task loss, the improvement cannot be isolated from confounding factors such as increased model capacity.
  2. [Experiments] Experiments: No information is supplied on train/test splits, cross-validation, statistical significance testing, or variance across runs. The reported error reduction is therefore an empirical fit on a single fixed dataset whose robustness to different partitions or hyperparameter choices remains unverified.
  3. [Introduction / Conclusion] Introduction and conclusion: The paper assumes that performance on the CloPeMa dataset will transfer to real robotic folding scenarios, yet no domain-shift, sim-to-real, or physical-robot experiments are described to support this transfer claim.
minor comments (2)
  1. [Abstract] The abstract states the model is 'scalable to many classes, memory and processing efficient' but provides no supporting measurements (e.g., FLOPs, parameter counts, inference time) relative to single-task baselines.
  2. [Methods] Notation for the two variants (GarmNet-A vs. GarmNet-B) is introduced only in the abstract; a clear definition and diagram in the methods section would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our multi-task approach. We address each point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experiments section: The central claim of a 24.7% localization error reduction for GarmNet-B is presented without any description of the baseline architecture (GarmNet-A), parameter counts, backbone depth, training schedule, data augmentation, or loss-weighting scheme. Without an explicit statement that the only difference is the added landmark head and multi-task loss, the improvement cannot be isolated from confounding factors such as increased model capacity.

    Authors: We agree that additional architectural and training details are needed to isolate the effect of joint training. In the revised manuscript, we will expand the experiments section with a table comparing GarmNet-A and GarmNet-B, explicitly stating that the backbone, parameter counts (except for the added landmark head), training schedule, data augmentation, and loss weighting remain identical, with the sole difference being the addition of the landmark detection head and its multi-task loss term. revision: yes

  2. Referee: [Experiments] Experiments: No information is supplied on train/test splits, cross-validation, statistical significance testing, or variance across runs. The reported error reduction is therefore an empirical fit on a single fixed dataset whose robustness to different partitions or hyperparameter choices remains unverified.

    Authors: We will add the train/test split details (proportions and any randomization seed) used for the 3,330-image CloPeMa dataset to the experiments section. The original evaluation was performed on a single fixed partition without multiple runs; we will note this limitation explicitly and, where possible, report results from additional runs with varied seeds to provide variance estimates. revision: partial

  3. Referee: [Introduction / Conclusion] Introduction and conclusion: The paper assumes that performance on the CloPeMa dataset will transfer to real robotic folding scenarios, yet no domain-shift, sim-to-real, or physical-robot experiments are described to support this transfer claim.

    Authors: The manuscript evaluates the perception module on the CloPeMa dataset and discusses its relevance to robotic applications. We will revise the introduction and conclusion to remove any implication of direct transfer, instead stating that the results demonstrate improved perception on this dataset and that validation on physical robots or under domain shift remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance comparison on fixed dataset with no load-bearing derivations or self-citations

full rationale

The paper presents an empirical ML model (GarmNet) trained and evaluated on the CloPeMa Garment dataset. The central claim is a measured 24.7% error reduction when adding a landmark detection head, reported directly from experimental results rather than any mathematical derivation, prediction, or first-principles chain. No equations, ansatzes, uniqueness theorems, or self-citations are invoked as load-bearing steps. The result is a standard train/evaluate comparison on a fixed dataset and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on empirical training of a deep network whose many weights are fitted to the CloPeMa images and on the untested assumption that the dataset captures the variability needed for robotic deployment.

free parameters (1)
  • multi-task loss weighting and backbone choice
    Hyperparameters that control how localization and landmark losses are balanced and which CNN backbone is used; these are selected to produce the reported improvement.
axioms (1)
  • domain assumption The CloPeMa Garment dataset of 3,330 images is representative of garment configurations encountered in robotic folding.
    All training and evaluation occur on this dataset; generalization claims depend on it.

pith-pipeline@v0.9.0 · 5755 in / 1254 out tokens · 29351 ms · 2026-05-25T12:23:18.619063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    Pattern Recognition 74, 629 – 641 (2018)

    Corona, E., Aleny, G., Gabas, A., Torras, C.: Active garment recognition and target grasping point detection using deep learning. Pattern Recognition 74, 629 – 641 (2018). https://doi.org/https://doi.org/10.1016/j.patcog.2017.09.042, http: //www.sciencedirect.com/science/article/pii/S0031320317303941

  2. [2]

    In: CVPR09 (2009) GarmNet: Improving Global with Local Perception for Robotic Laundry

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large- Scale Hierarchical Image Database. In: CVPR09 (2009) GarmNet: Improving Global with Local Perception for Robotic Laundry... 11

  3. [3]

    https://doi.org/10.1007/3-540-44988-4 3

    Engels, G., Heckel, R., Sauer, S.: Uml - a universal modeling language? LNCS (10 2000). https://doi.org/10.1007/3-540-44988-4 3

  4. [4]

    Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (Jun 2010). https://doi.org/10.1007/s11263-009-0275-4, http://dx.doi.org/10.1007/ s11263-009-0275-4

  5. [5]

    Fast R-CNN

    Girshick, R.B.: Fast R-CNN. CoRR abs/1504.08083 (2015), http://arxiv.org/ abs/1504.08083

  6. [6]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac- curate object detection and semantic segmentation. CoRRabs/1311.2524 (2013), http://arxiv.org/abs/1311.2524

  7. [7]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385

  8. [8]

    In: Advances in Neural Information Processing Systems (2012)

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con- volutional neural networks. In: Advances in Neural Information Processing Systems (2012)

  9. [9]

    The Handbook of Brain Theory and Neural Networks (01 1995)

    Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. The Handbook of Brain Theory and Neural Networks (01 1995)

  10. [10]

    In: Proceed- ings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)

    Lee, J.T., Bollegala, D., Luo, S.: ”Touching to See” and” Seeing to Feel”: Robotic Cross-modal Sensory Data Generation for Visual-Tactile Perception. In: Proceed- ings of the IEEE International Conference on Robotics and Automation (ICRA) (2019)

  11. [11]

    In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2014)

    Li, Y., Chen, C.F., Allen, P.K.: Recognition of deformable object category and pose. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2014)

  12. [12]

    In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

    Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  13. [13]

    Mechatronics 48, 54–67 (2017)

    Luo, S., Bimbo, J., Dahiya, R., Liu, H.: Robotic tactile perception of object prop- erties: A review. Mechatronics 48, 54–67 (2017)

  14. [14]

    Autonomous Robots pp

    Luo, S., Mou, W., Althoefer, K., Liu, H.: iCLAP: Shape recognition by combining proprioception and touch sensing. Autonomous Robots pp. 1–12 (2018)

  15. [15]

    In: 2010 IEEE International Conference on Robotics and Automation

    Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: 2010 IEEE International Conference on Robotics and Automation. pp. 2308–2315 (May 2010). https://doi.org/10.1109/ROBOT.2010.5509439

  16. [16]

    In: 2015 Inter- national Conference on Advanced Robotics (ICAR)

    Mariolis, I., Peleka, G., Kargakos, A., Malassiotis, S.: Pose and category recognition of highly deformable objects using deep learning. In: 2015 Inter- national Conference on Advanced Robotics (ICAR). pp. 655–662. IEEE (jul 2015). https://doi.org/10.1109/ICAR.2015.7251526, http://ieeexplore.ieee. org/document/7251526/

  17. [17]

    You Only Look Once: Unified, Real-Time Object Detection

    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015), http://arxiv.org/ abs/1506.02640

  18. [18]

    YOLO9000: Better, Faster, Stronger

    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016), http://arxiv.org/abs/1612.08242

  19. [19]

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015), http: //arxiv.org/abs/1506.01497 12 Daniel Fernandes Gomes, Shan Luo, and Luis F. Teixeira

  20. [20]

    Expert Systems with Applications 116, 328 – 339 (2019)

    Seo, Y., shik Shin, K.: Hierarchical convolutional neural networks for fash- ion image classification. Expert Systems with Applications 116, 328 – 339 (2019). https://doi.org/https://doi.org/10.1016/j.eswa.2018.09.022, http://www. sciencedirect.com/science/article/pii/S0957417418305992

  21. [21]

    Wagner, L., K.D., Smutn, V.: Ctu color and depth image dataset of spread gar- ments. Tech. Rep. CTUCMP201325, Center for Machine Perception, K13133 FEE Czech Technical University, Prague, Czech Republic (September 2013)

  22. [22]

    2015 IEEE International Conference on Robotics and Biomimetics, IEEE-ROBIO 2015 pp

    Yamazaki, K.: Instance recognition of clumped clothing using image fea- tures focusing on clothing fabrics and wrinkles. 2015 IEEE International Conference on Robotics and Biomimetics, IEEE-ROBIO 2015 pp. 1102–1108 (2016). https://doi.org/10.1109/ROBIO.2015.7418919, http://dx.doi.org/10. 1007/s10514-016-9559-z

  23. [23]

    In: Macq, B., Schelkens, P

    Yang, M., Yu, K.: Real-time clothing recognition in surveillance videos. In: Macq, B., Schelkens, P. (eds.) ICIP. pp. 2937–2940. IEEE (2011), http://dblp. uni-trier.de/db/conf/icip/icip2011.html#YangY11 GarmNet: Improving Global with Local Perception for Robotic Laundry... 13 6 Appendix T able 2. Summary of landmark Classification+Localization, as follows:...