pith. sign in

arxiv: 2504.19455 · v2 · submitted 2025-04-28 · 💻 cs.CV

Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition

Pith reviewed 2026-05-22 17:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords masked language promptinggenerative data augmentationfew-shot fashion style recognitionlarge language modelstext-to-image generationFashionStyle14data augmentation for recognition
0
0 comments X

The pith

Masked Language Prompting generates diverse yet consistent fashion images to improve few-shot style recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Masked Language Prompting to create better synthetic training data for recognizing fashion styles when only a few real examples are on hand. The idea is to take a caption describing a style, hide some words, and ask a large language model to fill in alternatives that keep the overall meaning but change details like colors or fabrics. These new captions then go to a text-to-image model to make pictures that look varied but still belong to the same style. A reader would care if this works because collecting lots of labeled fashion images is expensive and subjective, so smarter augmentation could let systems perform well with less data.

Core claim

The paper claims that Masked Language Prompting for generative data augmentation consistently outperforms class-name and caption-based baselines on the FashionStyle14 dataset in few-shot fashion style recognition tasks. By masking selected words in reference captions and using large language models to generate completions, the method introduces attribute-level variations while preserving structural semantics and style consistency, allowing effective image synthesis without model fine-tuning.

What carries the argument

Masked Language Prompting (MLP), which masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions that support style-consistent image generation.

If this is right

  • MLP-based augmentation leads to higher accuracy in few-shot fashion style recognition compared to simpler prompting methods.
  • The generated images maintain style semantics while providing more visual diversity.
  • No fine-tuning of the text-to-image model is required for the augmentation to work.
  • The approach balances diversity and consistency better than using full captions or just class names.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method succeeds, it suggests that language model completions can effectively expand limited visual datasets in domains with ambiguous categories.
  • Similar masking strategies might help in other generative augmentation settings where semantic control is needed.
  • Further work could explore how different masking ratios affect the quality of the resulting images.

Load-bearing premise

That masking selected words in a reference caption and prompting an LLM for completions will reliably produce attribute-level variations that preserve the intended style semantics while increasing visual diversity, and that the downstream text-to-image model will render these variations faithfully without any fine-tuning.

What would settle it

Observing that the few-shot recognition model trained on MLP-augmented data performs no better than one trained on standard caption-augmented data on the FashionStyle14 test set would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2504.19455 by Ryotaro Shimizu, Yuki Hirakawa.

Figure 1
Figure 1. Figure 1: c). Recent advances in text-to-image (T2I) mod￾els [37, 38, 40, 53] enable generative augmentation [14, 16, 24, 52, 57] from textual prompts, offering semanti- (a) Original (b) RandAug. [10] (c) AutoAug. [9] (d) Class (e) Caption (f) MLP (ours) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Masked language prompting. Our augmentation pipeline consists of four steps: (1) captioning, (2) masking, (3) fill-in-the￾masks, and (4) text-to-image generation. This process enables effective synthesis of images that preserve the style semantics of the original reference while introducing attribute-level variations. wearing [CAPTION].”) improves visual fidelity, but often leads to low diversity (Fig. 1e)… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of image-caption pairs generated using MLP, including both successful and failure cases. The leftmost column shows a reference fairy style image with a GPT-generated caption. Columns 2-5 display image-caption pairs generated by MLP conditioned on the reference, where columns 2-3 successfully capture key characteristics of the fairy style—e.g., aesthetics of cuteness, delicacy, and soft color … view at source ↗
Figure 4
Figure 4. Figure 4: Wordclouds of LLM-generated completions for masked captions. Visualized words are aggregated from com￾pletions generated under the nshot = 16 setting. themed” replaced with “streetwear”) can result in images misaligned with the intended style. Filtering such inconsis￾tencies may further enhance the effectiveness of MLP. 4. Conclusion We propose Masked Language Prompting (MLP), a gen￾erative augmentation fr… view at source ↗
Figure 5
Figure 5. Figure 5: shows the ablation study on the masking ratio used in the Masked Language Prompting (MLP) framework. We evaluate three masking ratios—0.25, 0.5, and 0.75—and observe that a ratio of 0.5 yields the highest relative im￾provement in classification accuracy over the baseline Cap￾tion model. Accordingly, we adopt 0.5 as the default mask￾ing ratio in our main experiments. 6.3. Implementation Details of Image Gen… view at source ↗
Figure 6
Figure 6. Figure 6: Synthetic images generated by MLP. The leftmost column of each row shows reference images along with their captions, while the remaining columns display results generated by the MLP. Words highlighted in red indicate those completed by the MLP. Rows 1–2 show conservative, 3–4 ethnic, 5–6 lolita style, 7-8 street style examples of generated image–caption pairs. 4 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Synthetic images generated by Caption. The leftmost column of each row shows reference images, while the remaining columns display results generated by the Caption. Rows 1–2 show conservative, 3–4 ethnic, 5–6 lolita style, 7-8 street style examples of generated images. Row indices align with the examples shown in [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbf{Masked Language Prompting (MLP)}, a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Masked Language Prompting (MLP), a technique that selects and masks words in reference captions for fashion images, prompts an LLM to generate completions, and feeds the resulting captions to an off-the-shelf text-to-image model to create augmented training data. The central claim is that this produces style-consistent yet visually diverse samples that improve few-shot recognition accuracy on the FashionStyle14 dataset relative to baselines that use only class names or unmasked reference captions.

Significance. If the core assumption holds—that masked LLM completions reliably preserve style semantics while adding useful visual diversity—the approach offers a lightweight, fine-tuning-free way to augment subjective visual categories. This could be relevant for other domains with ambiguous labels where generative models are available but direct supervision is scarce.

major comments (2)
  1. [Section 4] The manuscript provides no direct quantitative validation that the generated images remain faithful to the original style label. Section 4 (Experiments) reports only downstream few-shot classifier accuracy; it does not include style-classifier accuracy on the synthetic images, CLIP-style similarity scores to reference captions, or human consistency ratings that would confirm the weakest assumption does not introduce label noise.
  2. [Abstract and Section 4.3] The claim of consistent outperformance is stated in the abstract and Section 4.3 but the provided text supplies no numerical results, standard deviations, statistical significance tests, or details on dataset splits and controls for prompt length or LLM choice. This makes it impossible to assess whether the reported gains are robust or attributable to the masking strategy itself.
minor comments (2)
  1. [Section 3] Notation for the masking procedure and the exact prompt template sent to the LLM should be formalized with an equation or pseudocode in Section 3 to improve reproducibility.
  2. [Figures in Section 4] Figure captions and axis labels in the experimental plots are insufficiently descriptive; they should explicitly state the number of shots, the backbone classifier, and the augmentation ratio used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes we will incorporate in the revised manuscript.

read point-by-point responses
  1. Referee: [Section 4] The manuscript provides no direct quantitative validation that the generated images remain faithful to the original style label. Section 4 (Experiments) reports only downstream few-shot classifier accuracy; it does not include style-classifier accuracy on the synthetic images, CLIP-style similarity scores to reference captions, or human consistency ratings that would confirm the weakest assumption does not introduce label noise.

    Authors: We agree that direct validation of style fidelity would strengthen the claims. While downstream accuracy provides an indirect but task-relevant measure, we will add CLIP similarity scores between generated images and reference captions plus a human consistency study in the revised Section 4 to quantify style preservation and rule out label noise. revision: yes

  2. Referee: [Abstract and Section 4.3] The claim of consistent outperformance is stated in the abstract and Section 4.3 but the provided text supplies no numerical results, standard deviations, statistical significance tests, or details on dataset splits and controls for prompt length or LLM choice. This makes it impossible to assess whether the reported gains are robust or attributable to the masking strategy itself.

    Authors: We apologize for the lack of explicit numbers in the abstract and insufficient controls in Section 4.3. The full results appear in tables in Section 4; we will update the abstract with key accuracies and standard deviations, add p-values from significance tests, and expand Section 4.3 with dataset split details plus controls for prompt length and LLM choice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal validated on external dataset against independent baselines

full rationale

The paper introduces Masked Language Prompting as a prompting technique for generating augmented captions from reference text, then uses an off-the-shelf text-to-image model to synthesize images for few-shot training. Evaluation consists of direct accuracy comparisons on the FashionStyle14 dataset against class-name and caption-based baselines. No equations, derivations, or fitted parameters are defined in terms of the target performance metric. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on experimental outcomes that are falsifiable against external methods and data, making the derivation chain self-contained rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the untested reliability of current LLMs for style-preserving completions and of text-to-image models for faithful rendering; no new entities or fitted parameters are introduced in the abstract.

axioms (2)
  • domain assumption Large language models can generate completions for masked captions that remain semantically aligned with the original style concept.
    Invoked as the mechanism that enables attribute-level variation without breaking style consistency.
  • domain assumption Text-to-image models produce images that faithfully reflect the semantic content of the LLM-generated captions without additional training.
    Required for the augmentation pipeline to deliver usable training images.

pith-pipeline@v0.9.0 · 5672 in / 1324 out tokens · 70477 ms · 2026-05-22T17:57:38.468281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 5 internal anchors

  1. [1]

    Fashion forward: Forecasting visual style in fashion

    Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. Fashion forward: Forecasting visual style in fashion. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 388–397, 2017. 1

  2. [2]

    Conceptual framework of hybrid style in fashion im- age datasets for machine learning

    Hyosun An, Kyo Young Lee, Yerim Choi, and Minjung Park. Conceptual framework of hybrid style in fashion im- age datasets for machine learning. Fashion and Textiles, 10 (1):18, 2023. 1

  3. [3]

    Synthetic data from diffusion models improves imagenet classification

    Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- hammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023. 1

  4. [4]

    The fashion system

    Roland Barthes. The fashion system . Univ of California Press, 1990. 1

  5. [5]

    One-shot unsupervised do- main adaptation with personalized diffusion models

    Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalo- geiton, and St´ephane Lathuili`ere. One-shot unsupervised do- main adaptation with personalized diffusion models. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 698–708, 2023. 1

  6. [6]

    NLTK: The natural language toolkit

    Steven Bird and Edward Loper. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain,

  7. [7]

    Association for Computational Linguistics. 2

  8. [8]

    Apparel classi- fication with style

    Lukas Bossard, Matthias Dantone, Christian Leistner, Chris- tian Wengert, Till Quack, and Luc Van Gool. Apparel classi- fication with style. In Proceedings of the 11th Asian Confer- ence on Computer Vision - Volume Part IV , page 321–335. Springer-Verlag, 2012. 1

  9. [9]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zit- nick. Microsoft coco captions: Data collection and evalu- ation server. CoRR, abs/1504.00325, 2015. 2

  10. [10]

    Autoaugment: Learning augmentation strategies from data

    Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 113–123, 2019. 1, 2, 3

  11. [11]

    Randaugment: Practical automated data augmenta- tion with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 1, 2, 3

  12. [12]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009. 1

  13. [13]

    Terrance Devries and Graham W. Taylor. Improved regular- ization of convolutional neural networks with cutout. ArXiv, abs/1708.04552, 2017. 1

  14. [14]

    Gonzalez, Aditi Raghunathan, and Anna Rohrbach

    Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E. Gonzalez, Aditi Raghunathan, and Anna Rohrbach. Using language to extend to unseen do- mains. In The Eleventh International Conference on Learn- ing Representations, 2023. 1, 3

  15. [15]

    Gonzalez, and Trevor Darrell

    Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, and Trevor Darrell. Diversify your vi- sion datasets with automatic diffusion-based augmentation. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023. 1, 3

  16. [16]

    rembg, 2025

    Daniel Gatis. rembg, 2025. 2

  17. [17]

    Is synthetic data from generative models ready for image recognition? In The International Conference on Learning Representations ,

    Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? In The International Conference on Learning Representations ,

  18. [18]

    Aug- mix: A simple method to improve robustness and uncertainty under data shift

    Dan Hendrycks*, Norman Mu*, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Aug- mix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020. 2, 3, 1

  19. [19]

    Learning the latent “look”: Unsupervised discovery of a style-coherent embed- ding from fashion images

    Wei-Lin Hsiao and Kristen Grauman. Learning the latent “look”: Unsupervised discovery of a style-coherent embed- ding from fashion images. Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), pages 4213– 4222, 2017. 1

  20. [20]

    Active gen- eration for image classification

    Tao Huang, Jiaqi Liu, Shan You, and Chang Xu. Active gen- eration for image classification. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII , pages 270– 286, 2024. 3

  21. [21]

    Re- thinking fid: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9307–9315,

  22. [22]

    Fancy: Human-centered, deep learning-based framework for fashion style analysis

    Youngseung Jeon, Seungwan Jin, and Kyungsik Han. Fancy: Human-centered, deep learning-based framework for fashion style analysis. In Proceedings of the Web Conference 2021, pages 2367–2378. Association for Computing Machinery,

  23. [23]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Sys- tems, pages 26565–26577. Curran Associates, Inc., 2022. 3, 2

  24. [24]

    Berg, and Tamara L

    Mohammad Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, and Tamara L. Berg. Hipster wars: Discovering el- ements of fashion styles. In Proceedings of ECCV , pages 472–488, 2014. 1

  25. [25]

    Datadream: Few-shot guided dataset generation

    Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, and Zeynep Akata. Datadream: Few-shot guided dataset generation. In European Conference on Computer Vision, pages 252–268. Springer, 2024. 1

  26. [26]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 1

  27. [27]

    Image data augmentation approaches: A comprehensive survey and future directions

    Teerath Kumar, Rob Brennan, Alessandra Mileo, and Ma- lika Bendechache. Image data augmentation approaches: A comprehensive survey and future directions. IEEE Access, 12:187536–187571, 2024. 1

  28. [28]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 1 5

  29. [29]

    Semantic-guided generative image augmentation method with diffusion models for image classification

    Bohan Li, Xiao Xu, Xinghao Wang, Yutai Hou, Yunlong Feng, Feng Wang, Xuanliang Zhang, Qingfu Zhu, and Wanx- iang Che. Semantic-guided generative image augmentation method with diffusion models for image classification. In Proceedings of the Thirty-Eighth AAAI Conference on Artifi- cial Intelligence and Thirty-Sixth Conference on Innovative Applications of...

  30. [30]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML,

  31. [31]

    SDXL-Lightning: Progressive Adversarial Diffusion Distillation

    Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 3, 2

  32. [32]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 3, 2

  33. [33]

    Knowledge enhanced neural fashion trend forecasting

    Yunshan Ma, Yujuan Ding, Xun Yang, Lizi Liao, Wai Ke- ung Wong, and Tat-Seng Chua. Knowledge enhanced neural fashion trend forecasting. In Proceedings of the 2020 Inter- national Conference on Multimedia Retrieval, pages 82–90. Association for Computing Machinery, 2020. 1

  34. [34]

    Geostyle: Discovering fashion trends and events

    Utkarsh Mall, Kevin Matzen, Bharath Hariharan, Noah Snavely, and Kavita Bala. Geostyle: Discovering fashion trends and events. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision , pages 411–420,

  35. [35]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008. 1

  36. [36]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2

  37. [37]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 2, 3

  38. [38]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv, abs/2204.06125, 2022. 1

  39. [39]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1

  40. [40]

    Cap2aug: Caption guided image data aug- mentation

    Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, and Rama Chellappa. Cap2aug: Caption guided image data aug- mentation. In 2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV) , pages 9125–9135,

  41. [41]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 1

  42. [42]

    Fashion-specific ambiguous expression interpretation with partial visual-semantic embedding

    Ryotaro Shimizu, Takuma Nakamura, and Masayuki Goto. Fashion-specific ambiguous expression interpretation with partial visual-semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3497–3502, 2023. 1

  43. [43]

    Fashion intelligence system: An outfit in- terpretation utilizing images and rich abstract tags

    Ryotaro Shimizu, Yuki Saito, Megumi Matsutani, and Masayuki Goto. Fashion intelligence system: An outfit in- terpretation utilizing images and rich abstract tags. Expert Systems with Applications, 213:119167, 2023. 1

  44. [44]

    A fashion item recommendation model in hyperbolic space

    Ryotaro Shimizu, Yu Wang, Masanari Kimura, Yuki Hi- rakawa, Takashi Wada, Yuki Saito, and Julian McAuley. A fashion item recommendation model in hyperbolic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , pages 8377–8383, 2024. 1

  45. [45]

    Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction

    Edgar Simo-Serra and Hiroshi Ishikawa. Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 298–307, 2016. 1

  46. [46]

    What Makes a Style: Experimental Analysis of Fashion Prediction

    Moeko Takagi, Edgar Simo-Serra, Satoshi Iizuka, and Hi- roshi Ishikawa. What Makes a Style: Experimental Analysis of Fashion Prediction. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW), 2017. 1, 2, 3

  47. [47]

    Effective data augmentation with diffu- sion models

    Brandon Trabucco, Kyle Doherty, Max A Gurinas, and Rus- lan Salakhutdinov. Effective data augmentation with diffu- sion models. In The Twelfth International Conference on Learning Representations, 2024. 1

  48. [48]

    Sch ¨onberger, Juan Nunez- Iglesias, Franc ¸ois Boulogne, Joshua D

    St ´efan van der Walt, Johannes L. Sch ¨onberger, Juan Nunez- Iglesias, Franc ¸ois Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, and Tony Yu. scikit-image: image processing in python. PeerJ, 2:e453, 2014. 3

  49. [49]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 1

  50. [50]

    Attributed synthetic data generation for zero- shot image classification

    Shijian Wang, Linxin Song, Ryotaro Shimizu, Masayuki Goto, et al. Attributed synthetic data generation for zero- shot image classification. In Synthetic Data for Computer Vision Workshop@ CVPR 2024, 2024. 1

  51. [51]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing , 13(4): 600–612, 2004. 4, 2

  52. [52]

    Fashion captioning: Towards generating accurate descrip- tions with semantic rewards

    Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, and Xin Wang. Fashion captioning: Towards generating accurate descrip- tions with semantic rewards. In ECCV, 2020. 1, 2

  53. [53]

    Using syn- thetic data for data augmentation to improve classification accuracy

    Zhou Yongchao, Sahak Hshmat, and Ba Jimmy. Using syn- thetic data for data augmentation to improve classification accuracy. In Proceedings of the Workshop on Challenges in Deployable Machine Learning, ICML, 2023. 1

  54. [54]

    Scaling autoregressive models for content-rich text-to-image generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, 6 Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learn- ing Research, 2...

  55. [55]

    Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 2, 3, 1

  56. [56]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. In International Conference on Learning Representa- tions, 2018. 2, 3, 1

  57. [57]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 4, 2

  58. [58]

    Expanding small-scale datasets with guided imag- ination

    Yifan Zhang, Daquan Zhou, Bryan Hooi, Kai Wang, and Ji- ashi Feng. Expanding small-scale datasets with guided imag- ination. In Advances in Neural Information Processing Sys- tems, 2023. 1

  59. [59]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 1 7 Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition Supplementary Material

  60. [60]

    Which of these two outfits looks more bohemian style?

    Related Work 5.1. Fashion Style Recognition Fashion style recognition [2, 7, 18, 21, 23, 44, 45] has emerged as a key task in image understanding, with promis- ing applications in fashion recommendation [42] and trend analysis [1, 32, 33]. Unlike objective attributes such as gar- ment category or color, “style” represents a highly abstract, subjective, an...

  61. [61]

    Prompts Tab

    Experiments 6.1. Prompts Tab. 3 and Tab. 4 show the prompts used in step 1 (caption- ing) and step 3 ( fill-in-the-masks) of our MLP framework, respectively. 1 Captioning step Instruction: Please generate a caption of the provided images within 30 words. Note: • Colors, categories, designs of each piece of fash- ion item MUST be described. • Overall fashi...

  62. [62]

    • Risk of style misalignment : When the masked words are inappropriately completed by the LLMs, the resulting image may deviate from the intended style

    Limitations We acknowledge the following limitations of our work. • Risk of style misalignment : When the masked words are inappropriately completed by the LLMs, the resulting image may deviate from the intended style. In the cur- rent implementation, masking candidates are randomly sampled from nouns and adjectives, regardless of their semantic influence...