Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition
Pith reviewed 2026-05-22 17:57 UTC · model grok-4.3
The pith
Masked Language Prompting generates diverse yet consistent fashion images to improve few-shot style recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that Masked Language Prompting for generative data augmentation consistently outperforms class-name and caption-based baselines on the FashionStyle14 dataset in few-shot fashion style recognition tasks. By masking selected words in reference captions and using large language models to generate completions, the method introduces attribute-level variations while preserving structural semantics and style consistency, allowing effective image synthesis without model fine-tuning.
What carries the argument
Masked Language Prompting (MLP), which masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions that support style-consistent image generation.
If this is right
- MLP-based augmentation leads to higher accuracy in few-shot fashion style recognition compared to simpler prompting methods.
- The generated images maintain style semantics while providing more visual diversity.
- No fine-tuning of the text-to-image model is required for the augmentation to work.
- The approach balances diversity and consistency better than using full captions or just class names.
Where Pith is reading between the lines
- If the method succeeds, it suggests that language model completions can effectively expand limited visual datasets in domains with ambiguous categories.
- Similar masking strategies might help in other generative augmentation settings where semantic control is needed.
- Further work could explore how different masking ratios affect the quality of the resulting images.
Load-bearing premise
That masking selected words in a reference caption and prompting an LLM for completions will reliably produce attribute-level variations that preserve the intended style semantics while increasing visual diversity, and that the downstream text-to-image model will render these variations faithfully without any fine-tuning.
What would settle it
Observing that the few-shot recognition model trained on MLP-augmented data performs no better than one trained on standard caption-augmented data on the FashionStyle14 test set would falsify the effectiveness claim.
Figures
read the original abstract
Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbf{Masked Language Prompting (MLP)}, a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Masked Language Prompting (MLP), a technique that selects and masks words in reference captions for fashion images, prompts an LLM to generate completions, and feeds the resulting captions to an off-the-shelf text-to-image model to create augmented training data. The central claim is that this produces style-consistent yet visually diverse samples that improve few-shot recognition accuracy on the FashionStyle14 dataset relative to baselines that use only class names or unmasked reference captions.
Significance. If the core assumption holds—that masked LLM completions reliably preserve style semantics while adding useful visual diversity—the approach offers a lightweight, fine-tuning-free way to augment subjective visual categories. This could be relevant for other domains with ambiguous labels where generative models are available but direct supervision is scarce.
major comments (2)
- [Section 4] The manuscript provides no direct quantitative validation that the generated images remain faithful to the original style label. Section 4 (Experiments) reports only downstream few-shot classifier accuracy; it does not include style-classifier accuracy on the synthetic images, CLIP-style similarity scores to reference captions, or human consistency ratings that would confirm the weakest assumption does not introduce label noise.
- [Abstract and Section 4.3] The claim of consistent outperformance is stated in the abstract and Section 4.3 but the provided text supplies no numerical results, standard deviations, statistical significance tests, or details on dataset splits and controls for prompt length or LLM choice. This makes it impossible to assess whether the reported gains are robust or attributable to the masking strategy itself.
minor comments (2)
- [Section 3] Notation for the masking procedure and the exact prompt template sent to the LLM should be formalized with an equation or pseudocode in Section 3 to improve reproducibility.
- [Figures in Section 4] Figure captions and axis labels in the experimental plots are insufficiently descriptive; they should explicitly state the number of shots, the backbone classifier, and the augmentation ratio used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the changes we will incorporate in the revised manuscript.
read point-by-point responses
-
Referee: [Section 4] The manuscript provides no direct quantitative validation that the generated images remain faithful to the original style label. Section 4 (Experiments) reports only downstream few-shot classifier accuracy; it does not include style-classifier accuracy on the synthetic images, CLIP-style similarity scores to reference captions, or human consistency ratings that would confirm the weakest assumption does not introduce label noise.
Authors: We agree that direct validation of style fidelity would strengthen the claims. While downstream accuracy provides an indirect but task-relevant measure, we will add CLIP similarity scores between generated images and reference captions plus a human consistency study in the revised Section 4 to quantify style preservation and rule out label noise. revision: yes
-
Referee: [Abstract and Section 4.3] The claim of consistent outperformance is stated in the abstract and Section 4.3 but the provided text supplies no numerical results, standard deviations, statistical significance tests, or details on dataset splits and controls for prompt length or LLM choice. This makes it impossible to assess whether the reported gains are robust or attributable to the masking strategy itself.
Authors: We apologize for the lack of explicit numbers in the abstract and insufficient controls in Section 4.3. The full results appear in tables in Section 4; we will update the abstract with key accuracies and standard deviations, add p-values from significance tests, and expand Section 4.3 with dataset split details plus controls for prompt length and LLM choice. revision: yes
Circularity Check
No circularity: empirical method proposal validated on external dataset against independent baselines
full rationale
The paper introduces Masked Language Prompting as a prompting technique for generating augmented captions from reference text, then uses an off-the-shelf text-to-image model to synthesize images for few-shot training. Evaluation consists of direct accuracy comparisons on the FashionStyle14 dataset against class-name and caption-based baselines. No equations, derivations, or fitted parameters are defined in terms of the target performance metric. No self-citations are invoked to establish uniqueness theorems, ansatzes, or load-bearing premises. The central claim rests on experimental outcomes that are falsifiable against external methods and data, making the derivation chain self-contained rather than self-referential.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can generate completions for masked captions that remain semantically aligned with the original style concept.
- domain assumption Text-to-image models produce images that faithfully reflect the semantic content of the LLM-generated captions without additional training.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Masked Language Prompting (MLP) ... masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Fashion forward: Forecasting visual style in fashion
Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. Fashion forward: Forecasting visual style in fashion. In Pro- ceedings of the IEEE International Conference on Computer Vision, pages 388–397, 2017. 1
work page 2017
-
[2]
Conceptual framework of hybrid style in fashion im- age datasets for machine learning
Hyosun An, Kyo Young Lee, Yerim Choi, and Minjung Park. Conceptual framework of hybrid style in fashion im- age datasets for machine learning. Fashion and Textiles, 10 (1):18, 2023. 1
work page 2023
-
[3]
Synthetic data from diffusion models improves imagenet classification
Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- hammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023. 1
-
[4]
Roland Barthes. The fashion system . Univ of California Press, 1990. 1
work page 1990
-
[5]
One-shot unsupervised do- main adaptation with personalized diffusion models
Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalo- geiton, and St´ephane Lathuili`ere. One-shot unsupervised do- main adaptation with personalized diffusion models. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 698–708, 2023. 1
work page 2023
-
[6]
NLTK: The natural language toolkit
Steven Bird and Edward Loper. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain,
-
[7]
Association for Computational Linguistics. 2
-
[8]
Apparel classi- fication with style
Lukas Bossard, Matthias Dantone, Christian Leistner, Chris- tian Wengert, Till Quack, and Luc Van Gool. Apparel classi- fication with style. In Proceedings of the 11th Asian Confer- ence on Computer Vision - Volume Part IV , page 321–335. Springer-Verlag, 2012. 1
work page 2012
-
[9]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zit- nick. Microsoft coco captions: Data collection and evalu- ation server. CoRR, abs/1504.00325, 2015. 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Autoaugment: Learning augmentation strategies from data
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude- van, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 113–123, 2019. 1, 2, 3
work page 2019
-
[11]
Randaugment: Practical automated data augmenta- tion with a reduced search space
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmenta- tion with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020. 1, 2, 3
work page 2020
-
[12]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009. 1
work page 2009
-
[13]
Terrance Devries and Graham W. Taylor. Improved regular- ization of convolutional neural networks with cutout. ArXiv, abs/1708.04552, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[14]
Gonzalez, Aditi Raghunathan, and Anna Rohrbach
Lisa Dunlap, Clara Mohri, Devin Guillory, Han Zhang, Trevor Darrell, Joseph E. Gonzalez, Aditi Raghunathan, and Anna Rohrbach. Using language to extend to unseen do- mains. In The Eleventh International Conference on Learn- ing Representations, 2023. 1, 3
work page 2023
-
[15]
Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E. Gonzalez, and Trevor Darrell. Diversify your vi- sion datasets with automatic diffusion-based augmentation. In Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023. 1, 3
work page 2023
- [16]
-
[17]
Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition? In The International Conference on Learning Representations ,
-
[18]
Aug- mix: A simple method to improve robustness and uncertainty under data shift
Dan Hendrycks*, Norman Mu*, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Aug- mix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020. 2, 3, 1
work page 2020
-
[19]
Wei-Lin Hsiao and Kristen Grauman. Learning the latent “look”: Unsupervised discovery of a style-coherent embed- ding from fashion images. Proceedings of the IEEE Interna- tional Conference on Computer Vision (ICCV), pages 4213– 4222, 2017. 1
work page 2017
-
[20]
Active gen- eration for image classification
Tao Huang, Jiaqi Liu, Shan You, and Chang Xu. Active gen- eration for image classification. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII , pages 270– 286, 2024. 3
work page 2024
-
[21]
Re- thinking fid: Towards a better evaluation metric for image generation
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Re- thinking fid: Towards a better evaluation metric for image generation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 9307–9315,
work page 2024
-
[22]
Fancy: Human-centered, deep learning-based framework for fashion style analysis
Youngseung Jeon, Seungwan Jin, and Kyungsik Han. Fancy: Human-centered, deep learning-based framework for fashion style analysis. In Proceedings of the Web Conference 2021, pages 2367–2378. Association for Computing Machinery,
work page 2021
-
[23]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Sys- tems, pages 26565–26577. Curran Associates, Inc., 2022. 3, 2
work page 2022
-
[24]
Mohammad Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, and Tamara L. Berg. Hipster wars: Discovering el- ements of fashion styles. In Proceedings of ECCV , pages 472–488, 2014. 1
work page 2014
-
[25]
Datadream: Few-shot guided dataset generation
Jae Myung Kim, Jessica Bader, Stephan Alaniz, Cordelia Schmid, and Zeynep Akata. Datadream: Few-shot guided dataset generation. In European Conference on Computer Vision, pages 252–268. Springer, 2024. 1
work page 2024
-
[26]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 1
work page 2009
-
[27]
Image data augmentation approaches: A comprehensive survey and future directions
Teerath Kumar, Rob Brennan, Alessandra Mileo, and Ma- lika Bendechache. Image data augmentation approaches: A comprehensive survey and future directions. IEEE Access, 12:187536–187571, 2024. 1
work page 2024
-
[28]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 1 5
work page 2024
-
[29]
Semantic-guided generative image augmentation method with diffusion models for image classification
Bohan Li, Xiao Xu, Xinghao Wang, Yutai Hou, Yunlong Feng, Feng Wang, Xuanliang Zhang, Qingfu Zhu, and Wanx- iang Che. Semantic-guided generative image augmentation method with diffusion models for image classification. In Proceedings of the Thirty-Eighth AAAI Conference on Artifi- cial Intelligence and Thirty-Sixth Conference on Innovative Applications of...
work page 2024
-
[30]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML,
-
[31]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 3, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2019. 3, 2
work page 2019
-
[33]
Knowledge enhanced neural fashion trend forecasting
Yunshan Ma, Yujuan Ding, Xun Yang, Lizi Liao, Wai Ke- ung Wong, and Tat-Seng Chua. Knowledge enhanced neural fashion trend forecasting. In Proceedings of the 2020 Inter- national Conference on Multimedia Retrieval, pages 82–90. Association for Computing Machinery, 2020. 1
work page 2020
-
[34]
Geostyle: Discovering fashion trends and events
Utkarsh Mall, Kevin Matzen, Bharath Hariharan, Noah Snavely, and Kavita Bala. Geostyle: Discovering fashion trends and events. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision , pages 411–420,
-
[35]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008. 1
work page 2008
-
[36]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International Conference on Machine Learning (ICML), pages 8748–8763, 2021. 2, 3
work page 2021
-
[38]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv, abs/2204.06125, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1
work page 2022
-
[40]
Cap2aug: Caption guided image data aug- mentation
Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, and Rama Chellappa. Cap2aug: Caption guided image data aug- mentation. In 2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV) , pages 9125–9135,
work page 2025
-
[41]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022. 1
work page 2022
-
[42]
Fashion-specific ambiguous expression interpretation with partial visual-semantic embedding
Ryotaro Shimizu, Takuma Nakamura, and Masayuki Goto. Fashion-specific ambiguous expression interpretation with partial visual-semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3497–3502, 2023. 1
work page 2023
-
[43]
Fashion intelligence system: An outfit in- terpretation utilizing images and rich abstract tags
Ryotaro Shimizu, Yuki Saito, Megumi Matsutani, and Masayuki Goto. Fashion intelligence system: An outfit in- terpretation utilizing images and rich abstract tags. Expert Systems with Applications, 213:119167, 2023. 1
work page 2023
-
[44]
A fashion item recommendation model in hyperbolic space
Ryotaro Shimizu, Yu Wang, Masanari Kimura, Yuki Hi- rakawa, Takashi Wada, Yuki Saito, and Julian McAuley. A fashion item recommendation model in hyperbolic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , pages 8377–8383, 2024. 1
work page 2024
-
[45]
Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction
Edgar Simo-Serra and Hiroshi Ishikawa. Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 298–307, 2016. 1
work page 2016
-
[46]
What Makes a Style: Experimental Analysis of Fashion Prediction
Moeko Takagi, Edgar Simo-Serra, Satoshi Iizuka, and Hi- roshi Ishikawa. What Makes a Style: Experimental Analysis of Fashion Prediction. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW), 2017. 1, 2, 3
work page 2017
-
[47]
Effective data augmentation with diffu- sion models
Brandon Trabucco, Kyle Doherty, Max A Gurinas, and Rus- lan Salakhutdinov. Effective data augmentation with diffu- sion models. In The Twelfth International Conference on Learning Representations, 2024. 1
work page 2024
-
[48]
Sch ¨onberger, Juan Nunez- Iglesias, Franc ¸ois Boulogne, Joshua D
St ´efan van der Walt, Johannes L. Sch ¨onberger, Juan Nunez- Iglesias, Franc ¸ois Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, and Tony Yu. scikit-image: image processing in python. PeerJ, 2:e453, 2014. 3
work page 2014
-
[49]
The caltech-ucsd birds-200-2011 dataset
Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 1
work page 2011
-
[50]
Attributed synthetic data generation for zero- shot image classification
Shijian Wang, Linxin Song, Ryotaro Shimizu, Masayuki Goto, et al. Attributed synthetic data generation for zero- shot image classification. In Synthetic Data for Computer Vision Workshop@ CVPR 2024, 2024. 1
work page 2024
-
[51]
Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing , 13(4): 600–612, 2004. 4, 2
work page 2004
-
[52]
Fashion captioning: Towards generating accurate descrip- tions with semantic rewards
Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, and Xin Wang. Fashion captioning: Towards generating accurate descrip- tions with semantic rewards. In ECCV, 2020. 1, 2
work page 2020
-
[53]
Using syn- thetic data for data augmentation to improve classification accuracy
Zhou Yongchao, Sahak Hshmat, and Ba Jimmy. Using syn- thetic data for data augmentation to improve classification accuracy. In Proceedings of the Workshop on Challenges in Deployable Machine Learning, ICML, 2023. 1
work page 2023
-
[54]
Scaling autoregressive models for content-rich text-to-image generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, 6 Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learn- ing Research, 2...
work page 2022
-
[55]
Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regular- ization strategy to train strong classifiers with localizable fea- tures. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 6023–6032, 2019. 2, 3, 1
work page 2019
-
[56]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. In International Conference on Learning Representa- tions, 2018. 2, 3, 1
work page 2018
-
[57]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 4, 2
work page 2018
-
[58]
Expanding small-scale datasets with guided imag- ination
Yifan Zhang, Daquan Zhou, Bryan Hooi, Kai Wang, and Ji- ashi Feng. Expanding small-scale datasets with guided imag- ination. In Advances in Neural Information Processing Sys- tems, 2023. 1
work page 2023
-
[59]
Random erasing data augmentation
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 13001–13008, 2020. 1 7 Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition Supplementary Material
work page 2020
-
[60]
Which of these two outfits looks more bohemian style?
Related Work 5.1. Fashion Style Recognition Fashion style recognition [2, 7, 18, 21, 23, 44, 45] has emerged as a key task in image understanding, with promis- ing applications in fashion recommendation [42] and trend analysis [1, 32, 33]. Unlike objective attributes such as gar- ment category or color, “style” represents a highly abstract, subjective, an...
-
[61]
Experiments 6.1. Prompts Tab. 3 and Tab. 4 show the prompts used in step 1 (caption- ing) and step 3 ( fill-in-the-masks) of our MLP framework, respectively. 1 Captioning step Instruction: Please generate a caption of the provided images within 30 words. Note: • Colors, categories, designs of each piece of fash- ion item MUST be described. • Overall fashi...
-
[62]
Limitations We acknowledge the following limitations of our work. • Risk of style misalignment : When the masked words are inappropriately completed by the LLMs, the resulting image may deviate from the intended style. In the cur- rent implementation, masking candidates are randomly sampled from nouns and adjectives, regardless of their semantic influence...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.