pith. sign in

arxiv: 2606.13275 · v1 · pith:NA2VNYKFnew · submitted 2026-06-11 · 💻 cs.CV

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

Pith reviewed 2026-06-27 07:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot captioningcultural heritageIndonesian traditional clothingretrieval-augmented generationCLIPvision-language modelslow-resource datasetsprovince-level evaluation
0
0 comments X

The pith

A retrieval-augmented CLIP model generates captions for traditional Indonesian clothing from provinces never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Custom ZeroCLIP can produce captions for images of traditional garments from eight unseen Indonesian provinces by retrieving only from captions of the 24 training provinces. The model combines frozen CLIP image and text encoders with BERT and an LSTM decoder, and it reports higher CLIPScore, BLEU-4, and METEOR scores than baselines while recovering more culturally specific vocabulary. A sympathetic reader would care because the approach shows a way to automate descriptions of cultural heritage items without requiring annotations or images from every region, which is useful when expert labeling is scarce.

Core claim

Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859 on images from eight unseen provinces by retrieving captions exclusively from the bank of 24 seen provinces; ablation shows retrieval yields a 19.3 percent METEOR gain, and human raters confirm improved cultural accuracy and fluency compared with non-retrieval baselines.

What carries the argument

Province-level inductive zero-shot protocol that builds a retrieval bank solely from captions of seen provinces and augments generation for unseen-province images.

Load-bearing premise

Retrieval of captions from seen provinces is sufficient to produce culturally accurate descriptions for images from completely unseen provinces without any exposure to their images, labels, or captions during training or retrieval bank construction.

What would settle it

Expert raters familiar with the eight unseen provinces judge that the generated captions systematically omit or misrepresent distinctive garment features unique to those provinces.

Figures

Figures reproduced from arXiv: 2606.13275 by Anugrah Aidin Yotolembah, Gembong Edhi Setyawan, Novanto Yudistira.

Figure 1
Figure 1. Figure 1: Overview of Custom ZeroCLIP. Training (left) optimizes a BERT-LSTM decoder using frozen CLIP embeddings from seen provinces. Inference (right) applies cosine-similarity retrieval to generate captions for unseen provinces without labels [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline of Custom ZeroCLIP. Image and caption tokens are encoded by frozen CLIP, projected into the LM, and optimized using LCE and LCLIP . The BERT encoder, projection layers, and LSTM decoder are trained while CLIP remains frozen. computational efficiency, and noise reduction from excessive candidates while preserving relevant cultural context. Given a test image embedding v, cosine similarity … view at source ↗
Figure 3
Figure 3. Figure 3: Representative samples from seen and unseen provinces used in the inductive zero-shot evaluation protocol [26] [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation loss curves showing stable conver￾gence without overfitting. rotations up to 15◦ [14]. Captions are tokenized with the BERT tokenizer (maximum length: 512 tokens), while synonym re￾placement and back-translation improve robustness to cultural terminology [6], [21], [27]. The CLIP ViT-B/32 encoder remains frozen during training, while the BERT encoder, projection layers, and LSTM dec… view at source ↗
read the original abstract

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Custom ZeroCLIP, a retrieval-augmented zero-shot captioning framework for traditional Indonesian clothing images. It uses a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM decoder, with retrieval from captions of 24 seen provinces only. The model is evaluated on 8 unseen provinces under an inductive protocol, reporting CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859, a 19.3% METEOR improvement from the retrieval component in ablation, and favorable human evaluations for cultural accuracy and fluency. The 3800-image expert-annotated dataset spanning all 38 provinces is released publicly.

Significance. If the central performance and cultural-transfer claims hold, the work would be significant for zero-shot methods in culturally grounded, low-resource heritage domains where exhaustive regional labeling is impractical. The explicit 19.3% METEOR gain from retrieval and the public dataset release are concrete strengths that support reproducibility and further research in domain-adapted vision-language models for cultural heritage.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central performance claims (CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859 and 19.3% METEOR gain) are stated without error bars, standard deviations across runs, or statistical significance tests against the baselines. This information is load-bearing for assessing whether the reported outperformance over existing zero-shot methods is robust.
  2. [Method / Ablation study] Zero-shot protocol description and retrieval ablation: the claim that retrieval from the 24-province caption bank produces culturally accurate descriptions for the 8 unseen provinces rests on the assumption that CLIP embeddings separate province-specific garment features (motifs, colors, construction details) at sufficient granularity and that the retrieved strings supply the necessary cultural vocabulary. The manuscript provides no qualitative retrieval examples, nearest-neighbor analysis, or breakdown showing that the matched captions contain terms unique to the held-out provinces rather than generic visual matches.
minor comments (1)
  1. [Abstract] The abstract lists exact metric values; these should be cross-referenced to the corresponding table or figure in the main text for traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify valid opportunities to strengthen the presentation of results and the supporting analysis for the retrieval mechanism. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central performance claims (CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859 and 19.3% METEOR gain) are stated without error bars, standard deviations across runs, or statistical significance tests against the baselines. This information is load-bearing for assessing whether the reported outperformance over existing zero-shot methods is robust.

    Authors: We agree that error bars, standard deviations, and statistical significance tests would improve the robustness assessment of the reported metrics. The original experiments used fixed random seeds for reproducibility and were not repeated across multiple runs. In the revised manuscript we will rerun all experiments (including baselines and ablations) with five different random seeds, report means and standard deviations for CLIPScore, BLEU-4, METEOR, and the 19.3% gain, and add paired statistical significance tests against the baselines. revision: yes

  2. Referee: [Method / Ablation study] Zero-shot protocol description and retrieval ablation: the claim that retrieval from the 24-province caption bank produces culturally accurate descriptions for the 8 unseen provinces rests on the assumption that CLIP embeddings separate province-specific garment features (motifs, colors, construction details) at sufficient granularity and that the retrieved strings supply the necessary cultural vocabulary. The manuscript provides no qualitative retrieval examples, nearest-neighbor analysis, or breakdown showing that the matched captions contain terms unique to the held-out provinces rather than generic visual matches.

    Authors: We acknowledge that the current manuscript lacks qualitative retrieval examples and embedding analysis to directly illustrate how province-specific features are captured. The quantitative ablation and human evaluations for cultural accuracy are provided, but additional supporting evidence would strengthen the claim. In the revision we will add (1) qualitative examples of top-3 retrieved captions for sample unseen-province images, (2) nearest-neighbor analysis or similarity heatmaps in CLIP embedding space demonstrating separation of province-specific motifs/colors, and (3) a breakdown of unique cultural terms (e.g., specific garment names or motifs) appearing in the retrieved captions versus generic matches. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical zero-shot protocol on held-out provinces is self-contained

full rationale

The paper defines a province-level inductive zero-shot protocol that trains exclusively on 24 seen provinces, builds a retrieval bank only from their captions, and evaluates on 8 completely unseen provinces with no exposure to their images/labels/captions. Metrics (CLIPScore, BLEU-4, METEOR) and ablations are computed against external baselines; no equation, parameter fit, or self-citation reduces the central claim to its own inputs by construction. The setup is a standard held-out evaluation and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about pre-trained vision-language models and the transferability of cultural vocabulary via retrieval. No new free parameters, invented entities, or ad-hoc axioms beyond domain assumptions are introduced in the abstract.

axioms (2)
  • domain assumption Frozen CLIP embeddings support effective semantic retrieval for caption augmentation across cultural domains
    The framework relies on this to enable zero-shot transfer without fine-tuning the image encoder.
  • ad hoc to paper Captions from seen provinces contain transferable cultural vocabulary for unseen provinces
    This is the core premise of the inductive zero-shot protocol described in the abstract.

pith-pipeline@v0.9.1-grok · 5806 in / 1535 out tokens · 38448 ms · 2026-06-27T07:04:36.628969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,

    S. Andr ´efou¨et, M. Paul, and A. R. Farhan, “Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,”Marine Policy, vol. 135, pp. 104848, Jan. 2022

  2. [2]

    Perception and appreciation of the indonesian plural society toward cultural diversity,

    L. Suryatni and I. D. K. K. Widana, “Perception and appreciation of the indonesian plural society toward cultural diversity,”Technium Social Science Journal, vol. 43, pp. 466, 2023

  3. [3]

    Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,

    P. H. Candra, A. Widita, F. H. Maulida, M. Shanti, and Y . B. Kusuma, “Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,” inE3S Web of Conferences, 2023, vol. 426, p. 02129

  4. [4]

    Batik classification in indonesia: Exploring its significance on tourism and economy,

    R. G. Tiwari, A. K. Agarwal, V . Jain, and A. Kumar, “Batik classification in indonesia: Exploring its significance on tourism and economy,” in 2023 International Conference on Sustaining Heritage: Innovative and Digital Approaches (ICSH), Jun. 2023, pp. 119–124

  5. [5]

    Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,

    Y . Mayuzumi, “Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,”Asia-Pacific Journal of Regional Science, vol. 6, no. 1, pp. 423–451, Feb. 2022

  6. [6]

    A literature review on the cultural perspective study in elementary school education in indonesia,

    F. Fitriadi, R. M. Sinaga, and R. R. Muhammad, “A literature review on the cultural perspective study in elementary school education in indonesia,”Journal of Innovation in Educational and Cultural Research, vol. 5, no. 1, Feb. 2024

  7. [7]

    The forest cultural heritage in the east coast sumatra,

    A. Fitrisia and E. Ernawati, “The forest cultural heritage in the east coast sumatra,” inProceedings of the 9th Asbam International Conference (ASBAM 2021), Apr. 2022, pp. 453–457

  8. [8]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” 2021, arXiv:2103.00020

  9. [9]

    The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,

    M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, “The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2022, pp. 4661–4669

  10. [10]

    Reducing bias in AI-based analysis of visual artworks,

    Z. Zhang et al., “Reducing bias in AI-based analysis of visual artworks,” IEEE BITS the Information Theory Magazine, vol. 2, no. 1, pp. 36–48, Oct. 2022

  11. [11]

    Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,

    H. Wang, Y . Zhan, L. Liu, L. Ding, Y . Yang, and J. Yu, “Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,” 2024, arXiv:2402.18400

  12. [12]

    Zero-shot referring image segmentation with global-local context features,

    S. Yu, P. H. Seo, and J. Son, “Zero-shot referring image segmentation with global-local context features,” 2023, arXiv:2303.17811

  13. [13]

    Improved transformer with parallel encoders for image captioning,

    L. Lou, K. Lu, and J. Xue, “Improved transformer with parallel encoders for image captioning,” in2022 26th International Conference on Pattern Recognition (ICPR), Aug. 2022, pp. 4072–4075

  14. [14]

    Sieve: Multimodal dataset pruning using image captioning models,

    A. Mahmoud et al., “Sieve: Multimodal dataset pruning using image captioning models,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 22423–22432

  15. [15]

    Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,

    K. Jeong, W. Lee, W. Nam, M. Ma, and P. Kang, “Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2024, pp. 7366–7372

  16. [16]

    Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,

    H. Han, M. Bhatti, B. Ali, Y . A. Ali, M. Al-razgan, and Y . Yasid, “Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,”Big Data Research, vol. 37, pp. 100477, Jan. 2024

  17. [17]

    Improving multimodal datasets with image captioning,

    Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt, “Improving multimodal datasets with image captioning,”Advances in neural information processing systems, vol. 36, pp. 22047–22069, 2023

  18. [18]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al., “Pali: A jointly-scaled multilingual language-image model,”arXiv preprint arXiv:2209.06794, 2022

  19. [19]

    Grounding multimodal large language models to the world,

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei, “Grounding multimodal large language models to the world,” inInternational Conference on Learning Representations, 2024, vol. 2024, pp. 51575–51598

  20. [20]

    Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,

    N. Yudistira and T. Kurita, “Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,”EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, pp. 85, 2017

  21. [21]

    The effective- ness of t5, gpt-2, and bert on text-to-image generation task,

    Mourad Bahani, Aziza El Ouaazizi, and Khalil Maalmi, “The effective- ness of t5, gpt-2, and bert on text-to-image generation task,”Pattern recognition letters, vol. 173, pp. 57–63, 2023

  22. [22]

    Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,

    Z. Xu et al., “Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,” 2024, Amazon Science. [Online]. Available: https://www.amazon.science/publications/ benchmarking-zero-shot-recognition-with- vision-language-models- challenges-on-granularity- and-specificity

  23. [23]

    Positive- augmented contrastive learning for image and video captioning evalua- tion,

    S. Sarto, M. Barraco, M. Cornia, L. Baraldi, and R. Cucchiara, “Positive- augmented contrastive learning for image and video captioning evalua- tion,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp. 6914–6924

  24. [24]

    Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,

    D. Harisanty, K. L. B. Obille, N. E. V . Anna, E. Purwanti, and F. Re- trialisca, “Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,”Digital Library Perspectives, vol. 40, no. 4, pp. 609–630, Sep. 2024

  25. [25]

    Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,

    A. H. Aida, “Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,” in2024 Mediterranean Smart Cities Conference (MSCC), May 2024, pp. 1–6

  26. [26]

    Indrajaya, Jakarta Timur, 2017

    Apri Subagyo,Mengenal Pakaian Adat Nusantara, CV . Indrajaya, Jakarta Timur, 2017

  27. [27]

    arXiv preprint arXiv:2310.07699 , year=

    Z. Lai et al., “VeCLIP: Improving CLIP training via visual-enriched captions,” 2024, arXiv:2310.07699