Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

Anugrah Aidin Yotolembah; Gembong Edhi Setyawan; Novanto Yudistira

arxiv: 2606.13275 · v1 · pith:NA2VNYKFnew · submitted 2026-06-11 · 💻 cs.CV

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

Anugrah Aidin Yotolembah , Novanto Yudistira , Gembong Edhi Setyawan This is my paper

Pith reviewed 2026-06-27 07:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot captioningcultural heritageIndonesian traditional clothingretrieval-augmented generationCLIPvision-language modelslow-resource datasetsprovince-level evaluation

0 comments

The pith

A retrieval-augmented CLIP model generates captions for traditional Indonesian clothing from provinces never seen in training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Custom ZeroCLIP can produce captions for images of traditional garments from eight unseen Indonesian provinces by retrieving only from captions of the 24 training provinces. The model combines frozen CLIP image and text encoders with BERT and an LSTM decoder, and it reports higher CLIPScore, BLEU-4, and METEOR scores than baselines while recovering more culturally specific vocabulary. A sympathetic reader would care because the approach shows a way to automate descriptions of cultural heritage items without requiring annotations or images from every region, which is useful when expert labeling is scarce.

Core claim

Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859 on images from eight unseen provinces by retrieving captions exclusively from the bank of 24 seen provinces; ablation shows retrieval yields a 19.3 percent METEOR gain, and human raters confirm improved cultural accuracy and fluency compared with non-retrieval baselines.

What carries the argument

Province-level inductive zero-shot protocol that builds a retrieval bank solely from captions of seen provinces and augments generation for unseen-province images.

Load-bearing premise

Retrieval of captions from seen provinces is sufficient to produce culturally accurate descriptions for images from completely unseen provinces without any exposure to their images, labels, or captions during training or retrieval bank construction.

What would settle it

Expert raters familiar with the eight unseen provinces judge that the generated captions systematically omit or misrepresent distinctive garment features unique to those provinces.

Figures

Figures reproduced from arXiv: 2606.13275 by Anugrah Aidin Yotolembah, Gembong Edhi Setyawan, Novanto Yudistira.

**Figure 1.** Figure 1: Overview of Custom ZeroCLIP. Training (left) optimizes a BERT-LSTM decoder using frozen CLIP embeddings from seen provinces. Inference (right) applies cosine-similarity retrieval to generate captions for unseen provinces without labels [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Training pipeline of Custom ZeroCLIP. Image and caption tokens are encoded by frozen CLIP, projected into the LM, and optimized using LCE and LCLIP . The BERT encoder, projection layers, and LSTM decoder are trained while CLIP remains frozen. computational efficiency, and noise reduction from excessive candidates while preserving relevant cultural context. Given a test image embedding v, cosine similarity … view at source ↗

**Figure 3.** Figure 3: Representative samples from seen and unseen provinces used in the inductive zero-shot evaluation protocol [26] [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation loss curves showing stable convergence without overfitting. rotations up to 15◦ [14]. Captions are tokenized with the BERT tokenizer (maximum length: 512 tokens), while synonym replacement and back-translation improve robustness to cultural terminology [6], [21], [27]. The CLIP ViT-B/32 encoder remains frozen during training, while the BERT encoder, projection layers, and LSTM dec… view at source ↗

read the original abstract

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a new 3800-image dataset of Indonesian traditional clothing and shows retrieval helps zero-shot captioning across province splits, but the cultural detail transfer to unseen provinces rests on unexamined assumptions about CLIP granularity.

read the letter

This paper's main point is a new dataset of 3800 expert-annotated images of traditional Indonesian clothing from 38 provinces, used to test retrieval-augmented zero-shot captioning with a province split.

The model trains on 24 provinces and retrieves captions only from those for the 8 unseen test provinces. It combines frozen CLIP, BERT, and LSTM, and reports better scores than baselines along with a 19.3% METEOR gain from the retrieval component and positive human evaluations for cultural accuracy.

The dataset release is the clearest positive. Making 3800 images with captions public helps anyone who wants to work on this specific cultural domain. The inductive zero-shot protocol by province is a sensible choice that avoids data leakage and tests generalization across regions.

The soft spot is the reliance on CLIP-driven retrieval to handle culturally specific details. If the embeddings do not separate province-unique garment features finely enough, the retrieved captions from seen provinces may not contain the right vocabulary for motifs, colors, or construction in the unseen ones. The ablation and human results suggest improvement, but the abstract lacks examples or failure analysis to confirm the cultural transfer works as claimed.

This work is aimed at people building practical tools for cultural heritage documentation in low-resource settings. It is not advancing new vision-language techniques but applying existing ones to a fresh application area.

A reader focused on applied cultural AI or dataset creation would find it useful. It deserves peer review because the dataset and the clear experimental protocol provide something concrete to assess, even if the central cultural accuracy claim needs more evidence to fully convince.

I would recommend sending it for review. The empirical results are presented with enough structure to allow referees to check the claims against the data.

Referee Report

2 major / 1 minor

Summary. The paper presents Custom ZeroCLIP, a retrieval-augmented zero-shot captioning framework for traditional Indonesian clothing images. It uses a frozen CLIP ViT-B/32 image encoder, CLIP and BERT text encoders, and an LSTM decoder, with retrieval from captions of 24 seen provinces only. The model is evaluated on 8 unseen provinces under an inductive protocol, reporting CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859, a 19.3% METEOR improvement from the retrieval component in ablation, and favorable human evaluations for cultural accuracy and fluency. The 3800-image expert-annotated dataset spanning all 38 provinces is released publicly.

Significance. If the central performance and cultural-transfer claims hold, the work would be significant for zero-shot methods in culturally grounded, low-resource heritage domains where exhaustive regional labeling is impractical. The explicit 19.3% METEOR gain from retrieval and the public dataset release are concrete strengths that support reproducibility and further research in domain-adapted vision-language models for cultural heritage.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central performance claims (CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859 and 19.3% METEOR gain) are stated without error bars, standard deviations across runs, or statistical significance tests against the baselines. This information is load-bearing for assessing whether the reported outperformance over existing zero-shot methods is robust.
[Method / Ablation study] Zero-shot protocol description and retrieval ablation: the claim that retrieval from the 24-province caption bank produces culturally accurate descriptions for the 8 unseen provinces rests on the assumption that CLIP embeddings separate province-specific garment features (motifs, colors, construction details) at sufficient granularity and that the retrieved strings supply the necessary cultural vocabulary. The manuscript provides no qualitative retrieval examples, nearest-neighbor analysis, or breakdown showing that the matched captions contain terms unique to the held-out provinces rather than generic visual matches.

minor comments (1)

[Abstract] The abstract lists exact metric values; these should be cross-referenced to the corresponding table or figure in the main text for traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify valid opportunities to strengthen the presentation of results and the supporting analysis for the retrieval mechanism. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central performance claims (CLIPScore 0.8536, BLEU-4 0.3342, METEOR 0.4859 and 19.3% METEOR gain) are stated without error bars, standard deviations across runs, or statistical significance tests against the baselines. This information is load-bearing for assessing whether the reported outperformance over existing zero-shot methods is robust.

Authors: We agree that error bars, standard deviations, and statistical significance tests would improve the robustness assessment of the reported metrics. The original experiments used fixed random seeds for reproducibility and were not repeated across multiple runs. In the revised manuscript we will rerun all experiments (including baselines and ablations) with five different random seeds, report means and standard deviations for CLIPScore, BLEU-4, METEOR, and the 19.3% gain, and add paired statistical significance tests against the baselines. revision: yes
Referee: [Method / Ablation study] Zero-shot protocol description and retrieval ablation: the claim that retrieval from the 24-province caption bank produces culturally accurate descriptions for the 8 unseen provinces rests on the assumption that CLIP embeddings separate province-specific garment features (motifs, colors, construction details) at sufficient granularity and that the retrieved strings supply the necessary cultural vocabulary. The manuscript provides no qualitative retrieval examples, nearest-neighbor analysis, or breakdown showing that the matched captions contain terms unique to the held-out provinces rather than generic visual matches.

Authors: We acknowledge that the current manuscript lacks qualitative retrieval examples and embedding analysis to directly illustrate how province-specific features are captured. The quantitative ablation and human evaluations for cultural accuracy are provided, but additional supporting evidence would strengthen the claim. In the revision we will add (1) qualitative examples of top-3 retrieved captions for sample unseen-province images, (2) nearest-neighbor analysis or similarity heatmaps in CLIP embedding space demonstrating separation of province-specific motifs/colors, and (3) a breakdown of unique cultural terms (e.g., specific garment names or motifs) appearing in the retrieved captions versus generic matches. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical zero-shot protocol on held-out provinces is self-contained

full rationale

The paper defines a province-level inductive zero-shot protocol that trains exclusively on 24 seen provinces, builds a retrieval bank only from their captions, and evaluates on 8 completely unseen provinces with no exposure to their images/labels/captions. Metrics (CLIPScore, BLEU-4, METEOR) and ablations are computed against external baselines; no equation, parameter fit, or self-citation reduces the central claim to its own inputs by construction. The setup is a standard held-out evaluation and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions about pre-trained vision-language models and the transferability of cultural vocabulary via retrieval. No new free parameters, invented entities, or ad-hoc axioms beyond domain assumptions are introduced in the abstract.

axioms (2)

domain assumption Frozen CLIP embeddings support effective semantic retrieval for caption augmentation across cultural domains
The framework relies on this to enable zero-shot transfer without fine-tuning the image encoder.
ad hoc to paper Captions from seen provinces contain transferable cultural vocabulary for unseen provinces
This is the core premise of the inductive zero-shot protocol described in the abstract.

pith-pipeline@v0.9.1-grok · 5806 in / 1535 out tokens · 38448 ms · 2026-06-27T07:04:36.628969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,

S. Andr ´efou¨et, M. Paul, and A. R. Farhan, “Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,”Marine Policy, vol. 135, pp. 104848, Jan. 2022

2022
[2]

Perception and appreciation of the indonesian plural society toward cultural diversity,

L. Suryatni and I. D. K. K. Widana, “Perception and appreciation of the indonesian plural society toward cultural diversity,”Technium Social Science Journal, vol. 43, pp. 466, 2023

2023
[3]

Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,

P. H. Candra, A. Widita, F. H. Maulida, M. Shanti, and Y . B. Kusuma, “Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,” inE3S Web of Conferences, 2023, vol. 426, p. 02129

2023
[4]

Batik classification in indonesia: Exploring its significance on tourism and economy,

R. G. Tiwari, A. K. Agarwal, V . Jain, and A. Kumar, “Batik classification in indonesia: Exploring its significance on tourism and economy,” in 2023 International Conference on Sustaining Heritage: Innovative and Digital Approaches (ICSH), Jun. 2023, pp. 119–124

2023
[5]

Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,

Y . Mayuzumi, “Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,”Asia-Pacific Journal of Regional Science, vol. 6, no. 1, pp. 423–451, Feb. 2022

2022
[6]

A literature review on the cultural perspective study in elementary school education in indonesia,

F. Fitriadi, R. M. Sinaga, and R. R. Muhammad, “A literature review on the cultural perspective study in elementary school education in indonesia,”Journal of Innovation in Educational and Cultural Research, vol. 5, no. 1, Feb. 2024

2024
[7]

The forest cultural heritage in the east coast sumatra,

A. Fitrisia and E. Ernawati, “The forest cultural heritage in the east coast sumatra,” inProceedings of the 9th Asbam International Conference (ASBAM 2021), Apr. 2022, pp. 453–457

2021
[8]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” 2021, arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,

M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, “The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2022, pp. 4661–4669

2022
[10]

Reducing bias in AI-based analysis of visual artworks,

Z. Zhang et al., “Reducing bias in AI-based analysis of visual artworks,” IEEE BITS the Information Theory Magazine, vol. 2, no. 1, pp. 36–48, Oct. 2022

2022
[11]

Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,

H. Wang, Y . Zhan, L. Liu, L. Ding, Y . Yang, and J. Yu, “Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,” 2024, arXiv:2402.18400

work page arXiv 2024
[12]

Zero-shot referring image segmentation with global-local context features,

S. Yu, P. H. Seo, and J. Son, “Zero-shot referring image segmentation with global-local context features,” 2023, arXiv:2303.17811

work page arXiv 2023
[13]

Improved transformer with parallel encoders for image captioning,

L. Lou, K. Lu, and J. Xue, “Improved transformer with parallel encoders for image captioning,” in2022 26th International Conference on Pattern Recognition (ICPR), Aug. 2022, pp. 4072–4075

2022
[14]

Sieve: Multimodal dataset pruning using image captioning models,

A. Mahmoud et al., “Sieve: Multimodal dataset pruning using image captioning models,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 22423–22432

2024
[15]

Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,

K. Jeong, W. Lee, W. Nam, M. Ma, and P. Kang, “Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2024, pp. 7366–7372

2024
[16]

Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,

H. Han, M. Bhatti, B. Ali, Y . A. Ali, M. Al-razgan, and Y . Yasid, “Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,”Big Data Research, vol. 37, pp. 100477, Jan. 2024

2024
[17]

Improving multimodal datasets with image captioning,

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt, “Improving multimodal datasets with image captioning,”Advances in neural information processing systems, vol. 36, pp. 22047–22069, 2023

2023
[18]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al., “Pali: A jointly-scaled multilingual language-image model,”arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Grounding multimodal large language models to the world,

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei, “Grounding multimodal large language models to the world,” inInternational Conference on Learning Representations, 2024, vol. 2024, pp. 51575–51598

2024
[20]

Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,

N. Yudistira and T. Kurita, “Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,”EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, pp. 85, 2017

2017
[21]

The effective- ness of t5, gpt-2, and bert on text-to-image generation task,

Mourad Bahani, Aziza El Ouaazizi, and Khalil Maalmi, “The effective- ness of t5, gpt-2, and bert on text-to-image generation task,”Pattern recognition letters, vol. 173, pp. 57–63, 2023

2023
[22]

Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,

Z. Xu et al., “Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,” 2024, Amazon Science. [Online]. Available: https://www.amazon.science/publications/ benchmarking-zero-shot-recognition-with- vision-language-models- challenges-on-granularity- and-specificity

2024
[23]

Positive- augmented contrastive learning for image and video captioning evalua- tion,

S. Sarto, M. Barraco, M. Cornia, L. Baraldi, and R. Cucchiara, “Positive- augmented contrastive learning for image and video captioning evalua- tion,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp. 6914–6924

2023
[24]

Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,

D. Harisanty, K. L. B. Obille, N. E. V . Anna, E. Purwanti, and F. Re- trialisca, “Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,”Digital Library Perspectives, vol. 40, no. 4, pp. 609–630, Sep. 2024

2024
[25]

Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,

A. H. Aida, “Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,” in2024 Mediterranean Smart Cities Conference (MSCC), May 2024, pp. 1–6

2024
[26]

Indrajaya, Jakarta Timur, 2017

Apri Subagyo,Mengenal Pakaian Adat Nusantara, CV . Indrajaya, Jakarta Timur, 2017

2017
[27]

arXiv preprint arXiv:2310.07699 , year=

Z. Lai et al., “VeCLIP: Improving CLIP training via visual-enriched captions,” 2024, arXiv:2310.07699

work page arXiv 2024

[1] [1]

Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,

S. Andr ´efou¨et, M. Paul, and A. R. Farhan, “Indonesia’s 13558 islands: A new census from space and a first step towards a one map for small islands policy,”Marine Policy, vol. 135, pp. 104848, Jan. 2022

2022

[2] [2]

Perception and appreciation of the indonesian plural society toward cultural diversity,

L. Suryatni and I. D. K. K. Widana, “Perception and appreciation of the indonesian plural society toward cultural diversity,”Technium Social Science Journal, vol. 43, pp. 466, 2023

2023

[3] [3]

Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,

P. H. Candra, A. Widita, F. H. Maulida, M. Shanti, and Y . B. Kusuma, “Rebranding of malangan batik as a symbol of malang’s cultural identity through value chain analysis,” inE3S Web of Conferences, 2023, vol. 426, p. 02129

2023

[4] [4]

Batik classification in indonesia: Exploring its significance on tourism and economy,

R. G. Tiwari, A. K. Agarwal, V . Jain, and A. Kumar, “Batik classification in indonesia: Exploring its significance on tourism and economy,” in 2023 International Conference on Sustaining Heritage: Innovative and Digital Approaches (ICSH), Jun. 2023, pp. 119–124

2023

[5] [5]

Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,

Y . Mayuzumi, “Is meeting the needs of tourists through ethnic tourism sustainable? focus on bali, indonesia,”Asia-Pacific Journal of Regional Science, vol. 6, no. 1, pp. 423–451, Feb. 2022

2022

[6] [6]

A literature review on the cultural perspective study in elementary school education in indonesia,

F. Fitriadi, R. M. Sinaga, and R. R. Muhammad, “A literature review on the cultural perspective study in elementary school education in indonesia,”Journal of Innovation in Educational and Cultural Research, vol. 5, no. 1, Feb. 2024

2024

[7] [7]

The forest cultural heritage in the east coast sumatra,

A. Fitrisia and E. Ernawati, “The forest cultural heritage in the east coast sumatra,” inProceedings of the 9th Asbam International Conference (ASBAM 2021), Apr. 2022, pp. 453–457

2021

[8] [8]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” 2021, arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,

M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, “The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2022, pp. 4661–4669

2022

[10] [10]

Reducing bias in AI-based analysis of visual artworks,

Z. Zhang et al., “Reducing bias in AI-based analysis of visual artworks,” IEEE BITS the Information Theory Magazine, vol. 2, no. 1, pp. 36–48, Oct. 2022

2022

[11] [11]

Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,

H. Wang, Y . Zhan, L. Liu, L. Ding, Y . Yang, and J. Yu, “Towards alleviating text-to-image retrieval hallucination for CLIP in zero-shot learning,” 2024, arXiv:2402.18400

work page arXiv 2024

[12] [12]

Zero-shot referring image segmentation with global-local context features,

S. Yu, P. H. Seo, and J. Son, “Zero-shot referring image segmentation with global-local context features,” 2023, arXiv:2303.17811

work page arXiv 2023

[13] [13]

Improved transformer with parallel encoders for image captioning,

L. Lou, K. Lu, and J. Xue, “Improved transformer with parallel encoders for image captioning,” in2022 26th International Conference on Pattern Recognition (ICPR), Aug. 2022, pp. 4072–4075

2022

[14] [14]

Sieve: Multimodal dataset pruning using image captioning models,

A. Mahmoud et al., “Sieve: Multimodal dataset pruning using image captioning models,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2024, pp. 22423–22432

2024

[15] [15]

Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,

K. Jeong, W. Lee, W. Nam, M. Ma, and P. Kang, “Technical report of NICE challenge at CVPR 2024: Caption re-ranking evaluation using ensembled CLIP and consensus scores,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2024, pp. 7366–7372

2024

[16] [16]

Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,

H. Han, M. Bhatti, B. Ali, Y . A. Ali, M. Al-razgan, and Y . Yasid, “Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing,”Big Data Research, vol. 37, pp. 100477, Jan. 2024

2024

[17] [17]

Improving multimodal datasets with image captioning,

Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt, “Improving multimodal datasets with image captioning,”Advances in neural information processing systems, vol. 36, pp. 22047–22069, 2023

2023

[18] [18]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al., “Pali: A jointly-scaled multilingual language-image model,”arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Grounding multimodal large language models to the world,

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei, “Grounding multimodal large language models to the world,” inInternational Conference on Learning Representations, 2024, vol. 2024, pp. 51575–51598

2024

[20] [20]

Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,

N. Yudistira and T. Kurita, “Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning,”EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, pp. 85, 2017

2017

[21] [21]

The effective- ness of t5, gpt-2, and bert on text-to-image generation task,

Mourad Bahani, Aziza El Ouaazizi, and Khalil Maalmi, “The effective- ness of t5, gpt-2, and bert on text-to-image generation task,”Pattern recognition letters, vol. 173, pp. 57–63, 2023

2023

[22] [22]

Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,

Z. Xu et al., “Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity,” 2024, Amazon Science. [Online]. Available: https://www.amazon.science/publications/ benchmarking-zero-shot-recognition-with- vision-language-models- challenges-on-granularity- and-specificity

2024

[23] [23]

Positive- augmented contrastive learning for image and video captioning evalua- tion,

S. Sarto, M. Barraco, M. Cornia, L. Baraldi, and R. Cucchiara, “Positive- augmented contrastive learning for image and video captioning evalua- tion,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2023, pp. 6914–6924

2023

[24] [24]

Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,

D. Harisanty, K. L. B. Obille, N. E. V . Anna, E. Purwanti, and F. Re- trialisca, “Cultural heritage preservation in the digital age, harnessing artificial intelligence for the future: a bibliometric analysis,”Digital Library Perspectives, vol. 40, no. 4, pp. 609–630, Sep. 2024

2024

[25] [25]

Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,

A. H. Aida, “Cultural heritage preservation in the digital age: Balanc- ing tradition and innovation in mediterranean smart cities,” in2024 Mediterranean Smart Cities Conference (MSCC), May 2024, pp. 1–6

2024

[26] [26]

Indrajaya, Jakarta Timur, 2017

Apri Subagyo,Mengenal Pakaian Adat Nusantara, CV . Indrajaya, Jakarta Timur, 2017

2017

[27] [27]

arXiv preprint arXiv:2310.07699 , year=

Z. Lai et al., “VeCLIP: Improving CLIP training via visual-enriched captions,” 2024, arXiv:2310.07699

work page arXiv 2024