pith. machine review for the scientific record. sign in

arxiv: 2104.08718 · v3 · submitted 2021-04-18 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 22:19 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords image captioningevaluation metricsreference-free evaluationCLIP modelhuman correlationmultimodal similarityautomatic metrics
0
0 comments X

The pith

CLIP embeddings can score how well a generated caption matches its image without any human reference captions and match human judgments better than metrics that require them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a cross-modal model trained on hundreds of millions of web image-caption pairs can measure caption quality directly through image-text similarity. This reference-free approach removes the need to collect multiple human-written descriptions for comparison. Across several captioning datasets, the resulting CLIPScore correlates more strongly with human ratings than established reference-based metrics. The method focuses tightly on visual-textual fit and proves complementary to text-only similarity measures, yielding an improved hybrid when references are available. This finding matters because reference collection is costly and limits rapid iteration in captioning research.

Core claim

CLIPScore is computed from the cosine similarity between CLIP image and text embeddings and achieves higher correlation with human judgments of caption quality than reference-based metrics such as CIDEr and SPICE on multiple corpora. A reference-augmented variant called RefCLIPScore further improves correlation by incorporating text-text similarity as well. The approach performs strongly on literal description tasks and domains such as clip-art but shows relative weakness on captions that require external contextual knowledge.

What carries the argument

CLIPScore, the direct cosine similarity between a CLIP model's image embedding and caption embedding that quantifies image-text compatibility without references.

If this is right

  • Captioning systems can be evaluated automatically in settings where reference captions are unavailable or expensive to collect.
  • Hybrid reference-plus-CLIP metrics become preferable when references exist, as they capture both visual fit and textual fluency.
  • The metric remains reliable on literal visual descriptions but requires caution on tasks that demand world knowledge beyond the image.
  • Evaluation pipelines can now incorporate CLIPScore as a fast, scalable complement to slower human studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Captioning models could be trained end-to-end by treating CLIPScore as a differentiable reward signal instead of relying solely on cross-entropy or CIDEr optimization.
  • The same reference-free idea may extend to evaluating other image-text outputs such as visual question answering answers or story generation from images.
  • Domains where CLIPScore underperforms, such as news images, point to the need for additional knowledge sources that current web-pretrained embeddings do not supply.

Load-bearing premise

That CLIP's web-pretrained image and text representations already encode a general, transferable signal of caption quality that holds across domains without task-specific retraining.

What would settle it

A new human rating study on a held-out captioning dataset in which CLIPScore shows lower Pearson or Spearman correlation with the ratings than CIDEr or SPICE would falsify the central performance claim.

read the original abstract

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces CLIPScore, a reference-free metric for image captioning evaluation that computes cosine similarity between CLIP embeddings of an image and a generated caption from a fixed, publicly pretrained model. It claims this metric achieves the highest correlation with human judgments across several corpora, outperforming reference-based metrics such as CIDEr and SPICE, while information-gain experiments show complementarity to text-text similarity metrics. A reference-augmented variant (RefCLIPScore) is also presented that yields even higher correlations. Case studies highlight strong performance on clip-art and alt-text tasks but weaker results on news captions requiring external context.

Significance. If the empirical correlations hold, the work would be significant for establishing a simple, parameter-free, reference-free evaluation method that aligns better with human judgments than standard n-gram or scene-graph metrics. The absence of any fitting to evaluation datasets and the explicit complementarity analysis are strengths that could shift evaluation practices in vision-language research toward leveraging large pretrained multimodal models.

major comments (3)
  1. [Abstract] Abstract and experimental results: the claim of consistent outperformance and highest correlation with human judgments lacks reported exact Pearson/Spearman values, confidence intervals, or statistical significance tests comparing CLIPScore to CIDEr and SPICE; without these, the superiority assertion cannot be fully evaluated.
  2. [Case studies] Case studies section: weaker performance on news captions is noted as requiring richer contextual knowledge, but no domain-stratified splits, ablation on CLIP variants, or controls for distribution shift are described; this directly challenges the robustness claim for the web-pretrained embeddings across captioning domains.
  3. [Experiments] Human judgment collection: potential confounds (e.g., annotation instructions, inter-annotator agreement details, or selection bias in the corpora) are not addressed, which is load-bearing for validating that CLIPScore's image-text compatibility signal truly tracks quality rather than artifacts of the judgment process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical claims and limitations. We address each major point below and have revised the manuscript accordingly where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the claim of consistent outperformance and highest correlation with human judgments lacks reported exact Pearson/Spearman values, confidence intervals, or statistical significance tests comparing CLIPScore to CIDEr and SPICE; without these, the superiority assertion cannot be fully evaluated.

    Authors: The manuscript reports Pearson and Spearman correlations in Tables 2 and 3 for multiple datasets, showing CLIPScore outperforming CIDEr and SPICE. However, we agree that the abstract and main text do not highlight exact values, confidence intervals, or significance tests. In revision, we will update the abstract with key correlation figures and add bootstrap-derived 95% confidence intervals plus paired significance tests (e.g., Williams test) to the experimental section. This will allow direct evaluation of the outperformance claims without altering the underlying results. revision: yes

  2. Referee: [Case studies] Case studies section: weaker performance on news captions is noted as requiring richer contextual knowledge, but no domain-stratified splits, ablation on CLIP variants, or controls for distribution shift are described; this directly challenges the robustness claim for the web-pretrained embeddings across captioning domains.

    Authors: The case studies are qualitative illustrations of domain differences rather than a comprehensive robustness study; the primary claims rest on the aggregate results across standard captioning benchmarks. We explicitly flag the news-caption limitation in the manuscript. To address the concern, the revision will include a short discussion of potential distribution shift between web-pretraining data and news domains, plus a note that future work could explore CLIP variants or fine-tuning. No new ablations or stratified splits are added, as the focus remains on the fixed public model, but the limitation is now stated more prominently. revision: partial

  3. Referee: [Experiments] Human judgment collection: potential confounds (e.g., annotation instructions, inter-annotator agreement details, or selection bias in the corpora) are not addressed, which is load-bearing for validating that CLIPScore's image-text compatibility signal truly tracks quality rather than artifacts of the judgment process.

    Authors: The human judgments are taken from previously published evaluation datasets whose collection protocols are described in the cited source papers. We will expand the experimental setup section in revision to summarize the key details of annotation instructions, reported inter-annotator agreement, and corpus construction from those references. This addition will make explicit that CLIPScore is evaluated against the same human signals used by prior metrics, while acknowledging any known limitations of the original judgment processes. revision: yes

Circularity Check

0 steps flagged

No circularity: CLIPScore is a fixed function of an external pretrained model; correlations are measured empirically.

full rationale

The paper defines CLIPScore as a direct cosine similarity computation in the fixed CLIP embedding space (pretrained on 400M web pairs, no parameters tuned on caption evaluation data). The reported correlations with human judgments are post-hoc empirical measurements on standard corpora, not quantities fitted or defined in terms of the target results. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing for the central claim. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of the fixed pretrained CLIP model applied to caption evaluation; no new free parameters are introduced and no new entities are postulated.

axioms (1)
  • domain assumption CLIP embeddings capture semantic image-text compatibility sufficiently well to serve as a proxy for human caption quality judgments.
    Invoked when defining CLIPScore as the cosine similarity between image and text features without further justification or calibration on the target task.

pith-pipeline@v0.9.0 · 5520 in / 1233 out tokens · 43625 ms · 2026-05-12T22:19:02.195685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

    cs.CR 2026-05 unverdicted novelty 7.0

    A theoretical framework decouples diffusion model generation from watermark decisions, enabling SSB to reach any security-robustness-fidelity regime without model-specific empirical tests.

  2. PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

    cs.RO 2026-04 unverdicted novelty 7.0

    PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.

  3. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  4. Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

    cs.CR 2026-04 unverdicted novelty 7.0

    HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

  5. The Indra Representation Hypothesis for Multimodal Alignment

    cs.CV 2026-04 unverdicted novelty 7.0

    Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...

  6. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  7. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    cs.LG 2025-09 unverdicted novelty 7.0

    DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...

  8. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  9. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  10. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  11. ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

  12. BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...

  13. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  14. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  15. Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration

    cs.CV 2026-04 unverdicted novelty 6.0

    TICoE achieves more precise and faithful concept erasure in text-to-image models by collaborating text and image data through a convex manifold and hierarchical learning, outperforming prior methods.

  16. Bias at the End of the Score

    cs.CV 2026-04 unverdicted novelty 6.0

    Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.

  17. Evolutionary Token-Level Prompt Optimization for Diffusion Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.

  18. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  19. Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

  20. HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    HandDreamer is the first zero-shot text-to-3D method for hands that uses MANO initialization, skeleton-guided diffusion, and corrective shape guidance to produce view-consistent models.

  21. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  22. Aligning Text-to-Image Models using Human Feedback

    cs.LG 2023-02 unverdicted novelty 6.0

    A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.

  23. RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...

  24. Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models

    cs.LG 2026-05 unverdicted novelty 5.0

    SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.

  25. ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.

  26. SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

    cs.CV 2026-04 unverdicted novelty 5.0

    SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraini...

  27. AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

    cs.CV 2026-04 unverdicted novelty 5.0

    AutoVQA-G is a self-improving framework that generates VQA-G datasets with higher visual grounding accuracy than leading multimodal LLMs via iterative CoT verification and prompt refinement.

  28. ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation

    cs.RO 2026-04 unverdicted novelty 5.0

    Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 28 Pith papers · 1 internal anchor

  1. [1]

    Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. 2015. From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292

  2. [2]

    Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating CLIP : Towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818

  3. [3]

    Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Springer

  4. [4]

    Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL, 7:597--610

  5. [5]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop on Evaluation Measures for MT and Summarization

  6. [6]

    Berg, Tamara L

    Alexander C. Berg, Tamara L. Berg, Hal Daumé III, Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Aneesh Sood, Karl Stratos, and Kota Yamaguchi. 2012. Understanding and predicting importance in images. In CVPR

  7. [7]

    Ali Furkan Biten, Lluis Gomez, Mar c al Rusinol, and Dimosthenis Karatzas. 2019. Good news, everyone! context driven entity-aware captioning for news images. In CVPR

  8. [8]

    John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence estimation for machine translation. In COLING

  9. [9]

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV

  10. [10]

    Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. 2018. Learning to evaluate image captioning. In CVPR

  11. [11]

    Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In NeurIPS

  12. [12]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL

  13. [13]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR

  14. [14]

    Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In ACL

  15. [15]

    Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M Kitani, and Jeffrey P Bigham. 2019. ``it's almost like they're trying to hide it": How user-provided image descriptions have failed to make twitter accessible. In WWW

  16. [16]

    Cole Gleason, Amy Pavel, Emma McCamey, Christina Low, Patrick Carrington, Kris M Kitani, and Jeffrey P Bigham. 2020. Twitter a11y: A browser extension to make twitter images accessible. In CHI

  17. [17]

    Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), pages 771--787

  18. [18]

    Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853--899

  19. [19]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML

  20. [20]

    Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. 2019. TIGEr: text-to-image grounding for image caption evaluation. In EMNLP

  21. [21]

    Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. Cross-lingual ability of multilingual BERT : An empirical study. In ICLR

  22. [22]

    Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and Mohamed Coulibali. 2020. NUBIA : N e U ral based interchangeability assessor for text generation. In 1st Workshop on Evaluating NLG Evaluation

  23. [23]

    Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In EACL

  24. [24]

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: an unreferenced metric for image captioning via contrastive learning. In ACL

  25. [25]

    Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. 2020. Vilbertscore: Evaluating image caption using vision-and-language bert. In First Workshop on Evaluation and Comparison of NLP Systems

  26. [26]

    Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV

  27. [27]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out

  28. [28]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. 2014. Microsoft COCO : Common objects in context. In ECCV. Springer

  29. [29]

    Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In ECCV

  30. [30]

    Chi-kiu Lo. 2019. Yisi-a unified semantic mt quality evaluation and estimation metric for languages with different levels of available resources. In Fourth Conference on Machine Translation

  31. [31]

    Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267--300

  32. [32]

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS

  33. [33]

    Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR

  34. [34]

    Grace Luo, Trevor Darrell, and Anna Rohrbach. 2021. NewsCLIPpings: automatic generation of out-of-context multimodal media. arXiv preprint arXiv:2104.05893

  35. [35]

    Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In CVPR

  36. [36]

    Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people's experiences with computer-generated captions of social media images. In CHI

  37. [37]

    Pranava Madhyastha, Josiah Wang, and Lucia Specia. 2019. VIFIDEL : Evaluating the visual fidelity of image descriptions. In ACL

  38. [38]

    Yashar Mehdad, Matteo Negri, and Marcello Federico. 2012. Match without a referee: evaluating mt adequacy without reference translations. In Seventh Workshop on Statistical Machine Translation

  39. [39]

    Shikib Mehri and Maxine Eskenazi. 2020. USR : An unsupervised and reference free evaluation metric for dialog generation. In ACL

  40. [40]

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In FAccT

  41. [41]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

  42. [42]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL

  43. [43]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. JMLR, 12

  44. [44]

    Maxime Peyrard and Iryna Gurevych. 2018. Objective function learning to match human judgements for optimization-based summarization. In NAACL

  45. [45]

    Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT ? In ACL

  46. [46]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision

  47. [47]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  48. [48]

    Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In EMNLP

  49. [49]

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. IJCV

  50. [50]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL

  51. [51]

    Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aur \'e lie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. FOIL it! find one mismatch between image and language caption. In ACL

  52. [52]

    Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In CVPR

  53. [53]

    Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS

  54. [54]

    Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Machine translation evaluation versus quality estimation. Machine translation, 24(1):39--50

  55. [55]

    Lucia Specia and Kashif Shah. 2018. Machine translation quality estimation: Applications and future perspectives. In Translation Quality Assessment, pages 201--235. Springer

  56. [56]

    Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ``person, shoes, tree. is the person naked?" what people with vision impairments want in image descriptions. In CHI

  57. [57]

    Simeng Sun and Ani Nenkova. 2019. The feasibility of embedding based automatic evaluation for single document summarization. In EMNLP

  58. [58]

    Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI

  59. [59]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS

  60. [60]

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR

  61. [61]

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. TPAMI, 39(4):652--663

  62. [62]

    Sijin Wang, Ziwei Yao, Ruiping Wang, Zhongqin Wu, and Xilin Chen. 2021. FAIEr : Fidelity and adequacy ensured image caption evaluation. In CVPR

  63. [63]

    Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT . In EMNLP

  64. [64]

    Elizaveta Yankovskaya, Andre T \"a ttar, and Mark Fishel. 2019. Quality estimation and translation metrics via pre-trained word and sentence embeddings. In Fourth Conference on Machine Translation

  65. [65]

    Yanzhi Yi, Hangyu Deng, and Jinglu Hu. 2020. Improving image captioning evaluation by considering inter references variance. In ACL

  66. [66]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67--78

  67. [67]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. BERTScore : Evaluating text generation with BERT . In ICLR

  68. [68]

    Wei Zhao, Goran Glava s , Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In ACL

  69. [69]

    C Lawrence Zitnick and Devi Parikh. 2013. Bringing semantics into focus using visual abstraction. In CVPR

  70. [70]

    Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In EMNLP