Recognition: 2 theorem links
· Lean TheoremCLIPScore: A Reference-free Evaluation Metric for Image Captioning
Pith reviewed 2026-05-12 22:19 UTC · model grok-4.3
The pith
CLIP embeddings can score how well a generated caption matches its image without any human reference captions and match human judgments better than metrics that require them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLIPScore is computed from the cosine similarity between CLIP image and text embeddings and achieves higher correlation with human judgments of caption quality than reference-based metrics such as CIDEr and SPICE on multiple corpora. A reference-augmented variant called RefCLIPScore further improves correlation by incorporating text-text similarity as well. The approach performs strongly on literal description tasks and domains such as clip-art but shows relative weakness on captions that require external contextual knowledge.
What carries the argument
CLIPScore, the direct cosine similarity between a CLIP model's image embedding and caption embedding that quantifies image-text compatibility without references.
If this is right
- Captioning systems can be evaluated automatically in settings where reference captions are unavailable or expensive to collect.
- Hybrid reference-plus-CLIP metrics become preferable when references exist, as they capture both visual fit and textual fluency.
- The metric remains reliable on literal visual descriptions but requires caution on tasks that demand world knowledge beyond the image.
- Evaluation pipelines can now incorporate CLIPScore as a fast, scalable complement to slower human studies.
Where Pith is reading between the lines
- Captioning models could be trained end-to-end by treating CLIPScore as a differentiable reward signal instead of relying solely on cross-entropy or CIDEr optimization.
- The same reference-free idea may extend to evaluating other image-text outputs such as visual question answering answers or story generation from images.
- Domains where CLIPScore underperforms, such as news images, point to the need for additional knowledge sources that current web-pretrained embeddings do not supply.
Load-bearing premise
That CLIP's web-pretrained image and text representations already encode a general, transferable signal of caption quality that holds across domains without task-specific retraining.
What would settle it
A new human rating study on a held-out captioning dataset in which CLIPScore shows lower Pearson or Spearman correlation with the ratings than CIDEr or SPICE would falsify the central performance claim.
read the original abstract
Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLIPScore, a reference-free metric for image captioning evaluation that computes cosine similarity between CLIP embeddings of an image and a generated caption from a fixed, publicly pretrained model. It claims this metric achieves the highest correlation with human judgments across several corpora, outperforming reference-based metrics such as CIDEr and SPICE, while information-gain experiments show complementarity to text-text similarity metrics. A reference-augmented variant (RefCLIPScore) is also presented that yields even higher correlations. Case studies highlight strong performance on clip-art and alt-text tasks but weaker results on news captions requiring external context.
Significance. If the empirical correlations hold, the work would be significant for establishing a simple, parameter-free, reference-free evaluation method that aligns better with human judgments than standard n-gram or scene-graph metrics. The absence of any fitting to evaluation datasets and the explicit complementarity analysis are strengths that could shift evaluation practices in vision-language research toward leveraging large pretrained multimodal models.
major comments (3)
- [Abstract] Abstract and experimental results: the claim of consistent outperformance and highest correlation with human judgments lacks reported exact Pearson/Spearman values, confidence intervals, or statistical significance tests comparing CLIPScore to CIDEr and SPICE; without these, the superiority assertion cannot be fully evaluated.
- [Case studies] Case studies section: weaker performance on news captions is noted as requiring richer contextual knowledge, but no domain-stratified splits, ablation on CLIP variants, or controls for distribution shift are described; this directly challenges the robustness claim for the web-pretrained embeddings across captioning domains.
- [Experiments] Human judgment collection: potential confounds (e.g., annotation instructions, inter-annotator agreement details, or selection bias in the corpora) are not addressed, which is load-bearing for validating that CLIPScore's image-text compatibility signal truly tracks quality rather than artifacts of the judgment process.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our empirical claims and limitations. We address each major point below and have revised the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental results: the claim of consistent outperformance and highest correlation with human judgments lacks reported exact Pearson/Spearman values, confidence intervals, or statistical significance tests comparing CLIPScore to CIDEr and SPICE; without these, the superiority assertion cannot be fully evaluated.
Authors: The manuscript reports Pearson and Spearman correlations in Tables 2 and 3 for multiple datasets, showing CLIPScore outperforming CIDEr and SPICE. However, we agree that the abstract and main text do not highlight exact values, confidence intervals, or significance tests. In revision, we will update the abstract with key correlation figures and add bootstrap-derived 95% confidence intervals plus paired significance tests (e.g., Williams test) to the experimental section. This will allow direct evaluation of the outperformance claims without altering the underlying results. revision: yes
-
Referee: [Case studies] Case studies section: weaker performance on news captions is noted as requiring richer contextual knowledge, but no domain-stratified splits, ablation on CLIP variants, or controls for distribution shift are described; this directly challenges the robustness claim for the web-pretrained embeddings across captioning domains.
Authors: The case studies are qualitative illustrations of domain differences rather than a comprehensive robustness study; the primary claims rest on the aggregate results across standard captioning benchmarks. We explicitly flag the news-caption limitation in the manuscript. To address the concern, the revision will include a short discussion of potential distribution shift between web-pretraining data and news domains, plus a note that future work could explore CLIP variants or fine-tuning. No new ablations or stratified splits are added, as the focus remains on the fixed public model, but the limitation is now stated more prominently. revision: partial
-
Referee: [Experiments] Human judgment collection: potential confounds (e.g., annotation instructions, inter-annotator agreement details, or selection bias in the corpora) are not addressed, which is load-bearing for validating that CLIPScore's image-text compatibility signal truly tracks quality rather than artifacts of the judgment process.
Authors: The human judgments are taken from previously published evaluation datasets whose collection protocols are described in the cited source papers. We will expand the experimental setup section in revision to summarize the key details of annotation instructions, reported inter-annotator agreement, and corpus construction from those references. This addition will make explicit that CLIPScore is evaluated against the same human signals used by prior metrics, while acknowledging any known limitations of the original judgment processes. revision: yes
Circularity Check
No circularity: CLIPScore is a fixed function of an external pretrained model; correlations are measured empirically.
full rationale
The paper defines CLIPScore as a direct cosine similarity computation in the fixed CLIP embedding space (pretrained on 400M web pairs, no parameters tuned on caption evaluation data). The reported correlations with human judgments are post-hoc empirical measurements on standard corpora, not quantities fitted or defined in terms of the target results. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing for the central claim. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CLIP embeddings capture semantic image-text compatibility sufficiently well to serve as a proxy for human caption quality judgments.
Forward citations
Cited by 28 Pith papers
-
Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles
A theoretical framework decouples diffusion model generation from watermark decisions, enabling SSB to reach any security-robustness-fidelity regime without model-specific empirical tests.
-
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
-
Long-Text-to-Image Generation via Compositional Prompt Decomposition
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
-
Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal...
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
BalCapRL applies balanced multi-objective RL with GDPO-style normalization and length-conditional masking to improve MLLM image captioning, reporting gains of up to +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena on...
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration
TICoE achieves more precise and faithful concept erasure in text-to-image models by collaborating text and image data through a convex manifold and hierarchical learning, outperforming prior methods.
-
Bias at the End of the Score
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
-
Evolutionary Token-Level Prompt Optimization for Diffusion Models
A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
-
HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
HandDreamer is the first zero-shot text-to-3D method for hands that uses MANO initialization, skeleton-guided diffusion, and corrective shape guidance to produce view-consistent models.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
Aligning Text-to-Image Models using Human Feedback
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
-
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
-
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
-
SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing
SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraini...
-
AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation
AutoVQA-G is a self-improving framework that generates VQA-G datasets with higher visual grounding accuracy than leading multimodal LLMs via iterative CoT verification and prompt refinement.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Springer
work page 2016
-
[4]
Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL, 7:597--610
work page 2019
-
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop on Evaluation Measures for MT and Summarization
work page 2005
-
[6]
Alexander C. Berg, Tamara L. Berg, Hal Daumé III, Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Aneesh Sood, Karl Stratos, and Kota Yamaguchi. 2012. Understanding and predicting importance in images. In CVPR
work page 2012
-
[7]
Ali Furkan Biten, Lluis Gomez, Mar c al Rusinol, and Dimosthenis Karatzas. 2019. Good news, everyone! context driven entity-aware captioning for news images. In CVPR
work page 2019
-
[8]
John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence estimation for machine translation. In COLING
work page 2004
-
[9]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV
work page 2020
-
[10]
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge Belongie. 2018. Learning to evaluate image captioning. In CVPR
work page 2018
-
[11]
Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In NeurIPS
work page 2017
-
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL
work page 2019
-
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
work page 2021
-
[14]
Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In ACL
work page 2014
-
[15]
Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M Kitani, and Jeffrey P Bigham. 2019. ``it's almost like they're trying to hide it": How user-provided image descriptions have failed to make twitter accessible. In WWW
work page 2019
-
[16]
Cole Gleason, Amy Pavel, Emma McCamey, Christina Low, Patrick Carrington, Kris M Kitani, and Jeffrey P Bigham. 2020. Twitter a11y: A browser extension to make twitter images accessible. In CHI
work page 2020
-
[17]
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV), pages 771--787
work page 2018
-
[18]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853--899
work page 2013
-
[19]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
work page 2021
-
[20]
Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. 2019. TIGEr: text-to-image grounding for image caption evaluation. In EMNLP
work page 2019
-
[21]
Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. 2020. Cross-lingual ability of multilingual BERT : An empirical study. In ICLR
work page 2020
-
[22]
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and Mohamed Coulibali. 2020. NUBIA : N e U ral based interchangeability assessor for text generation. In 1st Workshop on Evaluating NLG Evaluation
work page 2020
-
[23]
Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In EACL
work page 2017
-
[24]
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Trung Bui, and Kyomin Jung. 2021. UMIC: an unreferenced metric for image captioning via contrastive learning. In ACL
work page 2021
-
[25]
Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, and Kyomin Jung. 2020. Vilbertscore: Evaluating image caption using vision-and-language bert. In First Workshop on Evaluation and Comparison of NLP Systems
work page 2020
-
[26]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV
work page 2018
-
[27]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out
work page 2004
-
[28]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. 2014. Microsoft COCO : Common objects in context. In ECCV. Springer
work page 2014
-
[29]
Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In ECCV
work page 2018
-
[30]
Chi-kiu Lo. 2019. Yisi-a unified semantic mt quality evaluation and estimation metric for languages with different levels of available resources. In Fourth Conference on Machine Translation
work page 2019
-
[31]
Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267--300
work page 2013
-
[32]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT : Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS
work page 2019
-
[33]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR
work page 2020
- [34]
-
[35]
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In CVPR
work page 2018
-
[36]
Haley MacLeod, Cynthia L Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people's experiences with computer-generated captions of social media images. In CHI
work page 2017
-
[37]
Pranava Madhyastha, Josiah Wang, and Lucia Specia. 2019. VIFIDEL : Evaluating the visual fidelity of image descriptions. In ACL
work page 2019
-
[38]
Yashar Mehdad, Matteo Negri, and Marcello Federico. 2012. Match without a referee: evaluating mt adequacy without reference translations. In Seventh Workshop on Statistical Machine Translation
work page 2012
-
[39]
Shikib Mehri and Maxine Eskenazi. 2020. USR : An unsupervised and reference free evaluation metric for dialog generation. In ACL
work page 2020
-
[40]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In FAccT
work page 2019
-
[41]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL
work page 2002
-
[43]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. JMLR, 12
work page 2011
-
[44]
Maxime Peyrard and Iryna Gurevych. 2018. Objective function learning to match human judgements for optimization-based summarization. In NAACL
work page 2018
-
[45]
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT ? In ACL
work page 2019
-
[46]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision
work page 2021
-
[47]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
work page 2019
-
[48]
Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018. Object hallucination in image captioning. In EMNLP
work page 2018
-
[49]
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. IJCV
work page 2017
-
[50]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL
work page 2016
-
[51]
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aur \'e lie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. FOIL it! find one mismatch between image and language caption. In ACL
work page 2017
-
[52]
Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In CVPR
work page 2019
-
[53]
Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS
work page 2016
-
[54]
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Machine translation evaluation versus quality estimation. Machine translation, 24(1):39--50
work page 2010
-
[55]
Lucia Specia and Kashif Shah. 2018. Machine translation quality estimation: Applications and future perspectives. In Translation Quality Assessment, pages 201--235. Springer
work page 2018
-
[56]
Abigale Stangl, Meredith Ringel Morris, and Danna Gurari. 2020. ``person, shoes, tree. is the person naked?" what people with vision impairments want in image descriptions. In CHI
work page 2020
-
[57]
Simeng Sun and Ani Nenkova. 2019. The feasibility of embedding based automatic evaluation for single document summarization. In EMNLP
work page 2019
-
[58]
Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In AAAI
work page 2018
-
[59]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS
work page 2017
-
[60]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR
work page 2015
-
[61]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. TPAMI, 39(4):652--663
work page 2016
-
[62]
Sijin Wang, Ziwei Yao, Ruiping Wang, Zhongqin Wu, and Xilin Chen. 2021. FAIEr : Fidelity and adequacy ensured image caption evaluation. In CVPR
work page 2021
-
[63]
Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT . In EMNLP
work page 2019
-
[64]
Elizaveta Yankovskaya, Andre T \"a ttar, and Mark Fishel. 2019. Quality estimation and translation metrics via pre-trained word and sentence embeddings. In Fourth Conference on Machine Translation
work page 2019
-
[65]
Yanzhi Yi, Hangyu Deng, and Jinglu Hu. 2020. Improving image captioning evaluation by considering inter references variance. In ACL
work page 2020
-
[66]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67--78
work page 2014
-
[67]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. BERTScore : Evaluating text generation with BERT . In ICLR
work page 2020
-
[68]
Wei Zhao, Goran Glava s , Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In ACL
work page 2020
-
[69]
C Lawrence Zitnick and Devi Parikh. 2013. Bringing semantics into focus using visual abstraction. In CVPR
work page 2013
-
[70]
Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In EMNLP
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.