pith. machine review for the scientific record. sign in

arxiv: 1504.00325 · v2 · submitted 2015-04-01 · 💻 cs.CV · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Microsoft COCO Captions: Data Collection and Evaluation Server

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords COCOimage captioningdatasetevaluation serverBLEUMETEORROUGECIDEr
0
0 comments X

The pith

Microsoft COCO Caption dataset collects over 1.5 million human captions for more than 330,000 images and supplies an evaluation server using standard metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Microsoft COCO Caption dataset along with a supporting evaluation server. When finished the collection will hold over one and a half million captions across more than 330,000 images, with five independent human captions supplied for every training and validation image. The server accepts candidate captions from automatic systems and returns scores computed with BLEU, METEOR, ROUGE, and CIDEr so that different algorithms can be compared on identical data and identical scoring procedures. A reader would care because the combination of large scale, multiple annotations, and a public scoring service removes one major source of inconsistency that has slowed progress in automatic image description.

Core claim

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr.

What carries the argument

The COCO Captions dataset with five independent human captions per training and validation image together with the public evaluation server that scores submitted captions using BLEU, METEOR, ROUGE, and CIDEr.

Load-bearing premise

The collected human captions are sufficiently consistent, high-quality, and representative to serve as reliable ground truth for automatic systems.

What would settle it

An experiment in which human judges systematically prefer machine captions that receive low scores from the server on all four metrics would falsify the claim that the server provides a useful proxy for caption quality.

read the original abstract

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper describes the Microsoft COCO Captions dataset and evaluation server. Upon completion, the dataset will contain over 1.5 million captions describing over 330,000 images, with five independent human-generated captions provided for each training and validation image. The evaluation server receives candidate captions and scores them using standard metrics including BLEU, METEOR, ROUGE, and CIDEr to promote consistent evaluation of automatic caption generation algorithms.

Significance. If the dataset is collected and the server implemented as described, this provides a valuable large-scale resource for image captioning research, enabling training on substantial data volumes and standardized benchmarking via established metrics on a public platform. The scale exceeds prior caption datasets and supports reproducible comparisons across methods.

minor comments (2)
  1. [Abstract] The abstract and description focus on planned scale and metrics but omit any details on the annotation protocol, quality control, or inter-annotator agreement measures, which would strengthen the presentation of the data collection process.
  2. No example captions, sample images, or illustrative server output are provided, which would improve clarity for readers unfamiliar with the dataset style or evaluation format.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of the manuscript and for recommending acceptance. The referee's summary correctly reflects the scope and purpose of the Microsoft COCO Captions dataset and evaluation server.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely descriptive release of the COCO Captions dataset and evaluation server. It states planned scale (over 1.5M captions for >330k images, five per training/validation image) and adoption of existing metrics (BLEU, METEOR, ROUGE, CIDEr) via a public server. No derivations, equations, predictions, fitted parameters, or optimality claims appear. Consequently no load-bearing steps exist that could reduce to self-definition, fitted inputs, or self-citation chains. The central content is factual description of data collection and tooling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no mathematical models, free parameters, axioms, or invented entities; it is purely a description of data collection and an evaluation tool.

pith-pipeline@v0.9.0 · 5399 in / 1019 out tokens · 68849 ms · 2026-05-12T21:32:41.651000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

    cs.CV 2026-04 unverdicted novelty 8.0

    MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

  2. Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

    cs.DC 2026-04 unverdicted novelty 8.0

    Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

  3. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  4. OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems

    cs.DB 2026-05 conditional novelty 7.0

    OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.

  5. Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.

  6. GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

  7. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  8. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  9. DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

    cs.CV 2026-04 unverdicted novelty 7.0

    DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.

  10. Batch Loss Score for Dynamic Data Pruning

    cs.LG 2026-04 unverdicted novelty 7.0

    BLS approximates per-sample loss importance via EMA of batch losses, enabling simple and effective dynamic pruning of 20-50% samples losslessly across many datasets and models.

  11. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  12. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  13. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  14. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  15. MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

    cs.CV 2026-05 unverdicted novelty 6.0

    MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.

  16. Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    Sentinel2Cap provides human-annotated captions for multimodal Sentinel satellite images, with zero-shot tests showing RGB outperforming SAR and prompts helping performance.

  17. Statistical Consistency and Generalization of Contrastive Representation Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Contrastive representation learning is statistically consistent for optimal retrieval and admits generalization bounds of order O(1/m + 1/sqrt(n)) supervised and O(1/sqrt(m) + 1/sqrt(n)) self-supervised that benefit f...

  18. EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure

    cs.NI 2026-05 unverdicted novelty 6.0

    EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.

  19. TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.

  20. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  21. LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    LinguDistill recovers approximately 10% of lost performance on language benchmarks in VLMs by selectively distilling from a frozen LM teacher using KV-cache sharing, while preserving vision performance.

  22. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  23. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  24. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  25. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  26. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  27. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

  28. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  29. From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

  30. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  31. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  32. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  33. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  34. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  35. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    cs.CV 2023-08 unverdicted novelty 4.0

    OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 35 Pith papers

  1. [1]

    Learning the semantics of words and pictures,

    K. Barnard and D. Forsyth, “Learning the semantics of words and pictures,” in ICCV, vol. 2, 2001, pp. 408–415

  2. [2]

    Matching words and pictures,

    K. Barnard, P . Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” JMLR, vol. 3, pp. 1107–1135, 2003

  3. [3]

    A model for learning the semantics of pictures,

    V . Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” in NIPS, 2003

  4. [4]

    Baby talk: Understanding and generating simple image descriptions,

    G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Baby talk: Understanding and generating simple image descriptions,” in CVPR, 2011

  5. [5]

    Midge: Generating image descriptions from computer vision detections,

    M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daum ´e III, “Midge: Generating image descriptions from computer vision detections,” in EACL, 2012

  6. [6]

    Every picture tells a story: Generating sentences from images,

    A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010

  7. [7]

    Framing image de- scription as a ranking task: Data, models and evaluation metrics

    M. Hodosh, P . Young, and J. Hockenmaier, “Framing image de- scription as a ranking task: Data, models and evaluation metrics.” JAIR, vol. 47, pp. 853–899, 2013

  8. [8]

    Collective generation of natural image descriptions,

    P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in ACL, 2012

  9. [9]

    Corpus- guided sentence generation of natural images,

    Y. Yang, C. L. Teo, H. Daum ´e III, and Y. Aloimonos, “Corpus- guided sentence generation of natural images,” in EMNLP, 2011

  10. [10]

    Choosing linguistics over vision to describe images

    A. Gupta, Y. Verma, and C. Jawahar, “Choosing linguistics over vision to describe images.” in AAAI, 2012

  11. [11]

    Distributional semantics in technicolor,

    E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, “Distributional semantics in technicolor,” in ACL, 2012

  12. [12]

    Automatic caption generation for news images,

    Y. Feng and M. Lapata, “Automatic caption generation for news images,” TP AMI, vol. 35, no. 4, pp. 797–812, 2013

  13. [13]

    Image description using visual depen- dency representations

    D. Elliott and F. Keller, “Image description using visual depen- dency representations.” in EMNLP, 2013, pp. 1292–1302

  14. [14]

    Deep fragment embeddings for bidirectional image sentence mapping,

    A. Karpathy, A. Joulin, and F.-F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014

  15. [15]

    Improving image-sentence embeddings using large weakly an- notated photo collections,

    Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving image-sentence embeddings using large weakly an- notated photo collections,” in ECCV, 2014, pp. 529–545

  16. [16]

    Nonparametric method for data- driven image captioning,

    R. Mason and E. Charniak, “Nonparametric method for data- driven image captioning,” in ACL, 2014

  17. [17]

    Treetalk: Com- position and compression of trees for image descriptions,

    P . Kuznetsova, V . Ordonez, T. Berg, and Y. Choi, “Treetalk: Com- position and compression of trees for image descriptions,” TACL, vol. 2, pp. 351–362, 2014

  18. [18]

    Autocaption: Automatic caption generation for personal photos,

    K. Ramnath, S. Baker, L. Vanderwende, M. El-Saban, S. N. Sinha, A. Kannan, N. Hassan, M. Galley, Y. Yang, D. Ramanan, A. Bergamo, and L. Torresani, “Autocaption: Automatic caption generation for personal photos,” in WACV, 2014

  19. [19]

    Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world,

    A. Lazaridou, E. Bruni, and M. Baroni, “Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world,” in ACL, 2014

  20. [20]

    Multimodal neural language models,

    R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in ICML, 2014

  21. [21]

    Explain im- ages with multimodal recurrent neural networks,

    J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain im- ages with multimodal recurrent neural networks,” arXiv preprint arXiv:1410.1090, 2014

  22. [22]

    Show and tell: A neural image caption generator,

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” arXiv preprint arXiv:1411.4555 , 2014

  23. [23]

    Deep visual-semantic alignments for generating image descriptions,

    A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” arXiv preprint arXiv:1412.2306, 2014

  24. [24]

    Unifying visual- semantic embeddings with multimodal neural language models,

    R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual- semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539 , 2014

  25. [25]

    Long-term recurrent convolutional networks for visual recognition and description,

    J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” arXiv preprint arXiv:1411.4389 , 2014

  26. [26]

    From captions to visual concepts and back,

    H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P . Doll ´ar, J. Gao, X. He, M. Mitchell, J. Platt et al. , “From captions to visual concepts and back,” arXiv preprint arXiv:1411.4952 , 2014

  27. [27]

    Learning a recurrent visual representa- tion for image caption generation,

    X. Chen and C. L. Zitnick, “Learning a recurrent visual representa- tion for image caption generation,” arXiv preprint arXiv:1411.5654 , 2014

  28. [28]

    Phrase-based image captioning,

    R. Lebret, P . O. Pinheiro, and R. Collobert, “Phrase-based image captioning,” arXiv preprint arXiv:1502.03671 , 2015

  29. [29]

    Simple image description generator via a linear phrase- based approach,

    ——, “Simple image description generator via a linear phrase- based approach,” arXiv preprint arXiv:1412.8419 , 2014

  30. [30]

    Combining language and vision with a multimodal skip-gram model,

    A. Lazaridou, N. T. Pham, and M. Baroni, “Combining language and vision with a multimodal skip-gram model,” arXiv preprint arXiv:1501.02598, 2015

  31. [31]

    ImageNet classifica- tion with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica- tion with deep convolutional neural networks,” in NIPS, 2012

  32. [32]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997

  33. [33]

    Im- ageNet: A Large-Scale Hierarchical Image Database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- ageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009

  34. [34]

    The iapr tc- 12 benchmark: A new evaluation resource for visual information systems,

    M. Grubinger, P . Clough, H. M¨uller, and T. Deselaers, “The iapr tc- 12 benchmark: A new evaluation resource for visual information systems,” in LREC Workshop on Language Resources for Content- based Image Retrieval , 2006

  35. [35]

    Im2text: Describing images using 1 million captioned photographs

    V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs.” in NIPS, 2011

  36. [36]

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,

    P . Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” TACL, vol. 2, pp. 67– 78, 2014

  37. [37]

    D ´ej´a image- captions: A corpus of expressive image descriptions in repetition,

    J. Chen, P . Kuznetsova, D. Warren, and Y. Choi, “D ´ej´a image- captions: A corpus of expressive image descriptions in repetition,” in NAACL, 2015

  38. [38]

    Microsoft COCO: Common objects in context,

    T. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014

  39. [39]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002

  40. [40]

    Rouge: A package for automatic evaluation of sum- maries,

    C.-Y. Lin, “Rouge: A package for automatic evaluation of sum- maries,” in ACL Workshop , 2004

  41. [41]

    Meteor universal: Language spe- cific translation evaluation for any target language,

    M. Denkowski and A. Lavie, “Meteor universal: Language spe- cific translation evaluation for any target language,” in EACL Workshop on Statistical Machine T ranslation , 2014

  42. [42]

    Cider: Consensus-based image description evaluation,

    R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” arXiv preprint arXiv:1411.5726, 2014

  43. [43]

    The Stanford CoreNLP natural language processing toolkit,

    C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. [Online]. Available: http: //www.aclweb.org/anthology/P/P14/P14-5010

  44. [44]

    Wordnet: a lexical database for english,

    G. A. Miller, “Wordnet: a lexical database for english,” Communi- cations of the ACM , vol. 38, no. 11, pp. 39–41, 1995

  45. [45]

    Comparing automatic evaluation mea- sures for image description,

    D. Elliott and F. Keller, “Comparing automatic evaluation mea- sures for image description,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics , vol. 2, 2014, pp. 452–457

  46. [46]

    Re-evaluation the role of bleu in machine translation research

    C. Callison-Burch, M. Osborne, and P . Koehn, “Re-evaluation the role of bleu in machine translation research.” in EACL, vol. 6, 2006, pp. 249–256