arxiv: 1504.00325 · v2 · submitted 2015-04-01 · 💻 cs.CV · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen , Hao Fang , Tsung-Yi Lin , Ramakrishna Vedantam , Saurabh Gupta , Piotr Dollar , C. Lawrence Zitnick

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:32 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords COCOimage captioningdatasetevaluation serverBLEUMETEORROUGECIDEr

0 comments

The pith

Microsoft COCO Caption dataset collects over 1.5 million human captions for more than 330,000 images and supplies an evaluation server using standard metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Microsoft COCO Caption dataset along with a supporting evaluation server. When finished the collection will hold over one and a half million captions across more than 330,000 images, with five independent human captions supplied for every training and validation image. The server accepts candidate captions from automatic systems and returns scores computed with BLEU, METEOR, ROUGE, and CIDEr so that different algorithms can be compared on identical data and identical scoring procedures. A reader would care because the combination of large scale, multiple annotations, and a public scoring service removes one major source of inconsistency that has slowed progress in automatic image description.

Core claim

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr.

What carries the argument

The COCO Captions dataset with five independent human captions per training and validation image together with the public evaluation server that scores submitted captions using BLEU, METEOR, ROUGE, and CIDEr.

Load-bearing premise

The collected human captions are sufficiently consistent, high-quality, and representative to serve as reliable ground truth for automatic systems.

What would settle it

An experiment in which human judges systematically prefer machine captions that receive low scores from the server on all four metrics would falsify the claim that the server provides a useful proxy for caption quality.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean dataset release paper that gives the field a large multi-caption image benchmark and a public eval server.

read the letter

The main point is that Chen et al. are releasing the COCO captions dataset plus an evaluation server. When done it will have over 1.5 million human captions on more than 330k images, with five independent captions per training and validation image, scored automatically on BLEU, METEOR, ROUGE, and CIDEr through a public server. That combination of scale and standardized tooling is the real contribution here. Earlier caption sets were smaller and evaluations were scattered, so this should cut down on apples-to-oranges comparisons in the captioning literature. The paper itself is straightforward and descriptive; it states the intended size, the annotation plan, and the metrics without extra claims or math. That keeps it honest and easy to use. The citation pattern is normal for a resource paper and references the standard metrics correctly. One soft spot is that the write-up does not include much detail on how caption quality or consistency was checked during collection, so readers will have to take the ground-truth reliability on trust for now. That is common in dataset papers and not a load-bearing flaw given the paper's limited scope. This is aimed at anyone training or benchmarking image captioning models. A reader in that area gets immediate practical value from the data and server. It is worth sending out for peer review because the resource is likely to see heavy use and the description is clear enough to evaluate.

Referee Report

0 major / 2 minor

Summary. The paper describes the Microsoft COCO Captions dataset and evaluation server. Upon completion, the dataset will contain over 1.5 million captions describing over 330,000 images, with five independent human-generated captions provided for each training and validation image. The evaluation server receives candidate captions and scores them using standard metrics including BLEU, METEOR, ROUGE, and CIDEr to promote consistent evaluation of automatic caption generation algorithms.

Significance. If the dataset is collected and the server implemented as described, this provides a valuable large-scale resource for image captioning research, enabling training on substantial data volumes and standardized benchmarking via established metrics on a public platform. The scale exceeds prior caption datasets and supports reproducible comparisons across methods.

minor comments (2)

[Abstract] The abstract and description focus on planned scale and metrics but omit any details on the annotation protocol, quality control, or inter-annotator agreement measures, which would strengthen the presentation of the data collection process.
No example captions, sample images, or illustrative server output are provided, which would improve clarity for readers unfamiliar with the dataset style or evaluation format.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of the manuscript and for recommending acceptance. The referee's summary correctly reflects the scope and purpose of the Microsoft COCO Captions dataset and evaluation server.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely descriptive release of the COCO Captions dataset and evaluation server. It states planned scale (over 1.5M captions for >330k images, five per training/validation image) and adoption of existing metrics (BLEU, METEOR, ROUGE, CIDEr) via a public server. No derivations, equations, predictions, fitted parameters, or optimality claims appear. Consequently no load-bearing steps exist that could reduce to self-definition, fitted inputs, or self-citation chains. The central content is factual description of data collection and tooling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no mathematical models, free parameters, axioms, or invented entities; it is purely a description of data collection and an evaluation tool.

pith-pipeline@v0.9.0 · 5399 in / 1019 out tokens · 68849 ms · 2026-05-12T21:32:41.651000+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Instructions for using the evaluation server are provided.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
cs.CV 2026-04 unverdicted novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
cs.DC 2026-04 unverdicted novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems
cs.DB 2026-05 conditional novelty 7.0

OxyEcomBench is a unified multimodal benchmark covering 6 capability areas and 29 tasks with authentic e-commerce data to measure how well foundation models handle real platform, merchant, and customer challenges.
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
cs.RO 2026-04 unverdicted novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
cs.CV 2026-04 unverdicted novelty 7.0

DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
Batch Loss Score for Dynamic Data Pruning
cs.LG 2026-04 unverdicted novelty 7.0

BLS approximates per-sample loss importance via EMA of batch losses, enabling simple and effective dynamic pruning of 20-50% samples losslessly across many datasets and models.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
cs.CV 2026-05 unverdicted novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
Sentinel2Cap: A Human-Annotated Benchmark Dataset for Multimodal Remote Sensing Image Captioning
cs.CV 2026-05 unverdicted novelty 6.0

Sentinel2Cap provides human-annotated captions for multimodal Sentinel satellite images, with zero-shot tests showing RGB outperforming SAR and prompts helping performance.
Statistical Consistency and Generalization of Contrastive Representation Learning
cs.LG 2026-05 unverdicted novelty 6.0

Contrastive representation learning is statistically consistent for optimal retrieval and admits generalization bounds of order O(1/m + 1/sqrt(n)) supervised and O(1/sqrt(m) + 1/sqrt(n)) self-supervised that benefit f...
EASE: Federated Multimodal Unlearning via Entanglement-Aware Anchor Closure
cs.NI 2026-05 unverdicted novelty 6.0

EASE closes three residual anchors in federated multimodal unlearning using bilateral displacement, cosine-sine decomposition, and forget lock, achieving near-retrain performance on forget and retain data.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
cs.CV 2026-04 unverdicted novelty 6.0

TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
cs.CV 2026-04 unverdicted novelty 6.0

LinguDistill recovers approximately 10% of lost performance on language benchmarks in VLMs by selectively distilling from a frozen LM teacher using KV-cache sharing, while preserving vision performance.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 35 Pith papers

[1]

Learning the semantics of words and pictures,

K. Barnard and D. Forsyth, “Learning the semantics of words and pictures,” in ICCV, vol. 2, 2001, pp. 408–415

work page 2001
[2]

Matching words and pictures,

K. Barnard, P . Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” JMLR, vol. 3, pp. 1107–1135, 2003

work page 2003
[3]

A model for learning the semantics of pictures,

V . Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” in NIPS, 2003

work page 2003
[4]

Baby talk: Understanding and generating simple image descriptions,

G. Kulkarni, V . Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Baby talk: Understanding and generating simple image descriptions,” in CVPR, 2011

work page 2011
[5]

Midge: Generating image descriptions from computer vision detections,

M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daum ´e III, “Midge: Generating image descriptions from computer vision detections,” in EACL, 2012

work page 2012
[6]

Every picture tells a story: Generating sentences from images,

A. Farhadi, M. Hejrati, M. A. Sadeghi, P . Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010

work page 2010
[7]

Framing image de- scription as a ranking task: Data, models and evaluation metrics

M. Hodosh, P . Young, and J. Hockenmaier, “Framing image de- scription as a ranking task: Data, models and evaluation metrics.” JAIR, vol. 47, pp. 853–899, 2013

work page 2013
[8]

Collective generation of natural image descriptions,

P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in ACL, 2012

work page 2012
[9]

Corpus- guided sentence generation of natural images,

Y. Yang, C. L. Teo, H. Daum ´e III, and Y. Aloimonos, “Corpus- guided sentence generation of natural images,” in EMNLP, 2011

work page 2011
[10]

Choosing linguistics over vision to describe images

A. Gupta, Y. Verma, and C. Jawahar, “Choosing linguistics over vision to describe images.” in AAAI, 2012

work page 2012
[11]

Distributional semantics in technicolor,

E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran, “Distributional semantics in technicolor,” in ACL, 2012

work page 2012
[12]

Automatic caption generation for news images,

Y. Feng and M. Lapata, “Automatic caption generation for news images,” TP AMI, vol. 35, no. 4, pp. 797–812, 2013

work page 2013
[13]

Image description using visual depen- dency representations

D. Elliott and F. Keller, “Image description using visual depen- dency representations.” in EMNLP, 2013, pp. 1292–1302

work page 2013
[14]

Deep fragment embeddings for bidirectional image sentence mapping,

A. Karpathy, A. Joulin, and F.-F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in NIPS, 2014

work page 2014
[15]

Improving image-sentence embeddings using large weakly an- notated photo collections,

Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving image-sentence embeddings using large weakly an- notated photo collections,” in ECCV, 2014, pp. 529–545

work page 2014
[16]

Nonparametric method for data- driven image captioning,

R. Mason and E. Charniak, “Nonparametric method for data- driven image captioning,” in ACL, 2014

work page 2014
[17]

Treetalk: Com- position and compression of trees for image descriptions,

P . Kuznetsova, V . Ordonez, T. Berg, and Y. Choi, “Treetalk: Com- position and compression of trees for image descriptions,” TACL, vol. 2, pp. 351–362, 2014

work page 2014
[18]

Autocaption: Automatic caption generation for personal photos,

K. Ramnath, S. Baker, L. Vanderwende, M. El-Saban, S. N. Sinha, A. Kannan, N. Hassan, M. Galley, Y. Yang, D. Ramanan, A. Bergamo, and L. Torresani, “Autocaption: Automatic caption generation for personal photos,” in WACV, 2014

work page 2014
[19]

Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world,

A. Lazaridou, E. Bruni, and M. Baroni, “Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world,” in ACL, 2014

work page 2014
[20]

Multimodal neural language models,

R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” in ICML, 2014

work page 2014
[21]

Explain im- ages with multimodal recurrent neural networks,

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain im- ages with multimodal recurrent neural networks,” arXiv preprint arXiv:1410.1090, 2014

work page arXiv 2014
[22]

Show and tell: A neural image caption generator,

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” arXiv preprint arXiv:1411.4555 , 2014

work page arXiv 2014
[23]

Deep visual-semantic alignments for generating image descriptions,

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” arXiv preprint arXiv:1412.2306, 2014

work page arXiv 2014
[24]

Unifying visual- semantic embeddings with multimodal neural language models,

R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual- semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539 , 2014

work page arXiv 2014
[25]

Long-term recurrent convolutional networks for visual recognition and description,

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” arXiv preprint arXiv:1411.4389 , 2014

work page arXiv 2014
[26]

From captions to visual concepts and back,

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P . Doll ´ar, J. Gao, X. He, M. Mitchell, J. Platt et al. , “From captions to visual concepts and back,” arXiv preprint arXiv:1411.4952 , 2014

work page arXiv 2014
[27]

Learning a recurrent visual representa- tion for image caption generation,

X. Chen and C. L. Zitnick, “Learning a recurrent visual representa- tion for image caption generation,” arXiv preprint arXiv:1411.5654 , 2014

work page arXiv 2014
[28]

Phrase-based image captioning,

R. Lebret, P . O. Pinheiro, and R. Collobert, “Phrase-based image captioning,” arXiv preprint arXiv:1502.03671 , 2015

work page arXiv 2015
[29]

Simple image description generator via a linear phrase- based approach,

——, “Simple image description generator via a linear phrase- based approach,” arXiv preprint arXiv:1412.8419 , 2014

work page arXiv 2014
[30]

Combining language and vision with a multimodal skip-gram model,

A. Lazaridou, N. T. Pham, and M. Baroni, “Combining language and vision with a multimodal skip-gram model,” arXiv preprint arXiv:1501.02598, 2015

work page arXiv 2015
[31]

ImageNet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classiﬁca- tion with deep convolutional neural networks,” in NIPS, 2012

work page 2012
[32]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[33]

Im- ageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- ageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009

work page 2009
[34]

The iapr tc- 12 benchmark: A new evaluation resource for visual information systems,

M. Grubinger, P . Clough, H. M¨uller, and T. Deselaers, “The iapr tc- 12 benchmark: A new evaluation resource for visual information systems,” in LREC Workshop on Language Resources for Content- based Image Retrieval , 2006

work page 2006
[35]

Im2text: Describing images using 1 million captioned photographs

V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs.” in NIPS, 2011

work page 2011
[36]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,

P . Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” TACL, vol. 2, pp. 67– 78, 2014

work page 2014
[37]

D ´ej´a image- captions: A corpus of expressive image descriptions in repetition,

J. Chen, P . Kuznetsova, D. Warren, and Y. Choi, “D ´ej´a image- captions: A corpus of expressive image descriptions in repetition,” in NAACL, 2015

work page 2015
[38]

Microsoft COCO: Common objects in context,

T. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014

work page 2014
[39]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002

work page 2002
[40]

Rouge: A package for automatic evaluation of sum- maries,

C.-Y. Lin, “Rouge: A package for automatic evaluation of sum- maries,” in ACL Workshop , 2004

work page 2004
[41]

Meteor universal: Language spe- ciﬁc translation evaluation for any target language,

M. Denkowski and A. Lavie, “Meteor universal: Language spe- ciﬁc translation evaluation for any target language,” in EACL Workshop on Statistical Machine T ranslation , 2014

work page 2014
[42]

Cider: Consensus-based image description evaluation,

R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” arXiv preprint arXiv:1411.5726, 2014

work page arXiv 2014
[43]

The Stanford CoreNLP natural language processing toolkit,

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55–60. [Online]. Available: http: //www.aclweb.org/anthology/P/P14/P14-5010

work page 2014
[44]

Wordnet: a lexical database for english,

G. A. Miller, “Wordnet: a lexical database for english,” Communi- cations of the ACM , vol. 38, no. 11, pp. 39–41, 1995

work page 1995
[45]

Comparing automatic evaluation mea- sures for image description,

D. Elliott and F. Keller, “Comparing automatic evaluation mea- sures for image description,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics , vol. 2, 2014, pp. 452–457

work page 2014
[46]

Re-evaluation the role of bleu in machine translation research

C. Callison-Burch, M. Osborne, and P . Koehn, “Re-evaluation the role of bleu in machine translation research.” in EACL, vol. 6, 2006, pp. 249–256

work page 2006