arxiv: 2003.10286 · v1 · submitted 2020-03-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He , Yichen Zhang , Luntian Mou , Eric Xing , Pengtao Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords pathologyvisual question answeringmedical datasetVQAAI in medicineimage captioningnatural language processingPathVQA

0 comments

The pith

The first pathology visual question answering dataset is created with 32,799 manually verified questions from 4,998 images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a dataset that lets AI answer questions about pathology images as an initial step toward an AI pathologist that could pass board exams. It tackles the barriers of private images and scarce expert time by pulling content from textbooks and digital libraries through a semi-automated extraction process. NLP turns captions into question-answer pairs, which are then manually checked for accuracy. The resulting public resource opens medical VQA research that previously lacked suitable data.

Core claim

We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA.

What carries the argument

The semi-automated pipeline that pulls images and captions from textbooks, applies NLP to generate open-ended QA pairs, and applies manual verification for medical correctness.

If this is right

AI models can be trained and benchmarked on pathology-specific visual question answering tasks using the released dataset.
The dataset supports progress toward AI systems capable of interpreting pathology images in response to natural-language queries.
Public availability invites researchers to develop and compare methods for medical VQA without needing to assemble data from scratch.
The same textbook-based extraction approach could be adapted to create similar datasets in other medical imaging specialties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

General-domain VQA models could be evaluated on this dataset to quantify performance drop when moving from everyday images to medical ones.
The resource might enable hybrid systems that combine image analysis with language models to provide explanations or second opinions in clinical settings.
Future expansions could link the questions to full case reports or patient outcomes to test deeper diagnostic reasoning.

Load-bearing premise

Questions produced from textbook captions by NLP and manual review match the medically accurate and representative questions that practicing pathologists would ask when examining the same images.

What would settle it

A group of board-certified pathologists reviews a random sample of 500 questions and finds more than 10 percent to be medically inaccurate or unrelated to standard diagnostic practice.

read the original abstract

Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology? To achieve this goal, the first step is to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Our work makes the first attempt to build such a dataset. Different from creating general-domain VQA datasets where the images are widely accessible and there are many crowdsourcing workers available and capable of generating question-answer pairs, developing a medical VQA dataset is much more challenging. First, due to privacy concerns, pathology images are usually not publicly available. Second, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. To address these challenges, we resort to pathology textbooks and online digital libraries. We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA. Our dataset will be released publicly to promote research in medical VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PathVQA supplies the first sizable public pathology VQA dataset mined from textbooks, which is a practical step forward even if the manual checks lack reported quality metrics.

read the letter

This paper's core move is to release PathVQA: roughly 33k open-ended questions paired with 5k pathology images, all pulled from textbooks via a semi-automated pipeline that turns captions into questions and then applies manual review. That directly tackles the data scarcity problem in medical VQA where privacy rules and expert time make fresh collection tough. Releasing it publicly is the right call and gives the community a concrete starting resource that did not exist before. The approach is straightforward and reproducible on paper: extract images and text from available sources, generate candidates with NLP, then verify. That part works as a pragmatic workaround. The soft spot sits in the verification step. The description says questions were manually checked for correctness, but it supplies no rejection counts, no information on who performed the checks, and no agreement or accuracy numbers. Without those details it is difficult to judge how medically precise or representative the final set really is. If the full manuscript adds those statistics, the concern shrinks; otherwise users will have to do their own spot checks. This work is aimed at researchers building or benchmarking medical VQA models, especially anyone focused on pathology. A reader who needs an initial labeled collection to train or test on will get immediate value from it. The dataset construction is solid enough on its own terms to merit peer review, mainly so the verification process can be clarified and the resource can be properly documented for downstream use.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces PathVQA, the first dataset for pathology visual question answering, consisting of 32,799 open-ended questions paired with 4,998 pathology images extracted from textbooks. The authors describe a semi-automated pipeline that extracts images and captions, generates QA pairs using NLP techniques, and performs manual checks to ensure correctness. The dataset is intended to support the development of AI systems capable of answering pathology-related questions from images.

Significance. This work addresses a significant gap in medical AI by providing the first publicly available VQA dataset focused on pathology, a domain where data access is limited by privacy concerns and the need for expert knowledge. If the manual verification process produces high-quality, accurate questions, the dataset could enable substantial progress in training and evaluating VQA models for pathology, potentially contributing to the long-term goal of AI-assisted pathology diagnosis. The release of the dataset is a positive step for the community.

major comments (1)

[Dataset Construction] The manual verification step (described in the dataset construction pipeline) states only that 'each question is manually checked to ensure correctness' without reporting checker qualifications (pathologists vs. non-experts), rejection or revision rates, or inter-annotator agreement. This is load-bearing for the central claim that the resulting 32,799 questions are medically accurate and representative, especially given the paper's own emphasis that only well-trained pathologists can interpret the images.

minor comments (1)

[Abstract] The abstract and introduction could more explicitly state the split between training/validation/test sets and any baseline model results to help readers assess immediate usability.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the positive assessment of PathVQA's significance and for the constructive feedback on dataset quality verification. We address the single major comment below.

read point-by-point responses

Referee: The manual verification step (described in the dataset construction pipeline) states only that 'each question is manually checked to ensure correctness' without reporting checker qualifications (pathologists vs. non-experts), rejection or revision rates, or inter-annotator agreement. This is load-bearing for the central claim that the resulting 32,799 questions are medically accurate and representative, especially given the paper's own emphasis that only well-trained pathologists can interpret the images.

Authors: We agree that the manuscript would be strengthened by additional details on the manual verification process. The checks were performed internally by the authors, who have expertise in medical imaging and NLP but are not board-certified pathologists. Verification focused on removing obvious generation errors (e.g., grammatical issues or caption mismatches) rather than deep medical interpretation. We did not record rejection/revision rates or compute inter-annotator agreement, as the process was not structured as a multi-annotator annotation task. We will revise the dataset construction section to describe the verification procedure in greater detail and to explicitly note these limitations. However, we cannot add quantitative statistics because they were not collected. revision: partial

standing simulated objections not resolved

Quantitative statistics on checker qualifications, rejection rates, revision rates, and inter-annotator agreement, as these were not recorded during dataset construction.

Circularity Check

0 steps flagged

No circularity: dataset construction paper has no derivation chain

full rationale

The paper describes a semi-automated pipeline to extract pathology images/captions from textbooks, generate QA pairs via NLP, and apply manual checks, resulting in 32,799 questions. No equations, predictions, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. The 'first dataset' claim is a factual assertion supported by the external textbook sources and described process, not by any self-citation load-bearing step or self-definitional reduction. This matches the expected honest non-finding for a pure dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that NLP-generated questions from captions remain medically faithful after manual review, with no free parameters or new invented entities introduced.

axioms (1)

domain assumption NLP techniques applied to image captions can produce valid open-ended medical questions after human review
The semi-automated pipeline relies on this to scale question generation from textbook captions.

pith-pipeline@v0.9.0 · 5539 in / 1123 out tokens · 48056 ms · 2026-05-15T05:09:48.556384+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
cs.CV 2026-05 unverdicted novelty 8.0

DALPHIN benchmark finds the pathology-specific AI copilot PathChat+ shows no statistically significant difference from expert pathologists in 4 of 6 tasks, with general models matching in 1-2 tasks, on a diverse open ...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
cs.CV 2026-03 conditional novelty 8.0

MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate ...
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
cs.CV 2026-04 unverdicted novelty 7.0

PlantInquiryVQA shows multimodal LLMs describe plant symptoms but struggle with clinical reasoning and diagnosis, with structured Chain of Inquiry improving correctness and reducing hallucinations.
KIRA: Knowledge-Intensive Image Retrieval and Reasoning Architecture for Specialized Visual Domains
cs.CV 2026-04 unverdicted novelty 7.0

KIRA is a unified architecture for visual RAG that reports 0.97 retrieval precision, 1.0 grounding, and 0.707 domain correctness across medical, circuit, satellite, and histopathology domains via hierarchical chunking...
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
cs.CV 2026-04 unverdicted novelty 7.0

A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
cs.CL 2026-02 conditional novelty 7.0

MEDSYN benchmark shows MLLMs match experts on differential diagnosis lists but have much larger gaps to final diagnosis selection than humans, due to text overreliance and cross-modal evidence gaps.
Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training
cs.CV 2026-05 unverdicted novelty 6.0

VISTA uses prefix resampling and a vision-aware attention score to address data imbalance and language prior bias in self-improvement training of MLLMs, yielding up to 13.66% gains on reasoning tasks.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
cs.CV 2026-04 unverdicted novelty 6.0

MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA
cs.CV 2026-04 unverdicted novelty 6.0

DCI unifies backdoor adjustment and instrumental variable learning in MedVQA to extract deconfounded representations, yielding better out-of-distribution performance on SLAKE, VQA-RAD and similar benchmarks.
Dialectic-Med: Mitigating Diagnostic Hallucinations via Counterfactual Adversarial Multi-Agent Debate
cs.CL 2026-04 unverdicted novelty 6.0

Dialectic-Med uses proponent-opponent-mediator agents with visual falsification to enforce grounded diagnostic reasoning in MLLMs, achieving SOTA accuracy and reduced hallucinations on MIMIC-CXR-VQA, VQA-RAD, and PathVQA.
Improving Medical VQA through Trajectory-Aware Process Supervision
cs.LG 2026-04 conditional novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
cs.CV 2026-03 conditional novelty 6.0

Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
cs.CL 2026-04 unverdicted novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
Bias-constrained multimodal intelligence for equitable and reliable clinical AI
cs.CV 2026-04 unverdicted novelty 5.0

BiasCareVL is a bias-aware vision-language framework trained on 3.44 million medical samples that outperforms prior methods on clinical tasks like diagnosis and segmentation while aiming for equitable performance unde...
MAny: Merge Anything for Multimodal Continual Instruction Tuning
cs.LG 2026-04 unverdicted novelty 5.0

MAny addresses dual-forgetting in multimodal continual instruction tuning via CPM and LPM merging strategies, delivering up to 8.57% accuracy gains on UCIT benchmarks without additional training.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 21 Pith papers · 4 internal anchors

[1]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015

work page 2015
[2]

A multi-world approach to question answering about real-world scenes based on uncertain input

Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, 2014

work page 2014
[3]

Image question answering: A visual semantic embedding model and a new dataset

Mengye Ren, Ryan Kiros, and Richard Zemel. Image question answering: A visual semantic embedding model and a new dataset. NIPS, 2015

work page 2015
[4]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017

work page 2017
[5]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017

work page 2017
[6]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In CLEF2019 Working Notes., 2019

work page 2019
[7]

A dataset of clinically generated visual questions and answers about radiology images

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientiﬁc data, 2018

work page 2018
[8]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012

work page 2012
[9]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[10]

Zero-shot learning via visual abstraction

Stanislaw Antol, C Lawrence Zitnick, and Devi Parikh. Zero-shot learning via visual abstraction. In ECCV, 2014

work page 2014
[11]

Bringing semantics into focus using visual abstraction

C Lawrence Zitnick and Devi Parikh. Bringing semantics into focus using visual abstraction. In CVPR, 2013

work page 2013
[12]

Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, 2017

work page 2017
[13]

Accurate unlexicalized parsing

Dan Klein and Christopher D Manning. Accurate unlexicalized parsing. In ACL, 2003

work page 2003
[14]

A question type driven framework to diversify visual question generation

Zhihao Fan, Zhongyu Wei, Piji Li, Yanyan Lan, and Xuanjing Huang. A question type driven framework to diversify visual question generation. 2018

work page 2018
[15]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

The pythy summarization system: Microsoft research at duc 2007

Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki, and Lucy Vander- wende. The pythy summarization system: Microsoft research at duc 2007. In Proc. of DUC, 2007

work page 2007
[17]

Hedge trimmer: A parse-and-trim approach to headline generation

Bonnie Dorr, David Zajic, and Richard Schwartz. Hedge trimmer: A parse-and-trim approach to headline generation. In HLT-NAACL workshop, 2003

work page 2003
[18]

The Stanford CoreNLP natural language processing toolkit

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In ACL, 2014

work page 2014
[19]

Question generation via overgenerating transformations and ranking

Michael Heilman and Noah A Smith. Question generation via overgenerating transformations and ranking. Technical report, 2009

work page 2009
[20]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In NIPS, 2018

work page 2018
[21]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Faster r-cnn: Towards real-time object detection with region proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015

work page 2015
[23]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997

work page 1997
[25]

Stacked attention networks for image question answering

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016. 11 A PREPRINT - MARCH 24, 2020

work page 2016
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[27]

Very deep convolutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2014

work page 2014
[28]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017

work page 2017
[29]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[30]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014

work page 2014
[31]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

A probabilistic interpretation of precision, recall and f-score, with implication for evaluation

Cyril Goutte and Eric Gaussier. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European Conference on Information Retrieval, 2005

work page 2005
[34]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002. 12

work page 2002