Recognition: no theorem link
BERTScore: Evaluating Text Generation with BERT
Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3
The pith
BERTScore uses BERT embeddings to score text generation quality and correlates better with human judgments than existing metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence using contextual embeddings from BERT. It evaluates using the outputs of 363 machine translation and image captioning systems and correlates better with human judgments and provides stronger model selection performance than existing metrics. It is more robust to challenging examples in an adversarial paraphrase detection task.
What carries the argument
BERTScore, which uses cosine similarity of BERT contextual embeddings to measure token-level similarity between candidate and reference texts.
If this is right
- Evaluators can rely on automated scores that better reflect human preferences for text quality.
- Model developers gain a more reliable signal for comparing different generation systems.
- The metric applies across machine translation and image captioning without task-specific changes.
- It provides a defense against certain adversarial inputs that exploit surface form differences.
Where Pith is reading between the lines
- Embedding similarity metrics could be applied to evaluate other text generation tasks such as summarization.
- The success indicates that pre-trained models like BERT capture semantic information relevant to quality assessment.
- Researchers might explore combining BERTScore with other metrics for even better performance.
Load-bearing premise
Cosine similarity on BERT embeddings captures the kind of semantic equivalence that humans rely on when rating text generation quality.
What would settle it
A set of generated texts where humans consistently rate higher the outputs that receive lower BERTScore than a competing metric like BLEU.
read the original abstract
We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes BERTScore, an automatic evaluation metric for text generation that computes token-level similarities between candidate and reference sentences using contextual embeddings from a pre-trained BERT model and cosine similarity, rather than exact matches or n-gram overlaps. Evaluated on outputs from 363 machine translation and image captioning systems, it reports higher correlation with human judgments and stronger model selection performance than baselines such as BLEU, METEOR, and CIDEr. It additionally demonstrates greater robustness on an adversarial paraphrase detection task.
Significance. If the empirical results hold, the work is significant for NLP because reliable automatic metrics are essential for model development and comparison in text generation. BERTScore's use of off-the-shelf contextual embeddings without task-specific training or free parameters provides a practical advance over prior metrics, and the scale of the evaluation (363 systems) lends weight to the correlation and model-selection claims. The approach is falsifiable via human correlation tests and supports reproducible implementations via public BERT checkpoints.
major comments (2)
- [§4.1, Tables 1–2] §4.1 and Tables 1–2: The central claim of improved correlation with human judgments is supported by reported Pearson and Spearman coefficients, but the manuscript provides no statistical significance tests (e.g., Steiger’s test for dependent correlations or bootstrap confidence intervals) on the differences versus baselines. Without these, it is difficult to assess whether the observed gains are reliable or could arise from sampling variability across the 363 systems.
- [§3.2] §3.2: The greedy matching procedure used to compute precision, recall, and F1 from the token similarity matrix is described at a high level, but lacks explicit details on tie-breaking, handling of duplicate tokens, and the precise implementation of the similarity matrix (including any normalization or IDF weighting). These choices are load-bearing for exact reproduction of the reported scores.
minor comments (3)
- [§4.3, Figure 3] §4.3 and Figure 3: The adversarial paraphrase results would be clearer if the figure or accompanying table listed the raw BERTScore, BLEU, and METEOR values for each example rather than only qualitative statements.
- [§2] §2: The related-work discussion of prior embedding-based metrics (e.g., those using static Word2Vec or ELMo) could be expanded with a direct comparison table to highlight how BERTScore’s contextual and matching approach differs.
- [§4] Appendix or §4: Provide the exact BERT model variant (e.g., bert-base-uncased), layer choice, and any preprocessing steps used for all experiments to ensure full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and reproducibility.
read point-by-point responses
-
Referee: [§4.1, Tables 1–2] §4.1 and Tables 1–2: The central claim of improved correlation with human judgments is supported by reported Pearson and Spearman coefficients, but the manuscript provides no statistical significance tests (e.g., Steiger’s test for dependent correlations or bootstrap confidence intervals) on the differences versus baselines. Without these, it is difficult to assess whether the observed gains are reliable or could arise from sampling variability across the 363 systems.
Authors: We agree that statistical significance tests on the correlation differences would strengthen the central claims. In the revised manuscript, we will add bootstrap confidence intervals (with 1000 resamples) around the reported Pearson and Spearman coefficients and apply Steiger’s test for dependent correlations to evaluate whether BERTScore’s improvements over baselines are statistically significant. These additions will directly address concerns about sampling variability across the 363 systems. revision: yes
-
Referee: [§3.2] §3.2: The greedy matching procedure used to compute precision, recall, and F1 from the token similarity matrix is described at a high level, but lacks explicit details on tie-breaking, handling of duplicate tokens, and the precise implementation of the similarity matrix (including any normalization or IDF weighting). These choices are load-bearing for exact reproduction of the reported scores.
Authors: We thank the referee for highlighting the need for greater implementation precision. We will expand §3.2 with explicit pseudocode and text describing: (i) the greedy matching algorithm (iteratively selecting the highest similarity pair and removing matched tokens), (ii) tie-breaking (by token index order), (iii) duplicate token handling (each occurrence is treated as distinct and matched independently), and (iv) similarity matrix construction (L2-normalized BERT embeddings with cosine similarity; optional IDF weighting as an ablation). The public code release will be updated to match these details exactly. revision: yes
Circularity Check
No significant circularity identified
full rationale
BERTScore is defined directly from pre-trained BERT embeddings and cosine similarity (external to the paper's evaluation data). Performance claims rest on empirical correlations and model-selection accuracy measured against independent human judgments across 363 MT and captioning systems. No equations reduce the metric to a fit on the test data, no self-citation chain carries the central result, and no uniqueness or ansatz is smuggled in. The derivation is self-contained with external empirical support.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cosine similarity between contextual embeddings reflects semantic relatedness for evaluation purposes
Forward citations
Cited by 60 Pith papers
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
WearBCI Dataset: Understanding and Benchmarking Real-World Wearable Brain-Computer Interfaces Signals
WearBCI provides the first multimodal dataset of wearable EEG signals under varied motion conditions with benchmarks for artifact removal and behavior analysis.
-
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...
-
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
-
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
-
Dataset Watermarking for Closed LLMs with Provable Detection
A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...
-
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
-
Identifying and Characterizing Semantic Clones of Solidity Functions
A code-and-comment analysis method detects semantic clones in Solidity functions with 59% overall precision (84% for same-name functions) and 97% recall on 300k contracts, plus LLM summaries for uncommented code.
-
Analysis and Explainability of LLMs Via Evolutionary Methods
Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation
ArgRE embeds abstract argumentation into multi-agent requirements negotiation to deliver argument-level traceability, higher evaluator-rated justifications, and improved compliance coverage over heuristic baselines.
-
EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents
EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...
-
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
-
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
-
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
-
Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification
Re-RIGHT trains a 4B policy model with vocabulary coverage, semantic preservation, and coherence rewards to perform proficiency-aware lexical simplification in four languages without parallel corpora.
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
-
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only b...
-
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
Sanity Checks for Long-Form Hallucination Detection
Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.
-
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
-
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.
-
HotComment: A Benchmark for Evaluating Popularity of Online Comments
HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
-
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
-
Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
-
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Long-Term Memory for VLA-based Agents in Open-World Task Execution
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...
-
MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator
MuTSE provides an interactive platform for parallel evaluation of LLM text simplifications using a tiered semantic alignment engine with a linearity bias heuristic.
-
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
Presents TR-EduVSum dataset and AutoMUP consensus framework for generating gold-standard summaries from multiple human annotations of Turkish educational videos.
-
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
-
Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs
A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.
-
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
-
MMP-Refer: Multimodal Path Retrieval-augmented LLMs For Explainable Recommendation
MMP-Refer augments LLMs with multimodal retrieval paths and a trainable collaborative adapter to produce more accurate and explainable recommendations.
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
-
Fine-Tuning Models for Automated Code Review Feedback
PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
Rigorous Interpretation Is a Form of Evaluation
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
-
Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition
Introduces POSER and EmbER metrics to assess grammatical and semantic contributions of language model rescoring in ASR systems.
-
Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models
Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recoverin...
-
SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling
SAGE uses a Next Strategy Classifier and Graph-Aware Attention on a psychologically grounded graph to improve LLM strategy prediction and response quality in online counseling.
-
A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents
MODEE is a multimodal system that integrates graphs with LLM embeddings to outperform prior open-domain event extraction methods on large datasets.
-
CARE: Counselor-Aligned Response Engine for Online Mental-Health Support
CARE fine-tunes LLMs on counselor-validated crisis dialogues to produce responses with stronger semantic and strategic alignment to expert standards than general-purpose models in Hebrew and Arabic.
-
An Explainable Approach to Document-level Translation Evaluation with Topic Modeling
A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insi...
-
Multilingual Training and Evaluation Resources for Vision-Language Models
Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
-
Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning
A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to cli...
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Reference graph
Works this paper leans on
-
[1]
x: At the same time Kingfisher is closing 60 B&Q outlets across the country 38 125 530ˆx: At the same time, Kingfisher will close 60 B & Q stores nationwide
-
[2]
x: Hewlett-Packard to cut up to 30,000 jobs 119 39 441ˆx: Hewlett-Packard will reduce jobs up to 30.000
-
[3]
x: According to opinion in Hungary, Serbia is “a safe third country". 23 96 465ˆx: According to Hungarian view, Serbia is a “safe third country."
-
[4]
73 147 492ˆx: Experts believe that the Black Friday in November has put the brakes on spending
x: Experts believe November’s Black Friday could be holding back spending. 73 147 492ˆx: Experts believe that the Black Friday in November has put the brakes on spending
-
[5]
37 111 414ˆx: And from this perspective, I will see him die
x: And it’s from this perspective that I will watch him die. 37 111 414ˆx: And from this perspective, I will see him die. BLEU>FBERT
-
[6]
500 470 115ˆx: Look at the human dignity of the man injured
x: In their view the human dignity of the man had been violated. 500 470 115ˆx: Look at the human dignity of the man injured
-
[7]
x: For example when he steered a shot from Ideye over the crossbar in the 56th minute.516 524 185ˆx: So, for example, when he steered a shot of Ideye over the latte (56th)
-
[8]
495 424 152ˆx: A good prank is funny, but it takes only moments before he becomes a boomerang
x: A good prank is funny, but takes moments to reverse. 495 424 152ˆx: A good prank is funny, but it takes only moments before he becomes a boomerang
-
[9]
507 471 220ˆx: I will exert the pressure on it and her urge to make a decision
x: I will put the pressure on them and onus on them to make a decision. 507 471 220ˆx: I will exert the pressure on it and her urge to make a decision
- [10]
-
[11]
x: One big obstacle to access to the jobs market is the lack of knowledge of the German language.558 131 313ˆx: A major hurdle for access to the labour market are a lack of knowledge of English
-
[12]
413 135 55ˆx: Hungary had in the night of Tuesday closed its 175 km long border with Serbia
x: On Monday night Hungary closed its 175 km long border with Serbia. 413 135 55ˆx: Hungary had in the night of Tuesday closed its 175 km long border with Serbia
-
[13]
428 174 318ˆx: You got nothing, but could keep the clothes
x: They got nothing, but they were allowed to keep the clothes. 428 174 318ˆx: You got nothing, but could keep the clothes
-
[14]
290 34 134ˆx: A majority of Republicans see Trump’s temperament is not a problem
x: A majority of Republicans don’t see Trump’s temperament as a problem. 290 34 134ˆx: A majority of Republicans see Trump’s temperament is not a problem
-
[15]
299 49 71ˆx: His car was still in the driveway
x:His car was still running in the driveway. 299 49 71ˆx: His car was still in the driveway. Human>FBERT
-
[16]
77 525 553ˆx: At the moment the men predominate among the staff
x: Currently the majority of staff are men. 77 525 553ˆx: At the moment the men predominate among the staff
-
[17]
30 446 552ˆx: In fact, several variables play a role
x: There are, indeed, multiple variables at play. 30 446 552ˆx: In fact, several variables play a role
-
[18]
124 551 528ˆx: One of the men was about 1,80 metres in size
x: One was a man of about 5ft 11in tall. 124 551 528ˆx: One of the men was about 1,80 metres in size
-
[19]
90 454 547ˆx: All of this certainly exacts its toll
x: All that stuff sure does take a toll. 90 454 547ˆx: All of this certainly exacts its toll
-
[20]
en-zh” data because the language pair “en-zh
x: Wage gains have shown signs of picking up. 140 464 514ˆx: Increases of wages showed signs of a recovery. Table 7: Examples sentences where similarity ranks assigned by Human, FBERT , and B LEU differ significantly on WMT16 German-to-English evaluation task.x: gold reference, ˆx: candidate outputs of MT systems. Rankings assigned by Human, FBERT , and B ...
work page 2016
-
[21]
[MNLI] Use a BERT model fine-tuned on MNLI (Williams et al., 2018)
work page 2018
-
[22]
[PMEANS] Apply power means (Rücklé et al., 2018) to aggregate the information of dif- ferent layers.9
work page 2018
-
[23]
[IDF-L] For reference sentences, instead of computing the idf scores on the 560 sen- tences in the segment-level data ([IDF-S]), compute theidf scores on the 3,005 sentences in the system-level data
-
[24]
The weighting of reference tokens are kept the same as in [IDF-S]
[SEP] For candidate sentences, recompute the idf scores on the candidate sentences. The weighting of reference tokens are kept the same as in [IDF-S]
-
[25]
We follow the setup of Zhao et al
[RM] Exclude punctuation marks and sub-word tokens except the first sub-word in each word from the matching. We follow the setup of Zhao et al. (2019) and use their released fine-tuned BERT model to conduct the experiments. Table 9 shows the results of our ablation study. We report corre- lations for the two variants of WMD Zhao et al. (2019) study: unigram...
work page 2019
-
[26]
Segment-level and system-level correlation studies on three years of WMT metric evalua- tion task (WMT16–18)
-
[27]
Model selection study on WMT18 10K hybrid systems
-
[28]
System-level correlation study on 2015 COCO captioning challenge
work page 2015
-
[29]
Robustness study on PAWS-QQP. Following BERT (Devlin et al., 2019), a variety of Transformer-based (Vaswani et al., 2017) pre- trained contextual embeddings have been proposed and released. We conduct additional experiments with four types of pre-trained embeddings: BERT, XLM (Lample & Conneau, 2019), XLNet (Yang et al., 2019b), and RoBERTa (Liu et al., 2...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.