pith. machine review for the scientific record. sign in

arxiv: 1904.09675 · v3 · submitted 2019-04-21 · 💻 cs.CL

Recognition: no theorem link

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang , Varsha Kishore , Felix Wu , Kilian Q. Weinberger , Yoav Artzi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords BERTScoretext generation evaluationBERT embeddingsautomatic metricmachine translationimage captioninghuman judgment correlationmodel selection
0
0 comments X

The pith

BERTScore uses BERT embeddings to score text generation quality and correlates better with human judgments than existing metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BERTScore, an evaluation metric for text generation that computes token similarities using contextual embeddings from BERT instead of exact matches. Tested on outputs from 363 machine translation and image captioning systems, it shows higher correlation with human judgments and better model selection. It is also more robust to adversarial paraphrases. This suggests that embedding-based similarity can provide a more reliable automatic assessment of generated text.

Core claim

BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence using contextual embeddings from BERT. It evaluates using the outputs of 363 machine translation and image captioning systems and correlates better with human judgments and provides stronger model selection performance than existing metrics. It is more robust to challenging examples in an adversarial paraphrase detection task.

What carries the argument

BERTScore, which uses cosine similarity of BERT contextual embeddings to measure token-level similarity between candidate and reference texts.

If this is right

  • Evaluators can rely on automated scores that better reflect human preferences for text quality.
  • Model developers gain a more reliable signal for comparing different generation systems.
  • The metric applies across machine translation and image captioning without task-specific changes.
  • It provides a defense against certain adversarial inputs that exploit surface form differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding similarity metrics could be applied to evaluate other text generation tasks such as summarization.
  • The success indicates that pre-trained models like BERT capture semantic information relevant to quality assessment.
  • Researchers might explore combining BERTScore with other metrics for even better performance.

Load-bearing premise

Cosine similarity on BERT embeddings captures the kind of semantic equivalence that humans rely on when rating text generation quality.

What would settle it

A set of generated texts where humans consistently rate higher the outputs that receive lower BERTScore than a competing metric like BLEU.

read the original abstract

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes BERTScore, an automatic evaluation metric for text generation that computes token-level similarities between candidate and reference sentences using contextual embeddings from a pre-trained BERT model and cosine similarity, rather than exact matches or n-gram overlaps. Evaluated on outputs from 363 machine translation and image captioning systems, it reports higher correlation with human judgments and stronger model selection performance than baselines such as BLEU, METEOR, and CIDEr. It additionally demonstrates greater robustness on an adversarial paraphrase detection task.

Significance. If the empirical results hold, the work is significant for NLP because reliable automatic metrics are essential for model development and comparison in text generation. BERTScore's use of off-the-shelf contextual embeddings without task-specific training or free parameters provides a practical advance over prior metrics, and the scale of the evaluation (363 systems) lends weight to the correlation and model-selection claims. The approach is falsifiable via human correlation tests and supports reproducible implementations via public BERT checkpoints.

major comments (2)
  1. [§4.1, Tables 1–2] §4.1 and Tables 1–2: The central claim of improved correlation with human judgments is supported by reported Pearson and Spearman coefficients, but the manuscript provides no statistical significance tests (e.g., Steiger’s test for dependent correlations or bootstrap confidence intervals) on the differences versus baselines. Without these, it is difficult to assess whether the observed gains are reliable or could arise from sampling variability across the 363 systems.
  2. [§3.2] §3.2: The greedy matching procedure used to compute precision, recall, and F1 from the token similarity matrix is described at a high level, but lacks explicit details on tie-breaking, handling of duplicate tokens, and the precise implementation of the similarity matrix (including any normalization or IDF weighting). These choices are load-bearing for exact reproduction of the reported scores.
minor comments (3)
  1. [§4.3, Figure 3] §4.3 and Figure 3: The adversarial paraphrase results would be clearer if the figure or accompanying table listed the raw BERTScore, BLEU, and METEOR values for each example rather than only qualitative statements.
  2. [§2] §2: The related-work discussion of prior embedding-based metrics (e.g., those using static Word2Vec or ELMo) could be expanded with a direct comparison table to highlight how BERTScore’s contextual and matching approach differs.
  3. [§4] Appendix or §4: Provide the exact BERT model variant (e.g., bert-base-uncased), layer choice, and any preprocessing steps used for all experiments to ensure full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and reproducibility.

read point-by-point responses
  1. Referee: [§4.1, Tables 1–2] §4.1 and Tables 1–2: The central claim of improved correlation with human judgments is supported by reported Pearson and Spearman coefficients, but the manuscript provides no statistical significance tests (e.g., Steiger’s test for dependent correlations or bootstrap confidence intervals) on the differences versus baselines. Without these, it is difficult to assess whether the observed gains are reliable or could arise from sampling variability across the 363 systems.

    Authors: We agree that statistical significance tests on the correlation differences would strengthen the central claims. In the revised manuscript, we will add bootstrap confidence intervals (with 1000 resamples) around the reported Pearson and Spearman coefficients and apply Steiger’s test for dependent correlations to evaluate whether BERTScore’s improvements over baselines are statistically significant. These additions will directly address concerns about sampling variability across the 363 systems. revision: yes

  2. Referee: [§3.2] §3.2: The greedy matching procedure used to compute precision, recall, and F1 from the token similarity matrix is described at a high level, but lacks explicit details on tie-breaking, handling of duplicate tokens, and the precise implementation of the similarity matrix (including any normalization or IDF weighting). These choices are load-bearing for exact reproduction of the reported scores.

    Authors: We thank the referee for highlighting the need for greater implementation precision. We will expand §3.2 with explicit pseudocode and text describing: (i) the greedy matching algorithm (iteratively selecting the highest similarity pair and removing matched tokens), (ii) tie-breaking (by token index order), (iii) duplicate token handling (each occurrence is treated as distinct and matched independently), and (iv) similarity matrix construction (L2-normalized BERT embeddings with cosine similarity; optional IDF weighting as an ablation). The public code release will be updated to match these details exactly. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

BERTScore is defined directly from pre-trained BERT embeddings and cosine similarity (external to the paper's evaluation data). Performance claims rest on empirical correlations and model-selection accuracy measured against independent human judgments across 363 MT and captioning systems. No equations reduce the metric to a fit on the test data, no self-citation chain carries the central result, and no uniqueness or ansatz is smuggled in. The derivation is self-contained with external empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the pre-trained BERT model and standard vector similarity; no new entities are postulated and no parameters are fitted to the target evaluation data.

axioms (1)
  • domain assumption Cosine similarity between contextual embeddings reflects semantic relatedness for evaluation purposes
    Invoked when replacing exact matches with embedding similarities

pith-pipeline@v0.9.0 · 5393 in / 1098 out tokens · 45346 ms · 2026-05-10T16:24:31.715261+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

    cs.CV 2026-04 unverdicted novelty 8.0

    PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.

  2. WearBCI Dataset: Understanding and Benchmarking Real-World Wearable Brain-Computer Interfaces Signals

    cs.HC 2026-03 accept novelty 8.0

    WearBCI provides the first multimodal dataset of wearable EEG signals under varied motion conditions with benchmarks for artifact removal and behavior analysis.

  3. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    cs.CL 2019-08 unverdicted novelty 8.0

    Sentence-BERT adapts BERT with siamese and triplet networks to produce sentence embeddings for efficient cosine-similarity comparisons, cutting computation time from hours to seconds on similarity search while matchin...

  4. WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language

    cs.NI 2026-05 unverdicted novelty 7.0

    WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.

  5. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 conditional novelty 7.0

    EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

  6. EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

    cs.AI 2026-05 unverdicted novelty 7.0

    EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.

  7. Dataset Watermarking for Closed LLMs with Provable Detection

    cs.LG 2026-05 unverdicted novelty 7.0

    A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tunin...

  8. Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

    cs.IR 2026-05 unverdicted novelty 7.0

    CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...

  9. Identifying and Characterizing Semantic Clones of Solidity Functions

    cs.SE 2026-04 unverdicted novelty 7.0

    A code-and-comment analysis method detects semantic clones in Solidity functions with 59% overall precision (84% for same-name functions) and 97% recall on 300k contracts, plus LLM summaries for uncommented code.

  10. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  11. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.

  12. ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation

    cs.SE 2026-04 unverdicted novelty 7.0

    ArgRE embeds abstract argumentation into multi-agent requirements negotiation to deliver argument-level traceability, higher evaluator-rated justifications, and improved compliance coverage over heuristic baselines.

  13. EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

    cs.CL 2026-04 unverdicted novelty 7.0

    EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.

  14. Evaluating Remote Sensing Image Captions Beyond Metric Biases

    cs.CV 2026-04 unverdicted novelty 7.0

    Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA pe...

  15. Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

    cs.CV 2026-04 unverdicted novelty 7.0

    A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.

  16. GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  17. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  18. Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

    cs.CL 2026-04 unverdicted novelty 7.0

    Re-RIGHT trains a 4B policy model with vocabulary coverage, semantic preservation, and coherence rewards to perform proficiency-aware lexical simplification in four languages without parallel corpora.

  19. Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

  20. PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.

  21. Adversarial SQL Injection Generation with LLM-Based Architectures

    cs.CR 2026-05 unverdicted novelty 6.0

    RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.

  22. TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

    cs.AI 2026-05 unverdicted novelty 6.0

    TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only b...

  23. ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

    cs.CL 2026-05 unverdicted novelty 6.0

    ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.

  24. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  25. Sanity Checks for Long-Form Hallucination Detection

    cs.CL 2026-05 unverdicted novelty 6.0

    Hallucination detectors on LLM reasoning traces often rely on final-answer artifacts rather than reasoning validity; once controlled, lightweight lexical trajectory features suffice for robust detection.

  26. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  27. Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

    cs.CL 2026-05 unverdicted novelty 6.0

    Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

  28. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  29. Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking

    cs.CR 2026-05 unverdicted novelty 6.0

    BREW achieves TPR of 0.965 and FPR of 0.02 under 10% synonym substitution by shifting from ECC decoding to designated verification with block voting and local validation.

  30. HotComment: A Benchmark for Evaluating Popularity of Online Comments

    cs.AI 2026-04 unverdicted novelty 6.0

    HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...

  31. CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

    cs.CV 2026-04 unverdicted novelty 6.0

    CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

  32. Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

    cs.CV 2026-04 unverdicted novelty 6.0

    IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

  33. Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

    cs.CY 2026-04 unverdicted novelty 6.0

    Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.

  34. Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

    cs.CL 2026-04 conditional novelty 6.0

    Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.

  35. MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

    cs.CE 2026-04 unverdicted novelty 6.0

    MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.

  36. Learning to Control Summaries with Score Ranking

    cs.CL 2026-04 unverdicted novelty 6.0

    A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.

  37. Long-Term Memory for VLA-based Agents in Open-World Task Execution

    cs.RO 2026-04 unverdicted novelty 6.0

    ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.

  38. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  39. Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    ESC-RL improves RL for radiology reports via group-wise evidence-aware rewards (GEAR) and LLM-driven self-correcting preference learning (SPL), reaching state-of-the-art on two chest X-ray datasets.

  40. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

    cs.AI 2026-04 unverdicted novelty 6.0

    A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.

  41. AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...

  42. MuTSE: A Human-in-the-Loop Multi-use Text Simplification Evaluator

    cs.CL 2026-04 unverdicted novelty 6.0

    MuTSE provides an interactive platform for parallel evaluation of LLM text simplifications using a tiered semantic alignment engine with a linearity bias heuristic.

  43. TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

    cs.CL 2026-04 unverdicted novelty 6.0

    Presents TR-EduVSum dataset and AutoMUP consensus framework for generating gold-standard summaries from multiple human annotations of Turkish educational videos.

  44. WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.

  45. Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    A small recency window of 3-5 prior ADRs as context produces higher-fidelity LLM-generated Architecture Decision Records than no context, full history, or retrieval-augmented selection in typical sequential workflows.

  46. ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.

  47. MMP-Refer: Multimodal Path Retrieval-augmented LLMs For Explainable Recommendation

    cs.IR 2026-04 conditional novelty 6.0

    MMP-Refer augments LLMs with multimodal retrieval paths and a trainable collaborative adapter to produce more accurate and explainable recommendations.

  48. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    cs.CL 2023-08 conditional novelty 6.0

    Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.

  49. Fine-Tuning Models for Automated Code Review Feedback

    cs.SE 2026-05 conditional novelty 5.0

    PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.

  50. ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

    cs.CV 2026-05 unverdicted novelty 5.0

    ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

  51. Rigorous Interpretation Is a Form of Evaluation

    cs.CY 2026-05 unverdicted novelty 5.0

    Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

  52. Qualitative Evaluation of Language Model Rescoring in Automatic Speech Recognition

    cs.CL 2026-04 unverdicted novelty 5.0

    Introduces POSER and EmbER metrics to assess grammatical and semantic contributions of language model rescoring in ASR systems.

  53. Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Selective pruning of low-activation neurons in task-specific LLMs preserves accuracy better than random pruning, but removing roughly 10% of highly selective neurons triggers total collapse, with fine-tuning recoverin...

  54. SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling

    cs.CL 2026-04 unverdicted novelty 5.0

    SAGE uses a Next Strategy Classifier and Graph-Aware Attention on a psychologically grounded graph to improve LLM strategy prediction and response quality in online counseling.

  55. A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

    cs.CL 2026-04 unverdicted novelty 5.0

    MODEE is a multimodal system that integrates graphs with LLM embeddings to outperform prior open-domain event extraction methods on large datasets.

  56. CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

    cs.CL 2026-04 unverdicted novelty 5.0

    CARE fine-tunes LLMs on counselor-validated crisis dialogues to produce responses with stronger semantic and strategic alignment to expert standards than general-purpose models in Hebrew and Arabic.

  57. An Explainable Approach to Document-level Translation Evaluation with Topic Modeling

    cs.CE 2026-04 unverdicted novelty 5.0

    A topic-modeling framework measures document-level thematic consistency in translations by aligning key tokens across languages with a bilingual dictionary and scoring via cosine similarity, providing explainable insi...

  58. Multilingual Training and Evaluation Resources for Vision-Language Models

    cs.CL 2026-04 conditional novelty 5.0

    Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.

  59. Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning

    cs.CV 2026-04 unverdicted novelty 5.0

    A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to cli...

  60. Calibrating Model-Based Evaluation Metrics for Summarization

    cs.CL 2026-04 unverdicted novelty 5.0

    A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 74 Pith papers

  1. [1]

    x: At the same time Kingfisher is closing 60 B&Q outlets across the country 38 125 530ˆx: At the same time, Kingfisher will close 60 B & Q stores nationwide

  2. [2]

    x: Hewlett-Packard to cut up to 30,000 jobs 119 39 441ˆx: Hewlett-Packard will reduce jobs up to 30.000

  3. [3]

    a safe third country

    x: According to opinion in Hungary, Serbia is “a safe third country". 23 96 465ˆx: According to Hungarian view, Serbia is a “safe third country."

  4. [4]

    73 147 492ˆx: Experts believe that the Black Friday in November has put the brakes on spending

    x: Experts believe November’s Black Friday could be holding back spending. 73 147 492ˆx: Experts believe that the Black Friday in November has put the brakes on spending

  5. [5]

    37 111 414ˆx: And from this perspective, I will see him die

    x: And it’s from this perspective that I will watch him die. 37 111 414ˆx: And from this perspective, I will see him die. BLEU>FBERT

  6. [6]

    500 470 115ˆx: Look at the human dignity of the man injured

    x: In their view the human dignity of the man had been violated. 500 470 115ˆx: Look at the human dignity of the man injured

  7. [7]

    x: For example when he steered a shot from Ideye over the crossbar in the 56th minute.516 524 185ˆx: So, for example, when he steered a shot of Ideye over the latte (56th)

  8. [8]

    495 424 152ˆx: A good prank is funny, but it takes only moments before he becomes a boomerang

    x: A good prank is funny, but takes moments to reverse. 495 424 152ˆx: A good prank is funny, but it takes only moments before he becomes a boomerang

  9. [9]

    507 471 220ˆx: I will exert the pressure on it and her urge to make a decision

    x: I will put the pressure on them and onus on them to make a decision. 507 471 220ˆx: I will exert the pressure on it and her urge to make a decision

  10. [10]

    vandalism

    x: Transport for London is not amused by this flyposting "vandalism." 527 527 246ˆx: Transport for London is the Plaka animal "vandalism" is not funny. FBERT>Human

  11. [11]

    x: One big obstacle to access to the jobs market is the lack of knowledge of the German language.558 131 313ˆx: A major hurdle for access to the labour market are a lack of knowledge of English

  12. [12]

    413 135 55ˆx: Hungary had in the night of Tuesday closed its 175 km long border with Serbia

    x: On Monday night Hungary closed its 175 km long border with Serbia. 413 135 55ˆx: Hungary had in the night of Tuesday closed its 175 km long border with Serbia

  13. [13]

    428 174 318ˆx: You got nothing, but could keep the clothes

    x: They got nothing, but they were allowed to keep the clothes. 428 174 318ˆx: You got nothing, but could keep the clothes

  14. [14]

    290 34 134ˆx: A majority of Republicans see Trump’s temperament is not a problem

    x: A majority of Republicans don’t see Trump’s temperament as a problem. 290 34 134ˆx: A majority of Republicans see Trump’s temperament is not a problem

  15. [15]

    299 49 71ˆx: His car was still in the driveway

    x:His car was still running in the driveway. 299 49 71ˆx: His car was still in the driveway. Human>FBERT

  16. [16]

    77 525 553ˆx: At the moment the men predominate among the staff

    x: Currently the majority of staff are men. 77 525 553ˆx: At the moment the men predominate among the staff

  17. [17]

    30 446 552ˆx: In fact, several variables play a role

    x: There are, indeed, multiple variables at play. 30 446 552ˆx: In fact, several variables play a role

  18. [18]

    124 551 528ˆx: One of the men was about 1,80 metres in size

    x: One was a man of about 5ft 11in tall. 124 551 528ˆx: One of the men was about 1,80 metres in size

  19. [19]

    90 454 547ˆx: All of this certainly exacts its toll

    x: All that stuff sure does take a toll. 90 454 547ˆx: All of this certainly exacts its toll

  20. [20]

    en-zh” data because the language pair “en-zh

    x: Wage gains have shown signs of picking up. 140 464 514ˆx: Increases of wages showed signs of a recovery. Table 7: Examples sentences where similarity ranks assigned by Human, FBERT , and B LEU differ significantly on WMT16 German-to-English evaluation task.x: gold reference, ˆx: candidate outputs of MT systems. Rankings assigned by Human, FBERT , and B ...

  21. [21]

    [MNLI] Use a BERT model fine-tuned on MNLI (Williams et al., 2018)

  22. [22]

    [PMEANS] Apply power means (Rücklé et al., 2018) to aggregate the information of dif- ferent layers.9

  23. [23]

    [IDF-L] For reference sentences, instead of computing the idf scores on the 560 sen- tences in the segment-level data ([IDF-S]), compute theidf scores on the 3,005 sentences in the system-level data

  24. [24]

    The weighting of reference tokens are kept the same as in [IDF-S]

    [SEP] For candidate sentences, recompute the idf scores on the candidate sentences. The weighting of reference tokens are kept the same as in [IDF-S]

  25. [25]

    We follow the setup of Zhao et al

    [RM] Exclude punctuation marks and sub-word tokens except the first sub-word in each word from the matching. We follow the setup of Zhao et al. (2019) and use their released fine-tuned BERT model to conduct the experiments. Table 9 shows the results of our ablation study. We report corre- lations for the two variants of WMD Zhao et al. (2019) study: unigram...

  26. [26]

    Segment-level and system-level correlation studies on three years of WMT metric evalua- tion task (WMT16–18)

  27. [27]

    Model selection study on WMT18 10K hybrid systems

  28. [28]

    System-level correlation study on 2015 COCO captioning challenge

  29. [29]

    Following BERT (Devlin et al., 2019), a variety of Transformer-based (Vaswani et al., 2017) pre- trained contextual embeddings have been proposed and released

    Robustness study on PAWS-QQP. Following BERT (Devlin et al., 2019), a variety of Transformer-based (Vaswani et al., 2017) pre- trained contextual embeddings have been proposed and released. We conduct additional experiments with four types of pre-trained embeddings: BERT, XLM (Lample & Conneau, 2019), XLNet (Yang et al., 2019b), and RoBERTa (Liu et al., 2...