GPTScore: Evaluate as You Desire

Jinlan Fu; Pengfei Liu; See-kiong Ng; Zhengbao Jiang

arxiv: 2302.04166 · v2 · pith:6UG7QNNZnew · submitted 2023-02-08 · 💻 cs.CL

GPTScore: Evaluate as You Desire

Jinlan Fu , See-Kiong Ng , Zhengbao Jiang , Pengfei Liu This is my paper

Pith reviewed 2026-05-17 17:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords modelsevaluationgenerationgptscorepre-trainedtextachieveevaluate

0 comments

The pith

GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating AI-generated text is difficult because quality depends on many subjective aspects such as coherence, relevance, or creativity, and traditional automatic metrics often fail to capture what humans care about. Collecting human ratings to train a new evaluator for each aspect is expensive and slow. GPTScore instead relies on the instruction-following ability of already-trained large models. A user writes a short description of the desired evaluation, for example 'rate how well this summary captures the main points,' and the model returns a numeric score. The authors ran experiments with 19 different models on text generation tasks including summarization and dialogue response. They covered 22 different evaluation aspects across 37 datasets. The main practical benefit is that new evaluation criteria can be defined and used immediately without any additional labeled training data.

Core claim

Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.

Load-bearing premise

That the emergent zero-shot instruction-following abilities of the tested pre-trained models can produce scores that meaningfully reflect the desired evaluation criteria without task-specific fine-tuning or annotated samples.

read the original abstract

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3). Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at https://github.com/jinlanfu/GPTScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTScore shows you can prompt LLMs to score text on custom criteria defined in natural language, with broad tests across models and datasets but thin details on the actual score quality.

read the letter

Hey, the main thing with this paper is that it demonstrates using off-the-shelf pre-trained models to score generated text based on whatever evaluation criteria you describe in plain instructions. They test this across 19 models from 80M parameters up to 175B, four generation tasks, 22 aspects, and 37 datasets, and release the code publicly. That setup lets you do customized, multi-aspect evaluation without collecting annotated samples for each new need. The experiments give evidence that zero-shot instruction following can produce usable scores in these cases, which lines up with the claim that emergent abilities handle this without task-specific tuning. The scale of the testing is the strongest part here, covering both smaller models like FLAN-T5 variants and larger ones like GPT-3, and showing the approach generalizes across the reported setups. Releasing code helps with checking the implementation directly. On the softer side, the summary leaves out specifics like exact correlation values with human judgments, the baselines compared against, or any statistical tests, so it's difficult to judge how reliable or superior the scores actually are in practice. The core assumption that the model outputs track the intended criteria holds in the described results without obvious internal contradictions or circular fitting, but independent checks on edge cases or bias in the scoring would clarify the limits. This paper is for NLP folks working on generative models who need more flexible evaluation options than fixed metrics provide. Readers focused on reducing annotation costs or exploring LLM uses for assessment tasks would get the most out of the empirical breadth. It deserves peer review because the practical framing and experiment coverage are substantial enough to benefit from referee input on the validation details.

Referee Report

2 major / 2 minor

Summary. The paper proposes GPTScore, a framework that leverages the zero-shot instruction-following abilities of 19 generative pre-trained models (sizes 80M to 175B) to assign scores to generated texts according to arbitrary natural-language criteria. Experiments cover four text-generation tasks, 22 evaluation aspects, and 37 datasets; the central claim is that this yields effective, customized, multi-faceted evaluation without task-specific fine-tuning or annotated samples.

Significance. If the empirical results hold, the work offers a practical route to annotation-free, instruction-driven evaluation that directly addresses long-standing limitations in NLG assessment. The breadth of models and datasets tested, together with the public code release, supplies concrete empirical support for the utility of emergent abilities in this setting.

major comments (2)

[Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.
[§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.

minor comments (2)

[Abstract] Abstract: the four tasks are not named; adding their names would immediately clarify the scope of the evaluation.
[Method] Notation: the prompt template is described in prose but never shown as a boxed example; including one concrete prompt per task would remove ambiguity for readers who wish to replicate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.

Authors: We agree that providing the precise correlation coefficients, details on the human judgment protocol, and statistical significance tests would enhance the quantitative evaluation of our claims. The human-judgment collection protocol is briefly described in the experimental setup, but we will expand this section with more details, including the number of annotators, inter-annotator agreement, and the exact procedure. We will add the specific Pearson, Spearman, and Kendall correlation values for all datasets and aspects to the main tables or a dedicated appendix. Additionally, we will include statistical significance tests (e.g., p-values) comparing GPTScore to baselines. These changes will be incorporated in the revised manuscript. revision: yes
Referee: [§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.

Authors: We appreciate this point regarding potential selective reporting. In §4.2, we focused on a subset of aspects that are commonly evaluated across tasks to provide a clear comparison within the space constraints of the main paper. To fully address the multi-faceted evaluation claim, we will include results for the complete set of 22 aspects in an extended table or appendix in the revised version. We will also add a statement in the main text explaining the selection of the subset for the primary table and confirming that GPTScore performs competitively across all aspects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method applies off-the-shelf zero-shot prompting without fitted parameters or self-referential reductions

full rationale

The paper presents GPTScore as a prompting-based evaluation framework that directly invokes the emergent zero-shot instruction-following abilities of existing pre-trained models (19 models from 80M to 175B parameters) to produce scores for generated text according to arbitrary natural-language criteria. No equations, parameter fitting, or derivation steps are described that would reduce the output scores back to the paper's own inputs, training data, or prior results by construction. The central claim is substantiated through direct empirical evaluation on 37 datasets spanning four tasks and 22 aspects, with public code release enabling external verification. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked; the approach treats model capabilities as an external, independently available resource rather than deriving them internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that pre-trained generative models possess reliable zero-shot scoring abilities via instructions; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Generative pre-trained models possess emergent zero-shot instruction-following abilities that can be directly used for text scoring.
Invoked as the foundation for the GPTScore framework in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1222 out tokens · 71655 ms · 2026-05-17T17:06:06.842651+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
cs.CL 2026-06 unverdicted novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
cs.CL 2026-06 unverdicted novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
Diagnosing Capability Gaps in Fine-Tuning Data
cs.LG 2026-04 unverdicted novelty 7.0

GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
cs.AI 2026-04 unverdicted novelty 7.0

LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge ...
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
cs.CL 2026-03 unverdicted novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
On the Factual Consistency of Text-based Explainable Recommendation Models
cs.IR 2025-12 unverdicted novelty 7.0

A prompting pipeline and statement-level metrics show that six state-of-the-art text-based explainable recommendation models achieve high semantic similarity but very low factual consistency on Amazon review data.
Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
cs.AI 2025-08 unverdicted novelty 7.0

ChatCLIDS creates a library of expert-validated virtual patients and tests LLM agents using evidence-based persuasive strategies in simulated longitudinal and adversarial health counseling sessions for closed-loop ins...
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
cs.CL 2025-04 unverdicted novelty 7.0

The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
CARE: A Conformal Safety Layer for Medical Summarization
cs.CL 2026-06 unverdicted novelty 6.0

CARE applies conformal risk control to deliver distribution-free guarantees bounding hallucination probability and omission fraction in medical summarization while reducing flagged sentences.
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
cs.CL 2026-05 conditional novelty 6.0

LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
The Illusion of Insight in Reasoning Models
cs.AI 2026-01 unverdicted novelty 6.0

Mid-reasoning shifts in reasoning models are rare symptoms of unstable inference that seldom improve accuracy and do not reflect intrinsic self-correction.
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
cs.AI 2025-10 unverdicted novelty 6.0

TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
cs.SE 2025-03 accept novelty 6.0

Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
LLM Evaluators Recognize and Favor Their Own Generations
cs.CL 2024-04 unverdicted novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
cs.CL 2023-08 conditional novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
cs.CL 2023-03 conditional novelty 6.0

G-Eval uses GPT-4 with chain-of-thought and form-filling to reach 0.514 Spearman correlation with humans on summarization, beating prior NLG metrics while noting a bias toward LLM outputs.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
cs.CL 2023-03 unverdicted novelty 6.0

SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
Resonant Minds: Closed-Loop Social Avatars with Theory of Mind
cs.CV 2026-06 unverdicted novelty 5.0

A dual-agent closed-loop system integrates Theory of Mind reasoning with multimodal video generation to create social avatars that outperform full-information baselines on dialogue quality under information asymmetry.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
cs.CL 2025-12 unverdicted novelty 5.0

LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
Instruction-Following Evaluation for Large Language Models
cs.CL 2023-11 unverdicted novelty 5.0

IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
SentTrack: Sentiment-Driven Bottleneck Detection in GitHub Issue Repositories
cs.SE 2026-06 unverdicted novelty 4.0

SentTrack applies LLM summarization, UMAP+HDBSCAN clustering, and the ABCDE interaction framework to GitHub issues, reporting 49% stagnation and 13% resolution rates in one repository as evidence of a dominant resolut...
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026
cs.CL 2026-06 unverdicted novelty 2.0

Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.