arxiv: 2302.04166 · v2 · pith:6UG7QNNZnew · submitted 2023-02-08 · 💻 cs.CL

GPTScore: Evaluate as You Desire

Jinlan Fu , See-Kiong Ng , Zhengbao Jiang , Pengfei Liu This is my paper

Pith reviewed 2026-05-17 17:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords modelsevaluationgenerationgptscorepre-trainedtextachieveevaluate

0 comments p. Extension

Add this Pith Number to your LaTeX paper

\usepackage{pith}
\pithnumber{6UG7QNNZ}

Prints a linked pith:6UG7QNNZ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating AI-generated text is difficult because quality depends on many subjective aspects such as coherence, relevance, or creativity, and traditional automatic metrics often fail to capture what humans care about. Collecting human ratings to train a new evaluator for each aspect is expensive and slow. GPTScore instead relies on the instruction-following ability of already-trained large models. A user writes a short description of the desired evaluation, for example 'rate how well this summary captures the main points,' and the model returns a numeric score. The authors ran experiments with 19 different models on text generation tasks including summarization and dialogue response. They covered 22 different evaluation aspects across 37 datasets. The main practical benefit is that new evaluation criteria can be defined and used immediately without any additional labeled training data.

Core claim

Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.

Load-bearing premise

That the emergent zero-shot instruction-following abilities of the tested pre-trained models can produce scores that meaningfully reflect the desired evaluation criteria without task-specific fine-tuning or annotated samples.

read the original abstract

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3). Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at https://github.com/jinlanfu/GPTScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPTScore shows you can prompt LLMs to score text on custom criteria defined in natural language, with broad tests across models and datasets but thin details on the actual score quality.

read the letter

Hey, the main thing with this paper is that it demonstrates using off-the-shelf pre-trained models to score generated text based on whatever evaluation criteria you describe in plain instructions. They test this across 19 models from 80M parameters up to 175B, four generation tasks, 22 aspects, and 37 datasets, and release the code publicly. That setup lets you do customized, multi-aspect evaluation without collecting annotated samples for each new need. The experiments give evidence that zero-shot instruction following can produce usable scores in these cases, which lines up with the claim that emergent abilities handle this without task-specific tuning. The scale of the testing is the strongest part here, covering both smaller models like FLAN-T5 variants and larger ones like GPT-3, and showing the approach generalizes across the reported setups. Releasing code helps with checking the implementation directly. On the softer side, the summary leaves out specifics like exact correlation values with human judgments, the baselines compared against, or any statistical tests, so it's difficult to judge how reliable or superior the scores actually are in practice. The core assumption that the model outputs track the intended criteria holds in the described results without obvious internal contradictions or circular fitting, but independent checks on edge cases or bias in the scoring would clarify the limits. This paper is for NLP folks working on generative models who need more flexible evaluation options than fixed metrics provide. Readers focused on reducing annotation costs or exploring LLM uses for assessment tasks would get the most out of the empirical breadth. It deserves peer review because the practical framing and experiment coverage are substantial enough to benefit from referee input on the validation details.

Referee Report

2 major / 2 minor

Summary. The paper proposes GPTScore, a framework that leverages the zero-shot instruction-following abilities of 19 generative pre-trained models (sizes 80M to 175B) to assign scores to generated texts according to arbitrary natural-language criteria. Experiments cover four text-generation tasks, 22 evaluation aspects, and 37 datasets; the central claim is that this yields effective, customized, multi-faceted evaluation without task-specific fine-tuning or annotated samples.

Significance. If the empirical results hold, the work offers a practical route to annotation-free, instruction-driven evaluation that directly addresses long-standing limitations in NLG assessment. The breadth of models and datasets tested, together with the public code release, supplies concrete empirical support for the utility of emergent abilities in this setting.

major comments (2)

[Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.
[§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.

minor comments (2)

[Abstract] Abstract: the four tasks are not named; adding their names would immediately clarify the scope of the evaluation.
[Method] Notation: the prompt template is described in prose but never shown as a boxed example; including one concrete prompt per task would remove ambiguity for readers who wish to replicate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.

Authors: We agree that providing the precise correlation coefficients, details on the human judgment protocol, and statistical significance tests would enhance the quantitative evaluation of our claims. The human-judgment collection protocol is briefly described in the experimental setup, but we will expand this section with more details, including the number of annotators, inter-annotator agreement, and the exact procedure. We will add the specific Pearson, Spearman, and Kendall correlation values for all datasets and aspects to the main tables or a dedicated appendix. Additionally, we will include statistical significance tests (e.g., p-values) comparing GPTScore to baselines. These changes will be incorporated in the revised manuscript. revision: yes
Referee: [§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.

Authors: We appreciate this point regarding potential selective reporting. In §4.2, we focused on a subset of aspects that are commonly evaluated across tasks to provide a clear comparison within the space constraints of the main paper. To fully address the multi-faceted evaluation claim, we will include results for the complete set of 22 aspects in an extended table or appendix in the revised version. We will also add a statement in the main text explaining the selection of the subset for the primary table and confirming that GPTScore performs competitively across all aspects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method applies off-the-shelf zero-shot prompting without fitted parameters or self-referential reductions

full rationale

The paper presents GPTScore as a prompting-based evaluation framework that directly invokes the emergent zero-shot instruction-following abilities of existing pre-trained models (19 models from 80M to 175B parameters) to produce scores for generated text according to arbitrary natural-language criteria. No equations, parameter fitting, or derivation steps are described that would reduce the output scores back to the paper's own inputs, training data, or prior results by construction. The central claim is substantiated through direct empirical evaluation on 37 datasets spanning four tasks and 22 aspects, with public code release enabling external verification. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked; the approach treats model capabilities as an external, independently available resource rather than deriving them internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that pre-trained generative models possess reliable zero-shot scoring abilities via instructions; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Generative pre-trained models possess emergent zero-shot instruction-following abilities that can be directly used for text scoring.
Invoked as the foundation for the GPTScore framework in the abstract.

pith-pipeline@v0.9.0 · 5502 in / 1222 out tokens · 71655 ms · 2026-05-17T17:06:06.842651+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diagnosing Capability Gaps in Fine-Tuning Data
cs.LG 2026-04 unverdicted novelty 7.0

GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
cs.AI 2026-04 unverdicted novelty 7.0

LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge ...
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
cs.CL 2026-03 unverdicted novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
cs.AI 2025-08 unverdicted novelty 7.0

ChatCLIDS creates a library of expert-validated virtual patients and tests LLM agents using evidence-based persuasive strategies in simulated longitudinal and adversarial health counseling sessions for closed-loop ins...
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
The Illusion of Insight in Reasoning Models
cs.AI 2026-01 unverdicted novelty 6.0

Mid-reasoning shifts in reasoning models are rare symptoms of unstable inference that seldom improve accuracy and do not reflect intrinsic self-correction.
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
cs.AI 2025-10 unverdicted novelty 6.0

TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
cs.CL 2023-08 conditional novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
cs.CL 2023-03 conditional novelty 6.0

G-Eval uses GPT-4 with chain-of-thought and form-filling to reach 0.514 Spearman correlation with humans on summarization, beating prior NLG metrics while noting a bias toward LLM outputs.
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
cs.CL 2023-03 unverdicted novelty 6.0

SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
cs.CL 2025-12 unverdicted novelty 5.0

LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Self-Refine: Iterative Refinement with Self-Feedback
cs.CL 2023-03 unverdicted novelty 5.0

Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.