GPTScore: Evaluate as You Desire
Pith reviewed 2026-05-17 17:06 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{6UG7QNNZ}
Prints a linked pith:6UG7QNNZ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
GPTScore uses zero-shot prompting of generative models ranging from 80M to 175B parameters to evaluate text according to arbitrary natural language criteria, tested on 4 tasks, 22 aspects, and 37 datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.
Load-bearing premise
That the emergent zero-shot instruction-following abilities of the tested pre-trained models can produce scores that meaningfully reflect the desired evaluation criteria without task-specific fine-tuning or annotated samples.
read the original abstract
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3). Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at https://github.com/jinlanfu/GPTScore.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GPTScore, a framework that leverages the zero-shot instruction-following abilities of 19 generative pre-trained models (sizes 80M to 175B) to assign scores to generated texts according to arbitrary natural-language criteria. Experiments cover four text-generation tasks, 22 evaluation aspects, and 37 datasets; the central claim is that this yields effective, customized, multi-faceted evaluation without task-specific fine-tuning or annotated samples.
Significance. If the empirical results hold, the work offers a practical route to annotation-free, instruction-driven evaluation that directly addresses long-standing limitations in NLG assessment. The breadth of models and datasets tested, together with the public code release, supplies concrete empirical support for the utility of emergent abilities in this setting.
major comments (2)
- [Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.
- [§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.
minor comments (2)
- [Abstract] Abstract: the four tasks are not named; adding their names would immediately clarify the scope of the evaluation.
- [Method] Notation: the prompt template is described in prose but never shown as a boxed example; including one concrete prompt per task would remove ambiguity for readers who wish to replicate.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the manuscript asserts that GPTScore tracks desired criteria across 37 datasets, yet the main text and tables do not report the precise correlation coefficients (Pearson, Spearman, or Kendall), the human-judgment collection protocol, or any statistical significance tests; without these quantities the effectiveness claim cannot be quantitatively evaluated.
Authors: We agree that providing the precise correlation coefficients, details on the human judgment protocol, and statistical significance tests would enhance the quantitative evaluation of our claims. The human-judgment collection protocol is briefly described in the experimental setup, but we will expand this section with more details, including the number of annotators, inter-annotator agreement, and the exact procedure. We will add the specific Pearson, Spearman, and Kendall correlation values for all datasets and aspects to the main tables or a dedicated appendix. Additionally, we will include statistical significance tests (e.g., p-values) comparing GPTScore to baselines. These changes will be incorporated in the revised manuscript. revision: yes
-
Referee: [§4.2] §4.2 and Table 2: the comparison with baselines is presented only for a subset of aspects; the paper must show that GPTScore remains competitive on the full set of 22 aspects or explicitly state which aspects were omitted and why, as selective reporting directly affects the multi-faceted evaluation claim.
Authors: We appreciate this point regarding potential selective reporting. In §4.2, we focused on a subset of aspects that are commonly evaluated across tasks to provide a clear comparison within the space constraints of the main paper. To fully address the multi-faceted evaluation claim, we will include results for the complete set of 22 aspects in an extended table or appendix in the revised version. We will also add a statement in the main text explaining the selection of the subset for the primary table and confirming that GPTScore performs competitively across all aspects. revision: yes
Circularity Check
No significant circularity: method applies off-the-shelf zero-shot prompting without fitted parameters or self-referential reductions
full rationale
The paper presents GPTScore as a prompting-based evaluation framework that directly invokes the emergent zero-shot instruction-following abilities of existing pre-trained models (19 models from 80M to 175B parameters) to produce scores for generated text according to arbitrary natural-language criteria. No equations, parameter fitting, or derivation steps are described that would reduce the output scores back to the paper's own inputs, training data, or prior results by construction. The central claim is substantiated through direct empirical evaluation on 37 datasets spanning four tasks and 22 aspects, with public code release enabling external verification. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked; the approach treats model capabilities as an external, independently available resource rather than deriving them internally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generative pre-trained models possess emergent zero-shot instruction-following abilities that can be directly used for text scoring.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Diagnosing Capability Gaps in Fine-Tuning Data
GoalCover detects capability gaps in fine-tuning datasets via interactive goal decomposition and LLM-based sample scoring, with experiments showing it distinguishes targeted gaps and improves downstream model rewards.
-
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge ...
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
-
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
-
Evalet: Evaluating Large Language Models through Functional Fragmentation
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
-
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
ChatCLIDS creates a library of expert-validated virtual patients and tests LLM agents using evidence-based persuasive strategies in simulated longitudinal and adversarial health counseling sessions for closed-loop ins...
-
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
The Illusion of Insight in Reasoning Models
Mid-reasoning shifts in reasoning models are rare symptoms of unstable inference that seldom improve accuracy and do not reflect intrinsic self-correction.
-
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
TRACE is a reference-free multi-dimensional evaluation framework for tool-augmented LLM reasoning trajectories that uses an evidence bank and is validated on a new meta-evaluation dataset of flawed trajectories.
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-Eval uses GPT-4 with chain-of-thought and form-filling to reach 0.514 Spearman correlation with humans on summarization, beating prior NLG metrics while noting a bias toward LLM outputs.
-
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
LLM-PeerReview ensembles LLMs by scoring responses with LLM-as-Judge and selecting the best via averaging or truth inference, beating Smoothie-Global by 6.9-7.3 points on four datasets.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.