hub

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu · 2023 · cs.CL · arXiv 2303.16634

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at https://github.com/nlpyang/geval

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2

citation-polarity summary

use method 2

representative citing papers

Green Shielding: A User-Centric Approach Towards Trustworthy AI

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

cs.AI · 2026-04-11 · conditional · novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.

Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.

An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.

SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.

Diversity in Large Language Models under Supervised Fine-Tuning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

DWTSumm: Discrete Wavelet Transform for Document Summarization

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

DWT decomposes sentence- or word-level embeddings into multi-resolution components that preserve semantics for direct or LLM-guided summarization, yielding up to 97% fidelity and gains in BERTScore and semantic metrics over GPT-4o baselines on clinical and legal benchmarks.

Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

An LLM-based topic modeling method with a custom evaluation framework improves topic interpretability, specificity, and polarity consistency over prior approaches when linking corporate review text to external outcomes such as employee morale.

Learning to Control Summaries with Score Ranking

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.

Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.

Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.

ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO while generalizing to UltraMedical.

Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

cs.AI · 2026-04-04 · unverdicted · novelty 6.0

Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

cs.CL · 2023-10-17 · unverdicted · novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

AFA: Identity-Aware Memory for Preventing Persona Confusion in Multi-User Dialogue

cs.HC · 2026-04-27 · unverdicted · novelty 5.0

AFA with identity-aware routing raises persona attribution accuracy from 35.7% to 61.3% on a new synthetic multi-user dialogue dataset.

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.

Calibrating Model-Based Evaluation Metrics for Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

cs.CL · 2023-11-09 · unverdicted · novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank

cs.LG · 2026-05-11 · unverdicted · novelty 4.0

Multi-objective LTR combining clicks, VLM labels, and locale boosting improves relevance and local content visibility across five growth markets.

citing papers explorer

Showing 27 of 27 citing papers.

Green Shielding: A User-Centric Approach Towards Trustworthy AI cs.CL · 2026-04-27 · unverdicted · none · ref 53 · internal anchor
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents cs.AI · 2026-04-21 · unverdicted · none · ref 20 · internal anchor
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories cs.CL · 2026-04-18 · unverdicted · none · ref 13 · internal anchor
BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems cs.CL · 2026-04-14 · unverdicted · none · ref 14 · internal anchor
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 26 · internal anchor
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 26 · internal anchor
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation cs.CV · 2026-05-07 · unverdicted · none · ref 31 · internal anchor
MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES cs.AI · 2026-05-04 · unverdicted · none · ref 19 · internal anchor
HADES is an agentic AI system that generates mechanistic hypotheses for drug-induced liver injury using molecular, metabolite, and pathway evidence, outperforming prior binary classifiers on the new DILER benchmark while establishing a baseline for hypothesis alignment.
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA? cs.CV · 2026-05-03 · unverdicted · none · ref 12 · internal anchor
SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 71 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
DWTSumm: Discrete Wavelet Transform for Document Summarization cs.CL · 2026-04-22 · unverdicted · none · ref 21 · internal anchor
DWT decomposes sentence- or word-level embeddings into multi-resolution components that preserve semantics for direct or LLM-guided summarization, yielding up to 97% fidelity and gains in BERTScore and semantic metrics over GPT-4o baselines on clinical and legal benchmarks.
Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data cs.CL · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
An LLM-based topic modeling method with a custom evaluation framework improves topic interpretability, specificity, and polarity consistency over prior approaches when linking corporate review text to external outcomes such as employee morale.
Learning to Control Summaries with Score Ranking cs.CL · 2026-04-19 · unverdicted · none · ref 41 · internal anchor
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs cs.SD · 2026-04-10 · unverdicted · none · ref 20 · internal anchor
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
ProMedical: Hierarchical Fine-Grained Criteria Modeling for Medical LLM Alignment via Explicit Injection cs.AI · 2026-04-09 · unverdicted · none · ref 2 · internal anchor
ProMedical builds a 50k preference dataset with fine-grained rubrics and a multi-dimensional reward model that disentangles safety from proficiency, yielding 22.3% accuracy and 21.7% safety gains on Qwen3-8B via GRPO while generalizing to UltraMedical.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge cs.AI · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge cs.AI · 2026-04-04 · unverdicted · none · ref 10 · internal anchor
Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 154 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
AFA: Identity-Aware Memory for Preventing Persona Confusion in Multi-User Dialogue cs.HC · 2026-04-27 · unverdicted · none · ref 11 · internal anchor
AFA with identity-aware routing raises persona attribution accuracy from 35.7% to 61.3% on a new synthetic multi-user dialogue dataset.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator cs.AI · 2026-04-27 · unverdicted · none · ref 17 · internal anchor
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
Calibrating Model-Based Evaluation Metrics for Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 57 · internal anchor
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 198 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank cs.LG · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
Multi-objective LTR combining clicks, VLM labels, and locale boosting improves relevance and local content visibility across five growth markets.
A Community-Based Approach for Stance Distribution and Argument Organization cs.CL · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
Unsupervised graph community detection organizes arguments to reveal stance distributions in debates.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 150 · internal anchor
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition cs.AI · 2026-04-06 · unverdicted · none · ref 37 · 2 links · internal anchor
OpenCLAW-P2P v7.0 adds mathematical corrections for dimensional consistency and range constraints plus ecosystem expansions including new 4B/9B parameter models to a decentralized AI peer-review platform that claims >85% accuracy on fabricated citation detection.

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer