The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
A Comparative Study of Quality Evaluation Methods for Text Summarization, June 2024
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3verdicts
UNVERDICTED 3representative citing papers
LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
LongSumEval evaluates long-document summaries via answerability and factual alignment of generated QA pairs, yielding stronger human correlation than prior metrics and enabling iterative self-improvement.
citing papers explorer
-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
-
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
-
LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization
LongSumEval evaluates long-document summaries via answerability and factual alignment of generated QA pairs, yielding stronger human correlation than prior metrics and enabling iterative self-improvement.