CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists

Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, Najoung Kim · 2025 · DOI 10.18653/v1/2025.emnlp-main.796

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

cs.CL · 2026-07-01 · unverdicted · novelty 5.0

FaithMed applies reinforcement learning with process-level rewards derived from evidence-based medicine rubrics to improve both task performance and reasoning faithfulness in medical LLMs.

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.

citing papers explorer

Showing 3 of 3 citing papers after filters.

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents cs.CL · 2026-05-27 · unverdicted · none · ref 30
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning cs.CL · 2026-07-01 · unverdicted · none · ref 63
FaithMed applies reinforcement learning with process-level rewards derived from evidence-based medicine rubrics to improve both task performance and reasoning faithfulness in medical LLMs.
Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation cs.CL · 2026-06-09 · unverdicted · none · ref 3
Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.

CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists

fields

years

verdicts

representative citing papers

citing papers explorer