IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3representative citing papers
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.
citing papers explorer
-
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.