IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4years
2026 4representative citing papers
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
FaithMed applies reinforcement learning with process-level rewards derived from evidence-based medicine rubrics to improve both task performance and reasoning faithfulness in medical LLMs.
Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.
citing papers explorer
-
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
-
FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning
FaithMed applies reinforcement learning with process-level rewards derived from evidence-based medicine rubrics to improve both task performance and reasoning faithfulness in medical LLMs.
-
Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
Controlled experiments on synthetic post-training data show provenance-grounded gating and adaptive recovery improve yield and recall over baselines, with generator scale as the primary driver of downstream fine-tuning quality.