QA-calibration of language model confidence scores.arXiv preprint arXiv:2410.06615

Putra Manggala, Atalanti Mastakouri, Elke Kirschbaum, Shiva Prasad Kasiviswanathan, Aaditya Ramdas · 2024 · arXiv 2410.06615

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

stat.ME · 2026-05-10 · unverdicted · novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

Calibrating Model-Based Evaluation Metrics for Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

citing papers explorer

Showing 3 of 3 citing papers.

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges stat.ME · 2026-05-10 · unverdicted · none · ref 7
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
Calibrating Model-Based Evaluation Metrics for Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 167
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration cs.AI · 2026-04-07 · unverdicted · none · ref 14
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

QA-calibration of language model confidence scores.arXiv preprint arXiv:2410.06615

fields

years

verdicts

representative citing papers

citing papers explorer