Qa-calibration of language model confidence scores

· 2024 · arXiv 2410.06615

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

stat.ME · 2026-05-10 · unverdicted · novelty 6.0

Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

cs.RO · 2025-11-04 · unverdicted · novelty 6.0

LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.

Calibrating Model-Based Evaluation Metrics for Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.

Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

cs.AI · 2026-04-07 · unverdicted · novelty 4.0

A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration cs.AI · 2026-04-07 · unverdicted · none · ref 14
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.

Qa-calibration of language model confidence scores

fields

years

verdicts

representative citing papers

citing papers explorer