Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
Qa-calibration of language model confidence scores
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.
citing papers explorer
-
Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Calibrating the full set of LLM judges with labeled data halves calibration error versus top-5 accuracy selection on RewardBench2 and outperforms on four benchmarks.
-
LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration
A deep research agent incorporates progressive confidence estimation and calibration to produce trustworthy reports with transparent confidence scores on claims.