Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Transactions of the Association for Computational Linguistics , volume =
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and multiple tasks at low parameter cost.
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
citing papers explorer
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and multiple tasks at low parameter cost.
-
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.