pith. sign in

arXiv preprint arXiv:2511.21140 , url=

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it
abstract

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

citation-role summary

background 1 other 1

citation-polarity summary

years

2026 7 2025 1

polarities

background 1 unclear 1

clear filters

representative citing papers

Uncertainty Propagation in LLM-Based Systems

cs.SE · 2026-04-26 · unverdicted · novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

cs.CL · 2026-05-25 · conditional · novelty 6.0

For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.

Open-Ended Task Discovery via Bayesian Optimization

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.

Bias and Uncertainty in LLM-as-a-Judge Estimation

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Uncertainty Propagation in LLM-Based Systems cs.SE · 2026-04-26 · unverdicted · none · ref 77 · internal anchor

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.