Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
arXiv preprint arXiv:2511.21140 , url=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6representative citing papers
This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.
Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
MaxShapley computes fair document attributions in generative QA by reducing Shapley value calculation to polynomial time via a max-sum utility, matching exact Shapley quality on HotPotQA, MuSiQUE, and MS MARCO while using up to 9x fewer resources.
citing papers explorer
-
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
-
Bias and Uncertainty in LLM-as-a-Judge Estimation
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.
-
MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution
MaxShapley computes fair document attributions in generative QA by reducing Shapley value calculation to polynomial time via a max-sum utility, matching exact Shapley quality on HotPotQA, MuSiQUE, and MS MARCO while using up to 9x fewer resources.