arxiv: 2605.08558 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

Haoyang Hong, Huazheng Wang, Muyun Lu, Ying Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-fidelity banditsimproving proxiesadaptive continuationregret boundsLLM evaluationmulti-armed bandits

0 comments

The pith

In multi-fidelity bandits with improving low-fidelity proxies, adaptive continuation replaces logarithmic high-fidelity sampling with bounded low-fidelity use for intermediate arms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies multi-armed bandit problems where evaluations can come from multiple sources of differing cost and accuracy, with a focus on low-fidelity sources that become more accurate through repeated calibration. It develops a selected-average mismatch bound to convert sequences of improving low-fidelity observations into valid confidence bounds on the underlying high-fidelity means. Using this, the Threshold-Based Adaptive Continuation Companion algorithm decides when low-fidelity sampling remains advantageous and when to switch to high-fidelity. The resulting instance-dependent regret bound shows that continuation stays bounded for arms whose value lies between the current best and the continuation threshold. This matters for applications such as using improving LLM judges for policy evaluation, where it can lower total evaluation cost without sacrificing the guarantee.

Core claim

We prove an instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation.

What carries the argument

The selected-average mismatch bound, which turns dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target and supports the bounded continuation rule.

If this is right

For intermediate arms the total number of low-fidelity samples remains bounded rather than growing logarithmically with time.
Cost-weighted regret improves when the proxy's improvement rate satisfies the mismatch condition.
The same decision rule applies directly to LLM-as-a-judge evaluation tasks without requiring static bias assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mismatch bound can be empirically verified for a new proxy, practitioners can safely restrict high-fidelity queries to only the current top contenders.
The approach extends to other improving surrogates such as neural-network simulators whose error decreases predictably with calibration data.
Proxy designers could target calibration schedules that enlarge the region of arms eligible for bounded continuation.

Load-bearing premise

The low-fidelity source improves with repeated use in a way that can be captured by a selected-average mismatch bound allowing safe bounded continuation decisions instead of high-fidelity escalation.

What would settle it

An experiment or dataset in which low-fidelity mismatch fails to decrease with additional samples as required by the bound, forcing the algorithm to either incur unbounded regret or revert to full high-fidelity confirmation.

Figures

Figures reproduced from arXiv: 2605.08558 by Haoyang Hong, Huazheng Wang, Muyun Lu, Ying Lin.

**Figure 1.** Figure 1: Synthetic simulation results with r = 0.5. TACC uses bounded low-fidelity continuation to reduce unnecessary high-fidelity queries. regime with more arms, larger mismatch, and greater penalty for premature high-fidelity escalation. In both settings, we set the continuation parameter to η = 10−4 . The main experiments use r = 0.5; Appendix D.1 reports sensitivity to r. Baselines and results. We compare TACC… view at source ↗

**Figure 2.** Figure 2: LLM-as-a-judge result under residual low–high mismatch with λ (H) = 500. Curves show mean cost-weighted pseudo-regret over 200 seeds. Main result. The main residual-mismatch LLM-as-a-judge experiment uses λ (L) = 1, λ (H) = 500, r = 0.75, ζ = 0.4, bk ≡ 0.05, γ = 0.025, η = 10−4 , and S0 = 128. We evaluate budgets up to Λ = 128000 over 200 independent seeds [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity of TACC to the decay rate r in the low-fidelity mismatch bound Bk(n) = O(n −r ). Larger r corresponds to faster low-fidelity improvement. Assumption C.4 then implies X r≤rk c ⋆ k (εr; Λ) ≤ C c ⋆ k (∆k/4; Λ). Since every unit of cost spent on arm k contributes at most ∆k to the cost-weighted pseudo-regret, Rk(Λ) ≤ C ∆k c ⋆ k (∆k/4; Λ). Summing over all suboptimal arms yields the main term. The w… view at source ↗

**Figure 4.** Figure 4: Residual-mismatch regret curves for λ (H) ∈ {200, 500, 1000}. The weak judge improves with use but retains persistent mismatch from the verifier. D.4 Vanishing-Mismatch Regime: Bk(n) → 0 We next remove the residual term and consider µ (L) k (n) = µ (H) k + ak(n + n0) −r . (58) This is the regime where the low–high mismatch shrinks toward zero. In this case, low fidelity eventually becomes aligned with the… view at source ↗

**Figure 5.** Figure 5: Continuation calls in the residual-mismatch experiment with [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Vanishing-mismatch regret curves for λ (H) ∈ {200, 500, 1000}. The low-fidelity source eventually aligns with the high-fidelity target [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Weak-judge prediction accuracy as a function of checkpoint size [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Policy-level low–high gap as a function of weak-judge checkpoint size [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Checkpoint-based bandit experiment with λ (H) = 500 and qmax = 2048. The low-fidelity process is estimated from trained weak-judge checkpoints rather than imposed algebraically from the high-fidelity means. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Additional diagnostics for the checkpoint-based bandit experiment. [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

read the original abstract

As an extension of the classical multi-armed bandit problem, multi-fidelity multi-armed bandits (MF-MAB) enable individual arms to be evaluated using diverse feedback sources that vary in both cost and accuracy. Prior stochastic models typically assume fixed low-to-high fidelity discrepancies, whereas modern proxy sources, such as learning-based simulators and Large Language Models (LLMs), can be improved using additional calibration. We investigate adaptive MF-MAB with improving proxy sources, and focus on the canonical two-fidelity case in which the low-fidelity source becomes more informative with repeated use. To capture this dynamic, we introduce a selected-average mismatch bound that converts dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target. We propose the Threshold-Based Adaptive Continuation Companion (TACC), an optimistic algorithm that uses a bounded continuation rule to decide when low-fidelity sampling remains cost-effective and when to escalate. We prove an instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation. Experiments on synthetic bandits and an LLM-as-a-judge policy-evaluation task examine when continuation improves cost-weighted regret.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical way to handle improving low-fidelity sources in bandits by turning repeated cheap samples into a safe continuation threshold instead of defaulting to expensive high-fidelity pulls.

read the letter

The main takeaway is that they drop the fixed-bias assumption common in multi-fidelity bandits and instead model the low-fidelity source as getting better with repeated use. They introduce a selected-average mismatch bound that converts those improving observations into a continuation rule, and the TACC algorithm uses it to decide when low-fidelity sampling stays cost-effective for arms whose gap sits in an intermediate range. The instance-dependent regret bound then shows that this replaces the usual logarithmic high-fidelity checks with a finite number of low-fidelity ones for those arms.

Referee Report

0 major / 3 minor

Summary. The paper extends multi-fidelity multi-armed bandits to the setting of improving low-fidelity proxies (e.g., calibrated simulators or LLMs). It introduces a selected-average mismatch bound that converts repeated low-fidelity observations into improvement-aware confidence intervals for the high-fidelity target. The Threshold-Based Adaptive Continuation Companion (TACC) algorithm uses an optimistic bounded-continuation rule to decide when low-fidelity sampling remains cost-effective. The central theoretical result is an instance-dependent regret bound showing that, for arms whose empirical gap falls in an intermediate regime, adaptive continuation replaces the usual logarithmic number of high-fidelity pulls with a finite number of low-fidelity samples. Experiments on synthetic instances and an LLM-as-a-judge policy-evaluation task illustrate the resulting cost-weighted regret improvement.

Significance. If the regret analysis holds, the work provides a principled relaxation of the static-bias assumption that has dominated prior MF-MAB literature. The instance-dependent bound and the explicit conversion of proxy improvement into a continuation threshold are technically substantive and directly address practical settings in which low-fidelity sources can be refined. The inclusion of both synthetic verification and a real LLM task strengthens the claim that the mechanism yields measurable savings when high-fidelity evaluations are expensive.

minor comments (3)

§3 (definition of the selected-average mismatch bound): a short paragraph clarifying how the bound is estimated from data and whether it requires any tuning parameter would improve readability for readers unfamiliar with the construction.
Algorithm 1 (TACC pseudocode): the continuation threshold is referenced but not given an explicit line-numbered definition; adding a boxed equation for the threshold would eliminate ambiguity when readers compare the algorithm to the regret proof.
Experiments section: the LLM-as-a-judge task description would benefit from one additional sentence stating how many low-fidelity samples are needed before the mismatch bound stabilizes in the reported runs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on adaptive multi-fidelity bandits with improving proxies, the recognition of the selected-average mismatch bound and TACC algorithm, and the recommendation for minor revision. We appreciate the assessment that the instance-dependent regret result and LLM experiment strengthen the contribution.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces a selected-average mismatch bound as a modeling assumption that captures dynamic improvement in low-fidelity observations. The TACC algorithm's continuation rule and the instance-dependent regret bound are then derived from this assumption using standard optimistic bandit analysis. No load-bearing step reduces a prediction to a fitted parameter by construction, invokes a self-citation chain, or renames a known result; the derivation remains self-contained against the stated model.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; ledger reflects only explicitly introduced concepts and standard implicit assumptions of the bandit setting.

axioms (1)

domain assumption Low-fidelity observations improve with repeated sampling in a quantifiable manner captured by a mismatch bound.
This is the core modeling choice enabling the adaptive continuation rule.

invented entities (1)

Selected-average mismatch bound no independent evidence
purpose: Converts dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target.
Newly proposed mechanism to handle improving proxies.

pith-pipeline@v0.9.0 · 5508 in / 1200 out tokens · 64314 ms · 2026-05-12T01:06:47.141732+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce a selected-average mismatch bound that converts dynamic low-fidelity observations into improvement-aware confidence bounds
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits.Theoretical Computer Science, 410(19):1876–1902,

work page 1902
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Put cash on bandits: A max k-armed problem for automated machine learning.arXiv preprint arXiv:2505.05226,

Amir Rezaei Balef, Claire Vernade, and Katharina Eggensperger. Put cash on bandits: A max k-armed problem for automated machine learning.arXiv preprint arXiv:2505.05226,

work page arXiv
[4]

A survey on practical applications of multi-armed and contextual bandits.arXiv preprint arXiv:1904.10040,

Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits.arXiv preprint arXiv:1904.10040,

work page arXiv 1904
[5]

Functional multi-armed bandit and the best function identification problems.arXiv preprint arXiv:2503.00509,

Yuriy Dorn, Aleksandr Katrutsa, Ilgam Latypov, and Anastasiia Soboleva. Functional multi-armed bandit and the best function identification problems.arXiv preprint arXiv:2503.00509,

work page arXiv
[6]

Ucb-type algorithm for budget-constrained expert learning.arXiv preprint arXiv:2510.22654,

Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, and Yuriy Dorn. Ucb-type algorithm for budget-constrained expert learning.arXiv preprint arXiv:2510.22654,

work page arXiv
[7]

Best arm identification for stochastic rising bandits.arXiv preprint arXiv:2302.07510,

Marco Mussi, Alessandro Montenegro, Francesco Trovo, Marcello Restelli, and Alberto Maria Metelli. Best arm identification for stochastic rising bandits.arXiv preprint arXiv:2302.07510,

work page arXiv
[8]

Functional bandits.arXiv preprint arXiv:1405.2432,

11 Long Tran-Thanh and Jia Yuan Yu. Functional bandits.arXiv preprint arXiv:1405.2432,

work page arXiv
[9]

Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models.arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review arXiv
[10]

A prompt-policy maps an NLI input to a candidate label

D.2 LLM-as-a-Judge Experimental Protocol The LLM-as-a-judge experiment treats a bandit arm as a prompt-policy rather than as an individual NLI example. A prompt-policy maps an NLI input to a candidate label. The high-fidelity target is the expected verifier correctness of that policy, while the low-fidelity source is a cheaper weak-judge feedback process ...

work page 2018
[11]

Confidence intervals are paired95%intervals over200common seeds

The reported value is TACC minus baseline; negative values favor TACC. Confidence intervals are paired95%intervals over200common seeds. Baseline Mean diff.95%CI Favors TACC? DNC−561.1 [−1103.6,−18.6]Yes MF-UCB−631.4 [−1208.5,−54.4]Yes UCB−619.4 [−1227.2,−11.6]Yes Weak-Fixed−2865.0 [−3318.3,−2411.6]Yes 0 3 6 9 12 13 Budget (×104) 0 1000 2000 3000 4000 5000...

work page 2000
[12]

Values are mean cost-weighted pseudo-regret ± standard error

Table 5: Vanishing-mismatch experiment at final budgetΛ = 128000. Values are mean cost-weighted pseudo-regret ± standard error. Lower is better; the best reported method for each high-fidelity cost is bolded. λ(H) TACC DNC MF-UCB UCB Weak-Fixed 20075.7±0.5677.5±28.6 1503.1±37.1 1506.1±39.4 6503.0±0.0 500143.0±0.3657.7±113.4 2687.8±74.0 2711.1±52.0 6503.0±...

work page 2048
[13]

Figure 8 and Table 7 report the resulting policy-level low–high gap

For each prompt-policy arm and weak-judge scale q, we compute an empirical low-fidelity policy mean and compare it with the verifier mean. Figure 8 and Table 7 report the resulting policy-level low–high gap. The mean absolute gap decreases from 0.1755 at q= 32 to 0.0452 at q= 512 and 0.0318 at q= 1024 , with later checkpoints fluctuating at a lower level....

work page 2048