pith. machine review for the scientific record. sign in

arxiv: 2605.06350 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM cascadesmodel cascadescost-quality tradeoffdecision-theoretic frameworkthreshold deferralshadow pricesrouting policies
0
0 comments X

The pith

LLM cascades achieve their cost-quality frontier as the pointwise envelope of pairwise two-model thresholds, with performance limited by always paying for the cheap model first.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Model cascades route uncertain queries from a cheap LLM to an expensive one using confidence thresholds to manage deployment costs and accuracy. The paper builds a constrained optimization framework that maps out the full cost-quality frontier for any number of models. For two models the frontier is piecewise concave, with shadow prices linking budget and quality constraints. With more models the best deterministic cascade reduces to the envelope of all possible pairs, and longer fixed chains or optimized subsequences add no meaningful improvement. Tests across five benchmarks show a simple pre-generation router often beats cascades because it skips the cheap model's cost on queries sent straight to larger models.

Core claim

The paper develops a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade it establishes piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking budget- and quality-constrained formulations. Given a pool of k models the frontier achievable by deterministic two-model threshold cascades is the pointwise envelope over all pairwise cascades, with switching points where the optimal pair changes. For k-model cascades first-order conditions require a single shadow price that equalizes marginal quality-per-cost across stage boundaries. Empirical validation on five

What carries the argument

The pointwise envelope over all pairwise two-model threshold cascades, derived via constrained optimization and duality, that characterizes the achievable frontier for any model pool.

If this is right

  • The optimal deterministic cascade policy is always one of the pairwise threshold cascades, with the best pair selected per region.
  • Full fixed k-model chains lie strictly inside or on the pairwise envelope and therefore cannot improve the frontier.
  • Optimized subsequence cascades deliver no practically meaningful held-out gains beyond the pairwise envelope.
  • A pre-generation router that avoids running the cheap model on queries routed directly to larger models exceeds the best cascade on four of five datasets.
  • Cascade performance is limited primarily by the structural cost of always invoking the cheap model before any escalation decision rather than by a shortage of intermediate stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could explore routers that choose the initial model without first paying for a cheap one, potentially shifting the frontier outward.
  • The shadow-price equalization condition suggests a natural way to set dynamic per-query pricing in multi-model serving systems.
  • The envelope result implies that exhaustive search over pairs is often enough; practitioners need not maintain long fixed chains.
  • If confidence scores are noisy, the framework predicts the frontier will flatten earlier, making early-exit routers even more attractive.

Load-bearing premise

Deterministic threshold-based deferral using confidence scores is sufficient to trace the relevant cost-quality frontier.

What would settle it

A held-out experiment in which a learned or stochastic deferral policy produces a cost-quality curve strictly above the pairwise envelope on any of the five benchmarks.

Figures

Figures reproduced from arXiv: 2605.06350 by Dylan Bouchard.

Figure 1
Figure 1. Figure 1: Optimal pair, full fixed chain, optimal subsequence, and diagnostic learned router across five view at source ↗
Figure 2
Figure 2. Figure 2: Escalation benefit mH(s) − mL(s) as a function of the cheap model’s confidence score sL, for one representative pair per dataset (the pair dominating the largest cost range on the envelope). Solid curves are medians across 50 random 50/50 splits; shaded bands are 10th–90th percentiles. The dotted line marks zero. Annotations report the expected-dominance fraction (dom) and decreasing￾benefit fraction (dec)… view at source ↗
Figure 3
Figure 3. Figure 3: Scorer choice ablation. Points show median pair-level normalized gain over a no-signal view at source ↗
Figure 4
Figure 4. Figure 4: Descriptive cost-quality Pareto frontiers per dataset. Gray curves are per-pair frontiers view at source ↗
Figure 5
Figure 5. Figure 5: Escalation benefit curves for all 28 pairs on MMLU. view at source ↗
Figure 6
Figure 6. Figure 6: Escalation benefit curves for all 28 pairs on TriviaQA. view at source ↗
Figure 7
Figure 7. Figure 7: Escalation benefit curves for all 28 pairs on MATH (levels 3–5). view at source ↗
Figure 8
Figure 8. Figure 8: Escalation benefit curves for all 28 pairs on SimpleQA. view at source ↗
Figure 9
Figure 9. Figure 9: Escalation benefit curves for all 21 pairs on LiveCodeBench (GPT-oss-20B excluded; see view at source ↗
Figure 10
Figure 10. Figure 10: Median quality gap view at source ↗
Figure 11
Figure 11. Figure 11: Median Pareto frontier (over 50 splits) for NSGA-II and random search versus the pairwise view at source ↗
read the original abstract

Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper develops a decision-theoretic framework for LLM cascades grounded in constrained optimization and duality. For two-model cascades it establishes piecewise concavity of the cost-quality frontier on decreasing-benefit regions together with reciprocal shadow prices between budget- and quality-constrained formulations. For a pool of k models it shows that the frontier achievable by deterministic two-model threshold cascades is the pointwise envelope over all pairwise cascades, with switching points where the optimal pair changes, and derives first-order conditions that equalize marginal quality-per-cost across stage boundaries for k-model cascades. Empirically, on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) with eight models, full fixed chains underperform the pairwise envelope, optimized subsequence cascades yield no practically meaningful held-out gains, and a lightweight pre-generation router outperforms the best cascade policy on four datasets primarily by skipping the cheap model's generation cost.

Significance. If the characterizations hold, the work supplies a precise geometric and duality-based description of cascade frontiers that directly explains why structural cost (mandatory invocation of the cheap model before any deferral decision) dominates performance limits more than the number of intermediate stages. The empirical patterns across multiple benchmarks and model providers lend concrete support to this structural-cost interpretation and suggest that future routing research should prioritize pre-generation decisions over multi-stage escalation. The explicit scoping to the deterministic threshold-cascade class and the use of standard optimization duality are strengths that make the theoretical results falsifiable and extensible.

major comments (2)
  1. [§4] §4 (multi-model characterization): the claim that the k-model frontier is exactly the pointwise envelope over pairwise cascades assumes that no three-or-more-model subsequence can improve upon the best pairwise envelope at any operating point; while the first-order conditions in §5 are consistent with this, the manuscript does not provide a formal proof that the envelope is globally optimal within the deterministic-threshold class, which is load-bearing for the conclusion that 'shortage of intermediate stages' is not the limiting factor.
  2. [Empirical results] Empirical section (results on subsequence cascades): the statement that optimized subsequence cascades 'do not deliver practically meaningful held-out gains' is central to the structural-cost claim, yet the manuscript reports only qualitative patterns without effect sizes, confidence intervals, or statistical tests on the held-out differences; this weakens the ability to rule out that small but consistent gains exist on some datasets.
minor comments (3)
  1. [§3] The abstract and §3 refer to 'decreasing-benefit regions of the confidence support' without a precise definition or example of how these regions are identified from the empirical confidence distribution.
  2. [Figures] Figure captions and axis labels for the cost-quality frontiers should explicitly state whether the plotted points are in-sample or held-out and whether thresholds were chosen by grid search or by the derived first-order conditions.
  3. [Empirical results] The pre-generation router is described as 'lightweight' but the manuscript does not report its training procedure, feature set, or computational overhead relative to the cascade policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The recommendation for minor revision is appreciated, and we address each major comment below with proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (multi-model characterization): the claim that the k-model frontier is exactly the pointwise envelope over pairwise cascades assumes that no three-or-more-model subsequence can improve upon the best pairwise envelope at any operating point; while the first-order conditions in §5 are consistent with this, the manuscript does not provide a formal proof that the envelope is globally optimal within the deterministic-threshold class, which is load-bearing for the conclusion that 'shortage of intermediate stages' is not the limiting factor.

    Authors: We thank the referee for this observation. Section 4 characterizes the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over all pairwise cascades, with switching points where the optimal pair changes. The first-order conditions in Section 5 for k-model cascades equalize marginal quality-per-cost across stage boundaries via a single shadow price. These conditions are consistent with the interpretation that optimal multi-stage policies effectively operate as sequences of pairwise segments. However, we agree that an explicit formal proof establishing global optimality of the pairwise envelope within the full deterministic-threshold class (i.e., that no three-or-more-model subsequence can strictly improve upon it at any operating point) is not provided. In the revised manuscript we will add a concise proof sketch in §4 or an appendix, showing that any k-stage cascade's (cost, quality) points are bounded above by the pairwise envelope using the established piecewise concavity and duality results for pairs. This will directly bolster the claim that shortage of intermediate stages is not the primary limitation. revision: yes

  2. Referee: [Empirical results] Empirical section (results on subsequence cascades): the statement that optimized subsequence cascades 'do not deliver practically meaningful held-out gains' is central to the structural-cost claim, yet the manuscript reports only qualitative patterns without effect sizes, confidence intervals, or statistical tests on the held-out differences; this weakens the ability to rule out that small but consistent gains exist on some datasets.

    Authors: We agree that the current empirical presentation of the subsequence-cascade results relies primarily on qualitative patterns across the five benchmarks. To strengthen this central claim, the revised manuscript will augment the empirical section with quantitative details: absolute and relative differences in accuracy and cost between optimized subsequence cascades and the pairwise envelope, bootstrap-derived 95% confidence intervals on the held-out sets, and results from paired statistical tests (e.g., Wilcoxon signed-rank or McNemar's test) to assess whether observed differences are statistically significant. These additions will allow readers to evaluate the practical magnitude of any gains more rigorously and will reinforce the structural-cost interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives its decision-theoretic results (piecewise concavity, reciprocal shadow prices, pointwise envelope over pairwise cascades, and first-order marginal conditions) directly from standard constrained optimization and duality applied to the deterministic threshold-cascade formulation. These characterizations are obtained prior to and independently of any data fitting; the empirical sections optimize thresholds per pair and validate on benchmarks but do not redefine or force the theoretical claims by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no fitted quantity is relabeled as a prediction. The structural-cost conclusion follows from the model structure (always invoking the cheap model first) rather than from circular reduction to the observed performance numbers.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework applies standard constrained optimization and duality to the cascade setting; the main added elements are the empirical fitting of per-pair thresholds on the listed benchmarks and the assumption that deterministic thresholds suffice to achieve the frontier.

free parameters (1)
  • deferral thresholds
    Optimized per model pair to trace the cost-quality frontier in the constrained problems and validation experiments.
axioms (1)
  • domain assumption LLM confidence scores are sufficiently calibrated to support meaningful deferral decisions
    Required for the threshold-based cascade model to produce the described frontier.

pith-pipeline@v0.9.0 · 5594 in / 1404 out tokens · 46609 ms · 2026-05-08T12:51:51.530529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 8 canonical work pages

  1. [1]

    , title =

    Viola, Paul and Jones, Michael J. , title =. Int. J. Comput. Vision , month = may, pages =. 2004 , issue_date =. doi:10.1023/B:VISI.0000013087.49260.fb , abstract =

  2. [2]

    2024 , eprint=

    When Does Confidence-Based Cascade Deferral Suffice? , author=. 2024 , eprint=

  3. [3]

    2024 , eprint=

    Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning , author=. 2024 , eprint=

  4. [4]

    Kusner and Kilian Q

    Zhixiang (Eddie) Xu and Matt J. Kusner and Kilian Q. Weinberger and Minmin Chen and Olivier Chapelle , title =. Journal of Machine Learning Research , year =

  5. [5]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Consistent Estimators for Learning to Defer to an Expert , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  6. [6]

    and Pratap, A

    Deb, K. and Pratap, A. and Agarwal, S. and Meyarivan, T. , title =. Trans. Evol. Comp , month = apr, pages =. 2002 , issue_date =. doi:10.1109/4235.996017 , abstract =

  7. [7]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  8. [8]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  9. [9]

    2017 , eprint=

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

  10. [10]

    2024 , eprint=

    Measuring short-form factuality in large language models , author=. 2024 , eprint=

  11. [11]

    2024 , eprint=

    Language Model Cascades: Token-level uncertainty and beyond , author=. 2024 , eprint=

  12. [12]

    2023 , eprint=

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author=. 2023 , eprint=

  13. [13]

    2025 , eprint=

    C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    RouteLLM: Learning to Route LLMs with Preference Data , author=. 2025 , eprint=

  15. [15]

    2026 , eprint=

    Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints , author=. 2026 , eprint=

  16. [16]

    2025 , eprint=

    OmniRouter: Budget and Performance Controllable Multi-LLM Routing , author=. 2025 , eprint=

  17. [17]

    2026 , eprint=

    Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents , author=. 2026 , eprint=

  18. [18]

    2026 , eprint=

    Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference , author=. 2026 , eprint=

  19. [19]

    2025 , eprint=

    Privacy-preserved LLM Cascade via CoT-enhanced Policy Learning , author=. 2025 , eprint=

  20. [20]

    2024 , eprint=

    Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning , author=. 2024 , eprint=

  21. [21]

    2026 , eprint=

    Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey , author=. 2026 , eprint=

  22. [22]

    2025 , eprint=

    A Unified Approach to Routing and Cascading for LLMs , author=. 2025 , eprint=

  23. [23]

    2025 , eprint=

    BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute , author=. 2025 , eprint=

  24. [24]

    2025 , eprint=

    When to Reason: Semantic Router for vLLM , author=. 2025 , eprint=

  25. [25]

    2024 , eprint=

    EmbedLLM: Learning Compact Representations of Large Language Models , author=. 2024 , eprint=

  26. [26]

    2025 , eprint=

    GraphRouter: A Graph-based Router for LLM Selections , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    CP-Router: An Uncertainty-Aware Router Between LLM and LRM , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    AutoMix: Automatically Mixing Language Models , author=. 2025 , eprint=

  29. [29]

    Portfolio Selection , urldate =

    Harry Markowitz , journal =. Portfolio Selection , urldate =

  30. [30]

    Journal of Machine Learning Research , year =

    Dylan Bouchard and Mohit Singh Chauhan and David Skarbrevik and Ho-Kyeong Ra and Viren Bajaj and Zeya Ahmad , title =. Journal of Machine Learning Research , year =

  31. [31]

    2025 , eprint=

    Atomic Calibration of LLMs in Long-Form Generations , author=. 2025 , eprint=

  32. [32]

    2026 , eprint=

    Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification , author=. 2026 , eprint=

  33. [33]

    Scikit-learn: Machine Learning in Python , journal =

    Fabian Pedregosa and Ga. Scikit-learn: Machine Learning in Python , journal =. 2011 , volume =

  34. [34]

    2024 , eprint=

    Calibration and Correctness of Language Models for Code , author=. 2024 , eprint=

  35. [35]

    2025 , eprint=

    UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models , author=. 2025 , eprint=

  36. [36]

    2025 , eprint=

    Reconsidering LLM Uncertainty Estimation Methods in the Wild , author=. 2025 , eprint=

  37. [37]

    2025 , eprint=

    Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation , author=. 2025 , eprint=

  38. [38]

    2020 , eprint=

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , author=. 2020 , eprint=

  39. [39]

    2023 , eprint=

    Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author=. 2023 , eprint=

  40. [40]

    2025 , eprint=

    NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions , author=. 2025 , eprint=

  41. [41]

    2019 , eprint=

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. 2019 , eprint=

  42. [42]

    2023 , eprint=

    FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation , author=. 2023 , eprint=

  43. [43]

    2018 , eprint=

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

  44. [44]

    2026 , version =

    Open R1 , author =. 2026 , version =

  45. [45]

    2024 , eprint=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

  46. [46]

    Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box,

    Dylan Bouchard and Mohit Singh Chauhan , journal=. Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box,. 2025 , url=

  47. [47]

    2026 , eprint=

    Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study , author=. 2026 , eprint=

  48. [48]

    2025 , eprint=

    EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling , author=. 2025 , eprint=

  49. [49]

    Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

    Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem , year=. Benchmarking Uncertainty Quantificatio...

  50. [50]

    2024 , eprint=

    RED-CT: A Systems Design Methodology for Using LLM-labeled Data to Train and Deploy Edge Classifiers for Computational Social Science , author=. 2024 , eprint=

  51. [51]

    2025 , eprint=

    Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency , author=. 2025 , eprint=

  52. [52]

    Akiba, S

    Akiba, Takuya and Sano, Shotaro and Yanase, Toshihiko and Ohta, Takeru and Koyama, Masanori , booktitle =. doi:10.1145/3292500.3330701 , pages =

  53. [53]

    OpenAI.com , author=

  54. [54]

    Google Cloud , author=

  55. [55]

    2025 , eprint=

    UQLM: A Python Package for Uncertainty Quantification in Large Language Models , author=. 2025 , eprint=

  56. [56]

    OpenAI.com , author=

    Introducing GPT-4.5 | OpenAI , url=. OpenAI.com , author=

  57. [57]

    2024 , eprint=

    Graph-based Uncertainty Metrics for Long-form Language Model Outputs , author=. 2024 , eprint=

  58. [58]

    Hughes, Simon and Bae, Minseok and Li, Miaoran , month = nov, title =

  59. [59]

    2024 , eprint=

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=

  60. [60]

    2024 , eprint=

    A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice , author=. 2024 , eprint=

  61. [61]

    2024 , eprint=

    A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions , author=. 2024 , eprint=

  62. [62]

    2023 , eprint=

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , author=. 2023 , eprint=

  63. [63]

    2024 , eprint=

    A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models , author=. 2024 , eprint=

  64. [64]

    2025 , eprint=

    A Survey on LLM-as-a-Judge , author=. 2025 , eprint=

  65. [65]

    2023 , eprint=

    Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness , author=. 2023 , eprint=

  66. [66]

    Nature , year =

    Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , title=. Nature , year=. doi:10.1038/s41586-024-07421-0 , url=

  67. [67]

    2024 , eprint=

    LUQ: Long-text Uncertainty Quantification for LLMs , author=. 2024 , eprint=

  68. [68]

    2023 , eprint=

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models , author=. 2023 , eprint=

  69. [69]

    2024 , eprint=

    Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs , author=. 2024 , eprint=

  70. [70]

    2024 , eprint=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2024 , eprint=

  71. [71]

    2023 , eprint=

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=

  72. [72]

    2024 , eprint=

    Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space , author=. 2024 , eprint=

  73. [73]

    2023 , eprint=

    Selectively Answering Ambiguous Questions , author=. 2023 , eprint=

  74. [74]

    2024 , eprint=

    CLUE: Concept-Level Uncertainty Estimation for Large Language Models , author=. 2024 , eprint=

  75. [75]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  76. [76]

    Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , title =. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pages =. 2002 , publisher =. doi:10.3115/1073083.1073135 , abstract =

  77. [77]

    , booktitle=

    Qurashi, Abdul Wahab and Holmes, Violeta and Johnson, Anju P. , booktitle=. Document Processing: Methods for Semantic Text Similarity Analysis , year=

  78. [78]

    METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

    Banerjee, Satanjeev and Lavie, Alon. METEOR : An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005

  79. [79]

    2019 , eprint=

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. 2019 , eprint=

  80. [80]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

Showing first 80 references.