arxiv: 2604.12491 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords tabular question answeringconfidence calibrationLLM overconfidencemulti-format agreementexpected calibration errorAUROCperturbation methodsstructured data

0 comments

The pith

Agreement across lossless table formats like Markdown and JSON calibrates LLM confidence more accurately than self-ratings or sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are overconfident when answering questions about tables, with expected calibration error far higher than on plain text. The paper compares five estimation methods across multiple models and benchmarks and finds that perturbation approaches consistently outperform self-evaluation ones. It introduces Multi-Format Agreement, which measures consistency across four deterministic serializations of the same table to produce confidence scores. This method cuts calibration error by 44-63 percent, reaches mean AUROC of 0.80, and works at lower cost than repeated sampling while combining well with it.

Core claim

The paper shows that Multi-Format Agreement exploits the unique property of structured data to have multiple lossless serializations, and that agreement across these serializations yields reliable confidence estimates that reduce ECE by 44-63 percent and lift AUROC to 0.80 on TableBench, generalizing across models and complementing sampling-based methods.

What carries the argument

Multi-Format Agreement (MFA): confidence derived from agreement among model answers when the input table is serialized in four different lossless formats (Markdown, HTML, JSON, CSV).

If this is right

MFA reduces expected calibration error by 44-63% across tested models.
It achieves mean AUROC of 0.80 on TableBench and generalizes to four models.
An MFA plus self-consistency ensemble raises AUROC from 0.74 to 0.82 at 20% lower API cost.
Structure-aware recalibration improves AUROC by 10 points over standard post-hoc methods.
The performance gap between self-evaluation and perturbation methods holds across both benchmarks and all fully covered models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production tabular QA systems could adopt format variation as a default low-overhead calibration step.
Structured inputs offer natural diversity signals that free-text methods lack, suggesting similar techniques for code or database queries.
Optimal weighting or selection among formats could further improve results without additional model calls.
The approach may transfer to other domains where data admits multiple canonical representations.

Load-bearing premise

That the different lossless table serializations produce independent, unbiased signals of uncertainty without format-specific artifacts that distort AUROC or ECE measurements.

What would settle it

Measuring no ECE reduction or AUROC gain from MFA versus single-format baselines on a new tabular QA benchmark with varied table structures would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12491 by Lukas Voss.

**Figure 2.** Figure 2: Five confidence elicitation methods on Llama-3.3-70B. Self-evaluation methods (red) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: P(True) self-evaluation fails consistently across three models. Even after the model is [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Confidence CDFs for correct versus wrong answers (verbalized confidence). GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: MFA improvements generalize from WikiTableQuestions to TableBench, a substantially [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Feature importance for structure-aware recalibration on Llama-3.3-70B (scaled logistic [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: ECE and Brier score before and after recalibration. Post-hoc methods reduce ECE to below [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Risk-coverage curves for the five elicitation methods on Llama-3.3-70B. Self-evaluation [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: MFA agreement distribution for each model on WTQ (blue) versus TableBench (red). WTQ [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a useful empirical comparison of confidence methods for tabular QA and introduces MFA as a lower-cost alternative that exploits table serializations, but the independence of those signals needs explicit checks.

read the letter

The core contribution here is the first head-to-head test of five confidence estimators on tabular QA across five models and two benchmarks. All models turn out badly overconfident, and the results split cleanly: self-evaluation approaches lag while perturbation methods, including the new Multi-Format Agreement, reach AUROCs of 0.78-0.86. MFA uses the four deterministic serializations (Markdown, HTML, JSON, CSV) to cut API cost by 20% versus sampling and reduces ECE by 44-63%, with a further lift when combined with self-consistency. The bootstrap tests and low seed variance are welcome details that make the numbers easier to trust. Structure-aware recalibration is a smaller but clean add-on that beats standard post-hoc fixes by 10 points AUROC. These pieces are new for the tabular setting and address a real deployment issue where structured data is common. The work is straightforward and reproducible in principle, with clear metrics and paired tests. The main soft spot is the untested assumption that the four formats supply independent signals. If the model is systematically stronger on JSON than on HTML because of training data, then agreement partly measures format compatibility rather than uncertainty. The abstract does not report per-format accuracy or calibration, so it is unclear whether that confound was measured. If the full paper contains those breakdowns, the claim holds up better; if not, it is a straightforward gap to close. Readers working on reliable LLM pipelines for business or scientific tables will find the comparisons and the cost-saving method directly useful. The empirical grounding and the practical angle make it worth a serious referee, even if revisions are needed on the format-bias diagnostics. I would send it to review with a request for those per-format statistics.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first systematic study of confidence calibration for LLMs on tabular question answering. It evaluates five estimation methods across five frontier models and two benchmarks, documenting severe overconfidence (ECE 0.35-0.64). It identifies a consistent gap between self-evaluation methods (AUROC 0.42-0.76) and perturbation methods (AUROC 0.78-0.86), supported by paired bootstrap tests with Holm-Bonferroni correction and low seed variance. The central contribution is Multi-Format Agreement (MFA), which uses agreement across four lossless table serializations (Markdown, HTML, JSON, CSV) to estimate confidence at reduced API cost. MFA is reported to cut ECE by 44-63%, achieve mean AUROC 0.80 on TableBench, generalize across models, and combine complementarily with self-consistency (lifting AUROC from 0.74 to 0.82). A secondary method, structure-aware recalibration, improves AUROC by 10 points over standard post-hoc approaches.

Significance. If the empirical claims hold, the work fills a clear gap in calibration research for structured data, where LLMs are increasingly applied. MFA exploits a property unique to tabular inputs (deterministic lossless serializations) to deliver a lower-cost uncertainty signal than sampling baselines. The consistent self-evaluation vs. perturbation dichotomy across benchmarks and models, together with the reported statistical controls (paired bootstraps, correction, 3-seed variance of 0.006), provides reproducible evidence. The complementarity result and the recalibration technique have direct practical value for reliable deployment of LLMs on tables.

major comments (2)

[MFA definition and experimental results] The load-bearing assumption for MFA—that the four serializations supply sufficiently independent, unbiased signals whose disagreements primarily reflect model uncertainty—is not directly tested. The manuscript does not report per-format accuracy, ECE, or AUROC statistics (e.g., in the experimental results section or any accompanying table), leaving open the possibility that format-specific performance differences (e.g., JSON vs. HTML) contribute to the observed agreement metric. This directly affects the interpretation of the reported AUROC 0.80, ECE reductions of 44-63%, and the complementarity claim with sampling.
[Results on TableBench and recalibration experiments] The generalization claim for MFA (mean AUROC 0.80 across four models on TableBench) and the +10-point AUROC gain from structure-aware recalibration rest on the same untested independence assumption. Adding per-format breakdowns and a controlled ablation (e.g., agreement on random vs. format-matched perturbations) would be required to confirm that the gains are not partly artifacts of format compatibility.

minor comments (2)

[Methods] The abstract and methods could more explicitly state the exact prompt templates and decision rules used for each serialization to enable exact reproduction.
[Figures] Figure captions for the calibration plots should include the exact number of examples per bin and the smoothing parameter for the reported smooth ECE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and constructive major comments. We address each point below and will revise the manuscript accordingly to strengthen the empirical support for MFA.

read point-by-point responses

Referee: [MFA definition and experimental results] The load-bearing assumption for MFA—that the four serializations supply sufficiently independent, unbiased signals whose disagreements primarily reflect model uncertainty—is not directly tested. The manuscript does not report per-format accuracy, ECE, or AUROC statistics (e.g., in the experimental results section or any accompanying table), leaving open the possibility that format-specific performance differences (e.g., JSON vs. HTML) contribute to the observed agreement metric. This directly affects the interpretation of the reported AUROC 0.80, ECE reductions of 44-63%, and the complementarity claim with sampling.

Authors: We agree that per-format performance statistics were not reported and that they would help validate the independence assumption. In the revised manuscript we will add a new table reporting accuracy, ECE, and AUROC for each individual serialization (Markdown, HTML, JSON, CSV) across all models and both benchmarks. We will also report pairwise agreement rates between formats to show that systematic format biases do not dominate the disagreement signal. These additions will allow readers to assess whether format-specific differences drive the MFA results. The MFA definition itself (majority agreement across lossless serializations) remains unchanged, but the new statistics will clarify its interpretation and support the reported AUROC and ECE gains. revision: yes
Referee: [Results on TableBench and recalibration experiments] The generalization claim for MFA (mean AUROC 0.80 across four models on TableBench) and the +10-point AUROC gain from structure-aware recalibration rest on the same untested independence assumption. Adding per-format breakdowns and a controlled ablation (e.g., agreement on random vs. format-matched perturbations) would be required to confirm that the gains are not partly artifacts of format compatibility.

Authors: We accept that a controlled ablation would further isolate the contribution of deterministic format variation. In the revision we will include the per-format breakdowns noted above. We will also add an ablation comparing MFA (using the four lossless formats) against agreement computed on randomly perturbed versions of a single format (e.g., multiple noisy Markdown serializations). This will test whether gains arise from the unique lossless serialization diversity of tabular data rather than generic perturbation effects. The structure-aware recalibration method operates on post-hoc features of the table structure and is independent of MFA; we will clarify this distinction in the text and report its results separately. These changes will strengthen the generalization and complementarity claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method comparisons with no derivations or self-referential reductions

full rationale

The paper conducts systematic empirical evaluations of five confidence estimation methods (including the proposed Multi-Format Agreement) across LLMs and tabular QA benchmarks, reporting AUROC, ECE, and bootstrap tests. No equations, derivations, or fitted parameters are defined in terms of the target quantities; MFA is introduced as a practical serialization-variation heuristic and evaluated directly via measurements rather than derived from self-citations or ansatzes. All central claims rest on external experimental outcomes (per-model paired tests, seed variance checks) that do not reduce to the paper's own inputs by construction. The analysis is self-contained against benchmarks with no load-bearing self-citation chains or renaming of known results as novel derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Relies on standard evaluation metrics and statistical tests; introduces one new method but no free parameters or invented physical entities.

axioms (2)

standard math Bootstrap resampling yields valid p-values for paired AUROC comparisons after multiple-testing correction
Invoked for the p<0.001 results with Holm-Bonferroni correction.
domain assumption AUROC and smooth ECE are appropriate metrics for assessing confidence estimation quality
Used throughout to compare methods.

invented entities (1)

Multi-Format Agreement (MFA) no independent evidence
purpose: Confidence estimation via agreement across lossless table serializations
New technique proposed to exploit structured data properties.

pith-pipeline@v0.9.0 · 5620 in / 1442 out tokens · 60738 ms · 2026-05-10T14:56:36.470122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 12 canonical work pages · 1 internal anchor

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

DiverseAgentEntropy : Knowledge-preserving query reformulation for multi-agent uncertainty estimation

AWS AI Labs . DiverseAgentEntropy : Knowledge-preserving query reformulation for multi-agent uncertainty estimation. In Findings of EMNLP, 2025

2025
[4]

Smooth ECE : Principled reliability diagrams via kernel smoothing

Jaros aw B asiok and Preetum Nakkiran. Smooth ECE : Principled reliability diagrams via kernel smoothing. In Proceedings of ICLR, 2024

2024
[5]

Elephants never forget: Memorization and learning of tabular data in large language models

Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, and Rich Caruana. Elephants never forget: Memorization and learning of tabular data in large language models. In Proceedings of COLM, 2024

2024
[6]

Adaptive abstention for text-to- SQL via conformal prediction on hidden layers

Wei Chen et al. Adaptive abstention for text-to- SQL via conformal prediction on hidden layers. In Proceedings of SIGMOD, 2025

2025
[7]

FinQA : A dataset of numerical reasoning over financial data

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borber, Siamak Shakeri, et al. FinQA : A dataset of numerical reasoning over financial data. In Proceedings of EMNLP, 2021

2021
[8]

Multicalibration of language models

Gianluca Detommaso et al. Multicalibration of language models. In Proceedings of ICML, 2024

2024
[9]

Confidence estimation for LLM -based text-to- SQL

Reza Entezari Maleki et al. Confidence estimation for LLM -based text-to- SQL . In Proceedings of AAAI, 2025

2025
[10]

SPUQ : Perturbation-based uncertainty quantification for large language models

Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. SPUQ : Perturbation-based uncertainty quantification for large language models. In Proceedings of EACL, 2024

2024
[11]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[12]

arXiv preprint arXiv:2311.08298 , year=

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov Anil, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298, 2024

work page arXiv 2024
[13]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2017

2017
[14]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In Proceedings of ICLR, 2024

2024
[15]

Maximizing overall diversity for improved uncertainty estimates in deep ensembles

Siddhartha Jain, Ge Liu, Jonas Mueller, and David Gifford. Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In Proceedings of AAAI, 2020

2020
[16]

Sample-dependent adaptive temperature scaling for improved calibration

Tom Joy et al. Sample-dependent adaptive temperature scaling for improved calibration. In Proceedings of AAAI, 2023

2023
[17]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review arXiv 2022
[18]

CCPS : Calibrated confidence via perturbed hidden states

Reza Khanmohammadi et al. CCPS : Calibrated confidence via perturbed hidden states. In Proceedings of EMNLP, 2025

2025
[19]

Semantic entropy probes: Uncertainty from hidden states

Jannik Kossen et al. Semantic entropy probes: Uncertainty from hidden states. In Proceedings of ICLR, 2025

2025
[20]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In Proceedings of ICLR, 2023

2023
[21]

Verified uncertainty calibration

Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019
[22]

Conformal prediction with large language models for multi-choice question answering

Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, and Andrew Beam. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404, 2023

work page arXiv 2023
[23]

Evidential semantic entropy for LLM uncertainty quantification

Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda, and Masahiro Hamasaki. Evidential semantic entropy for LLM uncertainty quantification. In Proceedings of EACL, 2026

2026
[24]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS), 2017

2017
[25]

TrustSQL : A reliability benchmark for text-to- SQL with penalty-based scoring and uncertainty-based abstention

Gyubok Lee et al. TrustSQL : A reliability benchmark for text-to- SQL with penalty-based scoring and uncertainty-based abstention. In Proceedings of ICLR, 2025

2025
[26]

ConfTuner : Tokenized brier score for verbalized confidence fine-tuning

Hao Li et al. ConfTuner : Tokenized brier score for verbalized confidence fine-tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2025

2025
[27]

Can LLM already serve as a database interface? A B ig bench for large-scale database grounded text-to- SQL

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can LLM already serve as a database interface? A B ig bench for large-scale database grounded text-to- SQL . In Advances in Neural Information Processing Systems (NeurIPS), 2024 a

2024
[28]

Few-shot recalibration of language models

Xiang Lisa Li, Urvashi Khandelwal, and Kelvin Guu. Few-shot recalibration of language models. arXiv preprint arXiv:2403.18286, 2024 b

work page arXiv 2024
[29]

Calibrating LLM -based evaluator for text-to- SQL

Hansong Liu et al. Calibrating LLM -based evaluator for text-to- SQL . In Proceedings of EMNLP, 2025

2025
[30]

Semantic energy: Detecting LLM hallucination beyond entropy

Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, and Haifeng Wang. Semantic energy: Detecting LLM hallucination beyond entropy. arXiv preprint arXiv:2508.14496, 2025

work page arXiv 2025
[31]

Confidence scoring for LLM -generated SQL in supply chain data extraction

Jiekai Ma and Yikai Zhao. Confidence scoring for LLM -generated SQL in supply chain data extraction. In KDD Workshop on AI for Supply Chain, 2025

2025
[32]

QA -calibration: Conditional calibration of LLM confidence via input-group embeddings

Pia Manggala et al. QA -calibration: Conditional calibration of LLM confidence via input-group embeddings. In Proceedings of ICLR, 2025

2025
[33]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In Proceedings of ICML, 2024

2024
[34]

u ller, Nicholas Popovic, Michael F \

Philip M \"u ller, Nicholas Popovi c , Michael F \"a rber, and Peter Steinbach. Benchmarking uncertainty calibration in large language model long-form question answering. arXiv preprint arXiv:2602.00279, 2026

work page arXiv 2026
[35]

Beyond semantic entropy: Smooth nearest neighbor entropy for long-form LLM uncertainty

Hai Nguyen et al. Beyond semantic entropy: Smooth nearest neighbor entropy for long-form LLM uncertainty. In Findings of the Association for Computational Linguistics: ACL, 2025

2025
[36]

Nemotron-4 340b technical report

NVIDIA . Nemotron-4 340b technical report. Technical report, NVIDIA, 2024. Lists WikiTableQuestions among supervised fine-tuning datasets

2024
[37]

Compositional semantic parsing on semi-structured tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceedings of ACL, 2015

2015
[38]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 1999

1999
[39]

Strength in numbers: Estimating confidence of large language models via multiple completions

Glenn Portillo Wightman, Alexandra Delucia, and Mark Dredze. Strength in numbers: Estimating confidence of large language models via multiple completions. In Proceedings of the TrustNLP Workshop at ACL, 2023

2023
[40]

Text-to- SQL calibration: No need to ask---just rescale model probabilities

Ashwin Ramachandran and Sunita Sarawagi. Text-to- SQL calibration: No need to ask---just rescale model probabilities. arXiv preprint arXiv:2411.16742, 2024

work page arXiv 2024
[41]

Mapping from meaning: Modeling prompt sensitivity as generalization error

Yair Reing et al. Mapping from meaning: Modeling prompt sensitivity as generalization error. In Proceedings of AAAI, 2025

2025
[42]

Tabular representation, noisy operators, and impacts on table structure understanding tasks in LLM s

Ananya Singha, Jos \'e Cambronero, Sumit Gulwani, Vu Le, and Chris Parnin. Tabular representation, noisy operators, and impacts on table structure understanding tasks in LLM s. arXiv preprint arXiv:2310.10358, 2023

work page arXiv 2023
[43]

Selective classification with entropy-based confidence for text-to- SQL error detection

Alexander Somov and Elena Tutubalina. Selective classification with entropy-based confidence for text-to- SQL error detection. In Proceedings of AAAI, 2025

2025
[44]

Calibrated interpretation: Confidence estimation in semantic parsing

Elias Stengel-Eskin and Benjamin Van Durme. Calibrated interpretation: Confidence estimation in semantic parsing. Transactions of the Association for Computational Linguistics (TACL), 2023

2023
[45]

Table meets LLM : Can large language models understand structured table data? A benchmark and empirical study

Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. Table meets LLM : Can large language models understand structured table data? A benchmark and empirical study. Proceedings of WSDM, 2024

2024
[46]

Exploring generative process reward modeling for semi-structured data: A case study of table question answering

Lei Tang, Wei Zhou, and Mohsen Mesgar. Exploring generative process reward modeling for semi-structured data: A case study of table question answering. In Proceedings of EACL, 2026. arXiv:2510.20304

work page arXiv 2026
[47]

Confidence improves self-consistency in LLM s

Amir Taubenfeld, Tom Sheffer, Eran Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, and Gal Yona. Confidence improves self-consistency in LLM s. In Findings of the Association for Computational Linguistics: ACL, 2025

2025
[48]

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of EMNLP, 2023

2023
[49]

Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models

Francesco Tonolini, Thomas Sherborne, and Tom Sherborne. Bayesian prompt ensembles: Model uncertainty estimation for black-box large language models. In Findings of ACL, 2024

2024
[50]

APRICOT : Calibrating large language models using their generations only

Dennis Ulmer, Christian Hardmeier, and Jes Frellsen. APRICOT : Calibrating large language models using their generations only. In Proceedings of ACL, 2024

2024
[51]

LM -polygraph: A library for uncertainty quantification in language models

Roman Vashurin et al. LM -polygraph: A library for uncertainty quantification in language models. Transactions of the Association for Computational Linguistics (TACL), 2025

2025
[52]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In Proceedings of ICLR, 2023

2023
[53]

Self-consistency sampling outperforms self-evaluation on reasoning tasks

Xuezhi Wang et al. Self-consistency sampling outperforms self-evaluation on reasoning tasks. In Findings of EMNLP, 2024 a

2024
[54]

Accurate and regret-aware numerical problem solver for tabular question answering

Yuxiang Wang, Jianzhong Qi, and Junhao Gan. Accurate and regret-aware numerical problem solver for tabular question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[55]

Chain-of-table: Evolving tables in the reasoning chain for table understanding

Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Eisenschlos, Vincent Perot, Zifeng Wang, et al. Chain-of-table: Evolving tables in the reasoning chain for table understanding. In Proceedings of ICLR, 2024 b

2024
[56]

TableBench : A comprehensive and complex benchmark for table question answering

Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, and Zhoujun Li. TableBench : A comprehensive and complex benchmark for table question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[57]

Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

2024
[58]

On calibration of large language models: From response to capability,

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, and Shao-Hua Sun. On calibration of large language models: From response to capability. arXiv preprint arXiv:2602.13540, 2026

work page arXiv 2026
[59]

Benchmarking LLM s via uncertainty quantification

Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking LLM s via uncertainty quantification. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[60]

STaR : Towards effective and stable table reasoning via slow-thinking large language models

Huajian Zhang, Mingyue Cheng, Yucong Luo, and Xiaoyu Tao. STaR : Towards effective and stable table reasoning via slow-thinking large language models. arXiv preprint arXiv:2511.11233, 2025

work page arXiv 2025
[61]

Self-improving code generation via semantic entropy and behavioral consensus

Huan Zhang, Wei Cheng, and Wei Hu. Self-improving code generation via semantic entropy and behavioral consensus. arXiv preprint arXiv:2603.29292, 2026

work page arXiv 2026