arxiv: 2605.09739 · v1 · submitted 2026-05-10 · 💻 cs.CL · cs.AI

Recognition: no theorem link

The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

Sanket Badhe , Priyanka Tiwari , Deep Shah

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords zero-shot classificationLLM calibrationrenormalization biassemantic softmaxsilent voteexpected calibration errorsemantic neighborhoodsinference-time method

0 comments

The pith

Aggregating semantic neighborhoods recovers lost probability mass and reduces overconfidence in zero-shot LLM classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models restricted to target labels during zero-shot classification lose probability mass assigned to semantic synonyms, which the paper calls the Silent Vote and attributes to renormalization bias. This produces artificially high confidence and miscalibrated outputs. Semantic Softmax counters the effect at inference time by summing the scores of each label's semantic neighborhood. Experiments on GoEmotions and Civil Comments with Qwen-3 and Phi-4-mini show lower expected calibration error and Brier score alongside higher AUROC and macro-F1. A reader would care because the change requires no retraining yet makes LLM classifiers more reliable on nuanced language tasks.

Core claim

Standard constrained decoding applies softmax only over a small target vocabulary and thereby discards the probability mass that the model originally assigned to semantic synonyms of those labels. The paper terms this discarded mass the Silent Vote and the resulting distortion renormalization bias. Semantic Softmax, an inference-time layer, recovers the lost mass by aggregating the raw scores of the semantic neighborhood around each target label, yielding predictions that are both better calibrated and more discriminative.

What carries the argument

Semantic Softmax, an inference-time aggregation that sums the raw model scores over the semantic neighborhood surrounding each target label.

If this is right

Substantially reduces Expected Calibration Error and Brier Score on the tested datasets.
Increases AUROC and Macro-F1 scores for both Qwen-3 and Phi-4-mini.
Yields more calibrated outputs while preserving or improving discriminative performance.
Accounts for linguistic synonyms that standard constrained decoding ignores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation step could be applied to any constrained generation setting where output tokens carry overlapping meaning.
Calibration problems in zero-shot classification may be partly an artifact of vocabulary restriction rather than a pure model defect.
Automated identification of semantic neighborhoods could extend the method beyond manually chosen label sets.

Load-bearing premise

Semantic neighborhoods around target labels can be reliably identified and aggregating their raw scores recovers the discarded probability mass without introducing new systematic bias.

What would settle it

Applying Semantic Softmax to a synthetic vocabulary in which target labels have no semantic neighbors produces calibration error equal to or higher than standard softmax.

Figures

Figures reproduced from arXiv: 2605.09739 by Deep Shah, Priyanka Tiwari, Sanket Badhe.

read the original abstract

Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic Softmax targets renormalization bias in zero-shot classification by pulling in neighborhood scores, but missing construction rules and aggregation math leave the calibration gains hard to trust.

read the letter

The main point here is that constrained decoding in LLMs for classification drops probability mass on words that are close in meaning to the target labels, and the authors call this renormalization bias and the silent vote. Their fix is Semantic Softmax, an inference-time step that adds up scores from the semantic neighborhood around each label instead of just renormalizing over the targets. They test it on GoEmotions and Civil Comments with Qwen-3 and Phi-4-mini and report lower ECE and Brier scores plus better AUROC and Macro-F1. That combination of calibration and accuracy lift is the practical hook for anyone running zero-shot classifiers in production. The framing of the bias as a distinct failure mode rather than generic overconfidence is the clearest new angle. The evaluation covers two relevant datasets and two recent models, which is enough to show the idea is worth checking. The soft spots are not small. The abstract gives no description of how the neighborhoods are identified, no formula for the aggregation, no mention of overlap handling, and no error bars or significance tests. If neighborhoods are built independently, tokens that fit multiple labels get counted more than once when scores are summed, which can inflate the recovered mass instead of restoring the original distribution. That risk is especially clear on emotion and toxicity data where sentiment cues overlap. Without a head-to-head against temperature scaling or label smoothing, it is difficult to know whether the reported numbers reflect a real fix or an artifact of the aggregation. This work is aimed at engineers who need better-calibrated zero-shot classification without retraining. A reader who wants a lightweight inference tweak could get something usable once the methods are spelled out, but the current version does not supply enough to reproduce or verify the claims. I would send it to peer review so the authors can add the neighborhood construction details, the exact aggregation rule, overlap correction if any, and proper statistical checks. The core observation about lost mass is worth pursuing, but the evidence as presented is too thin to stand on its own.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Semantic Softmax, an inference-time technique to mitigate Renormalization Bias in zero-shot LLM classification by aggregating probability scores from semantic neighborhoods around target labels, thereby recovering the 'Silent Vote' discarded by standard softmax. It reports empirical improvements in calibration metrics (ECE, Brier Score) and discriminative performance (AUROC, Macro-F1) on the GoEmotions and Civil Comments datasets using Qwen-3 and Phi-4-mini models.

Significance. Should the method prove robust and free of artifacts like double-counting, it would offer a lightweight, training-free enhancement to zero-shot classification reliability, addressing a common issue in LLM-based labeling. The reported gains across multiple metrics on two datasets and models indicate practical value for applications requiring calibrated predictions, though stronger baselines and statistical validation would strengthen the case.

major comments (2)

Abstract: The description of Semantic Softmax as 'aggregating the scores of the semantic neighborhood surrounding each target label' supplies neither a definition of how neighborhoods are constructed (e.g., embedding similarity or prompting) nor the aggregation formula. This is load-bearing for the central claim that the procedure recovers discarded probability mass without systematic bias or double-counting, as overlapping tokens (e.g., 'happy' across 'joy' and 'optimism' in GoEmotions) would otherwise inflate recovered mass.
Evaluation section: No error bars, statistical significance tests, or comparisons to established calibration methods such as temperature scaling or label smoothing are reported, despite the claim of 'substantial' reductions in ECE and Brier Score alongside gains in AUROC and Macro-F1. This prevents assessment of whether the observed improvements exceed what could arise from over-allocation artifacts.

minor comments (2)

The manuscript introduces 'Renormalization Bias' and 'Silent Vote' without situating them against existing calibration literature (e.g., temperature scaling or post-hoc recalibration techniques).
Reproducibility would be aided by an explicit pseudocode or equation block detailing neighborhood extraction and score aggregation, currently absent from the provided description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to improve clarity and empirical rigor in our work on Semantic Softmax. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: Abstract: The description of Semantic Softmax as 'aggregating the scores of the semantic neighborhood surrounding each target label' supplies neither a definition of how neighborhoods are constructed (e.g., embedding similarity or prompting) nor the aggregation formula. This is load-bearing for the central claim that the procedure recovers discarded probability mass without systematic bias or double-counting, as overlapping tokens (e.g., 'happy' across 'joy' and 'optimism' in GoEmotions) would otherwise inflate recovered mass.

Authors: We agree the abstract is overly concise and omits key operational details. Section 3 of the manuscript defines semantic neighborhoods via embedding-based retrieval (cosine similarity threshold on token embeddings from a sentence encoder) and specifies the aggregation as a sum of probabilities over the neighborhood tokens, followed by renormalization to the target label set. To prevent double-counting from overlaps, the implementation uses a deduplication step that assigns each token to at most one neighborhood based on maximum similarity. We will expand the abstract with a brief description of neighborhood construction, the aggregation formula, and the overlap-handling mechanism to make the central claim self-contained. revision: yes
Referee: Evaluation section: No error bars, statistical significance tests, or comparisons to established calibration methods such as temperature scaling or label smoothing are reported, despite the claim of 'substantial' reductions in ECE and Brier Score alongside gains in AUROC and Macro-F1. This prevents assessment of whether the observed improvements exceed what could arise from over-allocation artifacts.

Authors: We acknowledge that the current evaluation lacks these elements, which limits the strength of the claims. We will add error bars by averaging results over multiple random seeds and reporting standard deviations. We will also include statistical significance testing (paired t-tests across runs) for the reported metric improvements. In addition, we will introduce baselines using temperature scaling and label smoothing applied to the standard softmax, allowing direct comparison to show that Semantic Softmax gains exceed those obtainable from simple recalibration. These changes will appear in the revised evaluation section and tables. revision: yes

Circularity Check

0 steps flagged

No circularity: Semantic Softmax is an independent post-hoc aggregation layer

full rationale

The paper defines Renormalization Bias and Silent Vote descriptively from the known behavior of softmax under label constraints, then introduces Semantic Softmax as a separate inference-time aggregation over semantic neighborhoods. No equations, fitted parameters, or self-citations are shown that would make the claimed ECE/Brier reductions equivalent to the input distribution by construction. The method is presented as recovering information via external neighborhood identification rather than renaming or re-deriving the original probabilities. Empirical gains on GoEmotions and Civil Comments are therefore not forced by the definition itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that semantic similarity can be computed reliably and that probability aggregation is the correct recovery operation.

pith-pipeline@v0.9.0 · 5475 in / 1059 out tokens · 36435 ms · 2026-05-12T03:17:24.262747+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[2]

doi:10.1073/pnas.2305016120 , author =

Fabrizio Gilardi and Meysam Alizadeh and Maël Kubli , title =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2305016120 , abstract =

work page doi:10.1073/pnas.2305016120 2023
[3]

The 64th Annual Meeting of the Association for Computational Linguistics -- Industry Track , year=

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning , author=. The 64th Annual Meeting of the Association for Computational Linguistics -- Industry Track , year=

work page
[4]

Constrained Language Models Yield Few-Shot Semantic Parsers

Shin, Richard and Lin, Christopher and Thomson, Sam and Chen, Charles and Roy, Subhro and Platanios, Emmanouil Antonios and Pauls, Adam and Klein, Dan and Eisner, Jason and Van Durme, Benjamin. Constrained Language Models Yield Few-Shot Semantic Parsers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.1...

work page doi:10.18653/v1/2021.emnlp-main.608 2021
[5]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

work page 2023
[6]

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

Schick, Timo and Sch. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.20

work page doi:10.18653/v1/2021.eacl-main.20 2021
[7]

Distributed Representations of Words and Phrases and their Compositionality , url =

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff , booktitle =. Distributed Representations of Words and Phrases and their Compositionality , url =

work page
[8]

arXiv preprint arXiv:2604.00323 , year=

Large Language Models in the Abuse Detection Pipeline , author=. arXiv preprint arXiv:2604.00323 , year=

work page arXiv
[9]

Proceedings of the 34th International Conference on Machine Learning , pages =

On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[10]

Revisiting the Calibration of Modern Neural Networks , url =

Minderer, Matthias and Djolonga, Josip and Romijnders, Rob and Hubis, Frances and Zhai, Xiaohua and Houlsby, Neil and Tran, Dustin and Lucic, Mario , booktitle =. Revisiting the Calibration of Modern Neural Networks , url =

work page
[11]

Adversarial NLI : A new benchmark for natural language understanding

Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.441

work page doi:10.18653/v1/2020.acl-main.441 2020
[12]

Toxicity Detection: Does Context Really Matter?

Pavlopoulos, John and Sorensen, Jeffrey and Dixon, Lucas and Thain, Nithum and Androutsopoulos, Ion. Toxicity Detection: Does Context Really Matter?. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.396

work page doi:10.18653/v1/2020.acl-main.396 2020
[13]

G o E motions: A Dataset of Fine-Grained Emotions

Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith. G o E motions: A Dataset of Fine-Grained Emotions. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.372

work page doi:10.18653/v1/2020.acl-main.372 2020
[14]

Making Pre-trained Language Models Better Few-shot Learners

Gao, Tianyu and Fisch, Adam and Chen, Danqi. Making Pre-trained Language Models Better Few-shot Learners. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.295

work page doi:10.18653/v1/2021.acl-long.295 2021
[15]

2023 , eprint=

Efficient Guided Generation for Large Language Models , author=. 2023 , eprint=

work page 2023
[16]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

work page 2022
[17]

Understanding the Effects of

Robert Kirk and Ishita Mediratta and Christoforos Nalmpantis and Jelena Luketina and Eric Hambro and Edward Grefenstette and Roberta Raileanu , booktitle=. Understanding the Effects of. 2024 , url=

work page 2024
[18]

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher Manning

Tian, Katherine and Mitchell, Eric and Zhou, Allan and Sharma, Archit and Rafailov, Rafael and Yao, Huaxiu and Finn, Chelsea and Manning, Christopher. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[19]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[20]

Publications Manual , year = "1983", publisher =

work page 1983
[21]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[22]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[23]

Dan Gusfield , title =. 1997

work page 1997
[24]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[25]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page