arxiv: 2303.08896 · v3 · submitted 2023-03-15 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul , Adian Liusie , Mark J. F. Gales

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucination detectionlarge language modelsblack-box accessfactuality assessmentzero-resource verificationsampling consistencyGPT-3sentence-level detection

0 comments

The pith

Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SelfCheckGPT, a method that fact-checks responses from generative large language models without external databases or access to model probabilities. It generates several responses to the same prompt and treats agreement among them as evidence that the model knows the underlying facts, while contradictions signal hallucinations. This is evaluated by having GPT-3 generate passages about individuals from the WikiBio dataset, with manual annotation of factuality. The approach detects non-factual sentences and ranks entire passages for reliability, outperforming grey-box baselines that rely on output probabilities. A reader would care because it offers a practical, zero-resource way to assess trustworthiness of outputs from commercial systems like ChatGPT where internals are unavailable.

Core claim

SelfCheckGPT is a sampling-based approach for zero-resource hallucination detection in black-box LLMs. It rests on the premise that when an LLM possesses knowledge of a concept, stochastically sampled responses tend to be similar and factually consistent, whereas hallucinated facts produce divergent and contradictory samples. Applied to GPT-3 generations on the WikiBio dataset with human-annotated factuality labels, the method detects non-factual sentences and ranks passages by factuality, delivering higher AUC-PR scores at the sentence level and stronger correlation scores at the passage level than grey-box alternatives.

What carries the argument

Consistency check across multiple independently sampled responses to the same prompt

If this is right

Allows factuality assessment for closed models such as ChatGPT that expose only generated text.
Achieves higher precision-recall in sentence-level hallucination detection than methods using token probabilities.
Enables ranking of generated passages by overall factuality without any external knowledge source.
Requires only repeated sampling from the target model, making it applicable to any LLM supporting stochastic generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be layered with other lightweight signals to catch cases where the model consistently repeats the same error.
It suggests a general principle that internal model uncertainty may be approximated through output variation alone.
Extensions to longer-form or multi-turn outputs would require adapting the consistency metric to handle accumulating context.
In deployment, the approach could lower reliance on curated fact-checking databases for routine verification tasks.

Load-bearing premise

Divergence among sampled responses primarily signals hallucinated facts rather than stylistic differences or partial but consistent knowledge.

What would settle it

A collection of prompts where the model repeats the identical incorrect fact in every sample, causing the consistency metric to score the output as factual despite it being wrong.

read the original abstract

Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose "SelfCheckGPT", a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SelfCheckGPT uses sampling consistency to flag hallucinations in black-box LLMs and beats grey-box baselines on WikiBio, but the signal may mix factual errors with normal output variation.

read the letter

The main point is that SelfCheckGPT samples multiple responses from a black-box model like GPT-3 and measures consistency across them to detect non-factual content. On WikiBio passages with human factuality labels, it reports higher AUC-PR for sentence-level detection and stronger correlations for passage ranking than the grey-box baselines it compares against. The zero-resource framing is the practical hook—no external databases or logit access required, which fits closed APIs like ChatGPT. The underlying assumption is simple: known facts produce overlapping content across samples while hallucinations produce contradictions, and they test this with BERTScore, n-gram overlap, and QA-based metrics over 5–20 samples. The evaluation setup is direct and the gains look real enough in the reported numbers to make the method worth trying. The soft spot is exactly the one the stress test flags. Divergence can come from stylistic choices, different levels of detail, or partial but correct knowledge rather than outright errors, and the paper does not isolate those cases with targeted controls or breakdowns. Without that separation, the AUC-PR and correlation improvements could be overstated. The abstract also skips the precise aggregation rule for the consistency score and any statistical tests, so the strength of the central claim stays moderate until the methods section is checked. This is aimed at people who need a lightweight detection layer for generative LLM outputs in production settings where only black-box access is available. Practitioners working on reliability will find the sampling idea easy to reproduce and the WikiBio results a useful reference point. I would send it to peer review. The core hypothesis is testable, the comparison to baselines is concrete, and referees can push for the missing controls and details without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The paper introduces SelfCheckGPT, a zero-resource black-box method for hallucination detection in generative LLMs. It samples multiple responses from the model (e.g., GPT-3 on WikiBio prompts) and measures consistency via metrics such as BERTScore, QA-based overlap, and n-gram overlap; low consistency is taken to indicate hallucinated facts. On manually annotated generations, the method reports higher sentence-level AUC-PR for hallucination detection and higher passage-level correlation with human factuality judgments than grey-box baselines.

Significance. If the consistency signal can be shown to isolate factual errors rather than stylistic or partial-knowledge variation, the approach would offer a practical, external-database-free tool for fact-checking black-box LLMs, addressing a key deployment barrier. The zero-resource design and direct comparison to grey-box methods are clear strengths.

major comments (3)

[§4] §4 (Experiments): the central AUC-PR and correlation claims rest on human factuality annotations, yet no inter-annotator agreement statistics, annotation guidelines, or controls for annotator bias are reported; this directly affects the reliability of the ground-truth labels used to compute all performance numbers.
[§3] §3 (Method): the premise that sample divergence signals hallucination is not isolated from other sources of variation (alternative valid phrasings, differing levels of detail, or stylistic choices). No ablation or control experiment is described that holds factual content fixed while varying only style or completeness, leaving open whether the reported gains are inflated by conflating multiple variation types.
[§4.2] §4.2 (Evaluation metrics): the exact aggregation formulas for the consistency scores (e.g., how BERTScore or QA overlap is averaged or thresholded across the 5–20 samples) are not fully specified, nor are statistical significance tests for the AUC-PR improvements over baselines provided.

minor comments (2)

[§3] The number of samples, sampling temperature, and prompt templates used for generation should be stated explicitly in §3 and §4.1 for reproducibility.
[§4] Figure 2 or the corresponding table should include error bars or confidence intervals on the AUC-PR and correlation values.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying our approach where possible and committing to revisions that strengthen the presentation of the experimental details and method assumptions.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central AUC-PR and correlation claims rest on human factuality annotations, yet no inter-annotator agreement statistics, annotation guidelines, or controls for annotator bias are reported; this directly affects the reliability of the ground-truth labels used to compute all performance numbers.

Authors: We agree that inter-annotator agreement and annotation protocol details are necessary to establish label reliability. Although omitted from the initial submission for brevity, annotations were performed by three independent annotators following explicit guidelines that defined hallucination as any non-supported factual claim. In the revised manuscript we will add a new subsection in §4 reporting the inter-annotator agreement statistics, reproducing the full annotation guidelines in the appendix, and describing bias-mitigation steps such as randomized presentation order and independent adjudication of disagreements. revision: yes
Referee: [§3] §3 (Method): the premise that sample divergence signals hallucination is not isolated from other sources of variation (alternative valid phrasings, differing levels of detail, or stylistic choices). No ablation or control experiment is described that holds factual content fixed while varying only style or completeness, leaving open whether the reported gains are inflated by conflating multiple variation types.

Authors: We acknowledge the concern that consistency metrics could be influenced by non-factual sources of variation. Our QA-based and entity-overlap metrics are deliberately chosen to emphasize factual content rather than surface form; however, we concede that a controlled ablation isolating style while fixing facts would provide stronger evidence. We will expand the discussion in §3 to articulate why we expect stylistic variation to have limited impact on the reported metrics for the WikiBio task, and we will explicitly list the absence of such an ablation as a limitation. Performing the ablation would require substantial new annotation and is left for future work. revision: partial
Referee: [§4.2] §4.2 (Evaluation metrics): the exact aggregation formulas for the consistency scores (e.g., how BERTScore or QA overlap is averaged or thresholded across the 5–20 samples) are not fully specified, nor are statistical significance tests for the AUC-PR improvements over baselines provided.

Authors: We apologize for the incomplete specification. The sentence-level consistency score is the mean of the pairwise similarity values (BERTScore, QA overlap, or n-gram overlap) between the candidate sentence and each of the N sampled responses; no additional thresholding is applied. In the revised §4.2 we will state these aggregation formulas explicitly and will report statistical significance of the AUC-PR gains over baselines via paired bootstrap resampling with 10,000 iterations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SelfCheckGPT sampling consistency heuristic

full rationale

The paper defines SelfCheckGPT as a direct sampling procedure: generate multiple stochastic responses from a black-box LLM and measure consistency (via BERTScore, QA, n-gram overlap) to flag hallucinations. This is tested against independent human factuality labels on WikiBio passages. No equations reduce a claimed prediction to a fitted input by construction, no self-citation chains justify the core premise, and the consistency assumption is presented as a testable heuristic rather than a self-definition. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that response consistency across samples distinguishes factual knowledge from hallucinations; no free parameters or invented entities are introduced.

axioms (1)

domain assumption If an LLM has knowledge of a concept, sampled responses are likely to be similar and contain consistent facts; for hallucinated facts, responses are likely to diverge.
Stated directly as the simple idea leveraged by SelfCheckGPT in the abstract.

pith-pipeline@v0.9.0 · 5574 in / 1145 out tokens · 45909 ms · 2026-05-15T15:06:20.879049+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
cs.CL 2026-04 unverdicted novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
Evaluating the False Trust engendered by LLM Explanations
cs.HC 2026-05 unverdicted novelty 6.0

A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
cs.SE 2026-05 unverdicted novelty 6.0

Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
cs.CL 2026-04 unverdicted novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 6.0

Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
cs.CL 2024-04 unverdicted novelty 6.0

GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
cs.CL 2026-04 unverdicted novelty 5.0

HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey
cs.SE 2026-05 accept novelty 4.0

A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
cs.CR 2026-04 unverdicted novelty 4.0

FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

288 extracted references · 288 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[3]

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://doi.org/10.18653/v1/2022.bigscience-1.9 GPT - N eo X -20 B : An open-source autoreg...

work page doi:10.18653/v1/2022.bigscience-1.9 2022
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[6]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37 -- 46

work page 1960
[8]

Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. https://doi.org/10.1162/tacl_a_00454 A survey on automated fact-checking . Transactions of the Association for Computational Linguistics, 10:178--206

work page doi:10.1162/tacl_a_00454 2022
[9]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. https://openreview.net/forum?id=sE7-XhLxHA De BERT av3: Improving de BERT a using ELECTRA -style pre-training with gradient-disentangled embedding sharing . In The Eleventh International Conference on Learning Representations

work page 2023
[11]

Ganesh Jawahar, Beno \^ t Sagot, and Djam \'e Seddah. 2019. https://doi.org/10.18653/v1/P19-1356 What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651--3657, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1356 2019
[14]

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.750 Evaluating the factual consistency of abstractive text summarization . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332--9346, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.750 2020
[15]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/forum?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations

work page 2023
[16]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1082 2017
[17]

R \' e mi Lebret, David Grangier, and Michael Auli. 2016. http://arxiv.org/abs/1603.07771 Generating text from structured data with application to the biography domain . CoRR, abs/1603.07771

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. https://doi.org/10.18653/v1/2022.acl-long.464 A token-level reference-free hallucination detection benchmark for free-form text generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...

work page doi:10.18653/v1/2022.acl-long.464 2022
[20]

Adian Liusie, Vatsal Raina, and Mark Gales. 2023. https://aclanthology.org/2023.fever-1.5 `` world knowledge '' in multiple choice reading comprehension . In Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER), pages 49--57, Dubrovnik, Croatia. Association for Computational Linguistics

work page 2023
[22]

Andrey Malinin and Mark Gales. 2021. https://openreview.net/forum?id=jN5y-zb5Q7m Uncertainty estimation in autoregressive structured prediction . In International Conference on Learning Representations

work page 2021
[23]

Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. MQAG : Multiple-choice question answering and generation for assessing information consistency in summarization. arXiv preprint arXiv:2301.12307

work page arXiv 2023
[24]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.173 2020
[26]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485--5551

work page 2020
[27]

Vatsal Raina and Mark Gales. 2022. https://doi.org/10.18653/v1/2022.findings-acl.82 Answer uncertainty and unanswerability in multiple-choice machine reading comprehension . In Findings of the Association for Computational Linguistics: ACL 2022, pages 1020--1034, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-acl.82 2022
[28]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016
[29]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.320 Retrieval augmentation reduces hallucination in conversation . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803, Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.findings-emnlp.320 2021
[30]

James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018. The Fact Extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

work page 2018
[32]

Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med, 37(5):360--363

work page 2005
[33]

Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax

work page 2021
[34]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In The Eleventh International Conference on Learning Representations

work page 2023
[35]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...

work page doi:10.18653/v1/n18-1101 2018
[36]

Yijun Xiao and William Yang Wang. 2021. https://doi.org/10.18653/v1/2021.eacl-main.236 On hallucination and predictive uncertainty in conditional language generation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734--2744, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.eacl-main.236 2021
[37]

Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263--27277

work page 2021
[39]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware bert for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9628--9635

work page 2020
[40]

Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2020. https://doi.org/10.18653/v1/2020.acl-main.549 Reasoning over semantic-level graph for fact checking . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6170--6180, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.549 2020
[41]

Manakul, Potsawee and Liusie, Adian and Gales, Mark JF , journal=

work page
[42]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[43]

Transactions on Machine Learning Research , year=

Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=

work page
[44]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[46]

`` World Knowledge '' in Multiple Choice Reading Comprehension

Liusie, Adian and Raina, Vatsal and Gales, Mark. `` World Knowledge '' in Multiple Choice Reading Comprehension. Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER). 2023

work page 2023
[47]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Semantics-aware BERT for language understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[48]

and Vitrià, Jordi , journal=

Brando, Axel and Torres, Damià and Rodríguez-Serrano, Jose A. and Vitrià, Jordi , journal=. Building Uncertainty Models on Top of Black-Box Predictive APIs , year=

work page
[49]

Generating Text from Structured Data with Application to the Biography Domain , journal =

R. Generating Text from Structured Data with Application to the Biography Domain , journal =. 2016 , url =

work page 2016
[50]

Advances in Neural Information Processing Systems , volume=

Bartscore: Evaluating generated text as text generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[51]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Wang, Ben and Komatsuzaki, Aran , title =

work page
[53]

GPTScore: Evaluate as You Desire , publisher =

Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , keywords =. GPTScore: Evaluate as You Desire , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2302.04166 , url =

work page doi:10.48550/arxiv.2302.04166 2023
[54]

Proceedings of the First Workshop on

Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit , title =. Proceedings of the First Workshop on

work page
[55]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[56]

Training Compute-Optimal Large Language Models

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
[57]

Language Models are Unsupervised Multitask Learners , author=

work page
[58]

ACM Comput

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. ACM Comput. Surv. , month =. 2023 , issue_date =. doi:10.1145/3571730 , abstract =

work page doi:10.1145/3571730 2023
[59]

The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , publisher =

Huang, Yichong and Feng, Xiachong and Feng, Xiaocheng and Qin, Bing , keywords =. The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2104.14839 , url =

work page doi:10.48550/arxiv.2104.14839 2021
[60]

International Conference on Learning Representations , year=

Uncertainty Estimation in Autoregressive Structured Prediction , author=. International Conference on Learning Representations , year=

work page
[61]

Longformer: The Long-Document Transformer

Beltagy, Iz and Peters, Matthew E. and Cohan, Arman , keywords =. Longformer: The Long-Document Transformer , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2004.05150 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2004.05150 2020
[62]

Educational and Psychological Measurement , year=

A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=

work page
[63]

Fam med , volume=

Understanding interobserver agreement: the kappa statistic , author=. Fam med , volume=

work page
[64]

The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,

The Internal State of an LLM Knows When its Lying , author=. arXiv preprint arXiv:2304.13734 , year=

work page arXiv
[65]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

work page
[66]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

arXiv preprint arXiv:2303.15621 , year=

Chatgpt as a factual inconsistency evaluator for abstractive text summarization , author=. arXiv preprint arXiv:2303.15621 , year=

work page arXiv
[69]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[70]

arXiv preprint arXiv:2305.14251 , year=

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation , author=. arXiv preprint arXiv:2305.14251 , year=

work page arXiv
[71]

Pengcheng He and Jianfeng Gao and Weizhu Chen , booktitle=. De. 2023 , url=

work page 2023
[72]

The Eleventh International Conference on Learning Representations , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[73]

Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[74]

Neural Approaches for Data Driven Dependency Parsing in S anskrit

Krishna, Amrith and Gupta, Ashim and Garasangi, Deepak and Sandhan, Jeevnesh and Satuluri, Pavankumar and Goyal, Pawan. Neural Approaches for Data Driven Dependency Parsing in S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[75]

Evaluating Neural Word Embeddings for S anskrit

Sandhan, Jivnesh and Paranjay, Om Adideva and Digumarthi, Komal and Behra, Laxmidhar and Goyal, Pawan. Evaluating Neural Word Embeddings for S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[76]

Validation and Normalization of DCS corpus and Development of the S anskrit Heritage Engine ' s Segmenter

Sriram, Krishnan and Kulkarni, Amba and Huet, G \'e rard. Validation and Normalization of DCS corpus and Development of the S anskrit Heritage Engine ' s Segmenter. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[77]

Pre-annotation Based Approach for Development of a S anskrit Named Entity Recognition Dataset

Sujoy, Sarkar and Krishna, Amrith and Goyal, Pawan. Pre-annotation Based Approach for Development of a S anskrit Named Entity Recognition Dataset. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[78]

Disambiguation of Instrumental, Dative and Ablative Case suffixes in S anskrit

Maity, Malay and Panchal, Sanjeev and Kulkarni, Amba. Disambiguation of Instrumental, Dative and Ablative Case suffixes in S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[79]

Creation of a Digital Rig V edic Index (Anukramani) for Computational Linguistic Tasks

Mahesh, A V S D S and Bhattacharya, Arnab. Creation of a Digital Rig V edic Index (Anukramani) for Computational Linguistic Tasks. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[80]

Skrutable: Another Step Toward Effective S anskrit Meter Identification

Neill, Tyler. Skrutable: Another Step Toward Effective S anskrit Meter Identification. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[81]

Chandojnanam: A S anskrit Meter Identification and Utilization System

Terdalkar, Hrishikesh and Bhattacharya, Arnab. Chandojnanam: A S anskrit Meter Identification and Utilization System. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[82]

Ajotikar, Tanuja P and Scharf, Peter M. Development of a TEI standard for digital S anskrit texts containing commentaries: A pilot study of Bhaṭṭti ' s R \=a vaṇavadha with Mallin \=a tha ' s commentary on the first canto. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[83]

R \=a mop \=a khy \=a na: A Web-based reader and index

Scharf, Peter M and Chauhan, Dhruv. R \=a mop \=a khy \=a na: A Web-based reader and index. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[84]

Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text

Terdalkar, Hrishikesh and Bhattacharya, Arnab and Dubey, Madhulika and Ramamurthy, S and Singh, Bhavna Naneria. Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[85]

Shaastra Maps: Enabling Conceptual Exploration of I ndic Shaastra Texts

Susarla, Sai and Jammalamadaka, Suryanarayana and Nishankar, Vaishnavi and Panuganti, Siva and Ryali, Anupama and Sushrutha, S. Shaastra Maps: Enabling Conceptual Exploration of I ndic Shaastra Texts. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[86]

The V edic corpus as a graph

Hellwig, Oliver and Sellmer, Sven and Amano, Kyoko. The V edic corpus as a graph. An updated version of Bloomfields V edic Concordance. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[87]

The transmission of the Buddha ' s teachings in the digital age

Harnsukworapanich, Sumachaya and Supphipat, Phatchareporn. The transmission of the Buddha ' s teachings in the digital age. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[88]

Distinguishing Commentary from Canon: Experiments in P \=a li Computational Linguistics

Zigmond, Dan. Distinguishing Commentary from Canon: Experiments in P \=a li Computational Linguistics. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023

work page 2023
[89]

Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023

work page 2023
[90]

Analyzing Zero-Shot transfer Scenarios across S panish variants for Hate Speech Detection

Castillo-l \'o pez, Galo and Riabi, Arij and Seddah, Djam \'e. Analyzing Zero-Shot transfer Scenarios across S panish variants for Hate Speech Detection. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023

work page 2023
[91]

Optimizing the Size of Subword Vocabularies in Dialect Classification

Kanjirangat, Vani and Samard z i \'c , Tanja and Dolamic, Ljiljana and Rinaldi, Fabio. Optimizing the Size of Subword Vocabularies in Dialect Classification. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023

work page 2023
[92]

Murreviikko - A Dialectologically Annotated and Normalized Dataset of F innish Tweets

Kuparinen, Olli. Murreviikko - A Dialectologically Annotated and Normalized Dataset of F innish Tweets. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023

work page 2023

Showing first 80 references.