Recognition: 1 theorem link
· Lean TheoremSelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Pith reviewed 2026-05-15 15:06 UTC · model grok-4.3
The pith
Multiple stochastic samples from a black-box LLM reveal which generated facts are hallucinations by checking their consistency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SelfCheckGPT is a sampling-based approach for zero-resource hallucination detection in black-box LLMs. It rests on the premise that when an LLM possesses knowledge of a concept, stochastically sampled responses tend to be similar and factually consistent, whereas hallucinated facts produce divergent and contradictory samples. Applied to GPT-3 generations on the WikiBio dataset with human-annotated factuality labels, the method detects non-factual sentences and ranks passages by factuality, delivering higher AUC-PR scores at the sentence level and stronger correlation scores at the passage level than grey-box alternatives.
What carries the argument
Consistency check across multiple independently sampled responses to the same prompt
If this is right
- Allows factuality assessment for closed models such as ChatGPT that expose only generated text.
- Achieves higher precision-recall in sentence-level hallucination detection than methods using token probabilities.
- Enables ranking of generated passages by overall factuality without any external knowledge source.
- Requires only repeated sampling from the target model, making it applicable to any LLM supporting stochastic generation.
Where Pith is reading between the lines
- The method could be layered with other lightweight signals to catch cases where the model consistently repeats the same error.
- It suggests a general principle that internal model uncertainty may be approximated through output variation alone.
- Extensions to longer-form or multi-turn outputs would require adapting the consistency metric to handle accumulating context.
- In deployment, the approach could lower reliance on curated fact-checking databases for routine verification tasks.
Load-bearing premise
Divergence among sampled responses primarily signals hallucinated facts rather than stylistic differences or partial but consistent knowledge.
What would settle it
A collection of prompts where the model repeats the identical incorrect fact in every sample, causing the consistency metric to score the output as factual despite it being wrong.
read the original abstract
Generative Large Language Models (LLMs) such as GPT-3 are capable of generating highly fluent responses to a wide variety of user prompts. However, LLMs are known to hallucinate facts and make non-factual statements which can undermine trust in their output. Existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as ChatGPT) or external databases that are interfaced via separate, often complex, modules. In this work, we propose "SelfCheckGPT", a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. SelfCheckGPT leverages the simple idea that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset, and manually annotate the factuality of the generated passages. We demonstrate that SelfCheckGPT can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. We compare our approach to several baselines and show that our approach has considerably higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SelfCheckGPT, a zero-resource black-box method for hallucination detection in generative LLMs. It samples multiple responses from the model (e.g., GPT-3 on WikiBio prompts) and measures consistency via metrics such as BERTScore, QA-based overlap, and n-gram overlap; low consistency is taken to indicate hallucinated facts. On manually annotated generations, the method reports higher sentence-level AUC-PR for hallucination detection and higher passage-level correlation with human factuality judgments than grey-box baselines.
Significance. If the consistency signal can be shown to isolate factual errors rather than stylistic or partial-knowledge variation, the approach would offer a practical, external-database-free tool for fact-checking black-box LLMs, addressing a key deployment barrier. The zero-resource design and direct comparison to grey-box methods are clear strengths.
major comments (3)
- [§4] §4 (Experiments): the central AUC-PR and correlation claims rest on human factuality annotations, yet no inter-annotator agreement statistics, annotation guidelines, or controls for annotator bias are reported; this directly affects the reliability of the ground-truth labels used to compute all performance numbers.
- [§3] §3 (Method): the premise that sample divergence signals hallucination is not isolated from other sources of variation (alternative valid phrasings, differing levels of detail, or stylistic choices). No ablation or control experiment is described that holds factual content fixed while varying only style or completeness, leaving open whether the reported gains are inflated by conflating multiple variation types.
- [§4.2] §4.2 (Evaluation metrics): the exact aggregation formulas for the consistency scores (e.g., how BERTScore or QA overlap is averaged or thresholded across the 5–20 samples) are not fully specified, nor are statistical significance tests for the AUC-PR improvements over baselines provided.
minor comments (2)
- [§3] The number of samples, sampling temperature, and prompt templates used for generation should be stated explicitly in §3 and §4.1 for reproducibility.
- [§4] Figure 2 or the corresponding table should include error bars or confidence intervals on the AUC-PR and correlation values.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, clarifying our approach where possible and committing to revisions that strengthen the presentation of the experimental details and method assumptions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central AUC-PR and correlation claims rest on human factuality annotations, yet no inter-annotator agreement statistics, annotation guidelines, or controls for annotator bias are reported; this directly affects the reliability of the ground-truth labels used to compute all performance numbers.
Authors: We agree that inter-annotator agreement and annotation protocol details are necessary to establish label reliability. Although omitted from the initial submission for brevity, annotations were performed by three independent annotators following explicit guidelines that defined hallucination as any non-supported factual claim. In the revised manuscript we will add a new subsection in §4 reporting the inter-annotator agreement statistics, reproducing the full annotation guidelines in the appendix, and describing bias-mitigation steps such as randomized presentation order and independent adjudication of disagreements. revision: yes
-
Referee: [§3] §3 (Method): the premise that sample divergence signals hallucination is not isolated from other sources of variation (alternative valid phrasings, differing levels of detail, or stylistic choices). No ablation or control experiment is described that holds factual content fixed while varying only style or completeness, leaving open whether the reported gains are inflated by conflating multiple variation types.
Authors: We acknowledge the concern that consistency metrics could be influenced by non-factual sources of variation. Our QA-based and entity-overlap metrics are deliberately chosen to emphasize factual content rather than surface form; however, we concede that a controlled ablation isolating style while fixing facts would provide stronger evidence. We will expand the discussion in §3 to articulate why we expect stylistic variation to have limited impact on the reported metrics for the WikiBio task, and we will explicitly list the absence of such an ablation as a limitation. Performing the ablation would require substantial new annotation and is left for future work. revision: partial
-
Referee: [§4.2] §4.2 (Evaluation metrics): the exact aggregation formulas for the consistency scores (e.g., how BERTScore or QA overlap is averaged or thresholded across the 5–20 samples) are not fully specified, nor are statistical significance tests for the AUC-PR improvements over baselines provided.
Authors: We apologize for the incomplete specification. The sentence-level consistency score is the mean of the pairwise similarity values (BERTScore, QA overlap, or n-gram overlap) between the candidate sentence and each of the N sampled responses; no additional thresholding is applied. In the revised §4.2 we will state these aggregation formulas explicitly and will report statistical significance of the AUC-PR gains over baselines via paired bootstrap resampling with 10,000 iterations. revision: yes
Circularity Check
No significant circularity in SelfCheckGPT sampling consistency heuristic
full rationale
The paper defines SelfCheckGPT as a direct sampling procedure: generate multiple stochastic responses from a black-box LLM and measure consistency (via BERTScore, QA, n-gram overlap) to flag hallucinations. This is tested against independent human factuality labels on WikiBio passages. No equations reduce a claimed prediction to a fitted input by construction, no self-citation chains justify the core premise, and the consistency assumption is presented as a testable heuristic rather than a self-definition. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption If an LLM has knowledge of a concept, sampled responses are likely to be similar and contain consistent facts; for hallucinated facts, responses are likely to diverge.
Forward citations
Cited by 19 Pith papers
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
-
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
Evaluating the False Trust engendered by LLM Explanations
A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.
-
Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation
Semantic distance on program execution behaviors improves uncertainty estimation for LLM code generation and outperforms prior sample-based methods across benchmarks and models.
-
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
-
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
GraphRAG improves comprehensiveness and diversity of answers to global questions over million-token document sets by constructing entity graphs and hierarchical community summaries before combining partial responses.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey
A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Reference graph
Works this paper leans on
-
[3]
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. https://doi.org/10.18653/v1/2022.bigscience-1.9 GPT - N eo X -20 B : An open-source autoreg...
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[6]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37 -- 46
work page 1960
-
[8]
Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos. 2022. https://doi.org/10.1162/tacl_a_00454 A survey on automated fact-checking . Transactions of the Association for Computational Linguistics, 10:178--206
-
[9]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. https://openreview.net/forum?id=sE7-XhLxHA De BERT av3: Improving de BERT a using ELECTRA -style pre-training with gradient-disentangled embedding sharing . In The Eleventh International Conference on Learning Representations
work page 2023
-
[11]
Ganesh Jawahar, Beno \^ t Sagot, and Djam \'e Seddah. 2019. https://doi.org/10.18653/v1/P19-1356 What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651--3657, Florence, Italy. Association for Computational Linguistics
-
[14]
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.750 Evaluating the factual consistency of abstractive text summarization . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332--9346, Online. Association for Computational Linguistics
-
[15]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/forum?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations
work page 2023
-
[16]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 RACE : Large-scale R e A ding comprehension dataset from examinations . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785--794, Copenhagen, Denmark. Association for Computational Linguistics
-
[17]
R \' e mi Lebret, David Grangier, and Michael Auli. 2016. http://arxiv.org/abs/1603.07771 Generating text from structured data with application to the biography domain . CoRR, abs/1603.07771
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. https://doi.org/10.18653/v1/2022.acl-long.464 A token-level reference-free hallucination detection benchmark for free-form text generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...
-
[20]
Adian Liusie, Vatsal Raina, and Mark Gales. 2023. https://aclanthology.org/2023.fever-1.5 `` world knowledge '' in multiple choice reading comprehension . In Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER), pages 49--57, Dubrovnik, Croatia. Association for Computational Linguistics
work page 2023
-
[22]
Andrey Malinin and Mark Gales. 2021. https://openreview.net/forum?id=jN5y-zb5Q7m Uncertainty estimation in autoregressive structured prediction . In International Conference on Learning Representations
work page 2021
- [23]
-
[24]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics
-
[26]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485--5551
work page 2020
-
[27]
Vatsal Raina and Mark Gales. 2022. https://doi.org/10.18653/v1/2022.findings-acl.82 Answer uncertainty and unanswerability in multiple-choice machine reading comprehension . In Findings of the Association for Computational Linguistics: ACL 2022, pages 1020--1034, Dublin, Ireland. Association for Computational Linguistics
-
[28]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics
-
[29]
Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.320 Retrieval augmentation reduces hallucination in conversation . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803, Punta Cana, Dominican Republic. Association for Computational Linguistics
-
[30]
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2018. The Fact Extraction and VERification (FEVER) shared task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)
work page 2018
-
[32]
Anthony J Viera, Joanne M Garrett, et al. 2005. Understanding interobserver agreement: the kappa statistic. Fam med, 37(5):360--363
work page 2005
-
[33]
Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax
work page 2021
-
[34]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In The Eleventh International Conference on Learning Representations
work page 2023
-
[35]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...
-
[36]
Yijun Xiao and William Yang Wang. 2021. https://doi.org/10.18653/v1/2021.eacl-main.236 On hallucination and predictive uncertainty in conditional language generation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734--2744, Online. Association for Computational Linguistics
-
[37]
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263--27277
work page 2021
-
[39]
Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware bert for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9628--9635
work page 2020
-
[40]
Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. 2020. https://doi.org/10.18653/v1/2020.acl-main.549 Reasoning over semantic-level graph for fact checking . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6170--6180, Online. Association for Computational Linguistics
-
[41]
Manakul, Potsawee and Liusie, Adian and Gales, Mark JF , journal=
-
[42]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[43]
Transactions on Machine Learning Research , year=
Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=
-
[44]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
The Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=
work page 2020
-
[46]
`` World Knowledge '' in Multiple Choice Reading Comprehension
Liusie, Adian and Raina, Vatsal and Gales, Mark. `` World Knowledge '' in Multiple Choice Reading Comprehension. Proceedings of the Sixth Fact Extraction and VERification Workshop (FEVER). 2023
work page 2023
-
[47]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Semantics-aware BERT for language understanding , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[48]
Brando, Axel and Torres, Damià and Rodríguez-Serrano, Jose A. and Vitrià, Jordi , journal=. Building Uncertainty Models on Top of Black-Box Predictive APIs , year=
-
[49]
Generating Text from Structured Data with Application to the Biography Domain , journal =
R. Generating Text from Structured Data with Application to the Biography Domain , journal =. 2016 , url =
work page 2016
-
[50]
Advances in Neural Information Processing Systems , volume=
Bartscore: Evaluating generated text as text generation , author=. Advances in Neural Information Processing Systems , volume=
-
[51]
OPT: Open Pre-trained Transformer Language Models
Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Wang, Ben and Komatsuzaki, Aran , title =
-
[53]
GPTScore: Evaluate as You Desire , publisher =
Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , keywords =. GPTScore: Evaluate as You Desire , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2302.04166 , url =
-
[54]
Proceedings of the First Workshop on
Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit , title =. Proceedings of the First Workshop on
-
[55]
Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and Schuh, Parker and Shi, Kensen and Tsvyashchenko, Sasha and Maynez, Joshua and Rao, Abhishek and Barnes, Parker and Tay, Yi and Shazeer, Noam and Prabhakaran,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
-
[56]
Training Compute-Optimal Large Language Models
Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556 2022
-
[57]
Language Models are Unsupervised Multitask Learners , author=
-
[58]
Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. ACM Comput. Surv. , month =. 2023 , issue_date =. doi:10.1145/3571730 , abstract =
-
[59]
The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , publisher =
Huang, Yichong and Feng, Xiachong and Feng, Xiaocheng and Qin, Bing , keywords =. The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2104.14839 , url =
-
[60]
International Conference on Learning Representations , year=
Uncertainty Estimation in Autoregressive Structured Prediction , author=. International Conference on Learning Representations , year=
-
[61]
Longformer: The Long-Document Transformer
Beltagy, Iz and Peters, Matthew E. and Cohan, Arman , keywords =. Longformer: The Long-Document Transformer , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2004.05150 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2004.05150 2020
-
[62]
Educational and Psychological Measurement , year=
A Coefficient of Agreement for Nominal Scales , author=. Educational and Psychological Measurement , year=
-
[63]
Understanding interobserver agreement: the kappa statistic , author=. Fam med , volume=
-
[64]
The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734,
The Internal State of an LLM Knows When its Lying , author=. arXiv preprint arXiv:2304.13734 , year=
-
[65]
The Eleventh International Conference on Learning Representations , year=
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=
-
[66]
Language Models (Mostly) Know What They Know
Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
arXiv preprint arXiv:2303.15621 , year=
Chatgpt as a factual inconsistency evaluator for abstractive text summarization , author=. arXiv preprint arXiv:2303.15621 , year=
-
[69]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[70]
arXiv preprint arXiv:2305.14251 , year=
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation , author=. arXiv preprint arXiv:2305.14251 , year=
-
[71]
Pengcheng He and Jianfeng Gao and Weizhu Chen , booktitle=. De. 2023 , url=
work page 2023
-
[72]
The Eleventh International Conference on Learning Representations , year=
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=
-
[73]
Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[74]
Neural Approaches for Data Driven Dependency Parsing in S anskrit
Krishna, Amrith and Gupta, Ashim and Garasangi, Deepak and Sandhan, Jeevnesh and Satuluri, Pavankumar and Goyal, Pawan. Neural Approaches for Data Driven Dependency Parsing in S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[75]
Evaluating Neural Word Embeddings for S anskrit
Sandhan, Jivnesh and Paranjay, Om Adideva and Digumarthi, Komal and Behra, Laxmidhar and Goyal, Pawan. Evaluating Neural Word Embeddings for S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[76]
Sriram, Krishnan and Kulkarni, Amba and Huet, G \'e rard. Validation and Normalization of DCS corpus and Development of the S anskrit Heritage Engine ' s Segmenter. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[77]
Pre-annotation Based Approach for Development of a S anskrit Named Entity Recognition Dataset
Sujoy, Sarkar and Krishna, Amrith and Goyal, Pawan. Pre-annotation Based Approach for Development of a S anskrit Named Entity Recognition Dataset. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[78]
Disambiguation of Instrumental, Dative and Ablative Case suffixes in S anskrit
Maity, Malay and Panchal, Sanjeev and Kulkarni, Amba. Disambiguation of Instrumental, Dative and Ablative Case suffixes in S anskrit. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[79]
Creation of a Digital Rig V edic Index (Anukramani) for Computational Linguistic Tasks
Mahesh, A V S D S and Bhattacharya, Arnab. Creation of a Digital Rig V edic Index (Anukramani) for Computational Linguistic Tasks. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[80]
Skrutable: Another Step Toward Effective S anskrit Meter Identification
Neill, Tyler. Skrutable: Another Step Toward Effective S anskrit Meter Identification. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[81]
Chandojnanam: A S anskrit Meter Identification and Utilization System
Terdalkar, Hrishikesh and Bhattacharya, Arnab. Chandojnanam: A S anskrit Meter Identification and Utilization System. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[82]
Ajotikar, Tanuja P and Scharf, Peter M. Development of a TEI standard for digital S anskrit texts containing commentaries: A pilot study of Bhaṭṭti ' s R \=a vaṇavadha with Mallin \=a tha ' s commentary on the first canto. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[83]
R \=a mop \=a khy \=a na: A Web-based reader and index
Scharf, Peter M and Chauhan, Dhruv. R \=a mop \=a khy \=a na: A Web-based reader and index. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[84]
Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text
Terdalkar, Hrishikesh and Bhattacharya, Arnab and Dubey, Madhulika and Ramamurthy, S and Singh, Bhavna Naneria. Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[85]
Shaastra Maps: Enabling Conceptual Exploration of I ndic Shaastra Texts
Susarla, Sai and Jammalamadaka, Suryanarayana and Nishankar, Vaishnavi and Panuganti, Siva and Ryali, Anupama and Sushrutha, S. Shaastra Maps: Enabling Conceptual Exploration of I ndic Shaastra Texts. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[86]
Hellwig, Oliver and Sellmer, Sven and Amano, Kyoko. The V edic corpus as a graph. An updated version of Bloomfields V edic Concordance. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[87]
The transmission of the Buddha ' s teachings in the digital age
Harnsukworapanich, Sumachaya and Supphipat, Phatchareporn. The transmission of the Buddha ' s teachings in the digital age. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[88]
Distinguishing Commentary from Canon: Experiments in P \=a li Computational Linguistics
Zigmond, Dan. Distinguishing Commentary from Canon: Experiments in P \=a li Computational Linguistics. Proceedings of the Computational S anskrit & Digital Humanities: Selected papers presented at the 18th World S anskrit Conference. 2023
work page 2023
-
[89]
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023
work page 2023
-
[90]
Analyzing Zero-Shot transfer Scenarios across S panish variants for Hate Speech Detection
Castillo-l \'o pez, Galo and Riabi, Arij and Seddah, Djam \'e. Analyzing Zero-Shot transfer Scenarios across S panish variants for Hate Speech Detection. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023
work page 2023
-
[91]
Optimizing the Size of Subword Vocabularies in Dialect Classification
Kanjirangat, Vani and Samard z i \'c , Tanja and Dolamic, Ljiljana and Rinaldi, Fabio. Optimizing the Size of Subword Vocabularies in Dialect Classification. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023
work page 2023
-
[92]
Murreviikko - A Dialectologically Annotated and Normalized Dataset of F innish Tweets
Kuparinen, Olli. Murreviikko - A Dialectologically Annotated and Normalized Dataset of F innish Tweets. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.