Recognition: no theorem link
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3
The pith
Contrasting logits from later versus earlier transformer layers reduces hallucinations in large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that by obtaining the next-token distribution through contrasting differences in logits from projecting later layers versus earlier layers to the vocabulary space, the model can better surface factual knowledge and reduce generation of incorrect facts, leading to consistent improvements in truthfulness across tasks.
What carries the argument
Decoding by Contrasting Layers (DoLa), which obtains the next-token distribution by subtracting logits from early layers from those of late layers to isolate factual knowledge.
Load-bearing premise
Factual knowledge is localized to particular transformer layers, and subtracting early-layer logits from late-layer logits reliably surfaces accurate information without introducing new errors.
What would settle it
A test showing no improvement or increased errors when applying the layer contrast to a model where factual knowledge is not localized to specific layers would falsify the central claim.
read the original abstract
Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers. We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. DoLa consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of LLaMA family models on TruthfulQA by 12-17% absolute points, demonstrating its potential in making LLMs reliably generate truthful facts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Decoding by Contrasting Layers (DoLa), a training-free decoding strategy that obtains next-token logits by subtracting the vocabulary projections of earlier transformer layers from those of later layers. It exploits prior observations that factual knowledge tends to localize in deeper layers and reports consistent gains on multiple-choice and open-ended generation benchmarks, including 12-17 absolute point improvements on TruthfulQA for LLaMA-family models.
Significance. If the reported gains prove robust under controlled ablations and the mechanism is isolated from generic logit-shift effects, DoLa would provide a simple, parameter-free method to improve factuality in existing pretrained LLMs without retrieval or fine-tuning. The approach is lightweight and directly applicable at inference time.
major comments (2)
- [§3.1] §3.1, Eq. (2): the subtraction L_late - L_early is presented as surfacing factual knowledge, yet the manuscript provides no direct measurement (e.g., layer-wise factuality probes or knowledge-editing experiments) confirming that the chosen early layer systematically encodes less factual content than a mid-layer or random layer; without this isolation the 12-17% TruthfulQA lift could arise from any distributional contrast.
- [§4.3] §4.3, Table 3: the layer-selection ablation reports gains only for the authors' chosen early layer, but omits controls that replace the early layer with a mid-layer (e.g., layer 12) or a random layer while keeping the late layer fixed; such controls are required to rule out that any logit subtraction improves calibration.
minor comments (2)
- [§2] §2: the related-work discussion of logit-contrast methods (e.g., contrastive decoding) is brief; a short paragraph clarifying the precise difference between DoLa and prior logit-difference techniques would help readers.
- [Figure 2] Figure 2: the y-axis label 'Truthfulness' should explicitly state the metric (e.g., % truthful answers on TruthfulQA) and whether error bars represent standard deviation over seeds or prompts.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation. We address each major comment below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [§3.1] §3.1, Eq. (2): the subtraction L_late - L_early is presented as surfacing factual knowledge, yet the manuscript provides no direct measurement (e.g., layer-wise factuality probes or knowledge-editing experiments) confirming that the chosen early layer systematically encodes less factual content than a mid-layer or random layer; without this isolation the 12-17% TruthfulQA lift could arise from any distributional contrast.
Authors: We agree that direct layer-wise probes or editing experiments would provide stronger mechanistic evidence. The current work relies on and cites prior literature establishing that factual knowledge tends to localize in deeper layers of transformer models. The consistent empirical improvements across multiple benchmarks support the utility of the contrast, but we do not claim to have performed new isolation experiments ourselves. In the revision we will expand the discussion in §3.1 to more explicitly reference the supporting literature and clarify the scope of our claims. revision: partial
-
Referee: [§4.3] §4.3, Table 3: the layer-selection ablation reports gains only for the authors' chosen early layer, but omits controls that replace the early layer with a mid-layer (e.g., layer 12) or a random layer while keeping the late layer fixed; such controls are required to rule out that any logit subtraction improves calibration.
Authors: We thank the referee for this suggestion. The existing ablation selects early layers based on preliminary analysis of layer-wise behavior. To address the concern about generic logit-shift effects, we will add the requested controls (mid-layer and random-layer contrasts with the same late layer) and report the results in the revised version of Table 3 and accompanying text. revision: yes
Circularity Check
No significant circularity: DoLa decoding rule is a direct, parameter-free contrast of layer logits
full rationale
The DoLa method is defined directly from the forward pass by subtracting early-layer logits from late-layer logits to obtain the next-token distribution. No parameters are fitted to the target task, no self-citation chain is required to state the algorithm, and the localization premise is invoked as a general prior result rather than derived or renamed within the paper. Empirical gains on TruthfulQA and other benchmarks are reported as external measurements, not forced by construction from the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Factual knowledge in LLMs is localized to particular transformer layers
Forward citations
Cited by 17 Pith papers
-
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
-
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
-
APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
APCD reduces LLM hallucinations by expanding decoding paths adaptively when entropy signals uncertainty and by contrasting divergent paths to control their interaction.
-
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
Select to Think: Unlocking SLM Potential with Local Sufficiency
Small language models can achieve near large-model reasoning performance by learning to re-rank their own top-K token predictions after distilling selection from the large model.
-
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
-
HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models
HTDC mitigates hallucinations in LVLMs by triggering calibration only at hesitation-prone decoding steps via contrasts with visual-nullification and semantic-nullification probes.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation
DaID mitigates MLLM hallucinations by attention-guided selection of dual layers that calibrate token generation using internal perceptual discrepancies.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
-
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.\ 2206--2240. PMLR, 2022
work page 2022
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 1901
-
[5]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[7]
Knowledge neurons in pretrained transformers
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8493--8502, 2022
work page 2022
-
[9]
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In ICLR 2020-Eighth International Conference on Learning Representations, pp.\ 1--14, 2020
work page 2020
-
[10]
Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. Not all models localize linguistic knowledge in the same place: A layer-wise probing on bertoids’ representations. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.\ 375--388, 2021
work page 2021
-
[11]
Openllama: An open reproduction of llama, May 2023
Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama
work page 2023
-
[13]
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021
work page 2021
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[16]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55 0 (12): 0 1--38, 2023
work page 2023
-
[22]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3214--3252, 2022
work page 2022
-
[25]
Locating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022
work page 2022
-
[26]
Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05
work page 2023
-
[28]
Jingcheng Niu, Wenjie Lu, and Gerald Penn. Does bert rediscover a classical nlp pipeline? In Proceedings of the 29th International Conference on Computational Linguistics, pp.\ 3143--3153, 2022
work page 2022
-
[30]
Introducing chatgpt, November 2022
OpenAI. Introducing chatgpt, November 2022. URL https://openai.com/blog/chatgpt
work page 2022
-
[31]
OpenAI. Gpt-4 technical report. 2023. URL https://cdn.openai.com/papers/gpt-4.pdf
work page 2023
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[34]
Introduction to the conll-2003 shared task: Language-independent named entity recognition
Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp.\ 142--147, 2003
work page 2003
-
[35]
Confident adaptive language modeling
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35: 0 17456--17472, 2022
work page 2022
-
[37]
Branchynet: Fast inference via early exiting from deep neural networks
Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp.\ 2464--2469. IEEE, 2016
work page 2016
-
[38]
Bert rediscovers the classical nlp pipeline
Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 4593--4601, 2019
work page 2019
-
[40]
Emergent abilities of large language models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022 a
work page 2022
-
[43]
Learning to break the loop: Analyzing and mitigating repetitions for neural text generation
Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35: 0 3082--3095, 2022
work page 2022
-
[44]
Locating and Editing Factual Associations in
Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , journal=. Locating and Editing Factual Associations in
-
[45]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[46]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [47]
-
[48]
Transactions on Machine Learning Research , year=
Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , year=
-
[49]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[50]
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. arXiv preprint arXiv:2305.14325 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. arXiv preprint arXiv:2305.19118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. arXiv preprint arXiv:2306.03341 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[55]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
2016 23rd International Conference on Pattern Recognition (ICPR) , pages=
Branchynet: Fast inference via early exiting from deep neural networks , author=. 2016 23rd International Conference on Pattern Recognition (ICPR) , pages=. 2016 , organization=
work page 2016
-
[57]
ICLR 2020-Eighth International Conference on Learning Representations , pages=
Depth-adaptive Transformer , author=. ICLR 2020-Eighth International Conference on Learning Representations , pages=
work page 2020
-
[58]
International Conference on Learning Representations , year=
Reducing Transformer Depth on Demand with Structured Dropout , author=. International Conference on Learning Representations , year=
-
[59]
Advances in Neural Information Processing Systems , volume=
Bert loses patience: Fast and robust inference with early exit , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[61]
arXiv preprint arXiv:2001.09309 , year=
BERT's output layer recognizes all hidden layers? Some Intriguing Phenomena and a simple way to boost BERT , author=. arXiv preprint arXiv:2001.09309 , year=
-
[62]
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
BERT Rediscovers the Classical NLP Pipeline , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=
-
[63]
Proceedings of the 29th International Conference on Computational Linguistics , pages=
Does BERT Rediscover a Classical NLP Pipeline? , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=
-
[64]
Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations , author=. Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP , pages=
-
[65]
arXiv preprint arXiv:2305.15852 , year=
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation , author=. arXiv preprint arXiv:2305.15852 , year=
-
[66]
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv preprint arXiv:2303.08896 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
ACM Computing Surveys , volume=
Survey of hallucination in natural language generation , author=. ACM Computing Surveys , volume=. 2023 , publisher=
work page 2023
-
[68]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[69]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
- [70]
-
[71]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
GPT-4 Technical Report , author=
-
[73]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[74]
arXiv preprint arXiv:2210.15097 , year=
Contrastive decoding: Open-ended text generation as optimization , author=. arXiv preprint arXiv:2210.15097 , year=
-
[75]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[76]
arXiv preprint arXiv:1909.05858 , year=
Ctrl: A conditional transformer language model for controllable generation , author=. arXiv preprint arXiv:1909.05858 , year=
-
[77]
Advances in Neural Information Processing Systems , volume=
Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , author=. Advances in Neural Information Processing Systems , volume=
-
[78]
arXiv preprint arXiv:2307.06908 , year=
Generating Benchmarks for Factuality Evaluation of Language Models , author=. arXiv preprint arXiv:2307.06908 , year=
-
[79]
Transactions of the Association for Computational Linguistics , volume=
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
work page 2021
-
[80]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[82]
Advances in Neural Information Processing Systems , volume=
Confident adaptive language modeling , author=. Advances in Neural Information Processing Systems , volume=
-
[83]
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Few-shot learning with retrieval augmented language models , author=. arXiv preprint arXiv:2208.03299 , year=
work page internal anchor Pith review arXiv
-
[84]
International conference on machine learning , pages=
Improving language models by retrieving from trillions of tokens , author=. International conference on machine learning , pages=. 2022 , organization=
work page 2022
-
[85]
arXiv preprint arXiv:2302.00083 , year=
In-context retrieval-augmented language models , author=. arXiv preprint arXiv:2302.00083 , year=
- [86]
-
[87]
Knowledge Neurons in Pretrained Transformers , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[88]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[89]
arXiv preprint arXiv:2305.14739 , year=
Trusting Your Evidence: Hallucinate Less with Context-aware Decoding , author=. arXiv preprint arXiv:2305.14739 , year=
-
[90]
The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers
Gera, Ariel and Friedman, Roni and Arviv, Ofir and Gunasekara, Chulaka and Sznajder, Benjamin and Slonim, Noam and Shnarch, Eyal. The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.580
-
[91]
arXiv preprint arXiv:2309.09117 , year=
Contrastive decoding improves reasoning in large language models , author=. arXiv preprint arXiv:2309.09117 , year=
-
[92]
Geng, Xinyang and Liu, Hao , title =
-
[93]
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. arXiv preprint arXiv:2310.06694 , year=
-
[94]
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 , pages=
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , author=. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 , pages=
work page 2003
-
[95]
arXiv preprint arXiv:2305.01937 , year=
Can Large Language Models Be an Alternative to Human Evaluations? , author=. arXiv preprint arXiv:2305.01937 , year=
-
[96]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-eval: Nlg evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[97]
arXiv preprint arXiv:2310.05657 , year=
A Closer Look into Automatic Evaluation Using Large Language Models , author=. arXiv preprint arXiv:2310.05657 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.