Recognition: 3 theorem links
Discovering Latent Knowledge in Language Models Without Supervision
Pith reviewed 2026-05-15 20:30 UTC · model grok-4.3
The pith
A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a direction exists in activation space such that the sign of the projection of a statement's activation vector onto this direction correctly indicates the truth value of the statement, because the direction has been chosen to enforce logical consistency between statements and their negations. This direction is recovered by searching for the vector that best satisfies the consistency constraints across many unlabeled statements. The resulting classifier recovers diverse knowledge represented inside the model and outperforms zero-shot baselines while remaining robust to prompt variations and to instructions that ask the model to lie.
What carries the argument
The central object is a single linear direction in activation space found by optimizing for logical consistency: the projection of any statement and its negation must have opposite signs, and the sign of the projection then serves as the yes-no answer.
If this is right
- What the model knows internally can be read out separately from the text it generates under a given prompt.
- Prompt engineering becomes less necessary for eliciting truthful answers.
- The technique works even on models trained by imitation learning that may reproduce human errors in their outputs.
- Accuracy holds when models are explicitly instructed to produce incorrect answers, showing the direction tracks internal knowledge rather than surface generation.
- The same consistency-based search can be repeated on new models without task-specific labels or fine-tuning.
Where Pith is reading between the lines
- Similar directions might be recoverable for other abstract properties such as uncertainty or logical consistency itself.
- Model alignment procedures could directly optimize or verify against these latent directions instead of generated text.
- The approach suggests that internal truth representations in language models are often approximately linear and therefore relatively easy to isolate.
Load-bearing premise
The assumption that a single linear direction exists in activation space whose projections are logically consistent and specifically track the model's knowledge of truth rather than some other consistent property.
What would settle it
If the sign of projections onto the recovered direction fails to predict ground-truth answers on a held-out set of yes-no questions at a rate higher than the zero-shot baseline, or if no direction satisfies the consistency constraints across a diverse collection of statements.
read the original abstract
Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an unsupervised method to extract latent knowledge from language models by searching for a linear direction in activation space that satisfies logical consistency (a statement and its negation have opposite projections). This direction is then used to answer yes-no questions on unlabeled activations. Across 6 models and 10 QA datasets, the method reports a 4% average improvement over zero-shot baselines, halves prompt sensitivity, and maintains accuracy under misleading prompts.
Significance. If the central claim holds, the result is significant: it offers a purely unsupervised route to internal model knowledge that is distinct from generated outputs and less sensitive to prompting. The empirical scope (multiple models and datasets) and the robustness experiments provide concrete evidence that consistency-based directions can recover factual information without ground-truth labels or model outputs.
major comments (3)
- [§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.
- [§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.
- [Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.
minor comments (2)
- [Figure 2] Figure 2: The legend does not distinguish the zero-shot baseline from the consistency direction; add explicit labels or a separate panel.
- [§3.1] §3.1: Notation for the projection operator is introduced without an explicit equation; add Eq. (X) defining proj_v(a) = v·a / ||v||.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications and additional experiments where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.
Authors: We agree that logical consistency is a necessary but not sufficient condition for recovering factual truth, as other negating features could in principle satisfy the objective. Our empirical results across multiple models and datasets show that the consistency-optimized direction reliably correlates with ground-truth labels on held-out factual questions, outperforming zero-shot baselines. To directly address the concern, we will add an ablation study in the revised manuscript that compares the consistency-selected direction against random vectors and directions optimized for alternative objectives, confirming that high consistency without factual correlation does not yield comparable accuracy. revision: yes
-
Referee: [§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.
Authors: The direction v is computed once on the original unlabeled set of activations and held fixed for all robustness experiments, including those with lying prompts. This isolates the latent knowledge encoded in the fixed direction from any prompt-induced shifts in the activation distribution. We will revise §4.3 to explicitly state this procedure and include a brief diagram clarifying the experimental flow. revision: yes
-
Referee: [Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.
Authors: We acknowledge the per-dataset variance in Table 2. In the revision we will add statistical significance tests for the average improvement (with Bonferroni correction for multiple comparisons) and report per-dataset p-values. We will also include new results in which the direction is selected using only a held-out subset of statements, confirming that the reported gains persist and are not due to using the full unlabeled pool for direction selection. revision: yes
Circularity Check
No significant circularity; unsupervised consistency objective evaluated on external labels
full rationale
The derivation finds a direction v in activation space by maximizing logical consistency (proj(a(s)) ≈ -proj(a(¬s))) over unlabeled statements. This objective uses only model activations and the negation operator; no ground-truth labels enter the optimization. Reported accuracy is measured against 10 external QA datasets whose labels are never used to select or fit v. No equation reduces the final accuracy to a fitted parameter, and no load-bearing step relies on self-citation of an unverified uniqueness result. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single linear direction in activation space exists such that projections of a statement and its negation are approximately opposite.
Lean theorems connected to this paper
-
LawOfExistencedefect_zero_iff_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values.
-
InevitabilityStructureinevitability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the method recovers diverse knowledge represented in large language models and outperforms zero-shot accuracy by 4% on average
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
-
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability
Geometric deviation of LLM hidden states from an answerable reference centroid provides a pre-generation signal for answerability that works reliably for mathematical prompts (ROC-AUC 0.78-0.84) but not factual ones.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.
-
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
-
Emergent Manifold Separability during Reasoning in Large Language Models
Reasoning in LLMs produces a transient geometric pulse in which concept manifolds untangle into linearly separable subspaces immediately before computation and compress afterward.
-
The Internal State of an LLM Knows When It's Lying
Hidden activations in LLMs encode detectable information about statement truthfulness, enabling a classifier to identify true versus false content more reliably than the model's assigned probabilities.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
Reference graph
Works this paper leans on
-
[1]
Amanda Askell, Yushi Bai, Anna Chen, Dawn Drain, Deep Ganguli, T. J. Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, and Jared Kaplan. A general language assistant...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Yushi Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, T. J. Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Da...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,
work page 2021
-
[4]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
10 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[6]
Paul Francis Christiano, Jan Leike, Tom B
URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit. Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741,
-
[7]
Supervising strong learners by amplifying weak experts
Paul Francis Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplify- ing weak experts. ArXiv, abs/1810.08575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[9]
Truthful ai: Developing and governing ai that does not lie
Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful ai: Developing and governing ai that does not lie. ArXiv, abs/2110.06674,
-
[10]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[11]
Unsolved problems in ml safety
Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. ArXiv, abs/2109.13916,
-
[12]
Geoffrey Irving, Paul Francis Christiano, and Dario Amodei. Ai safety via debate. ArXiv, abs/1805.00899,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Maieutic prompting: Logically consistent reasoning with recursive explanations
Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. ArXiv, abs/2205.11822,
-
[14]
Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. ArXiv, abs/2103.14659,
-
[15]
Ground-truth labels matter: A deeper look into input-label demonstrations
Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. ArXiv, abs/2205.12685,
-
[16]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. ArXiv, abs/1811.07871,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[19]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
- [20]
-
[21]
Teaching language models to support answers with verified quotes
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nathan McAleese. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147,
-
[22]
Metaicl: Learning to learn in context
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. ArXiv, abs/2110.15943, 2022a. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837, 2022b. Nasrin Mos...
-
[23]
Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Red Teaming Language Models with Language Models
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nathan McAleese, and Geoffrey Irving. Red teaming language models with language models. ArXiv, abs/2202.03286,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
12 Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[27]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Choice of plausible alternatives: An evaluation of commonsense causal reasoning
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,
work page 2011
-
[29]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.ArXiv, abs/2009.01325,
-
[31]
FEVER: a large-scale dataset for Fact Extraction and VERification
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. ArXiv, abs/1803.05355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language model...
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[35]
Calibrate before use: Improving few-shot performance of language models
Tony Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. ArXiv, abs/2102.09690,
-
[36]
Prompt consistency for zero-shot task generalization
Chunting Zhou, Junxian He, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Prompt consistency for zero-shot task generalization. ArXiv, abs/2205.00049,
-
[37]
without NLI finetuning (unlike in Section 3, where we finetuned DeBERTa on an NLI task to use it in the zero-shot setting). The outputs of DeBERTa are unlikely to be meaningful if we provide a raw input without using any [MASK] tokens3. If CCS only works when the model outputs are useful, then it should perform poorly in this setting. We also consider one...
work page 2021
-
[38]
For encoder-only and decoder-only models, we provide the full input, x+ or x−, which includes both the question and the proposed answer, to the model; see Appendix I.1 for more formatting and tokenization details. For encoder-decoder models, our input format depends on whether we are taking the encoder hidden states or the decoder hidden states. When we t...
work page 2016
-
[39]
with a learning rate of 0.1. H S TATISTICAL SIGNIFICANCE Our main accuracy results for CCS and other methods (e.g. in Table 1 and elsewhere) are computed by evaluating the method on 40% of the 1000 (or 500 in the case of COPA) examples sampled for each dataset, then averaging the resulting accuracy across 9 prompts per dataset (on average), 10 different d...
work page 2021
-
[40]
(Page 164, 9 prompts in total), and the two of our own as follows: 1 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “nega...
work page 2019
-
[41]
Here the label is a short sentence
I.2.4 COPA COPA is a causal reasoning task to determine either the cause or the effect of a given premise (Roemmele et al., 2011). Here the label is a short sentence. We use 10 prompts, where 9 are from (Sanh et al.,
work page 2011
-
[42]
(Page 177), and we add one more prompt: 1 [prefix]Consider the following premise: “‘ [premise] ”’ Choice 1: [choice1] Choice 2: [choice2] Q: Which one is more likely to be the [question], choice 1 or choice 2? [label] I.2.5 DB PEDIA 14 DBpedia 14 is a topic classification dataset constructed by picking 14 non-overlapping classes from DBpedia 2014 (Lehmann...
work page 2014
-
[43]
(Page 168), and the rest two are as follows: 1 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “negative” forx+ and “positive” f...
work page 2021
-
[44]
is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We use 5 prompts from (Sanh et al., 2021). The label is either “yes” or “no” depending on whether the information in the paragraph is enough ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.