Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.
Challenges with unsupervised LLM knowledge discovery.arXiv:2312.10029, 2023
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
citing papers explorer
-
Rift: A Conflict Signature for Deception in Language Models
Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
ToxiREX: A Dataset on Toxic REasoning in ConteXt
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.