Challenges with unsupervised LLM knowledge discovery.arXiv:2312.10029, 2023

Sebastian Farquhar, Vikrant Varma, et al · 2023 · arXiv 2312.10029

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Rift: A Conflict Signature for Deception in Language Models

cs.LG · 2026-06-15 · conditional · novelty 8.0

Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Rift: A Conflict Signature for Deception in Language Models cs.LG · 2026-06-15 · conditional · none · ref 3
Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.
PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 15
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 137
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Challenges with unsupervised LLM knowledge discovery.arXiv:2312.10029, 2023

fields

years

verdicts

representative citing papers

citing papers explorer