pith. sign in

Challenges with unsupervised LLM knowledge discovery.arXiv:2312.10029, 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

years

2026 3

clear filters

representative citing papers

Rift: A Conflict Signature for Deception in Language Models

cs.LG · 2026-06-15 · conditional · novelty 8.0

Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • Rift: A Conflict Signature for Deception in Language Models cs.LG · 2026-06-15 · conditional · none · ref 3

    Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.

  • PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 15

    PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

  • ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 137

    ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.