Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.
and Bau, David , year = 2023, eprint =
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.
citing papers explorer
-
Predicting Future Behaviors in Reasoning Models Enables Better Steering
Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
A framework for analyzing concept representations in neural models
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled from speaker info while speaker info resists compact containment.