CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

· 2026 · cs.CL · arXiv 2604.10031

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers' characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.

representative citing papers

Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

Temporal Attractor Steering resolves 29-57% of parametric temporal conflicts in open-weight LLMs while preserving 85-99% accuracy on non-conflict queries.

citing papers explorer

Showing 1 of 1 citing paper.

Right Knowledge, Wrong Answer: Test-Time Steering for Temporal Fact Conflicts in Open-Weight Language Models cs.LG · 2026-06-18 · unverdicted · none · ref 67 · internal anchor
Temporal Attractor Steering resolves 29-57% of parametric temporal conflicts in open-weight LLMs while preserving 85-99% accuracy on non-conflict queries.

CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer