A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models

James Bailey; Siyi Wang; Ting Dang

arxiv: 2607.00946 · v1 · pith:BMTL7K6Gnew · submitted 2026-07-01 · 💻 cs.SD · cs.LG

A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models

Siyi Wang , James Bailey , Ting Dang This is my paper

Pith reviewed 2026-07-02 06:05 UTC · model grok-4.3

classification 💻 cs.SD cs.LG

keywords emotion steeringtext-to-speechspeech language modelsconditional flow matchingactivation steeringrepresentation geometryspeaker-emotion disentanglementmixed-emotion synthesis

0 comments

The pith

SLM modules provide a clean low-dimensional emotion subspace with strong speaker-emotion disentanglement unlike CFM modules in text-to-speech systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares speech language model and conditional flow-matching modules as sites for activation steering of mixed emotions in hybrid text-to-speech systems. It characterizes their emotion representations through linear probing and local intrinsic dimensionality measurements. The analysis finds that SLM maintains a low-dimensional emotion-specific subspace separated from speaker identity while CFM entangles the two. Joint steering across sites raises emotion intensity yet harms proportional control and overall speech quality on familiar data. A reader would care because the geometric differences directly affect how reliably systems can combine emotions without unwanted side effects.

Core claim

SLM offers a clean, low-dimensional emotion-specific subspace with strong speaker--emotion disentanglement, while CFM exhibits poor cross-speaker generalization due to speaker--emotion entanglement. Joint steering increases emotion intensity but degrades proportional control and speech quality on in-distribution data.

What carries the argument

Linear probing and local intrinsic dimensionality measurements applied to SLM and CFM modules as activation steering sites.

Load-bearing premise

Linear probing and local intrinsic dimensionality measurements on the modules accurately reflect the steerability properties relevant to mixed-emotion synthesis in real deployment.

What would settle it

A direct test of mixed-emotion synthesis where SLM steering fails to show superior disentanglement or control compared with CFM would falsify the geometric advantage claim.

Figures

Figures reproduced from arXiv: 2607.00946 by James Bailey, Siyi Wang, Ting Dang.

**Figure 1.** Figure 1: Geometry analysis of SLM (a–c) and CFM (d–f). (a,d) Per-layer emotion discriminability (linear-probe accuracy); blue = within-speaker, red = cross-speaker; shading in (d) shows the accuracy range across denoising steps. (b,e) Pooled LID across layers (CFM shown per denoising step). (c,f) ∆LID = LIDpooled − LIDper-emo across layers. sample 4,000 utterances for both per-emotion and pooled estimates, with k=5… view at source ↗

read the original abstract

While prior work has explored emotion control in hybrid text-to-speech systems, the geometric properties of these modules, and their implications for steerability, remain poorly understood. We present the first comparative study of speech language model (SLM) and conditional flow-matching (CFM) modules as activation steering sites for mixed emotion speech synthesis. We first characterize emotion representations using linear probing and local intrinsic dimensionality (LID), and then evaluate single-site and joint steering for mixed-emotion synthesis. Our results show that SLM offers a clean, low-dimensional emotion-specific subspace with strong speaker--emotion disentanglement, while CFM exhibitspoor cross-speaker generalization due to speaker--emotion entanglement. Joint steering increases emotion intensity but degrades proportional control and speech quality on in-distribution data. These findings provide practical guidance for multi-site activation steering in hybrid TTS systems and highlight the importance of representation geometry in controllable speech generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLM gives a cleaner low-dimensional emotion subspace with better disentanglement than CFM in hybrid TTS, and joint steering trades intensity for control and quality.

read the letter

SLM gives a cleaner low-dimensional emotion subspace with better disentanglement than CFM in hybrid TTS, and joint steering trades intensity for control and quality.

The paper's real addition is the direct comparison of SLM and CFM as steering sites. It uses linear probing and local intrinsic dimensionality to map the emotion representations, then runs single-site and joint activation steering experiments on mixed-emotion synthesis. That combination of geometry characterization plus downstream steering tests is new for this subfield and supplies the practical guidance the abstract promises.

The work connects the representation properties to the steering outcomes without obvious gaps in the logic. The stress-test note confirms the downstream results support the claims about subspaces and entanglement rather than relying only on the proxies, which keeps the argument grounded.

The main limitation is the focus on in-distribution data. Cross-speaker generalization for CFM is flagged as poor, but without reported effect sizes or out-of-distribution checks it is hard to judge how large the practical difference is. Minor details on metrics and baselines would help, but nothing looks load-bearing.

This is for people already working on controllable hybrid TTS or activation steering in speech models. A reader who needs concrete advice on where to place steering vectors will get usable takeaways.

It deserves a serious referee. The experimental structure is coherent and the question is narrow but well-defined, so the paper is worth the time even if it needs more numbers and generalization tests.

Referee Report

0 major / 2 minor

Summary. The paper conducts the first comparative geometric study of SLM and CFM modules as activation-steering sites for mixed-emotion synthesis in hybrid TTS. It characterizes emotion representations via linear probing and local intrinsic dimensionality (LID), then evaluates single-site and joint steering. Results indicate that SLM yields a clean, low-dimensional emotion-specific subspace with strong speaker–emotion disentanglement, while CFM exhibits speaker–emotion entanglement and poor cross-speaker generalization; joint steering raises emotion intensity at the cost of proportional control and speech quality on in-distribution data.

Significance. If the empirical steering results hold, the work supplies concrete guidance for multi-site activation steering in hybrid TTS and demonstrates that representation geometry (disentanglement, dimensionality) directly predicts steerability trade-offs. The explicit linkage of geometric proxies to downstream mixed-emotion synthesis experiments is a strength.

minor comments (2)

[Abstract] Abstract: 'CFM exhibitspoor' is a typographical error (missing space).
The manuscript would benefit from explicit statements of the TTS backbone architectures, training corpora, and steering hyper-parameters to support reproducibility of the reported intensity/control/quality trade-offs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance in linking geometric properties to steerability, and recommendation for minor revision. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper characterizes SLM and CFM modules via linear probing and LID, then directly evaluates single-site and joint activation steering on mixed-emotion tasks. No equations, derivations, or self-citations are presented that reduce any reported prediction or claim to a fitted input or prior self-result by construction. The central claims (clean subspace, entanglement, intensity-control trade-off) rest on the downstream empirical steering outcomes, which are independent of the geometric proxies. This is the common case of a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all claims rest on unstated assumptions about representation geometry and steering validity.

pith-pipeline@v0.9.1-grok · 5682 in / 909 out tokens · 16361 ms · 2026-07-02T06:05:16.722051+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 10 canonical work pages · 8 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Anastassiou, P., Chen, J., Chen, J., Chen, Y ., Chen, Z., Chen, Z., Cong, J., Deng, L., Ding, C., Gao, L., et al. Seed- tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Representation engineering for large-language mod- els: Survey and research challenges.arXiv preprint arXiv:2502.17601,

Bartoszcze, L., Munshi, S., Sukidi, B., Yen, J., Yang, Z., Williams-King, D., Le, L., Asuzu, K., and Maple, C. Representation engineering for large-language mod- els: Survey and research challenges.arXiv preprint arXiv:2502.17601,

work page arXiv
[4]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Gao, X., Zhang, C., Chen, Y ., Zhang, H., and Chen, N. F. Emo-dpo: Controllable emotional speech synthe- sis through direct preference optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025
[6]

Prompttts: Controllable text-to-speech with text descriptions

Guo, Z., Leng, Y ., Wu, Y ., Zhao, S., and Tan, X. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2023
[7]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ma, Z., Zheng, Z., Ye, J., Li, J., Gao, Z., Zhang, S., and Chen, X. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pp. 15747–15760,

2024
[9]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. 5 A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large- scale weak supervision. InInternational conference on machine learning, pp. 28492–28518. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Controlling language and diffusion models by transporting activations

Rodriguez, P., Blaas, A., Klein, M., Zappella, L., Apostoloff, N., Suau, X., et al. Controlling language and diffusion models by transporting activations. InInternational Con- ference on Learning Representations, volume 2025, pp. 89812–89855,

2025
[11]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

Wang, S., Tan, S., Liu, S., Jia, H., Huang, G., Bailey, J., and Dang, T. Cocoemo: Composable and controllable human-like emotional tts via activation steering.arXiv preprint arXiv:2602.03420,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

Xie, T., Yang, S., Li, C., Yu, D., and Liu, L. Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

work page arXiv
[14]

Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset

Zhou, K., Sisman, B., Liu, R., and Li, H. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. InICASSP 2021-2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924. IEEE,

2021
[15]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Understanding intermediate layers using linear classifier probes

Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Anastassiou, P., Chen, J., Chen, J., Chen, Y ., Chen, Z., Chen, Z., Cong, J., Deng, L., Ding, C., Gao, L., et al. Seed- tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Representation engineering for large-language mod- els: Survey and research challenges.arXiv preprint arXiv:2502.17601,

Bartoszcze, L., Munshi, S., Sukidi, B., Yen, J., Yang, Z., Williams-King, D., Le, L., Asuzu, K., and Maple, C. Representation engineering for large-language mod- els: Survey and research challenges.arXiv preprint arXiv:2502.17601,

work page arXiv

[4] [4]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Du, Z., Wang, Y ., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y ., Gao, C., Wang, H., et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Gao, X., Zhang, C., Chen, Y ., Zhang, H., and Chen, N. F. Emo-dpo: Controllable emotional speech synthe- sis through direct preference optimization. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025

[6] [6]

Prompttts: Controllable text-to-speech with text descriptions

Guo, Z., Leng, Y ., Wu, Y ., Zhao, S., and Tan, X. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2023

[7] [7]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

emotion2vec: Self-supervised pre-training for speech emotion representation

Ma, Z., Zheng, Z., Ye, J., Li, J., Gao, Z., Zhang, S., and Chen, X. emotion2vec: Self-supervised pre-training for speech emotion representation. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pp. 15747–15760,

2024

[9] [9]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. 5 A Geometric Perspective on Composable Emotion Steering in Text-to-Speech Models Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large- scale weak supervision. InInternational conference on machine learning, pp. 28492–28518. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Controlling language and diffusion models by transporting activations

Rodriguez, P., Blaas, A., Klein, M., Zappella, L., Apostoloff, N., Suau, X., et al. Controlling language and diffusion models by transporting activations. InInternational Con- ference on Learning Representations, volume 2025, pp. 89812–89855,

2025

[11] [11]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering lan- guage models with activation engineering.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

Wang, S., Tan, S., Liu, S., Jia, H., Huang, G., Bailey, J., and Dang, T. Cocoemo: Composable and controllable human-like emotional tts via activation steering.arXiv preprint arXiv:2602.03420,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

Xie, T., Yang, S., Li, C., Yu, D., and Liu, L. Emosteer- tts: Fine-grained and training-free emotion-controllable text-to-speech via activation steering.arXiv preprint arXiv:2508.03543,

work page arXiv

[14] [14]

Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset

Zhou, K., Sisman, B., Liu, R., and Li, H. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset. InICASSP 2021-2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 920–924. IEEE,

2021

[15] [15]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv