pith. sign in

arxiv: 2606.10046 · v2 · pith:2TXMDTWEnew · submitted 2026-06-08 · 💻 cs.SD · cs.AI

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

classification 💻 cs.SD cs.AI
keywords attentionaudiolayersacousticdynamicslsacprobingquality
0
0 comments X
read the original abstract

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.