pith. sign in

arxiv: 2606.19940 · v1 · pith:KIIFJAEZnew · submitted 2026-06-18 · 📡 eess.AS

Analyzing Language and Geographical Variation in Speech Representations Across 60 Indic Languages

classification 📡 eess.AS
keywords languagesupervisionclassificationdistrictgeographicallanguage-districtvariationjoint
0
0 comments X
read the original abstract

Self-supervised speech encoders are often fine-tuned with language supervision, which can overlook geographical variation. To understand the learned representations under joint supervision of language and district compared to language-only supervision, we fine-tune Whisper-base and Wav2Vec2.0-base for classification tasks with joint language-district (386 classes) and language-only classification (60 languages). The language-district supervision improves district discrimination conditioned on language in the embedding space while strong marginal language classification. We analyze the structure of the learned embeddings using Normalized Conditional Mutual Information (NCMI), showing that language-district supervision produces global language clusters with structured within language subclusters aligned to district variation, enhancing geographical separability without degrading language-level organization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.