Multimodal In-context Learning for ASR of Low-resource Languages

· 2026 · cs.CL · arXiv 2601.05707

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.

representative citing papers

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations

eess.AS · 2026-05-26 · unverdicted · novelty 4.0

FSA-GRPO applies reinforcement learning with a few-shot-aware reward to auditory LLMs, improving few-shot performance on children's ASR, speech translation, and audio tasks when trained only on adult data.

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

citing papers explorer

Showing 2 of 2 citing papers after filters.

FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations eess.AS · 2026-05-26 · unverdicted · none · ref 51 · internal anchor
FSA-GRPO applies reinforcement learning with a few-shot-aware reward to auditory LLMs, improving few-shot performance on children's ASR, speech translation, and audio tasks when trained only on adult data.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents cs.CL · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.

Multimodal In-context Learning for ASR of Low-resource Languages

fields

years

verdicts

representative citing papers

citing papers explorer