pith. sign in

arxiv: 2605.23975 · v1 · pith:XL7GRZNYnew · submitted 2026-05-13 · 💻 cs.CL · cs.SD

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Pith reviewed 2026-06-30 21:20 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords code-switchingspeech recognitiondirect preference optimizationaudio large language modelsEnglish-Mandarinmixed error rate
0
0 comments X

The pith

Direct Preference Optimization aligns Audio LLMs to transcribe English-Mandarin code-switching speech by preferring preservation of mixed-language content over translation or omission.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Audio large language models show systematic failures when transcribing code-switching speech, including language omission, translation instead of transcription, and hallucination. The paper constructs 100K preference pairs from 570 hours of data where chosen responses keep the original mixed English-Mandarin content and rejected responses replicate those failure patterns. Training three Audio LLMs with Direct Preference Optimization produces consistent shifts so that models follow transcription prompts by preserving language composition. This alignment produces mixed error rate reductions up to 89.6 percent in-distribution and 20.0 percent out-of-distribution.

Core claim

Audio LLMs exhibit three failure modes in English-Mandarin code-switching transcription; Direct Preference Optimization on pairs that contrast correct mixed-language preservation against those failure modes elicits correct transcription behavior, yielding MER reductions up to 89.6 percent in-distribution and 20.0 percent out-of-distribution across three models.

What carries the argument

Direct Preference Optimization applied to 100K manually constructed preference pairs that reward preservation of mixed-language content and penalize language omission, translation, and hallucination.

If this is right

  • Models shift from translating or omitting mixed content to preserving language composition when prompted for transcription.
  • The behavioral change appears consistently across three different Audio LLMs after training on the same 100K pairs.
  • Gains occur both on data matching the preference-pair distribution and on out-of-distribution code-switching speech.
  • DPO can be used to align multilingual Audio LLMs for transcription tasks without changing the underlying model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference-pair construction could be applied to other language pairs or additional failure modes in speech transcription.
  • If the preference pairs scale efficiently, DPO may serve as a lightweight method to correct other systematic errors in Audio LLMs beyond code-switching.
  • The 20 percent out-of-distribution gain suggests the alignment may transfer to real-world mixed-language conversations not seen during training.

Load-bearing premise

The manually constructed preference pairs accurately capture the three identified failure modes and do not introduce new biases or miss other real-world code-switching patterns that would affect generalization.

What would settle it

Measure mixed error rates on a held-out set of English-Mandarin code-switching utterances that exhibit failure patterns outside the three modes used to build the preference pairs; if rates do not drop after DPO training, the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.23975 by Ai Ti Aw, Cheng Yi Lewis Won, Minh Duc Pham, Shuo Sun, Trung Nguyen Quang, Yingxu He.

Figure 1
Figure 1. Figure 1: Overview of DPO training for code-switching alignment. Ground-truth transcriptions serve as chosen responses (yw), while an LLM generates rejected responses (yl) that mimic failure modes via Global Translation (full) and Partial Translation (spans only). DPO trains the model to prefer verbatim code-switching output [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript identifies three failure modes in Audio LLMs for English-Mandarin code-switching ASR (language omission, translation-instead-of-transcription, hallucination). It constructs 100K DPO preference pairs (chosen responses preserve mixed-language content; rejected responses mimic the failure modes) from 570 hours of data, trains three Audio LLMs, and reports MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution).

Significance. If the reported gains hold under proper controls, the work demonstrates that targeted DPO can elicit correct code-switching transcription behavior from multilingual Audio LLMs, offering a practical alignment route that avoids full retraining. The inclusion of both in- and out-of-distribution results is a positive feature.

major comments (1)
  1. [Abstract] Abstract: the claim of MER reductions up to 89.6% and 20.0% supplies no baselines, statistical details, error bars, or description of how the 100K preference pairs were constructed and validated; without these, the quantitative support for the central claim cannot be assessed.
minor comments (2)
  1. Clarify the exact construction process for the preference pairs (e.g., annotation protocol, inter-annotator agreement) so readers can evaluate whether the pairs fully capture real-world code-switching patterns.
  2. Define all acronyms (MER, DPO, ASR) on first use and ensure consistent terminology for the three failure modes throughout.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract. We address the comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of MER reductions up to 89.6% and 20.0% supplies no baselines, statistical details, error bars, or description of how the 100K preference pairs were constructed and validated; without these, the quantitative support for the central claim cannot be assessed.

    Authors: We agree the abstract, being a concise summary, omits these supporting details. The full manuscript reports baselines and comparisons in the experimental results section, includes statistical details with error bars in the main results tables and figures, and describes the construction and validation of the 100K preference pairs (including failure mode simulation and quality checks) in the data preparation subsection. To address the concern, we will revise the abstract to briefly reference the evaluation protocol, key baselines used, and the preference pair generation process while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical results from applying DPO to three Audio LLMs using 100K manually constructed preference pairs, with MER reductions measured on held-out in-distribution and out-of-distribution test sets. No equations, derivations, or self-citations appear in the provided text that reduce any claimed result to its inputs by construction. The central claims rest on experimental outcomes rather than tautological definitions, fitted parameters renamed as predictions, or load-bearing self-citations. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central result rests on the assumption that the preference data construction faithfully represents desired transcription behavior and that the observed MER drops are caused by the DPO alignment rather than other training factors.

free parameters (1)
  • DPO beta and other training hyperparameters
    Standard DPO training requires choices for learning rate, beta, and batch size that are not reported in the abstract.
axioms (1)
  • domain assumption Preference pairs correctly encode desired vs. undesired transcription behavior
    This assumption underpins the entire DPO training loop and is stated implicitly by the construction of chosen/rejected responses.

pith-pipeline@v0.9.1-grok · 5683 in / 1273 out tokens · 34233 ms · 2026-06-30T21:20:16.777850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 8 canonical work pages · 7 internal anchors

  1. [1]

    Introduction Audio large language models (Audio LLMs) extend large lan- guage models with the ability to process and understand audio inputs alongside text, enabling tasks such as speech recognition, audio captioning, and spoken dialogue [1, 2, 3]. Since Whis- per [1] established the foundation through large-scale weak su- pervision, numerous models have ...

  2. [2]

    我住temasek poly那边

    Method Our approach consists of two main components: (1) construct- ing preference pairs where ground-truth code-switching tran- scriptions serve as chosen responses and LLM-generated flawed transcriptions serve as rejected responses, and (2) applying DPO to align model transcription behavior toward the correct code-switching output. Figure 1 provides an ...

  3. [3]

    Please transcribe the speech in this audio file

    Experimental setup 3.1. Models To demonstrate the generalizability of our approach, we exper- iment with three multilingual Audio LLMs, which are already trained in both English and Mandarin. Table 4 summarizes the training configurations. MERaLiON-2-3B[7] is specifically designed for South- east Asian multilingual speech and includes extensive code- swit...

  4. [4]

    Quantitative analysis Table 6 presents MER scores across all models and bench- marks

    Results 4.1. Quantitative analysis Table 6 presents MER scores across all models and bench- marks. These results show that DPO consistently improves code-switching transcription performance across all configura- tions, though the magnitude of improvement varies consider- ably by model and benchmark. MERaLiON-2-3Bshows modest SEAME improvements (0.7–2.0%),...

  5. [5]

    Discussion Limitations.While our results are encouraging, several aspects of our approach present opportunities for refinement.First,we focus exclusively on English-Mandarin code-switching; gen- eralization to other language pairs remains untested.Second, our rejected samples are synthetic transformations rather than samples drawn from the model’s actual ...

  6. [6]

    Conclusion In this work, we applied Direct Preference Optimization to ad- dress English-Mandarin code-switching failures in three Au- dio LLMs: MERaLiON-2-3B, Phi-4-multimodal-instruct, and Qwen2-Audio-7B-Instruct. Starting from the observation that these models exhibit systematic failure modes – language omis- sion, translation-instead-of-transcription, ...

  7. [7]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  8. [8]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  9. [9]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023

  10. [10]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  11. [11]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,”arXiv...

  12. [12]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chenet al., “Phi-4-mini technical report: Compact yet powerful multi- modal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025

  13. [13]

    MERaLiON-AudioLLM: Advancing speech and language understanding for Singapore,

    Y . He, Z. Liu, G. Lin, S. Sun, B. Wang, W. Zhang, X. Zou, N. F. Chen, and A. Aw, “MERaLiON-AudioLLM: Advancing speech and language understanding for Singapore,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), P. Mishra, S. Muresan, and T. Yu, Eds. Vienna, Austria: Association fo...

  14. [14]

    Com- mon voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Hen- retty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon voice: A massively-multilingual speech corpus,” inProceed- ings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

  15. [15]

    Fleurs: Few-shot learning evaluation of universal representations of speech,

    A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 798–805

  16. [16]

    Seame: a mandarin- english code-switching speech corpus in south-east asia,

    D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “Seame: a mandarin- english code-switching speech corpus in south-east asia,” inInter- speech 2010, 2010, pp. 1986–1989

  17. [17]

    On the end-to-end solution to mandarin-english code-switching speech recognition,

    Z. Zeng, Y . Khassanov, T. Pham, H. Xu, E. S. Chng, and H. Li, “On the end-to-end solution to mandarin-english code-switching speech recognition,” inProc. Interspeech 2019, 2019, pp. 2165– 2169

  18. [18]

    A first speech recognition sys- tem for mandarin-english code-switch conversational speech,

    N. T. Vu, D.-C. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E.-S. Chng, T. Schultz, and H. Li, “A first speech recognition sys- tem for mandarin-english code-switch conversational speech,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4889–4892

  19. [19]

    Combination of recurrent neural networks and factored language models for code-switching language modeling,

    H. Adel, N. T. Vu, and T. Schultz, “Combination of recurrent neural networks and factored language models for code-switching language modeling,” inProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), 2013, pp. 206–211

  20. [20]

    Speech collage: Code-switched audio generation by collaging monolingual cor- pora,

    A. Hussein, D. Zeinali, O. Klejch, M. Wiesner, B. Yan, S. Chowd- hury, A. Ali, S. Watanabe, and S. Khudanpur, “Speech collage: Code-switched audio generation by collaging monolingual cor- pora,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 12 006–12 010

  21. [21]

    Can we train ASR systems on code- switch without real code-switch data? Case study for Singapore’s languages,

    T. Nguyen and H. D. Tran, “Can we train ASR systems on code- switch without real code-switch data? Case study for Singapore’s languages,” inProc. Interspeech, 2025, pp. 753–757

  22. [22]

    Adapting whisper for code-switching through encoding refining and language-aware decoding,

    J. Zhao, H. Shi, C. Cui, T. Wang, H. Liu, Z. Ni, L. Ye, and L. Wang, “Adapting whisper for code-switching through encoding refining and language-aware decoding,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  23. [23]

    Reducing language confusion for code-switching speech recognition with token-level language diarization,

    H. Liu, H. Xu, L. P. Garcia, A. W. H. Khong, Y . He, and S. Khu- danpur, “Reducing language confusion for code-switching speech recognition with token-level language diarization,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  24. [24]

    Enhancing code-switching speech recognition with in- teractive language biases,

    H. Liu, L. P. Garcia, X. Zhang, A. W. H. Khong, and S. Khu- danpur, “Enhancing code-switching speech recognition with in- teractive language biases,” inICASSP 2024 - 2024 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 886–10 890

  25. [25]

    Boosting code-switching asr with mixture of experts enhanced speech-conditioned llm,

    F. Zhang, W. Geng, H. Huang, Y . Shan, C. Yi, and H. Qu, “Boosting code-switching asr with mixture of experts enhanced speech-conditioned llm,” inICASSP 2025 - 2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  26. [26]

    SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR,

    S. Ye, S. Chen, X. Hu, and X. Xu, “SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR,” inInterspeech 2024, 2024, pp. 3999– 4003

  27. [27]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural informa- tion processing systems, vol. 36, pp. 53 728–53 741, 2023

  28. [28]

    Speechalign: Aligning speech generation to human preferences,

    D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” Advances in Neural Information Processing Systems, vol. 37, pp. 50 343–50 360, 2024

  29. [29]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  30. [30]

    arXiv preprint arXiv:2502.18913 , year=

    J. Zhou, Y . Guo, S. Zhao, H. Sun, H. Wang, J. He, A. Kong, S. Wang, X. Yang, Y . Wang, Y . Lin, and Y . Qin, “CS- Dialogue: A 104-hour dataset of spontaneous Mandarin-English code-switching dialogues for speech recognition,”arXiv preprint arXiv:2502.18913, 2025

  31. [31]

    Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,

    H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890

  32. [32]

    Simpo: Simple preference opti- mization with a reference-free reward,

    Y . Meng, M. Xia, and D. Chen, “Simpo: Simple preference opti- mization with a reference-free reward,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 124 198–124 235, 2024

  33. [33]

    mDPO: Conditional preference optimization for multimodal large language models,

    F. Wang, W. Zhou, J. Y . Huang, N. Xu, S. Zhang, H. Poon, and M. Chen, “mDPO: Conditional preference optimization for multimodal large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, N...