pith. sign in

arxiv: 2606.10231 · v2 · pith:AOMELPD6new · submitted 2026-06-08 · 📡 eess.AS · cs.SD

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

Pith reviewed 2026-06-27 14:39 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords encoder-free speech LLMMel spectrogramspeech-language modelingautomatic speech recognitiontext-to-speechmultimodal initializationlinear projection
0
0 comments X

The pith

An LLM can process Mel spectrograms directly through a linear projection and learn speech-text alignment without any dedicated speech encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can handle raw audio input by reading Mel spectrogram patches straight into their own parameters instead of relying on a separate pre-trained speech encoder. It introduces Mel-LLM, which projects these patches linearly into the LLM and trains the model end-to-end on ASR and TTS tasks. With initialization from a multimodal checkpoint, the encoder-free version stays competitive with encoder-based baselines on public ASR sets and production scaling experiments, showing only limited degradation even when data is scarce. Ablations identify which LLM layers matter less for speech, and early TTS results using a next-token VAE confirm the same architecture can generate speech autoregressively. This matters because it removes the need for a separate encoder component and points toward a single unified model for speech and text.

Core claim

Feeding lightly pre-processed Mel spectrogram patches directly into an LLM via linear projection lets the model learn speech-text alignment through its own parameters; when initialized from a multimodal checkpoint, this encoder-free approach achieves competitive ASR performance with only limited degradation relative to encoder-initialized counterparts on OpenASR benchmarks and scaled experiments, while ablation studies show certain layers contribute less to speech encoding and preliminary next-token VAE results establish feasibility for TTS in the same unified architecture.

What carries the argument

Linear projection of Mel spectrogram patches directly into the LLM input space, allowing internal learning of speech representations without an external encoder.

If this is right

  • Encoder-free models reach near parity with encoder-based ones on ASR when data is limited and multimodal initialization is used.
  • Ablation results indicate specific LLM layers can be deprioritized for speech encoding without major loss.
  • The same linear-projection architecture supports both recognition and generation in an autoregressive speech-text model.
  • Preliminary TTS results confirm a fully unified encoder-free pipeline is possible in both directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Removing the speech encoder could lower overall model size and inference cost in deployed systems.
  • The approach may generalize to other audio modalities if similar patch projections are applied.
  • Scaling training data beyond the limited regime tested here could close the remaining performance gap without changing the architecture.
  • This setup invites direct comparison of cross-modal alignment learned inside one transformer versus alignment learned across separate encoders.

Load-bearing premise

Initialization from a multimodal checkpoint supplies enough prior knowledge that the LLM can interpret spectrogram patches effectively even when training data is limited.

What would settle it

Training the identical architecture from a text-only or random initialization on the same limited data and observing substantially larger ASR accuracy drops on the OpenASR sets would show the multimodal checkpoint is not sufficient.

read the original abstract

Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into an LLM via a linear projection. It claims this allows the LLM to learn speech-text alignment purely through its own parameters, achieving competitive ASR performance on OpenASR public sets and production scaling experiments with only limited degradation relative to encoder-initialized models. The work highlights that multimodal initialization from Phi-4-MM is crucial on limited data, presents ablations on LLM layers for speech encoding, and shows preliminary TTS results via a next-token VAE approach.

Significance. If the empirical results hold under full scrutiny, the work would demonstrate the viability of removing dedicated speech encoders from Speech-LLMs, enabling simpler unified autoregressive speech-text architectures. The ablation findings on layer relevance and the role of multimodal initialization would offer practical guidance for training such models on limited data.

major comments (2)
  1. [Abstract] Abstract: The central claim that the LLM learns speech-text alignment 'purely through its own parameters' from spectrogram patches is directly qualified by the statement that 'when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance' and that without it 'the performance gap to encoder-initialized models widens substantially.' This makes the encoder-free viability conditional on inheriting multimodal capabilities from the specific checkpoint rather than a general property of the linear-projection + LLM architecture.
  2. [Abstract] Abstract: No quantitative ASR metrics, error bars, dataset sizes, training details, or ablation numbers are provided, and full experimental protocols (including data exclusion rules) are absent. This prevents verification of the 'competitive performance with only limited degradation' claim against encoder-initialized baselines.
minor comments (1)
  1. [Abstract] The TTS section describes results as 'preliminary' and 'not yet optimal' without quantitative metrics or comparison baselines; this section could be expanded or moved to supplementary material if the focus is ASR.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting areas where the abstract could be strengthened for clarity and verifiability. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the LLM learns speech-text alignment 'purely through its own parameters' from spectrogram patches is directly qualified by the statement that 'when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance' and that without it 'the performance gap to encoder-initialized models widens substantially.' This makes the encoder-free viability conditional on inheriting multimodal capabilities from the specific checkpoint rather than a general property of the linear-projection + LLM architecture.

    Authors: We agree the abstract wording risks overstating independence from initialization. The phrase 'purely through its own parameters' was meant to contrast the absence of any dedicated speech encoder (relying only on linear projection into the LLM) against conventional encoder-based Speech-LLMs. However, the manuscript already notes the critical role of Phi-4-MM initialization on limited data. We will revise the abstract to explicitly state that competitive performance with the encoder-free design is achieved when leveraging multimodal initialization, while still emphasizing the architectural simplification of removing the encoder. revision: yes

  2. Referee: [Abstract] Abstract: No quantitative ASR metrics, error bars, dataset sizes, training details, or ablation numbers are provided, and full experimental protocols (including data exclusion rules) are absent. This prevents verification of the 'competitive performance with only limited degradation' claim against encoder-initialized baselines.

    Authors: The abstract was kept concise per typical length limits, with all quantitative WER results, ablations, dataset sizes, training hyperparameters, and protocols (including any data filtering) provided in Sections 3–5 and the experimental setup. To improve immediate verifiability of the 'competitive with limited degradation' claim, we will add a small number of key ASR metrics (e.g., average WER on OpenASR public sets) and training scale information to the revised abstract. Full protocols remain in the main text. revision: yes

Circularity Check

0 steps flagged

Empirical experiments with no derivation chain or self-referential reductions

full rationale

The paper presents results from ASR and TTS experiments on encoder-free architecture using linear projection of Mel spectrogram patches into an LLM. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The claim that the LLM learns alignment 'purely through its own parameters' is an empirical observation conditioned on multimodal initialization when data is limited, but this is not a derivation that reduces to inputs by construction. The work is self-contained as experimental validation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities used in the work.

pith-pipeline@v0.9.1-grok · 5774 in / 1104 out tokens · 17825 ms · 2026-06-27T14:39:07.719668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 linked inside Pith

  1. [1]

    speech encoder

    INTRODUCTION The prevailing paradigm for speech large language models (Speech- LLMs) [1–3] consists of three components: a pre-trained speech encoder, a modality projector, and a large language model (LLM). The speech encoder, typically a Whisper-style [4] or Conformer- based [5] model pre-trained on large-scale ASR data, converts raw audio into high-leve...

  2. [2]

    These systems typically use large pre-trained speech encoders (Whisper [4], HuBERT [10]) to extract features before feeding them to the LLM

    RELATED WORK Recent works [1–3,9,24–27] have established the encoder-projector- LLM paradigm for speech understanding. These systems typically use large pre-trained speech encoders (Whisper [4], HuBERT [10]) to extract features before feeding them to the LLM. Encoder-free multimodal models.In vision, Fuyu [7] first demonstrated that raw image patches can ...

  3. [3]

    Encoder” and “LoRA

    METHOD 3.1. Architecture Overview The Mel-LLM architecture is illustrated in Figure 1. We build upon the standard Speech-LLM framework [1, 2] but systematically sim- plify the speech encoder component. Our system supports both ASR (speech-to-text) and TTS (text-to-speech) tasks, with a modficiation on Phi-4-MultiModal (Phi-4-MM) [3]. 3.2. ASR: Speech Inpu...

  4. [4]

    Model Configuration Our model is built upon Phi-4-MM [3]

    EXPERIMENTAL SETTINGS 4.1. Model Configuration Our model is built upon Phi-4-MM [3]. The LLM has a hidden di- mension of 3072, 32 layers, 24 attention heads, and 8 KV heads. We use LoRA withr= 320,α= 640for linear layers in atten- tion and MLP blocks. For ASR, the main Conformer encoder blocks are removed while NeMoConv layers are preserved for downsam- p...

  5. [5]

    Freeze Lk–31

    EXPERIMENTAL RESULTS 5.1. ASR: Main Results and Scaling Phi-4-MM initialization is critical at limited data scale.The encoder-free Mel-LLM with Phi-4-MM LoRA initialization achieves 7.12% average WER on OpenASR (Table 1), only 0.15% behind the random-encoder baseline (6.97%) that still uses a trainable encoder. Random initialization of the LLM degrades to...

  6. [6]

    CONCLUSION We present Mel-LLM, demonstrating that large language models can directly learn to read Mel spectrogram without a dedicated speech encoder. On ASR, the encoder-free approach achieves competitive results with only limited performance gap compared to encoder- initialized models, particularly when sufficient training data is avail- able. Phi-4-MM ...

  7. [7]

    Prompting large language models with speech recognition abilities,

    Y . Fathullahet al., “Prompting large language models with speech recognition abilities,” inProc. ICASSP, 2024, pp. 13 351–13 355

  8. [8]

    Train short, infer long: Speech-llm enables zero-shot streamable joint asr and diarization on long audio,

    M. Shiet al., “Train short, infer long: Speech-llm enables zero-shot streamable joint asr and diarization on long audio,” inIEEE ICASSP, 2026, pp. 17442-17446

  9. [9]

    Phi-4-Mini technical report: Com- pact yet powerful multimodal language models via mixture- of-LoRAs,

    A. Aboueleninet al., “Phi-4-Mini technical report: Com- pact yet powerful multimodal language models via mixture- of-LoRAs,”arXiv preprint, vol. arXiv:2503.01743, 2025

  10. [10]

    Robust speech recognition via large-scale weak supervision,

    A. Radfordet al., “Robust speech recognition via large-scale weak supervision,” inProc. ICML, 2023, pp. 28 492–28 518

  11. [11]

    Conformer: Convolution-augmented Trans- former for speech recognition,

    A. Gulatiet al., “Conformer: Convolution-augmented Trans- former for speech recognition,” inProc. Interspeech, 2020, pp. 5036–5040

  12. [12]

    Lora: Low-rank adaptation of large language models,

    E. J. Huet al., “Lora: Low-rank adaptation of large language models,” inProc. ICLR, 2022

  13. [13]

    Fuyu-8B: A multimodal architecture for AI agents,

    R. Bavishiet al., “Fuyu-8B: A multimodal architecture for AI agents,” 2023. [Online]. Available: https://www.adept.ai/blog/ fuyu-8b

  14. [14]

    Open ASR leaderboard: Towards reproducible and transparent multilingual and long- form speech recognition evaluation,

    V . Srivastavet al., “Open ASR leaderboard: Towards reproducible and transparent multilingual and long- form speech recognition evaluation,”arXiv preprint, vol. arXiv:2510.06961, 2026

  15. [15]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tanget al., “SALMONN: Towards generic hearing abilities for large language models,” inProc. ICLR, 2024

  16. [16]

    HuBERT: Self-supervised speech repre- sentation learning by masked prediction of hidden units,

    W. Hsuet al., “HuBERT: Self-supervised speech repre- sentation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021

  17. [17]

    SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

    D. Zhanget al., “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” in Proc. EMNLP, 2023

  18. [18]

    V oxtLM: Unified decoder-only models for con- solidating speech recognition/synthesis and speech/text contin- uation tasks,

    S. Maitiet al., “V oxtLM: Unified decoder-only models for con- solidating speech recognition/synthesis and speech/text contin- uation tasks,” inProc. ICASSP, 2024

  19. [19]

    Spirit LM: Interleaved spoken and written language model,

    T. A. Nguyenet al., “Spirit LM: Interleaved spoken and written language model,”arXiv preprint, vol. arXiv:2402.05755, 2024

  20. [20]

    Autoregressive speech synthesis without vec- tor quantization,

    L. Menget al., “Autoregressive speech synthesis without vec- tor quantization,” inProc. ACL, 2025, pp. 1287–1300

  21. [21]

    HiFi-GAN: Generative adversar- ial networks for efficient and high fidelity speech synthesis,

    J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversar- ial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17 022–17 033

  22. [22]

    UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,

    T. Saekiet al., “UTMOS: UTokyo-SaruLab system for V oice- MOS challenge 2022,” inProc. Interspeech, 2022, pp. 4521– 4525

  23. [23]

    Simple and effective V AE training with cal- ibrated decoders,

    O. Rybkinet al., “Simple and effective V AE training with cal- ibrated decoders,” inProc. ICML, 2021, pp. 9179–9189

  24. [24]

    Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation,

    Z. Liuet al., “Tuna-2: Pixel embeddings beat vision encoders for multimodal understanding and generation,”arXiv preprint, vol. arXiv:2604.24763, 2026

  25. [25]

    Representation forcing for bottleneck- free unified multimodal models,

    Y . Wanget al., “Representation forcing for bottleneck- free unified multimodal models,”arXiv preprint, vol. arXiv:2605.31604, 2026

  26. [26]

    Toward native multimodal modeling: A roadmap,

    S. Anet al., “Toward native multimodal modeling: A roadmap,”arXiv preprint, vol. arXiv:2605.25343, 2026

  27. [27]

    MELA-TTS: Joint Transformer-diffusion model with representation alignment for speech synthesis,

    K. Anet al., “MELA-TTS: Joint Transformer-diffusion model with representation alignment for speech synthesis,” inProc. ICASSP, 2026

  28. [28]

    MELD: Mel-spectrogram-based speech lan- guage modeling with discrete latent variables,

    S.-L. Yehet al., “MELD: Mel-spectrogram-based speech lan- guage modeling with discrete latent variables,”arXiv preprint, vol. arXiv:2605.29859, 2026

  29. [29]

    WavFlow: Audio generation in waveform space,

    F. Zhouet al., “WavFlow: Audio generation in waveform space,”arXiv preprint, vol. arXiv:2605.18749, 2026

  30. [30]

    Align- Former: Modality matching can achieve better zero-shot instruction-following speech-LLM,

    R. Fan, B. Ren, Y . Hu, R. Zhao, S. Liu, and J. Li, “Align- Former: Modality matching can achieve better zero-shot instruction-following speech-LLM,”IEEE J. Sel. Topics Sig- nal Process., 2025

  31. [31]

    Towards efficient speech-text jointly decoding within one speech language model,

    H. Wuet al., “Towards efficient speech-text jointly decoding within one speech language model,” inProc. ASRU, 2025

  32. [32]

    SLM-S2ST: A multimodal language model for direct speech-to-speech translation,

    Y . Huet al., “SLM-S2ST: A multimodal language model for direct speech-to-speech translation,” inProc. ASRU, 2025

  33. [33]

    Speech LLMs are contextual reasoning transcribers,

    K. Deng, R. Fan, B. Ren, Y . Wang, and J. Li, “Speech LLMs are contextual reasoning transcribers,”arXiv preprint, vol. arXiv:2604.00610, 2026

  34. [34]

    Lib- riSpeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: An ASR corpus based on public domain audio books,” inProc. ICASSP, 2015, pp. 5206–5210

  35. [35]

    GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,

    G. Chenet al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” inProc. Inter- speech, 2021, pp. 3670–3674

  36. [36]

    MLS: A large-scale multilingual dataset for speech research,

    V . Pratapet al., “MLS: A large-scale multilingual dataset for speech research,” inProc. Interspeech, 2020, pp. 2757–2761

  37. [37]

    SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recogni- tion,

    P. K. O’Neillet al., “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recogni- tion,” inProc. Interspeech, 2021, pp. 1434–1438

  38. [38]

    Common V oice: A massively-multilingual speech corpus,

    R. Ardilaet al., “Common V oice: A massively-multilingual speech corpus,” inProc. LREC, 2020, pp. 4218–4222

  39. [39]

    V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

    C. Wanget al., “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” inProc. ACL, 2021, pp. 993–1003

  40. [40]

    TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,

    F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Est`eve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” inProc. SPECOM, 2018, pp. 198–208

  41. [41]

    The AMI meeting corpus: A pre- announcement,

    J. Carlettaet al., “The AMI meeting corpus: A pre- announcement,” inProc. MLMI Workshop, 2005, pp. 28–39

  42. [42]

    Earnings-22: A practical benchmark for ac- cents in the wild,

    M. Del Rioet al., “Earnings-22: A practical benchmark for ac- cents in the wild,” inProc. Interspeech, 2022, pp. 4277–4281

  43. [43]

    FLEURS: Few-shot learning evaluation of universal representations of speech,

    A. Conneauet al., “FLEURS: Few-shot learning evaluation of universal representations of speech,” inProc. SLT, 2023, pp. 798–805