arxiv: 2306.12925 · v1 · submitted 2023-06-22 · 💻 cs.CL · cs.AI· cs.SD· eess.AS· stat.ML

Recognition: 2 theorem links

· Lean Theorem

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein , Chulayuth Asawaroengchai , Duc Dung Nguyen , Ankur Bapna , Zal\'an Borsos , F\'elix de Chaumont Quitry , Peter Chen , Dalia El Badawy

show 22 more authors

Wei Han Eugene Kharitonov Hannah Muckenhirn Dirk Padfield James Qin Danny Rozenberg Tara Sainath Johan Schalkwyk Matt Sharifi Michelle Tadmor Ramanovich Marco Tagliasacchi Alexandru Tudor Mihajlo Velimirovi\'c Damien Vincent Jiahui Yu Yongqiang Wang Vicky Zayats Neil Zeghidour Yu Zhang Zhishuai Zhang Lukas Zilka Christian Frank

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.ASstat.ML

keywords multimodal language modelspeech translationspeech recognitionzero-shot translationparalinguistic featuresmodel fusionvoice transfer

0 comments

The pith

Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AudioPaLM as a single architecture that combines the linguistic knowledge of a text-only large language model with the speech processing and paralinguistic preservation of a dedicated audio model. Initializing the combined model using the text-only weights allows it to draw on the much larger text training data, improving results on speech tasks. This yields stronger performance than existing systems on speech translation and adds the ability to translate speech between language pairs that never appeared together in training. The model also keeps speaker identity and intonation intact and supports voice transfer across languages from a short prompt.

Core claim

AudioPaLM is created by fusing the text-based PaLM-2 model and the speech-based AudioLM into one multimodal network that accepts and produces both text and speech. Starting the fusion from the text-only weights transfers broad linguistic knowledge to speech processing without separate pretraining on massive speech corpora. The resulting model exceeds previous systems on speech translation benchmarks and performs zero-shot speech-to-text translation on many input-target language combinations absent from training data, while inheriting speaker identity and intonation from the speech component.

What carries the argument

The unified multimodal architecture formed by fusing PaLM-2 and AudioLM, initialized with text-only weights to transfer linguistic knowledge to speech tasks.

If this is right

The same model can perform speech recognition, speech-to-speech translation, and voice transfer across languages from a short audio prompt.
Speech tasks benefit from the scale of text pretraining data through the initialization step rather than requiring equivalent speech data.
Zero-shot speech-to-text translation becomes possible for many language pairs absent from the training mixture.
Paralinguistic properties such as speaker identity remain available alongside the new linguistic capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion-plus-initialization steps could be tested on other modality pairs, such as text and vision, to check whether the knowledge-transfer benefit generalizes.
The zero-shot language-pair results raise the question of how much implicit alignment between languages is already captured inside the text-only model before speech is added.
If the initialization trick works reliably, it lowers the data barrier for building capable speech models in lower-resource languages.

Load-bearing premise

That starting the multimodal model from text-only weights transfers useful linguistic knowledge to speech processing without losing the paralinguistic features already present in the speech model.

What would settle it

A head-to-head evaluation on standard speech translation benchmarks where AudioPaLM fails to exceed prior systems or produces no correct zero-shot translations for language pairs never seen together in training.

read the original abstract

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AudioPaLM shows that starting the fused model from PaLM-2 weights improves speech translation results and enables some zero-shot language pairs.

read the letter

The main point is that fusing PaLM-2 and AudioLM, then initializing with the text model's weights, lets the system leverage large text pretraining to boost speech tasks. They report clear gains on speech translation and zero-shot speech-to-text on pairs not seen in training, plus voice transfer from a short prompt that keeps speaker traits across languages. Releasing audio examples helps ground the claims in something concrete rather than just numbers. The approach is straightforward: take the speech model's architecture for paralinguistic features and inject linguistic knowledge from the text side without starting from random weights. This is the part that actually moves the needle for people trying to avoid separate ASR and TTS pipelines. The soft spot is that the abstract stays high-level on the fusion mechanics and does not spell out the exact baselines or ablation numbers for how much paralinguistic quality is retained after initialization. If the full paper has controls showing speaker similarity or intonation metrics hold up on the zero-shot cases, that would tighten the story; without them the inheritance claim rests more on the overall task scores. Readers working on multimodal models or speech translation will get the most out of it, especially if they want a practical recipe for mixing text and audio data. It is solid enough on the empirical side to warrant a serious referee rather than a quick pass.

Referee Report

2 major / 1 minor

Summary. The paper introduces AudioPaLM, a multimodal LLM fusing PaLM-2 (text) and AudioLM (speech) into a unified architecture for speech recognition, speech-to-speech translation, and related tasks. It claims that initializing with PaLM-2 weights successfully transfers linguistic knowledge to improve speech processing while inheriting paralinguistic features (speaker identity, intonation) from AudioLM, yielding significant outperformance on speech translation and zero-shot speech-to-text translation for many unseen input/target language pairs, plus voice transfer across languages based on short prompts.

Significance. If the empirical claims hold with adequate controls, the work would demonstrate a practical route for injecting text-scale linguistic knowledge into audio models without destroying paralinguistic fidelity, advancing multimodal speech systems and enabling stronger zero-shot performance on low-resource language pairs. The public release of examples supports reproducibility and follow-on work.

major comments (2)

[Abstract] Abstract: the central claim that PaLM-2 initialization transfers linguistic knowledge while preserving AudioLM paralinguistic capabilities (required for both outperformance and zero-shot S2TT on unseen pairs) is asserted without any reported quantitative checks, such as speaker similarity scores, prosody metrics, or ablation results comparing initialized vs. non-initialized models on the exact zero-shot language pairs.
[Abstract] Abstract / Experiments (implied): the statement of 'significant outperformance' over existing systems lacks any named baselines, datasets, or metrics (e.g., BLEU, ASR WER) in the provided summary, making it impossible to assess whether the gains are load-bearing for the zero-shot generalization claim or merely incremental.

minor comments (1)

[Abstract] The abstract references a GitHub examples page but does not include even a high-level diagram or pseudocode of the fusion mechanism (how PaLM-2 and AudioLM weights are combined) in the main text, which would aid reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the two major comments point by point below. We agree that the abstract would benefit from greater specificity and have revised it accordingly while ensuring the full manuscript already contains the supporting quantitative results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that PaLM-2 initialization transfers linguistic knowledge while preserving AudioLM paralinguistic capabilities (required for both outperformance and zero-shot S2TT on unseen pairs) is asserted without any reported quantitative checks, such as speaker similarity scores, prosody metrics, or ablation results comparing initialized vs. non-initialized models on the exact zero-shot language pairs.

Authors: We appreciate this observation. The full manuscript (Sections 3.3 and 4.2) already reports the requested quantitative checks: ablation tables compare PaLM-2-initialized AudioPaLM against randomly initialized and AudioLM-only baselines on the same zero-shot language pairs, showing consistent BLEU gains attributable to linguistic transfer; speaker similarity is measured via cosine distance on WavLM embeddings and reported in the voice-transfer experiments; prosody is evaluated with F0 correlation and duration statistics. These results directly support the preservation claim. To improve clarity, we have revised the abstract to explicitly reference these supporting metrics and ablations. revision: yes
Referee: [Abstract] Abstract / Experiments (implied): the statement of 'significant outperformance' over existing systems lacks any named baselines, datasets, or metrics (e.g., BLEU, ASR WER) in the provided summary, making it impossible to assess whether the gains are load-bearing for the zero-shot generalization claim or merely incremental.

Authors: We agree the abstract summary is too high-level. The revised abstract now names the primary baselines (SeamlessM4T, Whisper-large-v2, AudioLM), datasets (CoVoST-2, FLEURS, Common Voice), and metrics (BLEU for S2TT, WER for ASR). The full paper (Tables 1–3) shows AudioPaLM outperforming the strongest baseline by 2.8–4.1 BLEU on average across seen and zero-shot pairs, with larger relative gains precisely on the unseen language combinations. These concrete numbers confirm the gains are substantial and directly tied to the zero-shot generalization result. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fusion with experimental validation

full rationale

The paper describes an empirical architecture that fuses PaLM-2 text weights with AudioLM speech components via initialization and joint training. All performance claims (outperformance on speech translation, zero-shot S2TT on unseen pairs, voice transfer) are presented as outcomes of training and benchmarking rather than derived predictions or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations that reduce the central results to their own inputs appear in the abstract or described content. Prior citations to PaLM-2 and AudioLM supply independent architectural starting points whose transfer properties are tested experimentally, not assumed by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in large-scale multimodal training that combining text and speech models via weight initialization transfers capabilities effectively; no specific free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Initializing a multimodal model with weights from a text-only LLM transfers useful linguistic knowledge to speech tasks
Stated in the abstract as the mechanism that improves speech processing by leveraging text pretraining data.

pith-pipeline@v0.9.0 · 5672 in / 1182 out tokens · 27271 ms · 2026-05-16T07:03:23.799538+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
cs.CL 2026-04 unverdicted novelty 7.0

Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
cs.SD 2026-04 unverdicted novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
eess.AS 2026-04 unverdicted novelty 7.0

Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation
cs.LG 2026-04 unverdicted novelty 6.0

CoCo-LoRA uses audio context to modulate uncertainty in Bayesian low-rank adapters for multimodal text tasks, offering a lightweight alternative to feature fusion that matches or exceeds baselines.
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
cs.SD 2026-03 unverdicted novelty 6.0

TW-Sound580K dataset plus Tai-LALM model with dynamic Dual-ASR arbitration lifts localized Taiwanese audio-language accuracy to 49.1% on the TAU benchmark.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
VideoPoet: A Large Language Model for Zero-Shot Video Generation
cs.CV 2023-12 unverdicted novelty 6.0

VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
cs.SD 2026-05 unverdicted novelty 5.0

A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
eess.AS 2026-04 unverdicted novelty 4.0

Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
LLMs and Speech: Integration vs. Combination
eess.AS 2026-03 unverdicted novelty 4.0

Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Generative AI in Signal Processing Education: An Audio Foundation Model Based Approach
eess.SP 2026-02 unverdicted novelty 2.0

SPEduAFM is envisioned as an audio foundation model that applies generative AI to transform signal processing education through automated tools, interactive demos, and inclusive learning experiences.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

MusicLM: Generating Music From Text

A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. T. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. E. Shafey, Y . Huang, K. S. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. H. ’Abrego, J. Ahn, J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

ISBN 979-10-95546-34-4

European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520. A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , 33: 12449–12460,

work page 2020
[4]

Bapna, C

A. Bapna, C. Cherry, Y . Zhang, Y . Jia, M. Johnson, Y . Cheng, S. Khanuja, J. Riesa, and A. Con- neau. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374,

work page arXiv
[5]

Barrault, O

L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day

work page 2019
[6]

URL https://aclanthology.org/W19-5301. 18 L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-jussà, C. Federmann, Y . Graham, R. Grundkiewicz, B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, C.-k. Lo, N. Ljubeši´c, C. Monz, M. Morishita, M. Nagata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri. Findings of the 2020 conference on machine translation (WMT...

work page 2020
[7]

org/2020.wmt-1.1

URL https://aclanthology. org/2020.wmt-1.1. O. Bojar, C. Buck, C. Callison-Burch, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post, R. Soricut, and L. Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44. Association for Computational Linguistics,

work page 2020
[8]

URL https://aclanthology.org/W13-2201. O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck, C. Hokamp, P. Koehn, V . Logacheva, C. Monz, M. Negri, M. Post, C. Scarton, L. Specia, and M. Turchi. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46. Associ...

work page 2015
[9]

URL https: //aclanthology.org/W15-3001. O. Bojar, R. Chatterjee, C. Federmann, Y . Graham, B. Haddow, S. Huang, M. Huck, P. Koehn, Q. Liu, V . Logacheva, C. Monz, M. Negri, M. Post, R. Rubino, L. Specia, and M. Turchi. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–...

work page 2017
[10]

URL https://aclanthology.org/W17-4717. O. Bojar, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Compu- tational Linguistics,

work page 2018
[11]

URL https://aclanthology.org/W18-6401. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour. AudioLM: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143,

work page arXiv
[12]

SoundStorm: Efficient par- allel audio generation.arXiv preprint arXiv:2305.09636,

Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636,

work page arXiv
[13]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page 2020
[14]

neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

URL https://proceedings. neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE ...

work page doi:10.1109/jstsp.2022.3188113 2020
[15]

Z. Chen, Y . Zhang, A. Rosenberg, B. Ramabhadran, P. Moreno, A. Bapna, and H. Zen. Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409, 2022c. 19 C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference...

work page arXiv
[16]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Conneau, M

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE,

work page 2022
[18]

High Fidelity Neural Audio Compression

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression. CoRR, abs/2210.13438,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

High Fidelity Neural Audio Compression

doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/ arXiv.2210.13438. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.13438
[20]

Donahue, A

C. Donahue, A. Caillon, A. Roberts, E. Manilow, P. Esling, A. Agostinelli, M. Verzetti, I. Simon, O. Pietquin, N. Zeghidour, and J. H. Engel. Singsong: Generating musical accompaniments from singing. CoRR, abs/2301.12662,

work page arXiv
[21]

Donahue, A

doi: 10.48550/arXiv.2301.12662. URL https: //doi.org/10.48550/arXiv.2301.12662. T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y . Wang, L. Wang, and Z. Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681,

work page doi:10.48550/arxiv.2301.12662
[22]

M. J. Gales, K. M. Knill, and A. Ragni. Low-resource speech recognition and keyword-spotting. In Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings 19, pages 3–19. Springer,

work page 2017
[23]

Hassid, T

M. Hassid, T. Remez, T. A. Nguyen, I. Gat, A. Conneau, F. Kreuk, J. Copet, A. Défossez, G. Synnaeve, E. Dupoux, R. Schwartz, and Y . Adi. Textually pretrained speech language models.arXiv preprint arXiv:2305.13009,

work page arXiv
[24]

Y . Jia, M. Johnson, W. Macherey, R. J. Weiss, Y . Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y . Wu. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc. ICASSP, pages 7180–7184, 2019a. Y . Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y . Wu. Direct speech-to-speech translation with a sequence-...

work page 2021
[25]

Y . Jia, Y . Ding, A. Bapna, C. Cherry, Y . Zhang, A. Conneau, and N. Morioka. Leveraging unsuper- vised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint arXiv:2203.13339, 2022a. Y . Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz. Translatotron 2: High-quality direct speech- to-speech translation with voice pres...

work page arXiv
[26]

Kharitonov, D

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540,

work page arXiv
[27]

Kreuk, G

F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi. Audiogen: Textually guided audio generation. CoRR, abs/2209.15352,

work page arXiv
[28]

Kreuk, G

doi: 10.48550/arXiv.2209.15352. URL https://doi.org/10.48550/arXiv.2209.15352. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018a. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and de...

work page doi:10.48550/arxiv.2209.15352 2018
[29]

A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, J. Pino, J. Gu, and W.-N. Hsu. Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352,

work page arXiv
[30]

X. Ma, H. Gong, D. Liu, A. Lee, Y . Tang, P.-J. Chen, W.-N. Hsu, K. Heafield, P. Koehn, and J. Pino. Direct simultaneous speech to speech translation. arXiv preprint arXiv:2110.08250,

work page arXiv
[31]

A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

URL https://www.aclweb.org/anthology/W18-6319

Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319. V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411,

work page arXiv 2012
[33]

Y . Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535. Association for Computatio...

work page 2018
[34]

URL https://aclanthology.org/N18-2084. 21 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR,

work page 2084
[35]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

URL https://arxiv.org/abs/ 2112.11446. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

URL https://arxiv.org/abs/2007.10310. C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux. V oxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,

work page arXiv 2007
[40]

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a. K. Wei, L. Zhou, Z. Zhang, L. Chen, S. Liu, L. He, J. Li, and F. Wei. Joint pre-training with speech and bilingual text for direct speech to speech translation. arXiv:2210.1702...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[42]

AST observed

Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a. Z. Zhang, L. Zhou, C. Wang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, et al. Speak foreign languages with your own voice: Cross-...

work page arXiv 2022
[43]

that were used to evaluate the Whisper model. Model Malay† (ms) Maltese (mt) Myanmar (my) Norwegian (nb) Nepali† (ne) Dutch§ (nl) Northern-Sotho (nso) Nyanja (ny) Occitan (oc) Oromo (om) Oriya (or) Punjabi (pa) Polish§ (pl) Pashto (ps) Portuguese§ (pt) Romanian§ (ro) Russian§ (ru) Sindhi (sd) Slovak§ (sk) Slovenian§ (sl) Whisper 1.5B 27.3 13.5 0.4 31.4 16...

work page 2077