Recognition: 2 theorem links
· Lean TheoremAudioPaLM: A Large Language Model That Can Speak and Listen
Pith reviewed 2026-05-16 07:03 UTC · model grok-4.3
The pith
Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AudioPaLM is created by fusing the text-based PaLM-2 model and the speech-based AudioLM into one multimodal network that accepts and produces both text and speech. Starting the fusion from the text-only weights transfers broad linguistic knowledge to speech processing without separate pretraining on massive speech corpora. The resulting model exceeds previous systems on speech translation benchmarks and performs zero-shot speech-to-text translation on many input-target language combinations absent from training data, while inheriting speaker identity and intonation from the speech component.
What carries the argument
The unified multimodal architecture formed by fusing PaLM-2 and AudioLM, initialized with text-only weights to transfer linguistic knowledge to speech tasks.
If this is right
- The same model can perform speech recognition, speech-to-speech translation, and voice transfer across languages from a short audio prompt.
- Speech tasks benefit from the scale of text pretraining data through the initialization step rather than requiring equivalent speech data.
- Zero-shot speech-to-text translation becomes possible for many language pairs absent from the training mixture.
- Paralinguistic properties such as speaker identity remain available alongside the new linguistic capabilities.
Where Pith is reading between the lines
- Similar fusion-plus-initialization steps could be tested on other modality pairs, such as text and vision, to check whether the knowledge-transfer benefit generalizes.
- The zero-shot language-pair results raise the question of how much implicit alignment between languages is already captured inside the text-only model before speech is added.
- If the initialization trick works reliably, it lowers the data barrier for building capable speech models in lower-resource languages.
Load-bearing premise
That starting the multimodal model from text-only weights transfers useful linguistic knowledge to speech processing without losing the paralinguistic features already present in the speech model.
What would settle it
A head-to-head evaluation on standard speech translation benchmarks where AudioPaLM fails to exceed prior systems or produces no correct zero-shot translations for language pairs never seen together in training.
read the original abstract
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AudioPaLM, a multimodal LLM fusing PaLM-2 (text) and AudioLM (speech) into a unified architecture for speech recognition, speech-to-speech translation, and related tasks. It claims that initializing with PaLM-2 weights successfully transfers linguistic knowledge to improve speech processing while inheriting paralinguistic features (speaker identity, intonation) from AudioLM, yielding significant outperformance on speech translation and zero-shot speech-to-text translation for many unseen input/target language pairs, plus voice transfer across languages based on short prompts.
Significance. If the empirical claims hold with adequate controls, the work would demonstrate a practical route for injecting text-scale linguistic knowledge into audio models without destroying paralinguistic fidelity, advancing multimodal speech systems and enabling stronger zero-shot performance on low-resource language pairs. The public release of examples supports reproducibility and follow-on work.
major comments (2)
- [Abstract] Abstract: the central claim that PaLM-2 initialization transfers linguistic knowledge while preserving AudioLM paralinguistic capabilities (required for both outperformance and zero-shot S2TT on unseen pairs) is asserted without any reported quantitative checks, such as speaker similarity scores, prosody metrics, or ablation results comparing initialized vs. non-initialized models on the exact zero-shot language pairs.
- [Abstract] Abstract / Experiments (implied): the statement of 'significant outperformance' over existing systems lacks any named baselines, datasets, or metrics (e.g., BLEU, ASR WER) in the provided summary, making it impossible to assess whether the gains are load-bearing for the zero-shot generalization claim or merely incremental.
minor comments (1)
- [Abstract] The abstract references a GitHub examples page but does not include even a high-level diagram or pseudocode of the fusion mechanism (how PaLM-2 and AudioLM weights are combined) in the main text, which would aid reader understanding.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the two major comments point by point below. We agree that the abstract would benefit from greater specificity and have revised it accordingly while ensuring the full manuscript already contains the supporting quantitative results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that PaLM-2 initialization transfers linguistic knowledge while preserving AudioLM paralinguistic capabilities (required for both outperformance and zero-shot S2TT on unseen pairs) is asserted without any reported quantitative checks, such as speaker similarity scores, prosody metrics, or ablation results comparing initialized vs. non-initialized models on the exact zero-shot language pairs.
Authors: We appreciate this observation. The full manuscript (Sections 3.3 and 4.2) already reports the requested quantitative checks: ablation tables compare PaLM-2-initialized AudioPaLM against randomly initialized and AudioLM-only baselines on the same zero-shot language pairs, showing consistent BLEU gains attributable to linguistic transfer; speaker similarity is measured via cosine distance on WavLM embeddings and reported in the voice-transfer experiments; prosody is evaluated with F0 correlation and duration statistics. These results directly support the preservation claim. To improve clarity, we have revised the abstract to explicitly reference these supporting metrics and ablations. revision: yes
-
Referee: [Abstract] Abstract / Experiments (implied): the statement of 'significant outperformance' over existing systems lacks any named baselines, datasets, or metrics (e.g., BLEU, ASR WER) in the provided summary, making it impossible to assess whether the gains are load-bearing for the zero-shot generalization claim or merely incremental.
Authors: We agree the abstract summary is too high-level. The revised abstract now names the primary baselines (SeamlessM4T, Whisper-large-v2, AudioLM), datasets (CoVoST-2, FLEURS, Common Voice), and metrics (BLEU for S2TT, WER for ASR). The full paper (Tables 1–3) shows AudioPaLM outperforming the strongest baseline by 2.8–4.1 BLEU on average across seen and zero-shot pairs, with larger relative gains precisely on the unseen language combinations. These concrete numbers confirm the gains are substantial and directly tied to the zero-shot generalization result. revision: yes
Circularity Check
No circularity: empirical fusion with experimental validation
full rationale
The paper describes an empirical architecture that fuses PaLM-2 text weights with AudioLM speech components via initialization and joint training. All performance claims (outperformance on speech translation, zero-shot S2TT on unseen pairs, voice transfer) are presented as outcomes of training and benchmarking rather than derived predictions or fitted parameters. No equations, self-definitional loops, or load-bearing self-citations that reduce the central results to their own inputs appear in the abstract or described content. Prior citations to PaLM-2 and AudioLM supply independent architectural starting points whose transfer properties are tested experimentally, not assumed by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Initializing a multimodal model with weights from a text-only LLM transfers useful linguistic knowledge to speech tasks
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Phoneme-based interfaces match or surpass projector-based ones for LLM ASR, especially in low-resource languages, and a BPE-phoneme hybrid offers additional improvements.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Beyond Feature Fusion: Contextual Bayesian PEFT for Multimodal Uncertainty Estimation
CoCo-LoRA uses audio context to modulate uncertainty in Bayesian low-rank adapters for multimodal text tasks, offering a lightweight alternative to feature fusion that matches or exceeds baselines.
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
-
TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
TW-Sound580K dataset plus Tai-LALM model with dynamic Dual-ASR arbitration lifts localized Taiwanese audio-language accuracy to 49.1% on the TAU benchmark.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR
A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.
-
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
-
In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Generative AI in Signal Processing Education: An Audio Foundation Model Based Approach
SPEduAFM is envisioned as an audio foundation model that applies generative AI to transform signal processing education through automated tools, interactive demos, and inclusive learning experiences.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
MusicLM: Generating Music From Text
A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. T. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. Clark, L. E. Shafey, Y . Huang, K. S. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y . Tay, K. Xiao, Y . Xu, Y . Zhang, G. H. ’Abrego, J. Ahn, J. Austin, P. Barham, J. A. Botha, J. Bradbury, S. Brahma, K...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520. A. Baevski, Y . Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems , 33: 12449–12460,
work page 2020
- [4]
-
[5]
L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day
work page 2019
-
[6]
URL https://aclanthology.org/W19-5301. 18 L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-jussà, C. Federmann, Y . Graham, R. Grundkiewicz, B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, C.-k. Lo, N. Ljubeši´c, C. Monz, M. Morishita, M. Nagata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri. Findings of the 2020 conference on machine translation (WMT...
work page 2020
-
[7]
URL https://aclanthology. org/2020.wmt-1.1. O. Bojar, C. Buck, C. Callison-Burch, C. Federmann, B. Haddow, P. Koehn, C. Monz, M. Post, R. Soricut, and L. Specia. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44. Association for Computational Linguistics,
work page 2020
-
[8]
URL https://aclanthology.org/W13-2201. O. Bojar, R. Chatterjee, C. Federmann, B. Haddow, M. Huck, C. Hokamp, P. Koehn, V . Logacheva, C. Monz, M. Negri, M. Post, C. Scarton, L. Specia, and M. Turchi. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46. Associ...
work page 2015
-
[9]
URL https: //aclanthology.org/W15-3001. O. Bojar, R. Chatterjee, C. Federmann, Y . Graham, B. Haddow, S. Huang, M. Huck, P. Koehn, Q. Liu, V . Logacheva, C. Monz, M. Negri, M. Post, R. Rubino, L. Specia, and M. Turchi. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–...
work page 2017
-
[10]
URL https://aclanthology.org/W17-4717. O. Bojar, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Compu- tational Linguistics,
work page 2018
- [11]
-
[12]
SoundStorm: Efficient par- allel audio generation.arXiv preprint arXiv:2305.09636,
Z. Borsos, M. Sharifi, D. Vincent, E. Kharitonov, N. Zeghidour, and M. Tagliasacchi. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636,
-
[13]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page 2020
-
[14]
neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
URL https://proceedings. neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE ...
-
[15]
Z. Chen, Y . Zhang, A. Rosenberg, B. Ramabhadran, P. Moreno, A. Bapna, and H. Zen. Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409, 2022c. 19 C.-C. Chiu, J. Qin, Y . Zhang, J. Yu, and Y . Wu. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference...
-
[16]
PaLM: Scaling Language Modeling with Pathways
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE,
work page 2022
-
[18]
High Fidelity Neural Audio Compression
A. Défossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression. CoRR, abs/2210.13438,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
High Fidelity Neural Audio Compression
doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/ arXiv.2210.13438. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.13438
-
[20]
C. Donahue, A. Caillon, A. Roberts, E. Manilow, P. Esling, A. Agostinelli, M. Verzetti, I. Simon, O. Pietquin, N. Zeghidour, and J. H. Engel. Singsong: Generating musical accompaniments from singing. CoRR, abs/2301.12662,
-
[21]
doi: 10.48550/arXiv.2301.12662. URL https: //doi.org/10.48550/arXiv.2301.12662. T.-J. Fu, L. Li, Z. Gan, K. Lin, W. Y . Wang, L. Wang, and Z. Liu. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681,
-
[22]
M. J. Gales, K. M. Knill, and A. Ragni. Low-resource speech recognition and keyword-spotting. In Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings 19, pages 3–19. Springer,
work page 2017
- [23]
-
[24]
Y . Jia, M. Johnson, W. Macherey, R. J. Weiss, Y . Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y . Wu. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc. ICASSP, pages 7180–7184, 2019a. Y . Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y . Wu. Direct speech-to-speech translation with a sequence-...
work page 2021
-
[25]
Y . Jia, Y . Ding, A. Bapna, C. Cherry, Y . Zhang, A. Conneau, and N. Morioka. Leveraging unsuper- vised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint arXiv:2203.13339, 2022a. Y . Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz. Translatotron 2: High-quality direct speech- to-speech translation with voice pres...
-
[26]
E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540,
- [27]
-
[28]
doi: 10.48550/arXiv.2209.15352. URL https://doi.org/10.48550/arXiv.2209.15352. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018a. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and de...
- [29]
- [30]
-
[31]
A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
URL https://www.aclweb.org/anthology/W18-6319
Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319. V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411,
-
[33]
Y . Qi, D. Sachan, M. Felix, S. Padmanabhan, and G. Neubig. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535. Association for Computatio...
work page 2018
-
[34]
URL https://aclanthology.org/N18-2084. 21 A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR,
work page 2084
-
[35]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
URL https://arxiv.org/abs/ 2112.11446. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551,
work page internal anchor Pith review Pith/arXiv arXiv
- [39]
-
[40]
C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a. K. Wei, L. Zhou, Z. Zhang, L. Chen, S. Liu, L. He, J. Li, and F. Wei. Joint pre-training with speech and bilingual text for direct speech to speech translation. arXiv:2210.1702...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[42]
Y . Zhang, W. Han, J. Qin, Y . Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V . Axelrod, G. Wang, et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a. Z. Zhang, L. Zhou, C. Wang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, et al. Speak foreign languages with your own voice: Cross-...
-
[43]
that were used to evaluate the Whisper model. Model Malay† (ms) Maltese (mt) Myanmar (my) Norwegian (nb) Nepali† (ne) Dutch§ (nl) Northern-Sotho (nso) Nyanja (ny) Occitan (oc) Oromo (om) Oriya (or) Punjabi (pa) Polish§ (pl) Pashto (ps) Portuguese§ (pt) Romanian§ (ro) Russian§ (ru) Sindhi (sd) Slovak§ (sk) Slovenian§ (sl) Whisper 1.5B 27.3 13.5 0.4 31.4 16...
work page 2077
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.