pith. machine review for the scientific record. sign in

arxiv: 2410.00037 · v2 · submitted 2024-09-17 · 📡 eess.AS · cs.AI· cs.CL· cs.LG· cs.SD

Recognition: 2 theorem links

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D\'efossez, Am\'elie Royer, Edouard Grave, Herv\'e J\'egou, Laurent Mazar\'e, Manu Orsini, Neil Zeghidour, Patrick P\'erez

Authors on Pith no claims yet

Pith reviewed 2026-05-12 08:07 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.LGcs.SD
keywords spoken dialoguefull-duplexspeech-to-speechinner monologuereal-time latencyneural audio codecparallel streamsspoken language model
0
0 comments X

The pith

Moshi treats spoken dialogue as parallel speech-to-speech generation from a text model backbone to enable real-time full-duplex interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Moshi to overcome the high latency, lost non-verbal information, and rigid turn-taking of pipeline-based spoken dialogue systems. It starts from a text language model and generates speech tokens while tracking user and system speech in separate parallel streams, removing the need for explicit speaker segmentation. An inner monologue step first predicts time-aligned text tokens before the audio tokens, which the authors show improves linguistic quality and supports streaming recognition and synthesis. This matters because it could let AI hold conversations that feel immediate and natural, including overlaps and interruptions, instead of waiting for processed turns. If the approach holds, spoken interaction with machines would no longer require separate components for detection, recognition, dialogue, and synthesis.

Core claim

Moshi is a speech-text foundation model that casts spoken dialogue as direct speech-to-speech generation. Starting from a text language model backbone, it produces speech as tokens from the residual quantizer of a neural audio codec while modeling its own speech and the user's speech into parallel streams. This removes explicit speaker turns and supports arbitrary conversational dynamics such as overlapping speech and interruptions. The model further extends prior hierarchical token generation by first predicting time-aligned text tokens as a prefix to the audio tokens; the authors call this the inner monologue and show that it improves the linguistic quality of generated speech while also,

What carries the argument

Parallel streams for user and system speech combined with an inner monologue that predicts time-aligned text tokens as a prefix to audio tokens.

If this is right

  • The system can handle interruptions, interjections, and simultaneous speech without post-processing steps.
  • Non-linguistic signals such as emotion and non-speech sounds remain available to shape the response.
  • Streaming speech recognition and text-to-speech emerge directly from the same token-generation process.
  • Theoretical latency drops to 160 ms, with measured practice at 200 ms, for immediate back-and-forth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The parallel-stream design could extend to three or more participants by adding further independent audio streams.
  • The inner-monologue prefix might transfer to other modalities, such as video tokens, to add visual context to dialogue.
  • Real-world performance with background noise or diverse accents would require separate checks beyond the reported results.
  • Such low-latency full-duplex models could support new uses like live translation or hands-free assistance tools.

Load-bearing premise

That jointly modeling parallel speech streams and prefixing audio tokens with aligned text will maintain coherence and quality across all conversational patterns without needing explicit turn segmentation or later corrections.

What would settle it

A live test in which the model produces incoherent replies or exceeds 200 ms latency during frequent interruptions and overlapping speech would show the parallel-stream plus inner-monologue method does not fully replace segmented pipelines.

read the original abstract

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Moshi, a speech-text foundation model for real-time full-duplex spoken dialogue. It replaces conventional pipeline components (VAD, ASR, text LLM, TTS) with a unified speech-to-speech generation approach based on a text LM backbone that produces residual-quantized audio tokens. User and system speech are modeled in parallel streams to remove explicit turn segmentation and handle overlaps/interruptions; an 'Inner Monologue' extension first predicts time-aligned text tokens as a prefix to the audio tokens. The resulting system is claimed to be the first real-time full-duplex spoken LLM, with 160 ms theoretical latency (200 ms measured) and open-source release at https://github.com/kyutai-labs/moshi.

Significance. If the central claims hold, the work is significant because it directly targets the three core limitations of current spoken dialogue systems (multi-second latency, loss of paralinguistic cues, and inability to model unsegmented overlaps). The parallel-stream architecture plus inner-monologue prefix constitute a clean architectural departure from turn-based pipelines. The open GitHub release supplies concrete artifacts that allow independent verification of the reported streaming latency and full-duplex behavior.

major comments (1)
  1. [Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.
minor comments (2)
  1. [Abstract] The abstract states that the inner-monologue method 'significantly improves the linguistic quality of generated speech' and 'can provide streaming speech recognition and text-to-speech,' yet supplies neither concrete metrics nor a pointer to the relevant results section or table.
  2. Notation for the residual quantizer and the parallel-stream tokenization should be introduced with a brief equation or diagram reference early in the manuscript to aid readers who are not already familiar with the neural audio codec literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need to better substantiate the full-duplex claims in the abstract. We address this point directly below and propose a targeted revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that the parallel user/system speech streams together with the inner-monologue text prefix produce coherent, high-quality responses 'across arbitrary conversational dynamics' without explicit turn segmentation is load-bearing for the 'first real-time full-duplex spoken LLM' assertion. The abstract supplies only high-level illustrations of quality gains and streaming capability; no quantitative ablations, error rates, or targeted metrics are reported for interruption handling, overlap resolution, or coherence degradation when user speech arrives mid-generation.

    Authors: We agree that the abstract, constrained by length, presents the claims at a high level without embedding specific quantitative metrics for interruption handling or coherence under mid-generation user speech. The manuscript body (Sections 3 and 4) provides the supporting architecture details, latency measurements (theoretical 160 ms, measured 200 ms), qualitative demonstrations of overlap and interruption handling via parallel streams, and ablations of the inner-monologue prefix showing improved linguistic quality. No dedicated error-rate metrics (e.g., word-error-rate on overlapped segments or coherence scores under interruption) are reported. We will revise the abstract to (a) explicitly state the measured latency, (b) note that parallel streams enable modeling of arbitrary dynamics without turn segmentation, and (c) reference the evaluation sections for supporting evidence. This constitutes a partial revision focused on clarity rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity in architectural claims or latency derivation

full rationale

The paper's core contribution is an architectural framework that casts spoken dialogue as parallel-stream speech-to-speech generation with an inner-monologue text prefix. Latency bounds (160 ms theoretical, 200 ms practical) and full-duplex capability follow directly from the removal of explicit turn segmentation and the choice of residual quantizer tokens; these are design consequences, not quantities fitted to data and then re-labeled as predictions. No equations, self-definitional loops, or load-bearing self-citations are present in the provided derivation chain. The result is self-contained as an engineering system whose performance claims rest on implementation rather than tautological reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard neural-network training assumptions plus two new architectural elements (parallel speech streams and inner monologue) introduced without independent external validation in the provided text.

axioms (2)
  • domain assumption Neural audio codecs produce faithful discrete representations of speech signals suitable for autoregressive generation
    The model generates speech tokens from the residual quantizer of a neural audio codec.
  • domain assumption Transformer language models can be extended to joint text-audio token prediction without fundamental architectural changes
    The backbone is described as a text language model extended to audio tokens.
invented entities (2)
  • Inner Monologue no independent evidence
    purpose: Time-aligned text token prediction that precedes and conditions audio token generation
    New prefix mechanism introduced to improve linguistic quality and enable streaming ASR/TTS.
  • Parallel speech streams no independent evidence
    purpose: Separate modeling of user and system speech to support full-duplex and overlapping speech without turn segmentation
    Core design choice that removes explicit speaker turns.

pith-pipeline@v0.9.0 · 5628 in / 1551 out tokens · 98120 ms · 2026-05-12T08:07:03.153429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Privacy Auditing with Zero (0) Training Run

    cs.CR 2026-05 unverdicted novelty 8.0

    Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

  2. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  3. How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

    cs.CL 2026-05 unverdicted novelty 7.0

    Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...

  4. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  5. LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

    eess.IV 2026-05 unverdicted novelty 7.0

    LiVeAction is a lightweight asymmetric neural codec using an FFT-inspired encoder and variance-based training that outperforms generative tokenizers in rate-distortion while supporting real-time use on resource-constr...

  6. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  7. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 7.0

    TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.

  8. SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.

  9. Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

    cs.CL 2026-04 unverdicted novelty 7.0

    Human-1 is the first open full-duplex spoken dialogue system for Hindi, created by adapting Moshi with a custom tokenizer and training on 26,000 hours of real-world conversations to enable natural interruptions and overlaps.

  10. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  11. Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

    eess.AS 2026-04 unverdicted novelty 7.0

    Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.

  12. Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    cs.CR 2026-04 unverdicted novelty 7.0

    AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

  13. HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...

  14. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  15. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  16. Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

    cs.SD 2026-05 unverdicted novelty 6.0

    Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.

  17. Exploring Token-Space Manipulation in Latent Audio Tokenizers

    cs.SD 2026-05 unverdicted novelty 6.0

    LATTE creates a compact latent token bottleneck in audio tokenizers that aggregates global information and enables unsupervised editing of attributes like speaker identity via token swapping.

  18. PoDAR: Power-Disentangled Audio Representation for Generative Modeling

    eess.AS 2026-05 unverdicted novelty 6.0

    PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...

  19. Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

    eess.AS 2026-05 unverdicted novelty 6.0

    L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.

  20. Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

    cs.LG 2026-05 unverdicted novelty 6.0

    An initial continuous autoencoder training phase prevents dimensional collapse in VQ-VAEs and yields lower reconstruction and perceptual losses.

  21. Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

    cs.LG 2026-05 unverdicted novelty 6.0

    A warm-up phase training VQ-VAEs as autoencoders first avoids dimensional collapse and yields better reconstruction and perceptual quality.

  22. VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

    cs.SD 2026-05 unverdicted novelty 6.0

    VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

  23. MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

    cs.SD 2026-05 accept novelty 6.0

    MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.

  24. Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 6.0

    A contrastive LLM fine-tuning method creates joint embeddings for dialogue contexts and backchannel realizations, improving retrieval performance and alignment with human judgments over raw WavLM features.

  25. Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

    eess.AS 2026-04 unverdicted novelty 6.0

    A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...

  26. Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...

  27. ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

  28. PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

    cs.CV 2026-04 unverdicted novelty 6.0

    PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

  29. FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

    cs.SD 2026-04 unverdicted novelty 6.0

    FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.

  30. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  31. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 5.0

    TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

  32. Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

    eess.AS 2026-04 unverdicted novelty 5.0

    A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.

  33. Sema: Semantic Transport for Real-Time Multimodal Agents

    cs.MM 2026-04 unverdicted novelty 5.0

    Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.

  34. Voxtral TTS

    cs.AI 2026-03 unverdicted novelty 5.0

    Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in hu...

  35. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  36. PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

    cs.AI 2026-04 unverdicted novelty 4.0

    PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while fi...

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 34 Pith papers · 22 internal anchors

  1. [1]

    Watermarking gpt outputs, 2023

    Scott Aaronson and Hendrik Kirchner. Watermarking gpt outputs, 2023. URL https://www.scottaaronson.com/talks/watermark.ppt

  2. [2]

    The falcon series of open language models.arXiv preprint arXiv:2311.16867, 2023

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

  3. [3]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477, 2020

  4. [4]

    Chou, Roy Frostig, and Percy Liang

    Jonathan Berant, Andrew K. Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013

  5. [5]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, 2020

  6. [6]

    Audiolm: A language modeling approach to audio generation

    Zal \'a n Borsos, Rapha \"e l Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  7. [7]

    Soundstorm: Efficient parallel audio generation,

    Zal \' a n Borsos, Matthew Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. Soundstorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023. doi:10.48550/ARXIV.2305.09636

  8. [8]

    pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe

    Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe . In Proc. INTERSPEECH 2023, 2023

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  10. [10]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

  11. [11]

    Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition

    \" O zg \" u r C etin and Elizabeth Shriberg. Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition. In Ninth International Conference on Spoken Language Processing, INTERSPEECH-ICSLP 2006, Pittsburgh, PA, USA, September 17-21, 2006 . ISCA , 2006

  12. [12]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. , 2022

  13. [13]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, ...

  14. [14]

    w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu - An Chung, Yu Zhang, Wei Han, Chung - Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU . IEEE , 2021

  15. [15]

    Fisher english training speech parts 1 and 2

    Christopher Cieri, David Miller, and Kevin Walker. Fisher english training speech parts 1 and 2. https://doi.org/10.35111/da4a-se30, 2004

  16. [16]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  17. [17]

    Fast and accurate deep network learning by exponential linear units (elus)

    Djork - Arn \' e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings , 2016

  18. [18]

    Simple and controllable music generation

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \' e fossez. Simple and controllable music generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Proc...

  19. [19]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35: 0 16344--16359, 2022

  20. [20]

    Real time speech enhancement in the waveform domain

    Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. Real time speech enhancement in the waveform domain. In Interspeech, 2020

  21. [21]

    High fidelity neural audio compression

    Alexandre D \'e fossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023

  22. [22]

    The case for 4-bit precision: k-bit inference scaling laws

    Tim Dettmers and Luke Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7750--7774. PM...

  23. [23]

    LLM.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022

  24. [24]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ) , pages 4171--4186. Association for Computational Linguistics, 2019. doi...

  25. [25]

    Icassp 2023 deep noise suppression challenge

    Harishchandra Dubey, Ashkan Aazami, Vishak Gopal, Babak Naderi, Sebastian Braun, Ross Cutler, Hannes Gamper, Mehrsa Golestaneh, and Robert Aichner. Icassp 2023 deep noise suppression challenge. In ICASSP, 2023

  26. [26]

    The zero resource speech challenge 2021: Spoken language modelling

    Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Roz \' e , Morgane Rivi \` e re, Eugene Kharitonov, and Emmanuel Dupoux. The zero resource speech challenge 2021: Spoken language modelling. In Interspeech. ISCA , 2021. doi:10.21437/Interspeech.2021-1755

  27. [27]

    Stable audio open

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. arXiv preprint arXiv:2407.14358, 2024

  28. [28]

    Watermarking images in self-supervised latent spaces

    Pierre Fernandez, Alexandre Sablayrolles, Teddy Furon, Herv \'e J \'e gou, and Matthijs Douze. Watermarking images in self-supervised latent spaces. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2022

  29. [29]

    Three bricks to consolidate watermarks for large language models

    Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon. Three bricks to consolidate watermarks for large language models. In Proc. International Workshop on Information Forensics and Security (WIFS), 2023

  30. [30]

    OPTQ: A ccurate post-training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: A ccurate post-training quantization for generative pre-trained transformers. In ICLR, 2023

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  32. [32]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Nai...

  33. [33]

    Textually pretrained speech language models

    Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre D \' e fossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. Textually pretrained speech language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Informatio...

  34. [34]

    Eben: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones

    Julien Hauret, Thomas Joubaud, V \'e ronique Zimpfer, and \'E ric Bavu. Eben: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  35. [36]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016 b

  36. [37]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  37. [38]

    ViSQOL : an objective speech quality model

    Andrew Hines, Jan Skoglund, Anil C Kokaram, and Naomi Harte. ViSQOL : an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing, 2015 0 (1): 0 1--18, 2015

  38. [39]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, 2020

  39. [40]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  40. [41]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  41. [42]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei - Ning Hsu, Benjamin Bolte, Yao - Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. , 29, 2021

  42. [43]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  43. [44]

    Bag of Tricks for Efficient Text Classification

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016

  44. [45]

    arXiv preprint arXiv:2403.03100 , year=

    Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024

  45. [46]

    Vggsound: A Large-Scale Audio-Visual Dataset

    Jacob Kahn, Morgane Rivi \` e re, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre - Emmanuel Mazar \' e , Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Libri-light: A benchmark for ASR with limited or no supervision. In IEEE Inte...

  46. [47]

    Available: https://doi.org/10.1162/tacl a 00449

    Eugene Kharitonov, Damien Vincent, Zal \' a n Borsos, Rapha \" e l Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics, 11: 0 1703--1718, 2023. doi:10.1162/TACL\_A\_00618

  47. [48]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, 2015

  48. [49]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning. PMLR, 2023

  49. [50]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018

  50. [51]

    High-fidelity audio compression with improved RVQGAN

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN . In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36, 2023

  51. [52]

    High-fidelity audio compression with improved rvqgan

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved rvqgan. In Advances in Neural Information Processing Systems, 2024

  52. [53]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

  53. [54]

    On generative spoken language modeling from raw audio

    Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 0 1336--1354, 2021

  54. [55]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook - Shin Han. Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , 2022

  55. [56]

    An independence-promoting loss for music generation with language models

    Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, and Alexandre Défossez. An independence-promoting loss for music generation with language models. In ICML, 2024

  56. [57]

    AudioSR : Versatile audio super-resolution at scale

    Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, and Mark D Plumbley. AudioSR : Versatile audio super-resolution at scale. arXiv preprint arXiv:2309.07314, 2023 a

  57. [58]

    AudioLDM : Text-to-audio generation with latent diffusion models

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. AudioLDM : Text-to-audio generation with latent diffusion models. In Proceedings of the International Conference on Machine Learning, 2023 b

  58. [59]

    SemantiCodec: An ultra low bitrate semantic audio codec for general sound.arXiv preprint arXiv:2405.00233,

    Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley. Semanticodec: An ultra low bitrate semantic audio codec for general sound. arXiv preprint arXiv:2405.00233, 2024

  59. [60]

    The llama 3 herd of models

    Team Llama. The llama 3 herd of models. preprint, 2024

  60. [61]

    Mosnet: Deep learning based objective assessment for voice conversion

    Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning based objective assessment for voice conversion. In Proc. Interspeech 2019, 2019

  61. [62]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

  62. [63]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  63. [64]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 , 2019

  64. [65]

    whisper-timestamped

    J \'e r \^o me Louradour. whisper-timestamped. https://github.com/linto-ai/whisper-timestamped, 2023

  65. [66]

    Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

    Soumi Maiti, Yifan Peng, Shukjae Choi, Jee weon Jung, Xuankai Chang, and Shinji Watanabe. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. ArXiv, abs/2309.07937, 2023

  66. [67]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018

  67. [68]

    Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024

    Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, and Kei Sawada. Pslm: Parallel generation of text and speech with llms for low-latency spoken dialogue systems, 2024

  68. [69]

    Spoken question answering and speech continuation using spectrogram-powered LLM

    Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM . In The Twelfth International Conference on Learning Representations, 2024

  69. [70]

    A white paper on neural network quantization, 2021

    Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization, 2021

  70. [71]

    Generative spoken dialogue language modeling

    Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Beno \^ t Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11: 0 250--266, 2023. doi:10.1162/tacl_a_00545

  71. [72]

    Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa - juss \` a , Maha Elbayad, Sravya Popuri, Paul - Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Beno \^ t Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model. CoRR, abs/2402.05755, 2024. doi:10.48550/ARXIV.2402.05755

  72. [73]

    Stateful conformer with cache-based inference for streaming automatic speech recognition

    Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, and Boris Ginsburg. Stateful conformer with cache-based inference for streaming automatic speech recognition. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12041--12045. IEEE, 2024

  73. [74]

    Librispeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5206--5210. IEEE , 2015. doi:10.1109/ICASSP.2015.7178964

  74. [75]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018

  75. [76]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

  76. [77]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawai...

  77. [78]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean - Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Rohan Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sottiaux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanz...

  78. [79]

    Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zal \' a n Borsos, F \' e lix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara N. Sainath, Johan Schalkwyk, Matthew Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tud...

  79. [80]

    Radioactive data: tracing through training

    Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv \'e J \'e gou. Radioactive data: tracing through training. In International Conference on Machine Learning, pages 8326--8335. PMLR, 2020

  80. [81]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

Showing first 80 references.