pith. sign in

hub

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

hub tools

citation-role summary

background 2 dataset 1

citation-polarity summary

representative citing papers

Learning When to Think While Listening in Large Audio-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.

Training-Free Multimodal Large Language Model Orchestration

cs.CL · 2025-08-06 · unverdicted · novelty 6.0 · 2 refs

LLM Orchestration integrates modality experts via an LLM controller, cross-modal memory, and interaction layer to enable multimodal input-output without gradient-based training.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

cs.CL · 2024-12-03 · conditional · novelty 6.0

GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

eess.AS · 2026-06-11 · unverdicted · novelty 5.0

ModeratorLM conditions a streaming speech LLM on assigned roles for adaptive turn-taking in multi-party settings, reporting over 40% higher precision and 70% higher recall than non-role baselines on real meetings and a new synthetic dataset.

Beyond Words: Multimodal LLM Knows When to Speak

cs.CV · 2025-05-20 · unverdicted · novelty 5.0

MM-When2Speak reformulates conversational timing as dense response-type prediction and achieves up to 3x better performance by integrating video, audio, and text cues on top of an LLM backbone using a new dyadic conversation dataset.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

citing papers explorer

Showing 18 of 18 citing papers.