Mumu-llama: Multi- modal music understanding and generation via large language models

Mert: Acoustic music understanding model with large-scale self-supervised training · 2024 · arXiv 2412.06660

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

cs.SD · 2025-09-26 · unverdicted · novelty 6.0

A zero-training VLM framework generates music from images via ABC notation, multi-modal RAG, and self-refinement while providing text and visual explanations for the outputs.

EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement

cs.SD · 2026-06-01 · unverdicted · novelty 4.0

EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.

citing papers explorer

Showing 3 of 3 citing papers.

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs cs.CL · 2026-05-28 · unverdicted · none · ref 3
MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach cs.SD · 2025-09-26 · unverdicted · none · ref 14
A zero-training VLM framework generates music from images via ABC notation, multi-modal RAG, and self-refinement while providing text and visual explanations for the outputs.
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement cs.SD · 2026-06-01 · unverdicted · none · ref 19
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.

Mumu-llama: Multi- modal music understanding and generation via large language models

fields

years

verdicts

representative citing papers

citing papers explorer