MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.
Mumu-llama: Multi- modal music understanding and generation via large language models
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
A zero-training VLM framework generates music from images via ABC notation, multi-modal RAG, and self-refinement while providing text and visual explanations for the outputs.
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
citing papers explorer
-
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
MusTBENCH evaluates temporal grounding in large audio-language models via five expert-validated tasks, and MusT improves performance through encoder adaptation, LLM adaptation, supervised fine-tuning, and RL optimization.
-
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
A zero-training VLM framework generates music from images via ABC notation, multi-modal RAG, and self-refinement while providing text and visual explanations for the outputs.
-
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.