{"total":20,"items":[{"citing_arxiv_id":"2606.12555","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation","primary_cat":"cs.SD","submitted_at":"2026-06-10T18:06:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AudioX-Turbo distills a Multimodal Diffusion Transformer into a 4-step student model for efficient multimodal anything-to-audio generation, trained on a new 9.2M-sample dataset IF-caps-Pro.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17360","ref_index":22,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-17T09:57:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Omni-DuplexEval provides a new benchmark and automatic evaluation method for real-time duplex omni-modal interaction, showing state-of-the-art models reach only 39.6% overall and 20% on proactive reminders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07490","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cross-Modal Backdoors in Multimodal Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-08T09:29:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"toward this malicious region in latent space. Under the matched prompt, the optimized inputs trigger the target response across modalities. Figure 1 illustrates the overall workflow of the proposed attack. We validate this claim on representative connector-based MLLM architectures, including any-to-any multimodal systems like PandaGPT [5] and NExT-GPT [29]. Native-door activation reaches up to 99.9% attack success rate (ASR), while most cross-modal activation settings exceed 95.0% ASR under bounded perturbations. The attack remains remarkably stealthy, utilizing only one backdoor anchor with 49 augmented variants, producing 0.0% backdoor leakage on clean inputs, and reducing benign utility by at most 1."},{"citing_arxiv_id":"2506.15564","ref_index":120,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Show-o2: Improved Native Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2025-06-18T15:39:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Unified Decoupled Support Video Native Und. & Gen. Assembling Tailored Models Paradigm Chameleon [102]✓ ✓ARTransfusion [147]✓ ✓AR + Diff.Show-o [128]✓ ✓AR + Diff.VILA-U [123]✓ ✓ ✓AREmu3 [114]✓ ✓ ✓ARLlamaFusion [95]✓ ✓AR + Diff.Show-o2 (Ours) ✓ ✓ ✓ AR + Diff. Janus-Series [26, 27, 79]✓ ✓AR (+Diff)UnidFluid [38]✓ ✓AR + MARMogao [65]✓ ✓AR + Diff.BAGEL [32]✓ ✓ ✓AR + Diff. NExT-GPT [120]✓ ✓ ✓AR + Diff.SEED-X [40]✓ ✓AR + Diff.ILLUME [111]✓ ✓AR + Diff.MetaMorph [106]✓ ✓AR + Diff.MetaQueries [83]✓ ✓AR + Diff. TokenFlow∗[89]✓ ✓AR operate within the 3D causal V AE [108] space, which is capable of accommodating both images and videos. Recognizing the distinct feature dependencies between multimodal understanding and generation, we construct unified visual representations that simultaneously capture rich semantic"},{"citing_arxiv_id":"2506.04565","ref_index":197,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"components of this ecosystem, including surveys on retrieval-augmented generation (RAG) [43], LLM-based agents [84], multi-agent frameworks [ 53], and LLM-driven system optimization [ 90]-but these efforts remain isolated. Some concentrate on narrow aspects such as prompt engineering [110], benchmark analysis [44], or agent communication protocols [197], without addressing the architectural interactions and trade-offs across the entire CAIS stack. These works contribute valuable insights into their respective domains, yet none provides a holistic, system-level synthesis. In contrast, our survey offers the first unified taxonomy and architectural analysis of Compound AI Systems. We integrate four foundational axes-retrieval, agency, multimodal perception, and orchestration-into a cohesive"},{"citing_arxiv_id":"2505.19237","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sensorimotor Self-Recognition in Multimodal Large Language Model-Driven Robots","primary_cat":"cs.AI","submitted_at":"2025-05-25T17:26:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Multimodal LLMs in robots develop self-identification and predictive awareness through sensorimotor loops, with structural equation modeling linking sensory integration to dimensions of the minimal self.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15809","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMaDA: Multimodal Large Diffusion Language Models","primary_cat":"cs.CV","submitted_at":"2025-05-21T17:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMaDA is a unified multimodal diffusion model using mixed chain-of-thought fine-tuning and a new UniGRPO reinforcement learning algorithm that outperforms specialized models in reasoning, understanding, and text-to-image tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023. [100] Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin neural information processing systems, 37:84839-84865, 2024. 22 [101] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2023. [102] Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, and Mohit Bansal. Any-to-any generation via composable diffusion. NeurIPS, 36, 2024. [103] Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo"},{"citing_arxiv_id":"2504.06256","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transfer between Modalities with MetaQueries","primary_cat":"cs.CV","submitted_at":"2025-04-08T17:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12937","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization","primary_cat":"cs.AI","submitted_at":"2025-03-17T08:51:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.17811","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","primary_cat":"cs.AI","submitted_at":"2025-01-29T18:00:19+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"2 - 26.2 Qwen-VL-Chat [1] 7B - 1487.5 60.6 58.2 57.5 - - IDEFICS-9B [19] 8B - - 48.2 - 38.4 - - Emu3-Chat [45] 8B 85.2 1244 58.5 68.2 60.3 31.6 37.2 InstructBLIP [8] 13B 78.9 1212.8 - - 49.5 - 25.6 Und. and Gen. DreamLLM† [10] 7B - - - - - - 36.6 LaVIT† [18] 7B - - - - 46.8 - - MetaMorph† [42] 8B - - 75.2 71.8 - - - Emu† [39] 13B - - - - - - - NExT-GPT† [47] 13B - - - - - - - Show-o-256 [50] 1.3B 73.8 948.4 - - 48.7 25.1 - Show-o-512 [50] 1.3B 80.0 1097.2 - - 58.0 26.7 - D-Dit [24] 2.0B 84.0 1124.7 - - 59.2 - - Gemini-Nano-1 [41] 1.8B - - - - - 26.3 - ILLUME [44] 7B 88.5 1445.3 65.1 72.9 − 38.2 37.0 TokenFlow-XL [34] 13B 86.8 1545.9 68.9 68.7 62.7 38.7 40.7 LWM [28] 7B 75.2 - - - 44.8 - 9.6 VILA-U [48] 7B 85."},{"citing_arxiv_id":"2410.22177","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Analyzing Multimodal Interaction Strategies for LLM-Assisted Manipulation of 3D Scenes","primary_cat":"cs.HC","submitted_at":"2024-10-29T16:15:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical study with 12 users identifies common interaction patterns and barriers when using LLMs for 3D scene manipulation in immersive settings and proposes design recommendations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.13848","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2024-10-17T17:58:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.12528","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2024-08-22T16:32:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.08748","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding","primary_cat":"cs.CV","submitted_at":"2024-05-14T16:33:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.16821","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","primary_cat":"cs.CV","submitted_at":"2024-04-25T17:59:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. NeurIPS, 35:24824-24837, 2022. 3 [123] Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei L ¨u, Rui Hu, et al. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023. 3 [124] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat- Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 3 [125] X.ai. Grok-1.5 vision preview. https://x.ai/blog/ grok-1.5v, 2024. 2, 3, 6, 7 [126] Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang."},{"citing_arxiv_id":"2404.14396","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","primary_cat":"cs.CV","submitted_at":"2024-04-22T17:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.09631","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D-VLA: A 3D Vision-Language-Action Generative World Model","primary_cat":"cs.CV","submitted_at":"2024-03-14T17:58:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":218,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As we described in sectionIV, many of the shortcomings and limitations of LLMs such as hallucination can be ad- dressed through advanced prompt engineering, use of tools, or other augmentation techniques. We should expect not only continued, but accelerated research in this area. It is worth mentioning that, in the specific case of software engineering, some works ([218]) tried to automatically eliminate this issue from the overall software engineering workflow LLM-based systems are already starting to replace ma- chine learning systems that were until recently using other approaches. As a clear example of this, LLMs are now being deployed to better understand people preference and interests, and provide more personalized interactions, whether in cus-"},{"citing_arxiv_id":"2312.14238","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","primary_cat":"cs.CV","submitted_at":"2023-12-21T18:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"results on Tiny LVLM in Table 17. A.2. More Ablation Studies Compatibility with Other LLM. In this experiment, we test the compatibility of InternVL with LLMs other than Vicuna [184]. The experimental setup used here is the same as in Table 9 of the main paper. As shown in Table 10 method CIFAR-10 [74] CIFAR-100 [74] MNIST [78] Caltech-101 [49] SUN397 [157] FGVC Aircraft [101] Country-211 [117] Stanford Cars [72] Birdsnap [9] DTD [28] Eurosat [59] FER2013 [52] Flowers-102 [109] Food-101 [13] GTSRB [129] Pets [113] Rendered SST2 [117] Resisc45 [27] STL10 [30] VOC2007 [45] avg. top-1 acc. OpenAI CLIP-L+ [117] 94.9 74.4 79.0 87.2 68.7 33.4 34.5 79.3 41.0 56.0 61.5 49.1 78.6 93.9 52.4 93.8 70.7 65.4 99.4 78."},{"citing_arxiv_id":"2306.13549","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(2) Enhanced support on input and output modalities [30], [31], such as image, video, audio, and point cloud. Besides input, projects like NExT-GPT [32] further support output in different modalities. (3) Improved language support. Efforts have been made to extend the success of MLLMs to other languages ( e.g. Chinese) with relatively limited training corpus [33], [34]. (4) Extension to more realms and usage scenarios. Some studies transfer the strong capabilities of MLLMs to other domains such as medical image understanding [35], [36], [37] and document parsing [38], [39], [40]. Moreover, multimodal agents are developed to assist in real-world interaction, e.g. embodied agents [41], [42] and GUI agents [43], [44], [45]."}],"limit":50,"offset":0}