pith. machine review for the scientific record. sign in

arxiv: 2605.09996 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Dongwook Lee, Heeseung Kim, Sangkwon Park, Sungroh Yoon, Yeongtak Oh

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords omnimodal personalizationbenchmarkcalibrated accuracypersona modality graphgrounding gapRLVRSFTabsent-persona
0
0 comments X

The pith

A new benchmark shows open-source omnimodal models ground audio less reliably than visuals and that supervised fine-tuning falls short while reinforcement learning generalizes more consistently but can become overly cautious.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Omni-Persona as the first benchmark to jointly test personalization across text, image, and audio in large language models. It defines the task through a Persona Modality Graph that routes information across modalities and creates Calibrated Accuracy to score both correct use of available persona details and proper refusal when no relevant information exists. Experiments on open-source models uncover a persistent audio-visual grounding gap, demonstrate that high performance on answerable queries can mask hallucinations in absent-persona cases, and indicate that reinforcement learning with verifiable rewards narrows modality gaps better than supervised fine-tuning yet tends to produce conservative and lower-quality outputs under the tested reward design.

Core claim

The Omni-Persona benchmark demonstrates three key issues in current omnimodal personalization: open-source models exhibit a consistent audio-versus-visual grounding gap that reinforcement learning with verifiable rewards can partially close through dense supervision; answerable recall and model scale do not fully diagnose performance since strong recall can pair with absent-persona hallucinations and larger models do not always yield higher calibrated accuracy; supervised fine-tuning is constrained by the challenge of creating large-scale annotated supervision while reinforcement learning generalizes better via outcome feedback but tends toward conservative outputs and reduced generation质量.

What carries the argument

The Persona Modality Graph formalizes cross-modal routing across 4 task groups and 18 fine-grained tasks with roughly 750 items, while Calibrated Accuracy jointly rewards correct grounding and appropriate abstention on absent-persona queries.

If this is right

  • Strong recall on answerable queries can still allow models to generate incorrect persona details when no matching information is available.
  • Increasing model parameter count does not reliably improve calibration scores in omnimodal personalization settings.
  • Reinforcement learning with verifiable rewards can reduce modality-specific grounding differences but requires careful reward engineering to prevent overly conservative generation behavior.
  • Post-training strategies should prioritize outcome-level verifiable feedback over large-scale annotated ground-truth construction to handle absent-persona scenarios more robustly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Personalized voice assistants that combine audio and visual inputs may need separate calibration tests for each modality to avoid over-reliance on visual cues.
  • New reward functions could be developed to balance the conservatism seen in reinforcement learning with higher generation quality for personalization tasks.
  • Extending the benchmark to closed-source models would clarify whether the observed grounding gaps and calibration issues are specific to open-source training data and methods.

Load-bearing premise

The Persona Modality Graph and Calibrated Accuracy metric provide a complete and unbiased framework for diagnosing omnimodal grounding without missing key failure modes or introducing evaluation artifacts.

What would settle it

Re-running the full Omni-Persona evaluation suite on a fresh set of models that use alternative post-training methods and checking whether the audio-visual gap and calibration problems remain or disappear would directly test the reported diagnostic findings.

Figures

Figures reproduced from arXiv: 2605.09996 by Dongwook Lee, Heeseung Kim, Sangkwon Park, Sungroh Yoon, Yeongtak Oh.

Figure 1
Figure 1. Figure 1: Formulation of omnimodal personalization in the Omni-Persona benchmark. As illustrated, a user query consists of a target modality and a text prompt, which we decompose into two main axes: blue elements denote the process of perceptual identification, whereas pink elements highlight the realistic textual retrieval desired by the user. Note that our underlying assumption is that the raw-form contexts are pr… view at source ↗
Figure 2
Figure 2. Figure 2: PMG illustration. Persona Modality Graph (PMG) and Task Formulation. We formalize omnimodal personalization as a cross-modal routing problem over a PMG, G = (V, E). The vertices V con￾sist of a query node Q and retrieved context nodes C1, . . . , CK, where each node can encompass visual (v), audio (a), and textual (t) modalities, as represented in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples of context construction, personalization cues, distractors, and unan [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance across post-training regimes. Personalization: (a) SFT scaling does not directly translate to gains in open-ended personalization scenarios. (b) RLVR improves recall and unanswerable performance. Generation quality: (c) RLVR reduces both ROUGE-L and 1 − FA compared with the base model. (d) RLVR yields stronger calibrated accuracy gains than SFT scaling. corpus for SFT is fundamentally constrain… view at source ↗
read the original abstract

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Omni-Persona as the first comprehensive benchmark for omnimodal personalization across text, image, and audio. It formalizes the task as cross-modal routing over the Persona Modality Graph (4 task groups, 18 fine-grained tasks, ~750 items), proposes Calibrated Accuracy (Cal) to jointly evaluate correct grounding and appropriate abstention (including absent-persona cases), and reports three diagnostic findings from experiments: (i) open-source models exhibit a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics because strong recall can coexist with absent-persona hallucination and larger models do not always yield higher Cal; (iii) SFT is limited by the difficulty of scaling annotated ground-truth supervision while RLVR generalizes more consistently via outcome-level verifiable feedback but drifts toward conservative behavior and lower generation quality under the authors' reward design.

Significance. If the benchmark construction, Cal metric, and experimental comparisons hold under scrutiny, the work provides a valuable diagnostic framework for omnimodal personalization that surfaces previously under-examined issues such as modality-specific grounding gaps, calibration failures, and trade-offs between SFT and RLVR. The explicit inclusion of absent-persona scenarios and the framing of the benchmark as a tool for identifying pitfalls rather than proving universal superiority are strengths that could usefully guide future post-training and reward design in multimodal LLMs.

major comments (3)
  1. [Experiments] The experimental section does not specify the exact open-source models evaluated, the data splits used for training versus evaluation, the precise mathematical definition of the RLVR reward function, or any statistical tests (e.g., significance levels or confidence intervals) applied to the three diagnostic findings. These omissions directly affect the ability to assess the robustness of the claims that RLVR narrows the audio-visual gap and generalizes more consistently than SFT.
  2. [§3 (Benchmark and Metric)] The definition and computation of Calibrated Accuracy (Cal) are described at a high level in the abstract and introduction but lack an explicit formula or pseudocode showing how it combines grounding accuracy with abstention penalties across the Persona Modality Graph, particularly for absent-persona queries. Without this, it is difficult to verify that Cal is free of evaluation artifacts or that it fully supports the claim that recall and scale are incomplete diagnostics.
  3. [§4 (Training Comparisons)] The manuscript reports that RLVR 'drifts toward conservative behavior and lower generation quality under our reward design' yet provides no quantitative metrics (e.g., generation quality scores, diversity measures, or human preference comparisons) or ablation studies isolating the reward components responsible for this drift. This weakens the third diagnostic finding.
minor comments (2)
  1. [Abstract and §4] The abstract states 'dense rule-based supervision' for RLVR without defining the rules or providing examples; the main text should include a concrete description or table of the rule set.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported Cal value to allow readers to gauge variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments. We address each major point below, agreeing to revisions where details were insufficiently specified in the original manuscript.

read point-by-point responses
  1. Referee: [Experiments] The experimental section does not specify the exact open-source models evaluated, the data splits used for training versus evaluation, the precise mathematical definition of the RLVR reward function, or any statistical tests (e.g., significance levels or confidence intervals) applied to the three diagnostic findings. These omissions directly affect the ability to assess the robustness of the claims that RLVR narrows the audio-visual gap and generalizes more consistently than SFT.

    Authors: We appreciate the referee's call for greater experimental transparency. The original manuscript provided high-level descriptions but indeed omitted some specifics. In the revised version, we will explicitly enumerate the evaluated open-source models (including their exact checkpoints and sizes), detail the data splits (with the benchmark serving as a held-out evaluation set separate from any training data used for SFT/RLVR), provide the precise mathematical definition of the RLVR reward function (which combines outcome-based rewards for correct cross-modal routing and abstention, with penalties for absent-persona hallucinations), and include statistical tests such as bootstrapped confidence intervals for the reported diagnostic differences. These changes will be incorporated into Section 4 and the supplementary material to allow full assessment of the claims' robustness. revision: yes

  2. Referee: [§3 (Benchmark and Metric)] The definition and computation of Calibrated Accuracy (Cal) are described at a high level in the abstract and introduction but lack an explicit formula or pseudocode showing how it combines grounding accuracy with abstention penalties across the Persona Modality Graph, particularly for absent-persona queries. Without this, it is difficult to verify that Cal is free of evaluation artifacts or that it fully supports the claim that recall and scale are incomplete diagnostics.

    Authors: We agree that providing an explicit formula and pseudocode is necessary for full verifiability. The Cal metric is intended to balance grounding accuracy on present-persona queries with appropriate abstention on absent-persona ones, using a penalty term calibrated across the Persona Modality Graph. In the revision, we will insert the formal definition Cal = (1/|Q|) sum_{q in Q} [Acc(q) - lambda * Halluc(q) + mu * Abstain(q)] where terms are defined for each task group, along with pseudocode for its computation. This will demonstrate its freedom from artifacts and support the incompleteness of recall/scale as diagnostics. An illustrative example will also be added. revision: yes

  3. Referee: [§4 (Training Comparisons)] The manuscript reports that RLVR 'drifts toward conservative behavior and lower generation quality under our reward design' yet provides no quantitative metrics (e.g., generation quality scores, diversity measures, or human preference comparisons) or ablation studies isolating the reward components responsible for this drift. This weakens the third diagnostic finding.

    Authors: This observation is valid; the drift was noted from qualitative inspection but lacked supporting quantitative evidence. We will revise Section 4 to include quantitative metrics: average output length and entropy as proxies for conservativeness, n-gram diversity scores, and results from a human evaluation on a 50-item subset comparing RLVR vs. SFT on quality and helpfulness. We will also add an ablation varying key reward terms (e.g., the abstention incentive weight) to identify the source of the drift. These additions will strengthen the third diagnostic finding without altering the core conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical benchmark (Omni-Persona) with a Persona Modality Graph and Calibrated Accuracy metric, then reports observational diagnostics from model evaluations under SFT and RLVR. No mathematical derivations, equations, predictions, or first-principles results appear in the provided text. Claims rest on experimental outcomes rather than any reduction of outputs to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as a diagnostic framework and does not invoke uniqueness theorems or ansatzes that collapse to prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that the new benchmark tasks and Cal metric validly measure grounding; the paper introduces two new constructs without external validation.

axioms (1)
  • domain assumption Cross-modal routing over a Persona Modality Graph is an appropriate formalization for omnimodal personalization tasks.
    Invoked in the task definition section of the abstract to structure the 4 task groups and 18 tasks.
invented entities (2)
  • Persona Modality Graph no independent evidence
    purpose: To organize cross-modal routing for 4 task groups and 18 fine-grained tasks across ~750 items.
    New structure introduced to formalize the benchmark.
  • Calibrated Accuracy (Cal) no independent evidence
    purpose: To jointly reward correct grounding and appropriate abstention for absent-persona queries.
    New evaluation metric proposed to diagnose grounding behavior.

pith-pipeline@v0.9.0 · 5597 in / 1581 out tokens · 66399 ms · 2026-05-12T03:56:08.726411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 14 internal anchors

  1. [1]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

  2. [2]

    MiniCPM-o 4.5: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone

    OpenBMB. MiniCPM-o 4.5: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone. https://github.com/OpenBMB/MiniCPM-o, 2026

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  4. [4]

    Gemma 4: Byte for byte, the most capable open models

    Google DeepMind. Gemma 4: Byte for byte, the most capable open models. https:// deepmind.google/models/gemma/gemma-4/, April 2026

  5. [5]

    GPT-4o System Card

    OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  6. [6]

    Yo’LLaV A: Your personalized language and vision assistant

    Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’LLaV A: Your personalized language and vision assistant. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.09400

  7. [7]

    RAP: Retrieval- augmented personalization for multimodal large language models

    Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. RAP: Retrieval- augmented personalization for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2410.13360

  8. [8]

    RePIC: Reinforced post-training for personalizing multi-modal language models.arXiv preprint arXiv:2506.18369, 2025

    Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon. RePIC: Reinforced post-training for personalizing multi-modal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506.18369

  9. [9]

    Contextualized Visual Personalization in Vision-Language Models

    Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, and Sungroh Yoon. Con- textualized visual personalization in vision-language models.arXiv preprint arXiv:2602.03454, 2026

  10. [10]

    Myvlm: Personalizing vlms for user-specific queries

    Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024

  11. [11]

    According to me: Long-term personalized referential memory qa, 2026

    Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, and Bill Byrne. Accord- ing to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026. 10

  12. [12]

    MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025

    Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. MMPB: It’s time for multi-modal personalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2509.22820

  13. [13]

    Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

  14. [14]

    How abilities in large language models are affected by supervised fine-tuning data composition

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 177–198, 2024

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    UniCTokens: Boosting personalized understanding and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025

    Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, and Wentao Zhang. UniCTokens: Boosting personalized understanding and generation via unified concept tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.14671

  17. [17]

    Tameing long contexts in personalization: Towards training-free and state-aware mllm personalized assistant

    Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, and Fan Zhou. Tameing long contexts in personalization: Towards training-free and state-aware mllm personalized assistant. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 452–463, 2026

  18. [18]

    Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory

    Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory. arXiv preprint arXiv:2508.09736, 2025

  19. [19]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  20. [20]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  21. [21]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  22. [22]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  23. [23]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Microsoft. Phi-4 Mini Technical Report: Compact yet powerful multimodal language models via mixture-of-LoRAs.arXiv preprint arXiv:2503.01743, 2025

  24. [24]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685

  25. [25]

    arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517

    Yuze Zhao, Jintao Qin, Haibo Niu, Wentao Cheng, Bingning Tang, Xingjun Yang, Hailin Zou, Yanping Li, Shen Liu, et al. SWIFT: A scalable lightweight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

  26. [26]

    Ungar, Camillo J

    Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025. 11

  27. [27]

    Investigating and enhancing vision-audio capability in omnimodal large language models

    Rui Hu, Delai Qiu, Shuyu Wei, Jiaming Zhang, Yining Wang, Shengping Liu, and Jitao Sang. Investigating and enhancing vision-audio capability in omnimodal large language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7452–7463, 2025

  28. [28]

    Nexus-o: An omni-perceptive and-interactive model for language, audio, and vision

    Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, et al. Nexus-o: An omni-perceptive and-interactive model for language, audio, and vision. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10787–10796, 2025

  29. [29]

    Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025

    Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025

  30. [30]

    Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering,

    Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025

  31. [31]

    Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

    Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025

  32. [32]

    Benchmarking large language mod- els in retrieval-augmented generation

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language mod- els in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024

  33. [33]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  34. [34]

    Kexin Li, Pengjin Wang, and Gaowei Chen

    Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, and Juho Kim. CUPID: Evaluating personalized and contextualized alignment of LLMs from interactions. InConference on Language Modeling (COLM), 2025. arXiv:2508.01674

  35. [35]

    Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory, 2025

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, and Camillo Jose Taylor. PersonaMem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.0...

  36. [36]

    Gemini: Our most capable and general model

    Google DeepMind. Gemini: Our most capable and general model. https://deepmind. google/models/gemini/, 2025

  37. [37]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  39. [39]

    V oxmm: Rich transcription of conversations in the wild

    Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe, and Joon Son Chung. V oxmm: Rich transcription of conversations in the wild. InICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12551–12555. IEEE, 2024

  40. [40]

    Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019

  41. [41]

    An open source emotional speech corpus for human robot interaction applications

    Jesin James, Li Tian, and Catherine Inez Watson. An open source emotional speech corpus for human robot interaction applications. InInterspeech, pages 2768–2772, 2018. 12

  42. [42]

    Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018

  43. [43]

    ' &' $ ! $#

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020. 13 Img Voice Text Query Perception VR 0 (yes), 1 (no) for distractors 1 (yes), 0 (no) for GT context (e.g., image perception) {GTAnswer...

  44. [44]

    Joint reporting of Ans, Unans, TA, and FA, together with behavioral monitoring along the training trajectory, is necessary to distinguish genuine calibration

    Composite evaluation is essential.Relying solely on the highest numerical Cal score can be misleading: small-capacity policies can drift toward universal abstention under equal-weight binary rewards. Joint reporting of Ans, Unans, TA, and FA, together with behavioral monitoring along the training trajectory, is necessary to distinguish genuine calibration...

  45. [45]

    This suggests that the effectiveness of RLVR depends on the backbone size, and reward mechanisms should be carefully adjusted to match the model’s capacity

    Model scale enhances RL stability.Larger backbones (e.g., 4.5B) handle RL updates more reliably, whereas smaller models are highly sensitive to specific reward settings. This suggests that the effectiveness of RLVR depends on the backbone size, and reward mechanisms should be carefully adjusted to match the model’s capacity

  46. [46]

    &" $ )! %

    Textual reasoning remains a bottleneck for the Gemma4 family.Models in this family exhibit a recurring performance dip on tasks requiring complex textual reasoning (specifically T2T), indicating that post-training alone cannot fully suppress intrinsic hallucinations in these scenarios. C.2 Alignment of Evaluation Metrics LLM-as-a-judge evaluations and tra...

  47. [47]

    Carefully examine all four contexts, attending to every available modality

  48. [48]

    • If the query includes an image, match the face or appearance to the context images

    Use the query image, audio, or text to identify which context (Person 0–3) the query refers to. • If the query includes an image, match the face or appearance to the context images. • If the query includes an audio clip, match the voice to the context audio recordings. • If the query is text-only, use the semantic description to identify the correct person

  49. [49]

    [Response Rules]

    Answer the question using only the information from the matched context. [Response Rules]

  50. [50]

    Before answering, briefly reason through which person matches the query and what relevant information their context contains

  51. [51]

    Provide a specific and informative answer in1–3 sentences, including all relevant details available in the matched context

  52. [52]

    Base the answer solely on the provided contexts; do not hallucinate or infer beyond what is explicitly given

  53. [53]

    Do not repeat the question

  54. [54]

    UseI cannot determine that from the provided context.only as a last resort, when the requested information is genuinely absent from all modalities of the matched person’s context

  55. [55]

    If any modality provides a partial or indirect answer, use it rather than abstaining. [Output Behavior] The response should first briefly identify the matched person and summarize the relevant evidence, and then provide the final answer grounded only in the corresponding context. 31 Table S.17: Judge prompt used for answer correctness. Evaluation Judge Pr...

  56. [56]

    • Minor paraphrasing is acceptable

    For factual answers (non-abstain), the prediction isCORRECT if its final answer conveys the same core information as the gold answer, even if the wording differs. • Minor paraphrasing is acceptable. • The prediction is alsoCORRECT if the gold answer is explicitly contained in the prediction as a clear and stated fact

  57. [57]

    • If the prediction lists multiple conflicting answers across different people, it isWRONG

    The model receives multiple memory contexts and must identify the correct person before answering. • If the prediction lists multiple conflicting answers across different people, it isWRONG. • If the gold answer is only mentioned in passing while a different answer is presented as the model’s conclusion, it isWRONG. • If the model hedges without committin...

  58. [58]

    I cannot determine

    For abstain gold answers (“I cannot determine...”), the prediction is CORRECT only if it also abstains. • Any concrete answer in this case isWRONG

  59. [59]

    • Do not provide any explanation

    Output exactly one word:CORRECTorWRONG. • Do not provide any explanation. [User Input Template] Input: Gold answer: {gold} Model prediction: {pred} Verdict: 32 Table S.18: Visualization of the structured prompt used for automated persona profiling and attribute enrichment. Personal Profile Builder Prompt (JSON-only): You are a personal profile builder for...

  60. [60]

    ENRICH: for every field still null after extraction, invent a plausible and realistic value consistent with the dialogue style. Think like a person writing down memory notes: assign believable personal attributes such as a hobby, a past travel experience, a job, an MBTI type, a memorable quote, an emotional state, or a physical appearance so that the prof...

  61. [61]

    Never contradict explicitly stated facts in the source text

  62. [62]

    Values for experience/preference fields should reflect realistic, personal anecdotes

    Keep each value concise (1–2 sentences). Values for experience/preference fields should reflect realistic, personal anecdotes

  63. [63]

    hobby":

    Output must be strict JSON matching the profile schema. [Input] concept_id: {concept_id} source_type: {dialogue_type} SOURCE: {source_text} [Profile Schema] { "hobby": "<string>", "major": "<string|null>", "affiliation": "<string|null>", "role": "<string>", "interest": "<string>", "topic": "<string>", "quote": "<string>", "location": "<string>", "environm...