Recognition: no theorem link
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3
The pith
A new benchmark shows open-source omnimodal models ground audio less reliably than visuals and that supervised fine-tuning falls short while reinforcement learning generalizes more consistently but can become overly cautious.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Omni-Persona benchmark demonstrates three key issues in current omnimodal personalization: open-source models exhibit a consistent audio-versus-visual grounding gap that reinforcement learning with verifiable rewards can partially close through dense supervision; answerable recall and model scale do not fully diagnose performance since strong recall can pair with absent-persona hallucinations and larger models do not always yield higher calibrated accuracy; supervised fine-tuning is constrained by the challenge of creating large-scale annotated supervision while reinforcement learning generalizes better via outcome feedback but tends toward conservative outputs and reduced generation质量.
What carries the argument
The Persona Modality Graph formalizes cross-modal routing across 4 task groups and 18 fine-grained tasks with roughly 750 items, while Calibrated Accuracy jointly rewards correct grounding and appropriate abstention on absent-persona queries.
If this is right
- Strong recall on answerable queries can still allow models to generate incorrect persona details when no matching information is available.
- Increasing model parameter count does not reliably improve calibration scores in omnimodal personalization settings.
- Reinforcement learning with verifiable rewards can reduce modality-specific grounding differences but requires careful reward engineering to prevent overly conservative generation behavior.
- Post-training strategies should prioritize outcome-level verifiable feedback over large-scale annotated ground-truth construction to handle absent-persona scenarios more robustly.
Where Pith is reading between the lines
- Personalized voice assistants that combine audio and visual inputs may need separate calibration tests for each modality to avoid over-reliance on visual cues.
- New reward functions could be developed to balance the conservatism seen in reinforcement learning with higher generation quality for personalization tasks.
- Extending the benchmark to closed-source models would clarify whether the observed grounding gaps and calibration issues are specific to open-source training data and methods.
Load-bearing premise
The Persona Modality Graph and Calibrated Accuracy metric provide a complete and unbiased framework for diagnosing omnimodal grounding without missing key failure modes or introducing evaluation artifacts.
What would settle it
Re-running the full Omni-Persona evaluation suite on a fresh set of models that use alternative post-training methods and checking whether the audio-visual gap and calibration problems remain or disappear would directly test the reported diagnostic findings.
Figures
read the original abstract
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Omni-Persona as the first comprehensive benchmark for omnimodal personalization across text, image, and audio. It formalizes the task as cross-modal routing over the Persona Modality Graph (4 task groups, 18 fine-grained tasks, ~750 items), proposes Calibrated Accuracy (Cal) to jointly evaluate correct grounding and appropriate abstention (including absent-persona cases), and reports three diagnostic findings from experiments: (i) open-source models exhibit a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics because strong recall can coexist with absent-persona hallucination and larger models do not always yield higher Cal; (iii) SFT is limited by the difficulty of scaling annotated ground-truth supervision while RLVR generalizes more consistently via outcome-level verifiable feedback but drifts toward conservative behavior and lower generation quality under the authors' reward design.
Significance. If the benchmark construction, Cal metric, and experimental comparisons hold under scrutiny, the work provides a valuable diagnostic framework for omnimodal personalization that surfaces previously under-examined issues such as modality-specific grounding gaps, calibration failures, and trade-offs between SFT and RLVR. The explicit inclusion of absent-persona scenarios and the framing of the benchmark as a tool for identifying pitfalls rather than proving universal superiority are strengths that could usefully guide future post-training and reward design in multimodal LLMs.
major comments (3)
- [Experiments] The experimental section does not specify the exact open-source models evaluated, the data splits used for training versus evaluation, the precise mathematical definition of the RLVR reward function, or any statistical tests (e.g., significance levels or confidence intervals) applied to the three diagnostic findings. These omissions directly affect the ability to assess the robustness of the claims that RLVR narrows the audio-visual gap and generalizes more consistently than SFT.
- [§3 (Benchmark and Metric)] The definition and computation of Calibrated Accuracy (Cal) are described at a high level in the abstract and introduction but lack an explicit formula or pseudocode showing how it combines grounding accuracy with abstention penalties across the Persona Modality Graph, particularly for absent-persona queries. Without this, it is difficult to verify that Cal is free of evaluation artifacts or that it fully supports the claim that recall and scale are incomplete diagnostics.
- [§4 (Training Comparisons)] The manuscript reports that RLVR 'drifts toward conservative behavior and lower generation quality under our reward design' yet provides no quantitative metrics (e.g., generation quality scores, diversity measures, or human preference comparisons) or ablation studies isolating the reward components responsible for this drift. This weakens the third diagnostic finding.
minor comments (2)
- [Abstract and §4] The abstract states 'dense rule-based supervision' for RLVR without defining the rules or providing examples; the main text should include a concrete description or table of the rule set.
- [Figures and Tables] Figure captions and table headers should explicitly state the number of runs or seeds used for each reported Cal value to allow readers to gauge variability.
Simulated Author's Rebuttal
We thank the referee for their insightful comments. We address each major point below, agreeing to revisions where details were insufficiently specified in the original manuscript.
read point-by-point responses
-
Referee: [Experiments] The experimental section does not specify the exact open-source models evaluated, the data splits used for training versus evaluation, the precise mathematical definition of the RLVR reward function, or any statistical tests (e.g., significance levels or confidence intervals) applied to the three diagnostic findings. These omissions directly affect the ability to assess the robustness of the claims that RLVR narrows the audio-visual gap and generalizes more consistently than SFT.
Authors: We appreciate the referee's call for greater experimental transparency. The original manuscript provided high-level descriptions but indeed omitted some specifics. In the revised version, we will explicitly enumerate the evaluated open-source models (including their exact checkpoints and sizes), detail the data splits (with the benchmark serving as a held-out evaluation set separate from any training data used for SFT/RLVR), provide the precise mathematical definition of the RLVR reward function (which combines outcome-based rewards for correct cross-modal routing and abstention, with penalties for absent-persona hallucinations), and include statistical tests such as bootstrapped confidence intervals for the reported diagnostic differences. These changes will be incorporated into Section 4 and the supplementary material to allow full assessment of the claims' robustness. revision: yes
-
Referee: [§3 (Benchmark and Metric)] The definition and computation of Calibrated Accuracy (Cal) are described at a high level in the abstract and introduction but lack an explicit formula or pseudocode showing how it combines grounding accuracy with abstention penalties across the Persona Modality Graph, particularly for absent-persona queries. Without this, it is difficult to verify that Cal is free of evaluation artifacts or that it fully supports the claim that recall and scale are incomplete diagnostics.
Authors: We agree that providing an explicit formula and pseudocode is necessary for full verifiability. The Cal metric is intended to balance grounding accuracy on present-persona queries with appropriate abstention on absent-persona ones, using a penalty term calibrated across the Persona Modality Graph. In the revision, we will insert the formal definition Cal = (1/|Q|) sum_{q in Q} [Acc(q) - lambda * Halluc(q) + mu * Abstain(q)] where terms are defined for each task group, along with pseudocode for its computation. This will demonstrate its freedom from artifacts and support the incompleteness of recall/scale as diagnostics. An illustrative example will also be added. revision: yes
-
Referee: [§4 (Training Comparisons)] The manuscript reports that RLVR 'drifts toward conservative behavior and lower generation quality under our reward design' yet provides no quantitative metrics (e.g., generation quality scores, diversity measures, or human preference comparisons) or ablation studies isolating the reward components responsible for this drift. This weakens the third diagnostic finding.
Authors: This observation is valid; the drift was noted from qualitative inspection but lacked supporting quantitative evidence. We will revise Section 4 to include quantitative metrics: average output length and entropy as proxies for conservativeness, n-gram diversity scores, and results from a human evaluation on a 50-item subset comparing RLVR vs. SFT on quality and helpfulness. We will also add an ablation varying key reward terms (e.g., the abstention incentive weight) to identify the source of the drift. These additions will strengthen the third diagnostic finding without altering the core conclusions. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces an empirical benchmark (Omni-Persona) with a Persona Modality Graph and Calibrated Accuracy metric, then reports observational diagnostics from model evaluations under SFT and RLVR. No mathematical derivations, equations, predictions, or first-principles results appear in the provided text. Claims rest on experimental outcomes rather than any reduction of outputs to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as a diagnostic framework and does not invoke uniqueness theorems or ansatzes that collapse to prior inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-modal routing over a Persona Modality Graph is an appropriate formalization for omnimodal personalization tasks.
invented entities (2)
-
Persona Modality Graph
no independent evidence
-
Calibrated Accuracy (Cal)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
OpenBMB. MiniCPM-o 4.5: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone. https://github.com/OpenBMB/MiniCPM-o, 2026
work page 2026
-
[3]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Gemma 4: Byte for byte, the most capable open models
Google DeepMind. Gemma 4: Byte for byte, the most capable open models. https:// deepmind.google/models/gemma/gemma-4/, April 2026
work page 2026
-
[5]
OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Yo’LLaV A: Your personalized language and vision assistant
Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’LLaV A: Your personalized language and vision assistant. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.09400
-
[7]
RAP: Retrieval- augmented personalization for multimodal large language models
Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, and Xiangyu Yue. RAP: Retrieval- augmented personalization for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2410.13360
-
[8]
Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon. RePIC: Reinforced post-training for personalizing multi-modal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506.18369
-
[9]
Contextualized Visual Personalization in Vision-Language Models
Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, and Sungroh Yoon. Con- textualized visual personalization in vision-language models.arXiv preprint arXiv:2602.03454, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Myvlm: Personalizing vlms for user-specific queries
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. Myvlm: Personalizing vlms for user-specific queries. InEuropean Conference on Computer Vision, pages 73–91. Springer, 2024
work page 2024
-
[11]
According to me: Long-term personalized referential memory qa, 2026
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, and Bill Byrne. Accord- ing to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026. 10
-
[12]
MMPB: It’s time for multi-modal personalization.arXiv preprint arXiv:2509.22820, 2025
Jaeik Kim, Woojin Kim, Woohyeon Park, and Jaeyoung Do. MMPB: It’s time for multi-modal personalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2509.22820
-
[13]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023
work page 2023
-
[14]
How abilities in large language models are affected by supervised fine-tuning data composition
Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 177–198, 2024
work page 2024
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, and Wentao Zhang. UniCTokens: Boosting personalized understanding and generation via unified concept tokens. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2505.14671
-
[17]
Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, and Fan Zhou. Tameing long contexts in personalization: Towards training-free and state-aware mllm personalized assistant. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 452–463, 2026
work page 2026
-
[18]
Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory. arXiv preprint arXiv:2508.09736, 2025
-
[19]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
work page internal anchor Pith review arXiv 2026
-
[22]
Jin Xu, Zhifang Guo, Jinzheng He, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Microsoft. Phi-4 Mini Technical Report: Compact yet powerful multimodal language models via mixture-of-LoRAs.arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
arXiv preprint arXiv:2408.05517 (2024),https://arxiv.org/abs/ 2408.05517
Yuze Zhao, Jintao Qin, Haibo Niu, Wentao Cheng, Bingning Tang, Xingjun Yang, Hailin Zou, Yanping Li, Shen Liu, et al. SWIFT: A scalable lightweight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024
-
[26]
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J Taylor, and Dan Roth. Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale.arXiv preprint arXiv:2504.14225, 2025. 11
-
[27]
Investigating and enhancing vision-audio capability in omnimodal large language models
Rui Hu, Delai Qiu, Shuyu Wei, Jiaming Zhang, Yining Wang, Shengping Liu, and Jitao Sang. Investigating and enhancing vision-audio capability in omnimodal large language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7452–7463, 2025
work page 2025
-
[28]
Nexus-o: An omni-perceptive and-interactive model for language, audio, and vision
Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, et al. Nexus-o: An omni-perceptive and-interactive model for language, audio, and vision. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10787–10796, 2025
work page 2025
-
[29]
Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025
Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025
-
[30]
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering,
Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025
-
[31]
Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, and Jingren Zhou. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277, 2025
-
[32]
Benchmarking large language mod- els in retrieval-augmented generation
Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language mod- els in retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024
work page 2024
-
[33]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
Kexin Li, Pengjin Wang, and Gaowei Chen
Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, and Juho Kim. CUPID: Evaluating personalized and contextualized alignment of LLMs from interactions. InConference on Language Modeling (COLM), 2025. arXiv:2508.01674
-
[35]
Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, and Camillo Jose Taylor. PersonaMem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.0...
-
[36]
Gemini: Our most capable and general model
Google DeepMind. Gemini: Our most capable and general model. https://deepmind. google/models/gemini/, 2025
work page 2025
-
[37]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
V oxmm: Rich transcription of conversations in the wild
Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe, and Joon Son Chung. V oxmm: Rich transcription of conversations in the wild. InICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12551–12555. IEEE, 2024
work page 2024
-
[40]
Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversa- tions. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 527–536, 2019
work page 2019
-
[41]
An open source emotional speech corpus for human robot interaction applications
Jesin James, Li Tian, and Catherine Inez Watson. An open source emotional speech corpus for human robot interaction applications. InInterspeech, pages 2768–2772, 2018. 12
work page 2018
-
[42]
Steven R Livingstone and Frank A Russo. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english.PloS one, 13(5):e0196391, 2018
work page 2018
-
[43]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A frame- work for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020. 13 Img Voice Text Query Perception VR 0 (yes), 1 (no) for distractors 1 (yes), 0 (no) for GT context (e.g., image perception) {GTAnswer...
work page 2020
-
[44]
Composite evaluation is essential.Relying solely on the highest numerical Cal score can be misleading: small-capacity policies can drift toward universal abstention under equal-weight binary rewards. Joint reporting of Ans, Unans, TA, and FA, together with behavioral monitoring along the training trajectory, is necessary to distinguish genuine calibration...
-
[45]
Model scale enhances RL stability.Larger backbones (e.g., 4.5B) handle RL updates more reliably, whereas smaller models are highly sensitive to specific reward settings. This suggests that the effectiveness of RLVR depends on the backbone size, and reward mechanisms should be carefully adjusted to match the model’s capacity
-
[46]
Textual reasoning remains a bottleneck for the Gemma4 family.Models in this family exhibit a recurring performance dip on tasks requiring complex textual reasoning (specifically T2T), indicating that post-training alone cannot fully suppress intrinsic hallucinations in these scenarios. C.2 Alignment of Evaluation Metrics LLM-as-a-judge evaluations and tra...
-
[47]
Carefully examine all four contexts, attending to every available modality
-
[48]
• If the query includes an image, match the face or appearance to the context images
Use the query image, audio, or text to identify which context (Person 0–3) the query refers to. • If the query includes an image, match the face or appearance to the context images. • If the query includes an audio clip, match the voice to the context audio recordings. • If the query is text-only, use the semantic description to identify the correct person
-
[49]
Answer the question using only the information from the matched context. [Response Rules]
-
[50]
Before answering, briefly reason through which person matches the query and what relevant information their context contains
-
[51]
Provide a specific and informative answer in1–3 sentences, including all relevant details available in the matched context
-
[52]
Base the answer solely on the provided contexts; do not hallucinate or infer beyond what is explicitly given
-
[53]
Do not repeat the question
-
[54]
UseI cannot determine that from the provided context.only as a last resort, when the requested information is genuinely absent from all modalities of the matched person’s context
-
[55]
If any modality provides a partial or indirect answer, use it rather than abstaining. [Output Behavior] The response should first briefly identify the matched person and summarize the relevant evidence, and then provide the final answer grounded only in the corresponding context. 31 Table S.17: Judge prompt used for answer correctness. Evaluation Judge Pr...
-
[56]
• Minor paraphrasing is acceptable
For factual answers (non-abstain), the prediction isCORRECT if its final answer conveys the same core information as the gold answer, even if the wording differs. • Minor paraphrasing is acceptable. • The prediction is alsoCORRECT if the gold answer is explicitly contained in the prediction as a clear and stated fact
-
[57]
• If the prediction lists multiple conflicting answers across different people, it isWRONG
The model receives multiple memory contexts and must identify the correct person before answering. • If the prediction lists multiple conflicting answers across different people, it isWRONG. • If the gold answer is only mentioned in passing while a different answer is presented as the model’s conclusion, it isWRONG. • If the model hedges without committin...
-
[58]
For abstain gold answers (“I cannot determine...”), the prediction is CORRECT only if it also abstains. • Any concrete answer in this case isWRONG
-
[59]
• Do not provide any explanation
Output exactly one word:CORRECTorWRONG. • Do not provide any explanation. [User Input Template] Input: Gold answer: {gold} Model prediction: {pred} Verdict: 32 Table S.18: Visualization of the structured prompt used for automated persona profiling and attribute enrichment. Personal Profile Builder Prompt (JSON-only): You are a personal profile builder for...
-
[60]
ENRICH: for every field still null after extraction, invent a plausible and realistic value consistent with the dialogue style. Think like a person writing down memory notes: assign believable personal attributes such as a hobby, a past travel experience, a job, an MBTI type, a memorable quote, an emotional state, or a physical appearance so that the prof...
-
[61]
Never contradict explicitly stated facts in the source text
-
[62]
Values for experience/preference fields should reflect realistic, personal anecdotes
Keep each value concise (1–2 sentences). Values for experience/preference fields should reflect realistic, personal anecdotes
-
[63]
Output must be strict JSON matching the profile schema. [Input] concept_id: {concept_id} source_type: {dialogue_type} SOURCE: {source_text} [Profile Schema] { "hobby": "<string>", "major": "<string|null>", "affiliation": "<string|null>", "role": "<string>", "interest": "<string>", "topic": "<string>", "quote": "<string>", "location": "<string>", "environm...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.