hub

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Microsoft: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao · 2025 · cs.CL · arXiv 2503.01743

36 Pith papers cite this work. Polarity classification is still indexing.

36 Pith papers citing it

open full Pith review browse 36 citing papers arXiv PDF

abstract

We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of

co-cited works

representative citing papers

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills

cs.CR · 2026-05-10 · unverdicted · novelty 7.0

Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.

RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.

Multimodal Data Curation Through Ranked Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.

SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

cs.IT · 2026-05-01 · unverdicted · novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR

cs.CL · 2026-04-30 · unverdicted · novelty 7.0

A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

MUSCAT is a benchmark of bilingual scientific conversations designed to evaluate ASR systems on code-switching and domain-specific challenges.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

eess.AS · 2026-04-19 · unverdicted · novelty 6.0

VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.

citing papers explorer

Showing 36 of 36 citing papers.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 55 · internal anchor
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 1 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues cs.AI · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization cs.CV · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Trust Me, Import This: Dependency Steering Attacks via Malicious Agent Skills cs.CR · 2026-05-10 · unverdicted · none · ref 37 · internal anchor
Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation cs.LG · 2026-05-07 · unverdicted · none · ref 19 · internal anchor
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI cs.RO · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
RobotEQ is the first benchmark for active intelligence in embodied AI, demonstrating that current models underperform on social norm adherence and spatial grounding tasks.
Multimodal Data Curation Through Ranked Retrieval cs.IR · 2026-05-01 · unverdicted · none · ref 21 · internal anchor
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass cs.IT · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR cs.CL · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 21 · internal anchor
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
MUSCAT: MUltilingual, SCientific ConversATion Benchmark cs.CL · 2026-04-17 · unverdicted · none · ref 13 · internal anchor
MUSCAT is a benchmark of bilingual scientific conversations designed to evaluate ASR systems on code-switching and domain-specific challenges.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 10 · internal anchor
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models cs.CL · 2026-04-12 · unverdicted · none · ref 31 · internal anchor
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing cs.CV · 2026-04-10 · unverdicted · none · ref 13 · internal anchor
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA cs.CV · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 46 · internal anchor
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generation tasks.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
HumanNet: Scaling Human-centric Video Learning to One Million Hours cs.CV · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 52 · internal anchor
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 1 · internal anchor
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling cs.LG · 2026-04-22 · unverdicted · none · ref 45 · internal anchor
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores cs.CL · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech eess.AS · 2026-04-19 · unverdicted · none · ref 43 · internal anchor
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CL · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction eess.AS · 2026-04-14 · unverdicted · none · ref 34 · internal anchor
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms cs.AI · 2026-04-12 · unverdicted · none · ref 1 · internal anchor
LLMs reach 52.6% average success on text-based rodent neuroscience tasks, above random agents at 32.1% but below approximate rodent baselines at 78.9%.
Differences in Text Generated by Diffusion and Autoregressive Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 1 · internal anchor
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models cs.AI · 2026-03-26 · unverdicted · none · ref 1 · internal anchor
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology cs.CL · 2026-04-29 · unverdicted · none · ref 30 · internal anchor
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA cs.CL · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
UniMesh: Unifying 3D Mesh Understanding and Generation cs.CV · 2026-04-19 · unverdicted · none · ref 29 · internal anchor
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Demographic and Linguistic Bias Evaluation in Omnimodal Language Models cs.CV · 2026-04-11 · unverdicted · none · ref 1 · internal anchor
Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.
Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs cs.CY · 2026-04-07 · unverdicted · none · ref 24 · internal anchor
CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafetyBench, and MedHallu.
Low-Rank Adaptation Redux for Large Models cs.LG · 2026-04-23 · unverdicted · none · ref 2 · internal anchor
An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer