SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
hub Mixed citations
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Mixed citation behavior. Most common role is background (56%).
abstract
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of
co-cited works
representative citing papers
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.
PreferenceASR is a preference-aware ASR test set built from seven corpora that shows model rankings change when user output-style instructions are considered.
AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.
Fine-tuning Whisper on Swiss German speech with subtitle supervision yields an honest 25.6% WER baseline (13.8% cWER) and demonstrates that prior SOTA claims of 17% WER result from benchmark contamination allowing 13.88% WER with no dialect training.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
OmniPro is the first benchmark jointly evaluating omni-modal perception, proactive responding, and diverse streaming video understanding tasks using a dual-mode protocol on 2700 samples.
PAREDA is a new multi-accent speech dataset of spontaneous NLP paper discussions that shows state-of-the-art ASR models struggle in zero-shot settings but improve after fine-tuning on it.
CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
MetaBackdoor shows that LLMs can be backdoored using positional triggers like sequence length, enabling stealthy activation on clean inputs to leak system prompts or trigger malicious behavior.
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Malicious Skills induce coding agents to hallucinate and import attacker-controlled packages at high rates while evading detection.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
RobotEQ is a new benchmark dataset and evaluation suite showing that current embodied AI models fall short on active social-norm compliance, especially spatial grounding, though RAG with external knowledge helps.
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
Introduces MUSCAT benchmark dataset of bilingual scientific discussions to evaluate multilingual ASR performance on code-switching and mixed inputs beyond standard WER.
citing papers explorer
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
-
Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction
Common-word acoustic cues and bias-word position prediction in speech LLMs cut rare-word transcription errors by 16.3% versus baselines, including out-of-domain cases.
-
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
CheeseBench is a benchmark where LLMs act as zero-shot agents in text-rendered versions of classical rodent experiments, with the best model reaching 52.6% success compared to 32.1% random and 78.9% approximate rodent baselines.
-
Differences in Text Generated by Diffusion and Autoregressive Language Models
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
-
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
-
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
-
PAL*M: Property Attestation for Large Generative Models
PAL*M is a property attestation framework for large generative models that combines confidential virtual machines, security-aware GPUs, and incremental multiset hashing to achieve low-overhead integrity tracking with formal security guarantees.
-
Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.
-
From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring
A graph-structured framework fuses 3D perception with rule-based, LLM, and memory reasoning to raise hazard coverage from 57% to 93% across 115 simulated underground mine scenarios.
-
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
SALSA adapts speech-aware LLMs via supervised layer-wise steering vectors, reporting up to 46.8% relative gains over zero-shot on out-of-domain speech benchmarks.
-
The Future of Facts: Tracing the Factual Generation-Verification Gap
Empirical tracing across model families shows verification precedes and outlasts generation for facts, with updates producing simultaneous verification of old and new answers.
-
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models
LamPO introduces a pairwise decomposed advantage with confidence-aware weighting to replace scalar group advantages in group-relative policy optimization for reasoning models.
-
Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Fine-tuned speech representation models with hierarchical classification outperform multimodal LLMs on pediatric speech sound disorder tasks.
-
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
Demographic and Linguistic Bias Evaluation in Omnimodal Language Models
Omnimodal models show reduced demographic bias in image and video tasks compared to substantial biases and lower performance in audio tasks.
-
Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
CG-CLIP adds caption-guided memory refinement and token-based spatiotemporal aggregation to CLIP for video person ReID, outperforming SOTA on MARS, iLIDS-VID, SportsVReID and DanceVReID.
-
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafetyBench, and MedHallu.
-
Different types of syntactic agreement recruit the same units within large language models
Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
-
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
-
A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems
A novel alignment algorithm using dynamic programming and beam search provides more accurate matching of individual errors between reference and model transcripts for improved speech recognition evaluation.
-
ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation
ChatENV fine-tunes Qwen-2.5-VL on a 177k-image dataset of temporal satellite pairs with sensor metadata to support interactive temporal and what-if reasoning for environmental monitoring.
-
Differentially Private Datastore Generation for Retrieval-Augmented Inference
Hashing-based framework adds DP noise to LSH bucket votes to release private probability distributions for datastores with 2.6% average accuracy loss at epsilon=5.
-
Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction
Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.
-
Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs
DPO on three Audio LLMs using 100K preference pairs yields up to 89.6% in-distribution and 20.0% out-of-distribution MER reduction for code-switching transcription.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
-
LLMs and Speech: Integration vs. Combination
Tight integration of acoustic models with LLMs for ASR is ablated against shallow fusion across label units, fine-tuning strategies, LLM sizes, and joint CTC decoding to mitigate hallucinations.
-
Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies
Systematic comparison of nine text-only and three multimodal LLMs using in-context learning, reasoning prompts, fine-tuning, and multimodal fusion on DementiaBank speech data finds class-centroid demonstrations and token-level fine-tuning most effective, with adapted open models matching or beating
-
Low-Rank Adaptation Redux for Large Models
An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
-
Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.
-
On The Landscape of Spoken Language Models: A Comprehensive Survey
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
- DataComp-VLM: Improved Open Datasets for Vision-Language Models
- The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models