Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
super hub Mixed citations
Qwen2 Technical Report
Mixed citation behavior. Most common role is background (57%).
abstract
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The
authors
co-cited works
representative citing papers
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
Fine-tuning LLMs on an unseen language teaches syntax but fails to transfer semantic competence, leaving Python with up to a 19% performance advantage and no tested intervention closing the gap.
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and stronger out-of-domain robustness than position-based methods across 14 benchmarks.
citing papers explorer
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
-
AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
AgentSocialBench demonstrates that privacy preservation is fundamentally harder in human-centered agentic social networks than in single-agent cases due to cross-domain coordination pressures and an abstraction paradox where privacy instructions increase discussion of sensitive information.
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
Adaptive Order Policies for Masked Diffusion
A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.
-
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
-
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
-
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
-
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
-
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language
Fine-tuning LLMs on an unseen language teaches syntax but fails to transfer semantic competence, leaving Python with up to a 19% performance advantage and no tested intervention closing the gap.
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
-
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
-
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
-
Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven
SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and stronger out-of-domain robustness than position-based methods across 14 benchmarks.
-
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA
GLoRA replaces raw factor averaging with gauge-aware aggregation in a consensus subspace estimated from client projectors, enabling consistent low-rank federated LoRA under heterogeneity.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
A single fused int4 KV cache kernel on Apple Silicon outperforms fp16 in latency with 3x memory compression and near-zero quality loss on tested models.
-
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
-
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing
A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.
-
PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
PuzzleMark provides a robust and imperceptible watermarking method for code datasets using adaptive variable name concatenation and statistical verification, achieving perfect detection rates with minimal performance impact.
-
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.
-
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
-
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
A two-agent adversarial rewriting framework achieves 20-40% evasion rates against LLM-based misinformation detectors under strict black-box constraints with binary feedback only, far outperforming prior methods and linking success to specific architectural properties.
-
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
In LLM feed-forward networks, the top 1% of channels per layer carry a median 58.7% of loss sensitivity, forming supernodes whose protection enables effective 50% sparsity pruning with much lower perplexity than baselines.
-
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
-
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
ProjRes achieves near-100% accuracy in membership inference on FedLLMs by measuring projection residuals of hidden embeddings on gradient subspaces, outperforming prior methods by up to 75.75% even under differential privacy.
-
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
-
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
Introduces the Indic-CodecFake dataset for Indic codec deepfakes and SATYAM, a novel hyperbolic ALM that outperforms baselines through dual-stage semantic-prosodic fusion using Bhattacharya distance.
-
Super Apriel: One Checkpoint, Many Speeds
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
-
SimDiff: Depth Pruning via Similarity and Difference
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
-
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.
-
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.