super hub Canonical reference

Mixtral of Experts

Albert Q · 2024 · cs.LG · arXiv 2401.04088

Canonical reference. 80% of citing Pith papers cite this work as background.

245 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 245 citing papers more from Albert Q arXiv PDF

abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 baseline 4 method 3 dataset 2 other 2

citation-polarity summary

background 49 baseline 4 unclear 3 use method 3 use dataset 2

claims ledger

abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tok

authors

Albert Q

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

Fine-grained Claim-level RAG Benchmark for Law

cs.CL · 2026-05-20 · unverdicted · novelty 7.0 · 6 refs

ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Evaluation artifacts substantially inflate the measured unsolvability ceiling in multi-LLM routing, leading to distorted router training and overstated headroom.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Machine Collective Intelligence for Explainable Scientific Discovery

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

Machine collective intelligence uses coordinated AI agents to evolve symbolic hypotheses and recover governing equations from observations in deterministic, stochastic, and uncharacterized systems, achieving up to six orders of magnitude better extrapolation than neural networks with 5-40 parameters

citing papers explorer

Showing 37 of 37 citing papers after filters.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery cs.AI · 2024-08-12 · unverdicted · none · ref 45 · internal anchor
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments cs.HC · 2024-05-13 · conditional · none · ref 9 · internal anchor
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 19 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 19 · internal anchor
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 19 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 107 · internal anchor
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 8 · internal anchor
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
Jamba: A Hybrid Transformer-Mamba Language Model cs.CL · 2024-03-28 · conditional · none · ref 24 · internal anchor
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? cs.CV · 2024-03-21 · conditional · none · ref 28 · internal anchor
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 128 · internal anchor
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
OLMo: Accelerating the Science of Language Models cs.CL · 2024-02-01 · accept · none · ref 5 · internal anchor
OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries cs.CL · 2024-01-27 · accept · none · ref 12 · internal anchor
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning cs.LG · 2024-11-26 · unverdicted · none · ref 20 · internal anchor
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection cs.LG · 2024-11-13 · unverdicted · none · ref 9 · internal anchor
Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 16 · internal anchor
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models cs.CL · 2024-11-02 · unverdicted · none · ref 13 · internal anchor
DIP interleaves English word translations into non-English prompts to boost multilingual reasoning on synthetic benchmarks spanning 10-200 languages.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 12 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation cs.CV · 2024-09-06 · unverdicted · none · ref 8 · internal anchor
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 28 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
RouteLLM: Learning to Route LLMs with Preference Data cs.LG · 2024-06-26 · unverdicted · none · ref 18 · internal anchor
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports cs.CL · 2024-06-22 · unverdicted · none · ref 21 · internal anchor
LaMSUM is a novel multi-level LLM framework with voting methods for extractive summarization of large incident report collections that outperforms prior extractive methods.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 12 · 2 links · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 21 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 14 · internal anchor
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.
Argumentative Large Language Models for Explainable and Contestable Claim Verification cs.CL · 2024-05-03 · unverdicted · none · ref 27 · internal anchor
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 19 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models cs.CR · 2024-03-28 · accept · none · ref 22 · internal anchor
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cs.LG · 2024-01-26 · unverdicted · none · ref 21 · internal anchor
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
Test-Time Alignment via Hypothesis Reweighting cs.LG · 2024-12-11 · unverdicted · none · ref 30 · internal anchor
HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware cs.DC · 2024-09-02 · unverdicted · none · ref 15 · internal anchor
HexiScale enables LLM training on heterogeneous GPUs via asymmetric parallelism and graph partitioning, matching homogeneous performance at equal FLOPS and delivering 1.5-2.4x higher throughput than prior heterogeneous systems.
The Platonic Representation Hypothesis cs.LG · 2024-05-13 · unverdicted · none · ref 258 · internal anchor
Representations learned by large AI models are converging toward a shared statistical model of reality.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 29 · internal anchor
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model cs.CV · 2024-01-29 · unverdicted · none · ref 38 · internal anchor
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities cs.LG · 2024-08-14 · accept · none · ref 99 · internal anchor
The paper introduces a new taxonomy for model merging methods and reviews their applications in LLMs, MLLMs, continual learning, multi-task learning, and other subfields while outlining open challenges.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 21 · internal anchor
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 12 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Entry-level guide to the use of large language models for medical research cs.AI · 2024-10-24 · unverdicted · none · ref 9 · internal anchor
A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Mixtral of Experts

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer