super hub Canonical reference

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dehao Chen, Dmitry Lepikhin, HyoukJoong Lee, Orhan Firat, Yanping Huang, Yuanzhong Xu · 2020 · cs.CL · arXiv 2006.16668

Canonical reference. 78% of citing Pith papers cite this work as background.

107 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 107 citing papers more from Dehao Chen arXiv PDF

abstract

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 method 4 dataset 2

citation-polarity summary

background 21 use method 4 use dataset 2

claims ledger

abstract Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minim

authors

Dehao Chen Dmitry Lepikhin HyoukJoong Lee Orhan Firat Yanping Huang Yuanzhong Xu

co-cited works

representative citing papers

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

cs.DC · 2026-05-20 · unverdicted · novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

cs.DC · 2026-05-05 · unverdicted · novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Depth Adaptive Efficient Visual Autoregressive Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.

Path-Constrained Mixture-of-Experts

cs.LG · 2026-03-18 · unverdicted · novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

cs.LG · 2026-02-13 · unverdicted · novelty 7.0

Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.

The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

cs.LG · 2024-08-28 · conditional · novelty 7.0

Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.

NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

cs.AR · 2026-05-22 · unverdicted · novelty 6.0

NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.

Exploiting Multicast for Accelerating Collective Communication

cs.DC · 2026-05-21 · unverdicted · novelty 6.0

MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.

citing papers explorer

Showing 50 of 107 citing papers.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models cs.AR · 2026-05-11 · conditional · none · ref 38 · internal anchor
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 94 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation cs.DC · 2026-05-20 · unverdicted · none · ref 32 · internal anchor
Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error versus prior simulators.
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts cs.LG · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference cs.LG · 2026-05-08 · conditional · none · ref 14 · internal anchor
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation cs.CV · 2026-05-06 · unverdicted · none · ref 22 · internal anchor
BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures cs.LG · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform dense on ViT under cost-matched aggressive approximation.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs cs.DC · 2026-05-05 · unverdicted · none · ref 24 · internal anchor
Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving cs.LG · 2026-05-03 · unverdicted · none · ref 37 · 2 links · internal anchor
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 14 · internal anchor
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training cs.DC · 2026-04-21 · unverdicted · none · ref 5 · internal anchor
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 28 · 2 links · internal anchor
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Depth Adaptive Efficient Visual Autoregressive Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 34 · internal anchor
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis cs.LG · 2026-04-07 · unverdicted · none · ref 41 · internal anchor
A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.
Path-Constrained Mixture-of-Experts cs.LG · 2026-03-18 · unverdicted · none · ref 9 · internal anchor
PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning cs.LG · 2026-02-13 · unverdicted · none · ref 25 · internal anchor
Split-MoPE integrates split learning with predefined-expert routing to maximize usable data in vertical federated learning under sample misalignment, delivering state-of-the-art accuracy in one communication round plus built-in robustness and per-sample contribution scores.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models cs.SD · 2026-01-06 · unverdicted · none · ref 12 · internal anchor
TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts cs.LG · 2024-08-28 · conditional · none · ref 4 · internal anchor
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 78 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 20 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models cs.LG · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference cs.AR · 2026-05-22 · unverdicted · none · ref 12 · internal anchor
NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
Exploiting Multicast for Accelerating Collective Communication cs.DC · 2026-05-21 · unverdicted · none · ref 18 · internal anchor
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models cs.AI · 2026-05-20 · unverdicted · none · ref 18 · internal anchor
PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs cs.LG · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
FedCoE proposes a coordinated dual-level MoE framework for federated learning that improves global and personalized accuracy while enabling strong cold-start performance for new clients.
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 23 · internal anchor
HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems cs.DC · 2026-05-19 · unverdicted · none · ref 12 · internal anchor
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 17 · internal anchor
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates cs.LG · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE cs.AI · 2026-05-14 · conditional · none · ref 15 · internal anchor
BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
Combining pre-trained models via localized model averaging stat.ME · 2026-05-13 · unverdicted · none · ref 178 · internal anchor
Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 17 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 63 · internal anchor
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression cs.LG · 2026-05-09 · unverdicted · none · ref 62 · internal anchor
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems cs.AR · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs cs.AR · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
DySHARP accelerates MoE expert parallelism via dynamic multimem addressing and token-centric kernel fusion to cut redundant traffic and deliver up to 1.79x speedup over prior in-switch solutions.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 13 · 2 links · internal anchor
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 51 · internal anchor
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 9 · internal anchor
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs cs.CV · 2026-04-27 · unverdicted · none · ref 27 · internal anchor
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling cs.CL · 2026-04-23 · unverdicted · none · ref 13 · internal anchor
X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts cs.LG · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
WiFo-MiSAC: A Wireless Foundation Model for Multimodal Sensing and Communication Integration via Synesthesia of Machines (SoM) eess.SP · 2026-04-20 · unverdicted · none · ref 29 · internal anchor
WiFo-MiSAC is a task-agnostic foundation model that unifies multimodal wireless signals via tokenization and self-supervised learning with SS-DMoE to achieve strong few-shot performance on beam prediction and channel estimation.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer