super hub Canonical reference

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Krzysztof Maziarz, Noam Shazeer, Quoc Le · 2017 · cs.LG · arXiv 1701.06538

Canonical reference. 75% of citing Pith papers cite this work as background.

220 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 220 citing papers more from Andy Davis arXiv PDF

abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 10 baseline 2 dataset 1

citation-polarity summary

background 39 use method 9 baseline 2 support 1 use dataset 1

claims ledger

abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic

authors

Andy Davis Azalia Mirhoseini Geoffrey Hinton Krzysztof Maziarz Noam Shazeer Quoc Le

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

SDG-MoE: Signed Debate Graph Mixture-of-Experts

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SDG-MoE introduces learned signed interaction graphs and disagreement-gated deliberation among experts in MoE architectures, yielding 19.8% better validation perplexity than the strongest baseline.

Approximation-Free Differentiable Oblique Decision Trees

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

DTSemNet gives an exact, invertible neural-network encoding of hard oblique decision trees that supports direct gradient training for both classification and regression without probabilistic softening or quantized estimators.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

cs.LG · 2026-05-08 · conditional · novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

StrLoRA is a regularized two-stage expert routing method for streaming CVIT that selects experts via textual instructions and applies token-wise cross-modal weighting with historical routing alignment.

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

cs.CV · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

MoE experts in pretrained Transformers exhibit functional decorrelation with near-zero Jacobian alignment yet occupy partially overlapping representation subspaces, with routing sparsity modulating the geometry.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

cs.DC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

citing papers explorer

Showing 50 of 220 citing papers.

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting cs.LG · 2025-09-18 · unverdicted · none · ref 40 · internal anchor
Super-Linear introduces a pretrained MoE architecture using frequency-specialized linear experts and spectral gating for efficient general time series forecasting.
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV · 2025-08-06 · unverdicted · none · ref 110 · internal anchor
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 140 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts cs.LG · 2024-08-28 · conditional · none · ref 8 · internal anchor
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 56 · internal anchor
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 30 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 68 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference cs.AR · 2026-05-22 · unverdicted · none · ref 11 · internal anchor
NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 24 · internal anchor
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs cs.LG · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
FedCoE proposes a coordinated dual-level MoE framework for federated learning that improves global and personalized accuracy while enabling strong cold-start performance for new clients.
Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory cs.CL · 2026-05-20 · unverdicted · none · ref 42 · internal anchor
Memory Grafting improves language-model benchmarks by grafting offline hidden-state memory from a larger model into a recipient model using n-gram lookups and lightweight adapters, outperforming MoE and vanilla Engram baselines at 0.92B and 2.8B scales.
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction cs.CV · 2026-05-20 · unverdicted · none · ref 37 · internal anchor
HDMoE uses hierarchical MoE and RFR modules to address redundant information and fine-grained intra/inter-modality relationships in multimodal cancer survival prediction, with positive results on private liver cancer and TCGA datasets.
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems cs.DC · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
PRiSE-EEG: A Prior-Guided Foundation Model with Depth-Stratified Experts for Cross-Paradigm EEG Representation Learning eess.SP · 2026-05-18 · unverdicted · none · ref 57 · internal anchor
PRiSE-EEG is a prior-guided EEG foundation model that allocates shared and specialized experts across depth using CKA-derived sigmoid mappings and reports strong cross-paradigm results on 12 benchmarks.
BLAgent: Agentic RAG for File-Level Bug Localization cs.SE · 2026-05-18 · unverdicted · none · ref 45 · internal anchor
BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite for file-level bug localization using agentic RAG, at 18x lower cost than baselines, and boosts end-to-end APR success by over 20%.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models cs.LG · 2026-05-15 · unverdicted · none · ref 43 · internal anchor
UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.
Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates cs.LG · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE cs.AI · 2026-05-14 · conditional · none · ref 17 · internal anchor
BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.
MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data cs.LG · 2026-05-14 · unverdicted · none · ref 26 · 3 links · internal anchor
MoRe identifies modular structure in representations themselves to enable principled reuse, alignment, and expansion of modules during continual adaptation on sequential data.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts cs.CL · 2026-05-13 · unverdicted · none · ref 65 · internal anchor
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning cs.CV · 2026-05-13 · unverdicted · none · ref 35 · 2 links · internal anchor
A3B2 introduces an adaptive asymmetric adapter with uncertainty-aware dampening to reduce branch bias in few-shot vision-language image classification and outperforms standard adapter and prompt methods.
U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics cs.LG · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
U-HNO uses adaptive per-point routing in a U-shaped hybrid architecture to achieve state-of-the-art accuracy on PDE benchmarks with sharp localized features.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 37 · internal anchor
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Domain Restriction via Multi SAE Layer Transitions cs.AI · 2026-05-12 · unverdicted · none · ref 25 · internal anchor
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 35 · internal anchor
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks cs.CV · 2026-05-11 · unverdicted · none · ref 80 · internal anchor
GenMed uses diffusion models to capture P(X,Y) for medical tasks and performs inference via gradient-based test-time optimization, supporting arbitrary observation combinations without retraining.
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation cs.CV · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning cs.AI · 2026-05-11 · unverdicted · none · ref 21 · 3 links · internal anchor
STAR presents a failure-aware routing framework using a state-conditioned transition policy and an agent routing matrix combining expert routes with learned recoveries from execution traces to improve multi-agent spatiotemporal reasoning.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
Test-Time Speculation cs.CL · 2026-05-10 · unverdicted · none · ref 37 · 2 links · internal anchor
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 5 · internal anchor
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models cs.CL · 2026-05-09 · unverdicted · none · ref 61 · internal anchor
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC · 2026-05-08 · unverdicted · none · ref 35 · 2 links · internal anchor
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Hierarchical Mixture-of-Experts with Two-Stage Optimization cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and vision tasks.
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation cs.CL · 2026-05-08 · unverdicted · none · ref 33 · 2 links · internal anchor
MoLF routes updates between full fine-tuning and LoRA at the optimizer level to match or exceed the better of the two static methods on SQL, medical QA, and counterfactual tasks while an efficient variant outperforms prior adaptive LoRA by up to 20%.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading cs.LG · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism cs.DC · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring cs.CV · 2026-05-06 · unverdicted · none · ref 28 · internal anchor
DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.
Learngene Search Across Multiple Datasets for Building Variable-Sized Models cs.LG · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and training cost.
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking cs.CL · 2026-05-06 · unverdicted · none · ref 45 · internal anchor
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 50 · internal anchor
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Eliminating Hidden Serialization in Multi-Node Megakernel Communication cs.DC · 2026-05-01 · conditional · none · ref 45 · internal anchor
Perseus removes serialization bottlenecks in multi-node megakernel MoE communication via batched per-destination fences and hardware fence flags, delivering up to 10.3x speedup on proxy transports and matching or exceeding GPU-direct RDMA.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents cs.CL · 2026-05-01 · unverdicted · none · ref 15 · internal anchor
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 8 · internal anchor
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 6 · internal anchor
SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched baselines.
QERNEL: a Scalable Large Electron Model cond-mat.str-el · 2026-04-28 · unverdicted · none · ref 24 · internal anchor
QERNEL is a single conditioned neural wavefunction that variationally solves families of many-electron Hamiltonians in moiré heterobilayers and identifies the quantum liquid-crystal phase transition.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer