super hub Canonical reference

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Andy Davis, Azalia Mirhoseini, Geoffrey Hinton, Krzysztof Maziarz, Noam Shazeer, Quoc Le · 2017 · cs.LG · arXiv 1701.06538

Canonical reference. 75% of citing Pith papers cite this work as background.

308 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 308 citing papers more from Andy Davis arXiv PDF

abstract

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 39 method 10 baseline 2 dataset 1

citation-polarity summary

background 39 use method 9 baseline 2 support 1 use dataset 1

claims ledger

abstract The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational effic

authors

Andy Davis Azalia Mirhoseini Geoffrey Hinton Krzysztof Maziarz Noam Shazeer Quoc Le

co-cited works

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.

Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

GeMoE adaptively sets the number of experts per token via gating entropy, retaining 99.5% of static-routing performance while raising average sparsity by 36.5%.

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

cs.DC · 2026-06-10 · unverdicted · novelty 7.0

ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.

PCCL: Process Group-Aware Scalable and Generic Collective Algorithm Synthesizer

cs.DC · 2026-06-05 · unverdicted · novelty 7.0

PCCL synthesizes near-optimal topology-aware collective algorithms for arbitrary patterns while being process group-aware and scalable to subsets of devices.

Less is MoE: Trimming Experts in Domain-Specialist Language Models

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.

Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

cs.IR · 2026-06-03 · unverdicted · novelty 7.0

Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

cs.DC · 2026-05-30 · unverdicted · novelty 7.0

ViBE co-optimizes expert placement with measured GPU performance variability in MoE inference to cut execution-time imbalance, delivering 14% better SLO attainment and up to 45% lower P90 TTFT.

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

math.DS · 2026-05-27 · unverdicted · novelty 7.0

A mean-field limit of a reinforcement-based softmax router for two experts shows a supercritical pitchfork bifurcation, with an external asymmetry unfolding it into a cusp of fold bifurcations.

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

cs.IR · 2026-05-26 · unverdicted · novelty 7.0

L2Rec introduces dual-view personalized low-rank perturbations via DPMoE to let one LLM backbone produce complementary behavioral and semantic adaptations, with cross-view fusion, outperforming baselines on four datasets and in industrial A/B tests.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Expert specialization in vision MoE models is dominated by a stable animate-inanimate distinction visible from gating to readout, with broader tuning to continuous visual and semantic dimensions rather than narrow categorical preferences.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

Dynamic Chunking for Diffusion Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.

Vector-Quantized Discrete Latent Factors Meet Financial Priors: Dynamic Cross-Sectional Stock Ranking Prediction for Portfolio Construction

cs.LG · 2026-05-13 · conditional · novelty 7.0

PRISM-VQ integrates vector-quantized latent factors with financial priors and a structure-conditioned mixture-of-experts to deliver improved cross-sectional stock return predictions and portfolio performance on CSI 300 and S&P 500.

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

citing papers explorer

Showing 50 of 308 citing papers.

Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling cs.LG · 2026-04-21 · unverdicted · none · ref 18 · internal anchor
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation cs.IR · 2026-04-21 · unverdicted · none · ref 86 · internal anchor
STK-Adapter adds Spatial-Temporal MoE, Event-Aware MoE, and Cross-Modality Alignment MoE to integrate evolving TKG graphs and event chains into LLMs, reducing information loss and improving extrapolation performance over prior methods.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts cs.CV · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms cs.CV · 2026-04-16 · unverdicted · none · ref 40 · internal anchor
WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation cs.LG · 2026-04-14 · unverdicted · none · ref 48 · internal anchor
SubFlow restores full mode coverage in one-step flow matching by conditioning on sub-modes from semantic clustering, yielding higher diversity on ImageNet-256 while preserving FID.
Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence cs.SE · 2026-03-27 · unverdicted · none · ref 54 · internal anchor
Empirical case study on a flagship Android device profiles energy, latency, and quality trade-offs across eight LLMs, revealing a quantization energy paradox and identifying mid-sized models as practical sweet spots.
Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment cs.NI · 2026-03-26 · unverdicted · none · ref 3 · internal anchor
Qwen-2.5-3B achieves 0.793 accuracy and 988 ms median latency on six-class task routing but misses the pre-registered viability bar of 0.85 accuracy and 2000 ms P95 latency.
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts cs.LG · 2025-10-09 · unverdicted · none · ref 18 · internal anchor
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition eess.AS · 2025-09-10 · unverdicted · none · ref 78 · internal anchor
Sparse MERIT uses frame-wise sparse mixture-of-experts with task-specific gating on self-supervised speech features to jointly optimize enhancement and emotion recognition, reporting gains over baselines on MSP-Podcast at low SNR.
PiKV: KV Cache Management System for Mixture of Experts cs.DC · 2025-08-02 · unverdicted · none · ref 18 · internal anchor
PiKV proposes expert-sharded KV storage, PiKV routing, adaptive scheduling, and compression modules to reduce overhead in multi-GPU MoE inference.
Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate cs.LG · 2025-07-08 · unverdicted · none · ref 4 · internal anchor
Demonstrates that Transformers can continue learning when grown modularly above a frozen minimal token interface under a fixed active-parameter budget, with reported viability in 9-layer and 16-layer experiments.
Test-Time Alignment via Hypothesis Reweighting cs.LG · 2024-12-11 · unverdicted · none · ref 53 · internal anchor
HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 28 · internal anchor
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model cs.CL · 2022-01-28 · unverdicted · none · ref 61 · internal anchor
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.
4K-Memristor Analog-Grade Passive Crossbar Circuit cs.ET · 2019-06-27 · conditional · none · ref 32 · internal anchor
Experimental demonstration of a 4K-device passive analog memristor crossbar with high yield and sufficient precision for neuromorphic pattern classification.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 32 · internal anchor
Pith review generated a malformed one-line summary.
A Multi-task Mixture of Experts Framework for Malware Classification, Packing Detection, and Family Attribution cs.CR · 2026-06-29 · unverdicted · none · ref 10 · internal anchor
A Multi-Gate MoE architecture achieves 0.9744 combined detection rate across three malware tasks with improved robustness to mutations compared to other MoE variants.
Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts? cs.LG · 2026-06-28 · unverdicted · none · ref 26 · internal anchor
Representation decorrelation regularization in MoE models improves explanation faithfulness on multimodal benchmarks while preserving task performance.
CMSL: Constructive Multi-Sequence Learning for Recommendation Systems cs.IR · 2026-06-26 · unverdicted · none · ref 96 · internal anchor
CMSL uses a learnable module to disentangle user history into multiple pure sequences modeled with linear attention to improve recommendation performance over single-sequence approaches.
Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI cs.LG · 2026-06-24 · unverdicted · none · ref 26 · internal anchor
A cycle-based reentry architecture is proposed to guarantee self-model emergence, self-preservation, and prompt-injection immunity in AGI via a D-I loop and a new S-measure of integrated information.
LiMoDE: Rethinking Lifelong Robot Manipulation from a Mixture-of-Dynamic-Experts Perspective cs.RO · 2026-06-24 · unverdicted · none · ref 52 · internal anchor
LiMoDE uses dynamic MoE pre-training on motion cues followed by lifelong expert addition for continuous robot task adaptation.
Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study cs.PF · 2026-06-19 · accept · none · ref 23 · internal anchor
Empirical benchmarks show MoE inference cost on edge hardware tracks total parameters rather than active parameters, with OLMoE-1B-7B behind dense baselines especially on the Jetson device.
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale cs.CL · 2026-06-13 · unverdicted · none · ref 74 · internal anchor
Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling cs.LG · 2026-06-05 · unverdicted · none · ref 59 · internal anchor
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition cs.CV · 2026-05-20 · unverdicted · none · ref 32 · 2 links · internal anchor
Rank-aware selective fusion via attention-based gating and decoupled presence/salience heads with unsupervised domain adaptation outperforms baselines and ranks 2nd on the BlEmoRE challenge for blended emotion recognition.
Tracing the ongoing emergence of human-like reasoning in Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 52 · internal anchor
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources cs.DC · 2026-05-16 · unverdicted · none · ref 27 · internal anchor
GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics cs.DC · 2026-05-02 · accept · none · ref 29 · internal anchor
LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation cs.CV · 2026-04-28 · unverdicted · none · ref 37 · internal anchor
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation cs.AI · 2026-04-23 · unverdicted · none · ref 36 · internal anchor
LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input cs.RO · 2026-04-21 · unverdicted · none · ref 24 · internal anchor
Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.
A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction eess.IV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
A two-stage architecture processes multi-modal MRI data independently before late fusion to classify developmental stages and predict brain age within stages for lifespan assessment.
Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework cs.LG · 2026-04-14 · unverdicted · none · ref 19 · internal anchor
A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO cs.DC · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
StableHLO serves as a viable unified representation for cross-architecture performance modeling of distributed ML workloads, preserving relative trends while exposing fidelity trade-offs.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 76 · internal anchor
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics cs.RO · 2025-06-16 · unverdicted · none · ref 25 · internal anchor
GRaD-Nav++ combines 3D Gaussian Splatting simulation and differentiable RL to train an onboard VLA policy that achieves 50-83% success on language-guided drone navigation tasks in simulation and real hardware.
A 35B Hybrid-Attention Mixture-of-Experts Model on a 6GB 2011 GPU: Hand-Written 4-bit CUDA Inference for Fermi cond-mat.other · 2026-06-23 · unverdicted · none · ref 4 · internal anchor
Engineers implement hybrid CPU-GPU 4-bit inference for a 35B hybrid-attention MoE on a 2011 Fermi GPU, reporting 34% lower prefill latency and 3x decode throughput via custom kernels and expert pinning.
Hybrid Compression: Integrating Pruning and Quantization for Optimized Neural Networks cs.CV · 2026-06-22 · unverdicted · none · ref 25 · internal anchor
Hybrid method applies pruning and quantization followed by MoE routing of compressed CNN experts to achieve large reductions in FLOPs and parameters with negligible accuracy loss on benchmarks.
Logit Distillation on Manifolds: Mapping by Learning cs.LG · 2026-05-30 · unverdicted · none · ref 12 · internal anchor
Presents a layer- and point-wise projection mapping for manifold-based logit distillation combined with LoRA to enable low-parameter student training with reported WER gains.
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent cs.CL · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
MMoA adds LSTM recurrence to Mixture-of-Agents routing, reaching 58.0% win rate on AlpacaEval 2.0 versus 59.8% for baseline MoA while cutting runtime by up to 4.6%.
Deep Learning for Electricity Price Forecasting: A Review of Day-Ahead, Intraday, and Balancing Electricity Markets q-fin.CP · 2026-02-10 · unverdicted · none · ref 66 · internal anchor
A structured review organizes deep learning models for electricity price forecasting via a backbone-head-loss taxonomy and identifies gaps in intraday and balancing market applications.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 140 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 238 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 130 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
The Hitchhiker's Guide to Agentic AI: From Foundations to Systems cs.AI · 2026-06-22 · unverdicted · none · ref 115 · internal anchor
A comprehensive reference book organizing existing techniques for agentic AI systems across LLM substrate, reasoning, agent design patterns, inter-agent coordination, and production deployment.
Benchmarking PNW Model for MedMNIST to 100% Accuracy cs.AI · 2026-04-20 · unverdicted · none · ref 13 · internal anchor
A new 'Artificial Special Intelligence' method is claimed to enable error-free training of classification models to 100% accuracy on 15 of 18 MedMNIST biomedical datasets.
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research cs.CL · 2024-11-30 · unverdicted · none · ref 127 · internal anchor
This survey paper identifies opportunities for LLMs in low-resource language humanities research along with challenges in data accessibility, model adaptability, and cultural sensitivity.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 121 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
Hyperloop Transformers cs.LG · 2026-04-23 · unreviewed · ref 23 · internal anchor

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer