pith. machine review for the scientific record. sign in

arxiv: 2006.16668 · v1 · submitted 2020-06-30 · 💻 cs.CL · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:22 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML
keywords model scalingmixture of expertsautomatic shardingconditional computationneural machine translationTransformermultilingual modelsmodel parallelism
0
0 comments X

The pith

GShard enables scaling of sparsely-gated mixture-of-experts models beyond 600 billion parameters through automatic sharding and minimal code changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GShard as a set of lightweight annotation APIs plus an XLA compiler extension that lets developers express parallel computation patterns without rewriting large parts of their models. It demonstrates this by scaling a multilingual neural machine translation Transformer that uses Sparsely-Gated Mixture-of-Experts layers to more than 600 billion parameters. The authors trained the resulting model on 2048 TPU v3 accelerators in four days and report substantially better translation quality from 100 languages into English than earlier systems. A reader would care because scaling model size has repeatedly improved performance on language tasks, yet the practical barriers of manual parallelization and compute cost have limited how far that scaling can go. If the approach holds, it removes much of the engineering friction that currently stands between researchers and giant conditional-computation models.

Core claim

GShard is a module of lightweight annotation APIs and an extension to the XLA compiler that provides an elegant way to express a wide range of parallel computation patterns with minimal changes to existing model code. Using GShard, the authors scaled a multilingual neural machine translation Transformer with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters. The model trained efficiently on 2048 TPU v3 accelerators in four days and delivered far superior quality for translation from 100 languages to English compared with prior art.

What carries the argument

GShard module consisting of lightweight annotation APIs and an XLA compiler extension that automates sharding for conditional computation patterns such as Sparsely-Gated Mixture-of-Experts.

If this is right

  • Models that activate only a small subset of parameters per input can be trained at scales previously limited by manual sharding effort.
  • Training runs for models exceeding 600 billion parameters become feasible on accelerator clusters within days rather than weeks or months.
  • Multilingual neural machine translation quality improves measurably when the number of experts and total parameters increase under the same training budget.
  • Existing Transformer code bases can adopt conditional computation and model parallelism with only localized annotation changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation-plus-compiler pattern could be applied to other sparse architectures in vision or speech models without requiring new hardware primitives.
  • Widespread adoption might shift research focus from hand-tuned parallelism to higher-level decisions about which computations should be conditional.
  • If the overhead remains low, future work could explore even larger numbers of experts or dynamic routing across modalities while keeping code readable.

Load-bearing premise

The automatic sharding and conditional computation can be realized with minimal model-code changes and without introducing correctness or performance problems that would invalidate the reported quality gains or training efficiency.

What would settle it

Re-implementing the 600-billion-parameter multilingual translation model with GShard, running it on 2048 TPU v3 accelerators, and checking whether training completes in roughly four days while matching or exceeding the claimed BLEU improvements over prior models.

read the original abstract

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GShard, a module of lightweight annotation APIs plus an XLA compiler extension that lets users express a wide range of parallel computation patterns (including conditional computation) with minimal changes to existing model code. It demonstrates the approach by scaling a multilingual NMT Transformer that uses a Sparsely-Gated Mixture-of-Experts layer to more than 600 billion parameters, training the model on 2048 TPU v3 chips in four days and reporting substantially better translation quality from 100 languages into English than prior systems.

Significance. If the empirical claims are reproducible and the sharding semantics are preserved, the work is significant because it shows a practical route to training giant conditional-computation models at the 600 B+ scale with only modest code changes. The combination of automatic sharding and MoE routing could lower the barrier to experimenting with models whose size would otherwise be limited by manual partitioning effort.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.
  2. [§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from a short table or bullet list of the exact API annotations introduced (@gshard, mesh, etc.) so readers can immediately see the claimed 'minimal code change' surface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the presentation of results and verification of implementation correctness. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (MoE scaling results): the headline claim that the 600 B+ model achieves 'far superior quality' is presented without any quantitative metrics (BLEU scores, baselines, number of languages evaluated, or statistical significance), so the link between the GShard implementation and the reported quality gain cannot be evaluated from the given text.

    Authors: We agree that the absence of explicit quantitative metrics in the abstract and Section 4 makes it difficult to evaluate the quality claims. In the revised manuscript we have added specific BLEU scores for the 600B model, direct baseline comparisons against prior systems, the exact number of languages evaluated, and notes on evaluation methodology to make the improvements verifiable. revision: yes

  2. Referee: [§3 and §4] §3 (GShard API and XLA extension) and §4 (MoE dispatch/combine): the paper asserts that the automatic sharding of top-k expert routing, capacity-factor dispatch, and all-to-all communication preserves exact semantics and gradient flow, yet supplies neither a machine-checked equivalence argument nor a side-by-side numerical audit of sharded versus unsharded forward/backward passes at the scale used; this is load-bearing for the correctness of the 4-day training result.

    Authors: We acknowledge that the manuscript does not include a machine-checked equivalence proof or a full-scale numerical audit at 600B parameters. A formal machine-checked argument for the XLA extensions is outside the scope of the paper. The GShard annotations are designed to produce an identical computation graph to the unsharded version, with sharding applied as a transparent compiler transformation that preserves dataflow and gradients by construction. In the revision we have added a side-by-side numerical audit on a smaller-scale model (showing forward and backward passes match within floating-point tolerance) in the appendix to provide concrete verification evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems demonstration without derivation or fitted predictions

full rationale

The paper presents GShard as a set of lightweight annotation APIs plus an XLA compiler extension that enables automatic sharding for conditional computation patterns such as sparsely-gated MoE. Its core claim is an end-to-end empirical result: a 600B-parameter multilingual Transformer was trained on 2048 TPU v3 chips in four days and produced superior BLEU scores. No equations, first-principles derivations, parameter fits, or predictions appear in the abstract or described content. The result is externally falsifiable by re-implementation and re-training rather than being forced by any self-definition, self-citation chain, or renaming of prior results. This is a standard non-circular engineering paper whose validity rests on implementation correctness and experimental reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems and engineering paper; no mathematical free parameters, domain axioms, or new invented entities are introduced.

pith-pipeline@v0.9.0 · 5498 in / 1088 out tokens · 55391 ms · 2026-05-11T02:22:08.183651+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  2. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  3. Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 7.0

    Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.

  4. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  5. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  6. Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

    cs.CV 2026-05 unverdicted novelty 7.0

    BatMIL uses hybrid hyperbolic-Euclidean geometry, an S4 state-space backbone, and chunk-level mixture-of-experts to outperform prior multiple-instance learning methods on seven whole-slide image datasets across six cancers.

  7. AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

    cs.LG 2026-05 unverdicted novelty 7.0

    Approximate multipliers degrade MoE and dense DNNs at different rates; ResNet-20 recovers fully after retraining while VGG models often fail at aggressive approximations except Cluster MoE, and Hard MoE can outperform...

  8. Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.

  9. Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and common...

  10. FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

    cs.DC 2026-04 unverdicted novelty 7.0

    FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

  11. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

  12. Depth Adaptive Efficient Visual Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

  13. A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    A mixture-of-experts transformer foundation model pretrained on diverse SEM images enables generalization across materials and outperforms SOTA on unsupervised defocus-to-focus restoration.

  14. Path-Constrained Mixture-of-Experts

    cs.LG 2026-03 unverdicted novelty 7.0

    PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

  15. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  16. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  17. BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

    cs.AI 2026-05 conditional novelty 6.0

    BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.

  18. Combining pre-trained models via localized model averaging

    stat.ME 2026-05 unverdicted novelty 6.0

    Localized model averaging with covariate-dependent weights achieves asymptotic optimality and weight consistency for combining pre-trained models under a general loss framework.

  19. Enabling Performant and Flexible Model-Internal Observability for LLM Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

  20. DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

    cs.LG 2026-05 unverdicted novelty 6.0

    DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.

  21. XPERT: Expert Knowledge Transfer for Effective Training of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.

  22. Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

  23. DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.

  24. Hierarchical Mixture-of-Experts with Two-Stage Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    Hi-MoE uses two-level hierarchical routing objectives to enforce group-level balance while promoting within-group specialization, yielding better perplexity and expert utilization than prior MoE baselines in NLP and v...

  25. UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  26. MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

    cs.AR 2026-05 unverdicted novelty 6.0

    MoE-Hub enables seamless MoE communication overlap via hardware-accelerated destination-agnostic data transmission, delivering 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedups over prior systems.

  27. Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs

    cs.AR 2026-05 unverdicted novelty 6.0

    DySHARP accelerates MoE expert parallelism via dynamic multimem addressing and token-centric kernel fusion to cut redundant traffic and deliver up to 1.79x speedup over prior in-switch solutions.

  28. Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 6.0

    Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.

  29. Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

  30. ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeRO-Prefill achieves 1.35-1.59x higher throughput for MoE prefill serving by replacing per-layer activation AllToAll with overlapped asynchronous weight AllGather and prefix-aware routing.

  31. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

  32. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  33. Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scale...

  34. Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.

  35. Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

    cs.LG 2026-04 unverdicted novelty 6.0

    NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...

  36. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  37. WiFo-MiSAC: A Wireless Foundation Model for Multimodal Sensing and Communication Integration via Synesthesia of Machines (SoM)

    eess.SP 2026-04 unverdicted novelty 6.0

    WiFo-MiSAC is a task-agnostic foundation model that unifies multimodal wireless signals via tokenization and self-supervised learning with SS-DMoE to achieve strong few-shot performance on beam prediction and channel ...

  38. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  39. DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

    cs.AR 2026-04 conditional novelty 6.0

    DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...

  40. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  41. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  42. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  43. ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 5.0

    ResiHP improves LLM training throughput by 1.04-4.39x under hardware failures by using a workload-aware execution time predictor to avoid false failure detections and a scheduler that dynamically changes parallelism g...

  44. FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

    cs.DC 2026-04 unverdicted novelty 5.0

    FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.

  45. Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.

  46. PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs

    cs.LG 2026-04 accept novelty 5.0

    PINNACLE is an open-source framework for classical and quantum PINNs that supplies modular training methods and benchmarks showing high sensitivity to architecture choices plus parameter-efficiency gains in some hybri...

  47. M-IDoL: Information Decomposition for Modality-Specific and Diverse Representation Learning in Medical Foundation Model

    cs.CV 2026-04 unverdicted novelty 5.0

    M-IDoL learns modality-specific and diverse representations by maximizing inter-modality entropy and minimizing intra-modality uncertainty through information decomposition in MoE subspaces.

  48. HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.

  49. JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    cs.CL 2026-04 unverdicted novelty 5.0

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

  50. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  51. gpt-oss-120b & gpt-oss-20b Model Card

    cs.CL 2025-08 unverdicted novelty 5.0

    OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.

  52. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  53. ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

    cs.DC 2026-05 unverdicted novelty 4.0

    ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.

  54. Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

    cs.AI 2026-05 unverdicted novelty 4.0

    AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

  55. Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

    cs.DC 2026-05 accept novelty 4.0

    LLM serving requires mathematical optimization and algorithms with provable guarantees rather than generic heuristics that fail unpredictably on LLM workloads.

  56. Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation

    cs.AI 2026-04 unverdicted novelty 4.0

    LLM chain-of-thought rewriting of job postings plus category-aware MoE improves person-job fit AUC by 2.4%, GAUC by 7.5%, and live click-through conversion by 19.4%.

  57. Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input

    cs.RO 2026-04 unverdicted novelty 4.0

    Sparsely gated MoE policies double the success rate of a real Unitree Go2 quadruped on large-obstacle parkour versus matched-active-parameter MLP baselines while cutting inference time compared with a scaled-up MLP.

  58. Efficient Handwriting-Based Alzheimer,s Disease Diagnosis Using a Low-Rank Mixture of Experts Deep Learning Framework

    cs.LG 2026-04 unverdicted novelty 4.0

    A low-rank mixture of experts model trained on handwriting data delivers strong Alzheimer's diagnosis performance with substantially reduced parameter activation during inference.

  59. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    cs.CL 2026-05 unverdicted novelty 3.0

    EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

  60. A Survey on Efficient Inference for Large Language Models

    cs.CL 2024-04 accept novelty 3.0

    The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 58 Pith papers · 9 internal anchors

  1. [1]

    On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018

  2. [2]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

  3. [3]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  4. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  5. [5]

    Exploring the limits of weakly supervised pretraining

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018

  6. [7]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  7. [8]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016

  8. [9]

    Nas-fpn: Learning scalable feature pyramid architecture for object detection

    Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7036–7045, 2019

  9. [10]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017

  10. [11]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019

  11. [12]

    Language Models are Few-Shot Learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

  12. [13]

    Unsupervised cross-lingual representation learning at scale, 2019

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale, 2019

  13. [14]

    Massively multilingual neural machine translation in the wild: Findings and challenges, 2019

    Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019

  14. [15]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems 32, pages 103–112, 2019. 26

  15. [16]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  16. [17]

    Advani and Andrew M

    Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks, 2017

  17. [18]

    Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017

  18. [19]

    Beyond human-level accuracy

    Joel Hestness, Newsha Ardalani, and Gregory Diamos. Beyond human-level accuracy. Pro- ceedings of the 24th Symposium on Principles and Practice of Parallel Programming , Feb 2019

  19. [20]

    Scaling description of generalization with number of parameters in deep learning

    Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’ Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, Feb 2020

  20. [21]

    Tensorflow: a system for large-scale machine learning

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016

  21. [23]

    Mesh-tensorflow: Deep learning for supercomputers

    Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, pages 10414–10423, 2018

  22. [24]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018

  23. [25]

    Conditional computa- tion in neural networks for faster models, 2015

    Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computa- tion in neural networks for faster models, 2015

  24. [26]

    Elbayad, J

    Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2020

  25. [27]

    Controlling computation versus quality for neural sequence models, 2020

    Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Controlling computation versus quality for neural sequence models, 2020

  26. [28]

    https://www.tensorflow.org/xla, 2019

    XLA: Optimizing Compiler for TensorFlow. https://www.tensorflow.org/xla, 2019. Online; accessed 1 June 2020

  27. [29]

    Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010

  28. [30]

    Die grundlage der allgemeinen relativitätstheorie

    Albert Einstein. Die grundlage der allgemeinen relativitätstheorie. In Das Relativitätsprinzip, pages 81–124. Springer, 1923

  29. [31]

    Lingvo: a modular and scalable framework for sequence-to-sequence modeling

    Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia Xu Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019

  30. [32]

    Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs

    Youlong Cheng, HyoukJoong Lee, and Tamas Berghammer. Train ML models on large images and 3D volumes with spatial partitioning on Cloud TPUs. https: //cloud.google.com/blog/products/ai-machine-learning/train-ml-models- on-large-images-and-3d-volumes-with-spatial-partitioning-on-cloud-tpus ,

  31. [33]

    Online; accessed 12 June 2020. 27

  32. [34]

    https://github.com/onnx/onnx, 2019

    ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx, 2019. Online; accessed 1 June 2020

  33. [35]

    Relay: a new ir for machine learning frameworks

    Jared Roesch, Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, and Zachary Tatlock. Relay: a new ir for machine learning frameworks. Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2018, 2018

  34. [36]

    Glow: Graph lowering compiler techniques for neural networks, 2018

    Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Mont- gomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. Glow: Graph lowering compiler techniques for neural networks, 2018

  35. [37]

    MPI: A Message-Passing Interface Standard

    MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, September 4th 2009. available at: http://www.mpi-forum.org (Dec. 2009)

  36. [38]

    BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy

    Minsik Cho, Ulrich Finkler, and David Kung. BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019

  37. [39]

    A Cellular Computer to Implement the Kalman Filter Algorithm

    Lynn Elliot Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, USA, 1969. AAI7010025

  38. [40]

    Multi-way, multilingual neural machine translation with a shared attention mechanism

    Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016

  39. [41]

    Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and et al

    Melvin Johnson, Mike Schuster, Quoc V . Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, and et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, Dec 2017

  40. [42]

    Massively multilingual neural machine translation

    Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. CoRR, abs/1903.00089, 2019

  41. [43]

    https://ai

    Exploring massively multilingual, massive neural machine translation. https://ai. googleblog.com/2019/10/exploring-massively-multilingual.html. Accessed: 2020-06-05

  42. [44]

    https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html

    Recent advances in google translate. https://ai.googleblog.com/2020/06/recent- advances-in-google-translate.html . Accessed: 2020-06-05

  43. [45]

    Transfer of training: A review and directions for future research

    Timothy T Baldwin and J Kevin Ford. Transfer of training: A review and directions for future research. Personnel psychology, 41(1):63–105, 1988

  44. [46]

    Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

  45. [47]

    Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013

    Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward compu- tation in deep neural networks, 2013

  46. [48]

    Ponte, Ashok C

    Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, page 1101–1109, USA, 2010. Association for Computational Linguistics

  47. [49]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002

  48. [50]

    Training deeper neural machine translation models with transparent attention

    Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018. 28

  49. [51]

    Language modeling with deep transformers

    Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney. Language modeling with deep transformers. Interspeech 2019, Sep 2019

  50. [52]

    Wong, and Lidia S

    Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  51. [53]

    So, Chen Liang, and Quoc V

    David R. So, Chen Liang, and Quoc V . Le. The evolved transformer, 2019

  52. [54]

    https://cloud.google.com/tpu/docs/ bfloat16, 2020

    Using bfloat16 with TensorFlow models. https://cloud.google.com/tpu/docs/ bfloat16, 2020. Online; accessed 12 June 2020

  53. [55]

    Wide and deep learning for recommender systems

    Heng-Tze Cheng, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, Hemal Shah, Levent Koc, Jeremiah Harmsen, and et al. Wide and deep learning for recommender systems. Proceedings of the 1st Workshop on Deep Learning for Recommender Systems - DLRS 2016, 2016

  54. [56]

    Lampinen and Surya Ganguli

    Andrew K. Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks, 2018

  55. [57]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019

  56. [58]

    ImageNet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

  57. [59]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

  58. [60]

    Sequence to sequence learning with neural networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014

  59. [61]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014

  60. [62]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016

  61. [63]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

    Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012

  62. [64]

    Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

    William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964. IEEE, 2016

  63. [65]

    State-of-the-art speech recognition with sequence-to-sequence models

    Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE, 2018

  64. [66]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. 29

  65. [67]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

    Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018

  66. [68]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. 2017

  67. [69]

    Exploring generalization in deep learning, 2017

    Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning, 2017

  68. [70]

    Special-purpose digital hardware for neural networks: An architectural survey

    Paolo Ienne, Thierry Cornu, and Gary Kuhn. Special-purpose digital hardware for neural networks: An architectural survey. Journal of VLSI signal processing systems for signal, image and video technology, 13(1):5–25, 1996

  69. [71]

    Large-scale deep unsupervised learning using graphics processors

    Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pages 873–880, 2009

  70. [72]

    Deep, big, simple neural nets for handwritten digit recognition

    Dan Claudiu Cire¸ san, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12):3207–3220, 2010

  71. [73]

    In-datacenter performance analysis of a tensor processing unit

    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12, 2017

  72. [74]

    https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/

    2019 recent trends in GPU price per FLOPS. https://aiimpacts.org/2019-recent- trends-in-gpu-price-per-flops/ . Accessed: 2020-06-05

  73. [75]

    Summarizing cpu and gpu design trends with product data

    Yifan Sun, Nicolas Bohm Agostini, Shi Dong, and David Kaeli. Summarizing cpu and gpu design trends with product data. arXiv preprint arXiv:1911.11313, 2019

  74. [76]

    Large scale distributed deep networks

    Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012

  75. [77]

    Theano: new features and speed improvements

    Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012

  76. [78]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

  77. [79]

    Scalable parallel programming with cuda

    John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, 2008

  78. [80]

    JAX: composable transformations of Python+NumPy programs

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs. 2018

  79. [81]

    Compiling machine learning programs via high-level tracing

    Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. In Machine Learning and Systems (MLSys), 2018

  80. [82]

    Beyond Data and Model Parallelism for Deep Neural Networks

    Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. In Proceedings of the Conference on Systems and Machine Learning (SysML), Palo Alto, CA, 2019

Showing first 80 references.