Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
Perez, and Andrew Fitzgib- bon
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Introduces a graphical calculus with nested graded tubes bridging tensor networks and computation graphs for einops, turning equivariance proofs into diagrammatic derivations and enabling efficient sparse attention via mask preprocessing.
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
RAVEN pretrains on over one million EHR sequences via recurrence-aware next-visit event prediction, enabling zero-shot disease incidence forecasting that rivals fine-tuned models and generalizes across cohorts.
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
deCIFer trains an autoregressive LM on 2.3 million structures with synthetic PXRD noise to generate CIF files, reporting 94% structural match rate on synthetic inorganic test sets.
Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
Graphical einops: bridging tensor networks and computation graphs
Introduces a graphical calculus with nested graded tubes bridging tensor networks and computation graphs for einops, turning equivariance proofs into diagrammatic derivations and enabling efficient sparse attention via mask preprocessing.
-
Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
-
Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
RAVEN pretrains on over one million EHR sequences via recurrence-aware next-visit event prediction, enabling zero-shot disease incidence forecasting that rivals fine-tuned models and generalizes across cohorts.
-
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
-
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
InfiniPipe proposes elastic pipeline parallelism and stage-aware chunk-level adaptive checkpointing to achieve 1.69x speedup over state-of-the-art for variable-length long-context LLM training.
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
-
deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models
deCIFer trains an autoregressive LM on 2.3 million structures with synthetic PXRD noise to generate CIF files, reporting 94% structural match rate on synthetic inorganic test sets.
-
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
ResiHP introduces a workload-aware failure detector and dynamic scheduler for hybrid-parallel LLM training that achieves 1.04-4.39x higher throughput than prior resilient systems under failures on a 256-GPU cluster.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.