pith. machine review for the scientific record. sign in

A theory on Adam instability in large-scale machine learning

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

verdicts

UNVERDICTED 5

roles

method 1

polarities

use method 1

representative citing papers

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

Emerging Properties in Unified Multimodal Pretraining

cs.CV · 2025-05-20 · unverdicted · novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

Open-Sora: Democratizing Efficient Video Production for All

cs.CV · 2024-12-29 · unverdicted · novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

citing papers explorer

Showing 5 of 5 citing papers.

  • Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes cs.LG · 2026-05-07 · unverdicted · none · ref 24 · 2 links

    Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.

  • MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 23

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

  • Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 52

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  • Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 21

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

  • Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers cs.LG · 2026-05-09 · unverdicted · none · ref 25

    This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.