hub Mixed citations

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin · 2016 · cs.LG · arXiv 1604.06174

Mixed citation behavior. Most common role is background (52%).

71 Pith papers citing it

Background 52% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 8 dataset 1

citation-polarity summary

background 11 use method 8 unclear 1 use dataset 1

claims ledger

abstract We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph a

co-cited works

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

A matrix-free, differentiable PyTorch solver for phase-field fracture: Formulation, benchmarks, and inverse analysis

cs.CE · 2026-06-22 · unverdicted · novelty 7.0

A matrix-free, GPU-compatible PyTorch implementation of phase-field fracture with explicit dynamics, custom differentiable implicit damage solve, benchmarks on dynamic and quasi-static cases, and inverse recovery of fracture energy G_c via L-BFGS.

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

cs.LG · 2026-05-15 · unverdicted · novelty 7.0 · 2 refs

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

Efficient and provably convergent end-to-end training of deep neural networks with linear constraints

math.OC · 2026-05-12 · unverdicted · novelty 7.0

An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Data-guided finite-volume PINNs for 2D shallow water equations avoid trivial low-momentum collapse via sparse measurements, achieving up to 22x error reduction on benchmarks and accurate surrogates on real river data.

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.

ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations

cs.DC · 2026-05-07 · conditional · novelty 7.0

ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency than finite differences on models with up to 1.9 million latent variables.

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

cs.CL · 2024-12-22 · unverdicted · novelty 7.0

GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

cs.LG · 2024-03-06 · conditional · novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.

Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations

cs.LG · 2024-02-27 · unverdicted · novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.

Moonwalk: Inverse-Forward Differentiation

cs.LG · 2024-02-22 · unverdicted · novelty 7.0

Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop runtime at over twice the depth.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL · 2023-10-03 · unverdicted · novelty 7.0

Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.

Efficient Memory Management for Large Language Model Serving with PagedAttention

cs.LG · 2023-09-12 · conditional · novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

QLoRA: Efficient Finetuning of Quantized LLMs

cs.LG · 2023-05-23 · conditional · novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

cs.LG · 2022-08-15 · conditional · novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

cs.LG · 2022-05-27 · accept · novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

citing papers explorer

Showing 50 of 71 citing papers.

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing cs.DC · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion cs.CV · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
MIRAGE discovers semantic attacks on online HD map construction via conditional diffusion, enabling boundary removal and injection that degrade AV performance while passing as realistic environmental changes.
Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 4 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
A matrix-free, differentiable PyTorch solver for phase-field fracture: Formulation, benchmarks, and inverse analysis cs.CE · 2026-06-22 · unverdicted · none · ref 40 · internal anchor
A matrix-free, GPU-compatible PyTorch implementation of phase-field fracture with explicit dynamics, custom differentiable implicit damage solve, benchmarks on dynamic and quasi-static cases, and inverse recovery of fracture energy G_c via L-BFGS.
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking cs.LG · 2026-05-15 · unverdicted · none · ref 3 · 2 links · internal anchor
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
Efficient and provably convergent end-to-end training of deep neural networks with linear constraints math.OC · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
An efficiently computable HS-Jacobian acts as a conservative mapping for projections onto polyhedral sets, supporting provably convergent Adam-based end-to-end training of linearly constrained deep neural networks.
Locking Pretrained Weights via Deep Low-Rank Residual Distillation cs.LG · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance cs.LG · 2026-05-09 · unverdicted · none · ref 40 · internal anchor
Data-guided finite-volume PINNs for 2D shallow water equations avoid trivial low-momentum collapse via sparse measurements, achieving up to 22x error reduction on benchmarks and accurate surrogates on real river data.
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval cs.CV · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
ADELIA: Automatic Differentiation for Efficient Laplace Inference Approximations cs.DC · 2026-05-07 · conditional · none · ref 30 · internal anchor
ADELIA is the first AD-enabled INLA system that computes exact hyperparameter gradients via a structure-exploiting multi-GPU backward pass, delivering 4.2-7.9x per-gradient speedups and 5-8x better energy efficiency than finite differences on models with up to 1.9 million latent variables.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing cs.SE · 2026-04-21 · unverdicted · none · ref 12 · internal anchor
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization cs.LG · 2026-04-14 · unverdicted · none · ref 13 · internal anchor
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs cs.CL · 2024-12-22 · unverdicted · none · ref 5 · internal anchor
GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection cs.LG · 2024-03-06 · conditional · none · ref 5 · internal anchor
GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations cs.LG · 2024-02-27 · unverdicted · none · ref 16 · internal anchor
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
Moonwalk: Inverse-Forward Differentiation cs.LG · 2024-02-22 · unverdicted · none · ref 9 · internal anchor
Moonwalk enables memory-efficient training of deep networks via mixed-mode gradient computation with vector-inverse-Jacobian products for submersive layers and fragmental checkpointing otherwise, matching backprop runtime at over twice the depth.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 78 · internal anchor
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
Ring Attention with Blockwise Transformers for Near-Infinite Context cs.CL · 2023-10-03 · unverdicted · none · ref 5 · internal anchor
Ring Attention uses blockwise computation and ring communication to let Transformers process sequences up to device-count times longer than prior memory-efficient methods.
Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 7 · internal anchor
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
QLoRA: Efficient Finetuning of Quantized LLMs cs.LG · 2023-05-23 · conditional · none · ref 9 · internal anchor
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 75 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness cs.LG · 2022-05-27 · accept · none · ref 10 · internal anchor
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on sequences up to 64K long.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 138 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Longformer: The Long-Document Transformer cs.CL · 2020-04-10 · accept · none · ref 80 · internal anchor
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models cs.LG · 2019-10-04 · accept · none · ref 7 · internal anchor
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 3 · internal anchor
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 4 · internal anchor
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Generating Long Sequences with Sparse Transformers cs.LG · 2019-04-23 · unverdicted · none · ref 4 · internal anchor
Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning cs.LG · 2026-05-20 · conditional · none · ref 6 · internal anchor
ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 180 · internal anchor
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 60 · internal anchor
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces cs.LG · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
LBI enables tractable parallel backpropagation by reducing inter-region adjoint computation to low-dimensional r x r Jacobians while preserving exact gradients under a bounded-interface model.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training cs.LG · 2026-05-08 · unverdicted · none · ref 126 · internal anchor
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 98 · 2 links · internal anchor
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring cs.CV · 2026-04-28 · conditional · none · ref 7 · 2 links · internal anchor
SIEVES improves selective prediction coverage by up to 3x on OOD VQA benchmarks by training a selector to score the quality of visual evidence produced by reasoner models, generalizing across benchmarks and proprietary models without internal access or per-task retraining.
Quantum Dynamics via Score Matching on Bohmian Trajectories quant-ph · 2026-04-28 · unverdicted · none · ref 19 · internal anchor
Neural networks learn the score of the probability density on Bohmian trajectories to recover exact Schrödinger dynamics via self-consistent minimization for nodeless wave functions, demonstrated on double-well splitting and Morse chain vibrations.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study cs.SE · 2026-04-27 · unverdicted · none · ref 6 · internal anchor
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Streaming Structured Inference with Flash-SemiCRF cs.LG · 2026-04-20 · unverdicted · none · ref 21 · internal anchor
Flash-SemiCRF enables exact semi-CRF inference on long sequences by evaluating edge potentials from compact prefix sums and streaming the forward-backward pass while preserving exact gradients.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation cs.CV · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning cs.LG · 2026-04-10 · unverdicted · none · ref 92 · internal anchor
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.
GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training cs.LG · 2026-02-23 · unverdicted · none · ref 6 · internal anchor
GeoPT pre-trains on over one million geometry samples augmented with synthetic dynamics to improve neural physics simulators on fluid and solid mechanics benchmarks while reducing labeled data needs by 20-60% and accelerating convergence by 2x.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 34 · internal anchor
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure cs.LG · 2025-09-23 · unverdicted · none · ref 7 · internal anchor
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction cs.IR · 2025-09-22 · unverdicted · none · ref 8 · internal anchor
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
SpikingBrain: Spiking Brain-inspired Large Models cs.LG · 2025-09-05 · unverdicted · none · ref 3 · internal anchor
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation cs.LG · 2025-06-02 · conditional · none · ref 3 · internal anchor
MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps cs.CV · 2025-01-16 · conditional · none · ref 9 · internal anchor
Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.
GWT: Scalable Optimizer State Compression for Large Language Model Training cs.LG · 2025-01-13 · unverdicted · none · ref 5 · internal anchor
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.

Training Deep Nets with Sublinear Memory Cost

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer