{"total":16,"items":[{"citing_arxiv_id":"2605.18404","ref_index":28,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials","primary_cat":"cs.DC","submitted_at":"2026-05-18T13:45:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18083","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\\Delta$ Integration into Upcycled MoE","primary_cat":"cs.CL","submitted_at":"2026-05-18T08:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09922","ref_index":64,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:17:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00539","ref_index":90,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-01T09:39:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00380","ref_index":16,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-01T03:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"identify relative alignment. Let D={R i,t} denote their projection residuals and Q(D, γ) the empirical γ-quantile. We set qlow =Q(D, α), q high =Q(D, β),(15) where (α, β) define a robust range by replacing min/max with quantiles. We then compute a quantile-based min-max normalized residual score with clipping: zi,t = clamp \u0012 Ri,t −q low (qhigh −q low) +ϵ ,0,1 \u0013 ,(16) where ϵ >0 prevents division by zero. Finally, we map zi,t to a token-wise NSR weight in[ξ,1]via ωi,t =ξ+ (1−ξ)z i,t,(17) Table 2.Performance on code reasoning benchmarks using Qwen3- 4B. We report LiveCodeBench Avg/Pass@16, CodeForces Rat- ing/Percentile (Pct.), and HumanEval+ Pass@16. Model LiveCodeBench CodeForces HumanEval+ Avg/Pass@16 Rating (Pct."},{"citing_arxiv_id":"2605.04058","ref_index":143,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T08:00:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":176,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.03387","ref_index":149,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LIMO: Less is More for Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T17:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14164","ref_index":110,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2024-12-18T18:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17891","ref_index":90,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scaling Diffusion Language Models via Adaptation from Autoregressive Models","primary_cat":"cs.CL","submitted_at":"2024-10-23T14:04:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":89,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.17297","ref_index":120,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"InternLM2 Technical Report","primary_cat":"cs.CL","submitted_at":"2024-03-26T00:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":100,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.16867","ref_index":96,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Falcon Series of Open Language Models","primary_cat":"cs.CL","submitted_at":"2023-11-28T15:12:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.14509","ref_index":152,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models","primary_cat":"cs.LG","submitted_at":"2023-09-25T20:15:57+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2208.07339","ref_index":71,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","primary_cat":"cs.LG","submitted_at":"2022-08-15T17:08:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}