DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Pith reviewed 2026-05-24 03:20 UTC · model grok-4.3
The pith
DeepSeekMath 7B reaches 51.7% on the MATH benchmark by continuing pre-training on 120B curated web math tokens and applying Group Relative Policy Optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepSeekMath 7B shows that continued pre-training on a large volume of curated math tokens from public web data, followed by reinforcement learning with Group Relative Policy Optimization, enables a 7B open model to reach 51.7 percent on the MATH benchmark and approach the level of closed frontier systems without relying on external tools or ensembles.
What carries the argument
Group Relative Policy Optimization (GRPO), a memory-efficient variant of Proximal Policy Optimization that scores groups of responses relative to one another, combined with a data selection pipeline that extracts and filters 120B math-related tokens from Common Crawl.
If this is right
- Self-consistency sampling over 64 responses raises MATH accuracy from 51.7 percent to 60.9 percent.
- Open 7B models can reach performance close to closed models on competition math without tool use or voting.
- GRPO reduces the memory footprint of PPO while still improving reasoning performance.
- Public web data contains enough high-quality math content to support large-scale continued pre-training when filtered carefully.
Where Pith is reading between the lines
- The same data-curation approach could be tested on other structured reasoning domains such as code or physics problem solving.
- GRPO might transfer to reinforcement learning settings outside mathematics where relative scoring within batches is feasible.
- Further increases in the volume of filtered math tokens or model size could narrow the remaining gap to closed frontier systems.
Load-bearing premise
The performance on MATH is driven primarily by the data selection pipeline and the GRPO algorithm rather than by other details of the base model or training setup.
What would settle it
Train an otherwise identical 7B model on the same base checkpoint but without the math-data selection step or without GRPO and measure whether accuracy on MATH stays well below 51.7 percent.
read the original abstract
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DeepSeekMath 7B, obtained by continued pre-training of DeepSeek-Coder-Base-v1.5 7B on 120B math-related tokens from Common Crawl plus natural language and code data. It reports 51.7% accuracy on the MATH benchmark (60.9% with 64-sample self-consistency) without external toolkits or voting, approaching Gemini-Ultra and GPT-4. The authors attribute the gains to a meticulously engineered data selection pipeline from web data and the introduction of Group Relative Policy Optimization (GRPO), a PPO variant that improves mathematical reasoning while reducing memory usage.
Significance. If the attribution to the data pipeline and GRPO is substantiated by controls, the result would demonstrate that open 7B models can reach near-closed-model performance on competition math through public-data curation and a memory-efficient RL variant, offering a reproducible route for advancing reasoning capabilities.
major comments (2)
- [Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.
- [Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.
minor comments (2)
- [Abstract] The abstract states “120B math-related tokens” but does not clarify the total token count, the exact mix of math/NL/code, or the filtering criteria used in the pipeline.
- [Method section on GRPO] Notation for GRPO (reward formulation, group size, KL coefficient) should be defined with explicit equations in the method section to allow direct comparison with standard PPO.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns about missing ablations and table information below, and will make appropriate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.
Authors: We acknowledge the importance of ablations to substantiate the claims. In the revised manuscript, we will add the performance of the base model DeepSeek-Coder-Base-v1.5 7B on the MATH benchmark. For comparisons involving standard SFT or PPO, and the pipeline without filtering, these experiments were not conducted due to computational constraints. We will provide additional discussion on the rationale behind GRPO and the data curation process to better support the attribution. revision: partial
-
Referee: [Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.
Authors: We agree that including these baselines will improve clarity. We will update the tables in the results section to include the base model score and clarify the contributions from continued pre-training. revision: yes
- Full ablation studies on the effects of the data filtering pipeline and direct comparisons between GRPO and standard PPO, as these require new experiments not present in the original work.
Circularity Check
No circularity; empirical training results with no self-referential derivation.
full rationale
The paper reports benchmark scores from continued pretraining of an existing base model (DeepSeek-Coder-Base-v1.5 7B) on 120B tokens followed by GRPO fine-tuning. The central claims are measured outcomes (51.7% MATH) and an attribution to two engineering choices (data pipeline + GRPO). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that close a logical loop appear in the abstract or stated claims. Attribution without ablations is a weakness of evidence, not circularity by construction. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
-
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
-
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
SEVerA: Verified Synthesis of Self-Evolving Agents
SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
-
ETCHR: Editing To Clarify and Harness Reasoning
A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
-
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
-
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
-
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
-
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
-
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
-
Visual-Advantage On-Policy Distillation for Vision-Language Models
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
-
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% s...
-
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...
-
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
-
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
-
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
-
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
-
Grounding Driving VLA via Inverse Kinematics
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v...
-
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
-
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
-
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
-
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
-
RECIPE: Procedural Planning via Grounding in Instructional Video
RECIPE improves visual procedural planners by rewarding plans according to their grounding quality in ASR transcripts via GRPO, yielding +7–8 in-domain and up to +16 zero-shot macro-accuracy gains over base models and...
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
-
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
A new dual-protocol expert benchmark for image aesthetics is fused into ground truth and used to self-distill a VLM, raising SRCC from 0.504 to 0.709 across categories while matching closed-source performance.
-
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer...
-
Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning
ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
-
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
Aurora: Unified Video Editing with a Tool-Using Agent
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
-
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
-
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
-
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
-
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
-
Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.
-
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
-
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Autonomous AI agents outperform humans in supply chain simulations but exhibit an inherent agent bullwhip effect of amplified decision unreliability, mitigated by GRPO reinforcement learning post-training.
-
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...
Reference graph
Works this paper leans on
-
[1]
R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
-
[2]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Llemma: An Open Language Model For Mathematics
Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [5]
-
[6]
Chatglm3 series: Open bilingual chat llms, 2023
ChatGLM3 Team . Chatglm3 series: Open bilingual chat llms, 2023. URL https://github.com/THUDM/ChatGLM3
work page 2023
-
[7]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022. doi:10.48550/ARXIV.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.12588 2022
-
[9]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [10]
-
[11]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. doi:10.48550/ARXIV.2401.02954. URL https://doi.org/10.48550/arXiv.2401.02954
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.02954 2024
-
[12]
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022
work page 2022
-
[13]
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL: program-aided language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research,...
work page 2023
-
[14]
Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. doi:10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17452 2023
-
[15]
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024
work page 2024
-
[16]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[17]
Measuring Mathematical Problem Solving With the MATH Dataset
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm
work page 2023
-
[19]
Inflection AI . Inflection-2, 2023. URL https://inflection.ai/inflection-2
work page 2023
- [20]
-
[21]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
FastText.zip: Compressing text classification models
A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J \'e gou, and T. Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[24]
Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274--19286. PMLR, 2023
work page 2023
-
[25]
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022 a
work page 2022
-
[26]
A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman - Solo, Y. Wu, B. Neyshabur, G. Gur - Ari, and V. Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processi...
work page 2022
-
[27]
H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
S. Mishra, M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit, O. Tafjord, A. Sabharwal, P. Clark, and A. Kalyan. LILA: A unified benchmark for mathematical reasoning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab...
-
[31]
X. Nguyen, W. Zhang, X. Li, M. M. Aljunied, Q. Tan, L. Cheng, G. Chen, Y. Deng, S. Yang, C. Liu, H. Zhang, and L. Bing. Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023. doi:10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738
-
[32]
OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [33]
-
[34]
K. Paster, M. D. Santos, Z. Azerbayev, and J. Ba. Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi:10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786
-
[35]
L. C. Paulson. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In R. A. Schmidt, S. Schulz, and B. Konev, editors, Proceedings of the 2nd Workshop on Practical Aspects of Automated Reasoning, PAAR-2010, Edinburgh, Scotland, UK, July 14, 2010, volume 9 of EPiC Series in Computing, pages 1--10...
-
[36]
Generative Language Modeling for Automated Theorem Proving
S. Polu and I. Sutskever. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[37]
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023
work page 2023
- [38]
-
[39]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR...
work page 2023
- [42]
-
[43]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Sch \"a rli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
T. Tao. Embracing change and resetting expectations, 2023. URL https://unlocked.microsoft.com/ai-anthology/terence-tao/
work page 2023
-
[45]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton - Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[46]
T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024
work page 2024
- [47]
-
[48]
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Z. Wang, R. Xia, and P. Liu. Generative AI for math: Part I - mathpile: A billion-token-scale pretraining corpus for math. CoRR, abs/2312.17120, 2023 c . doi:10.48550/ARXIV.2312.17120. URL https://doi.org/10.48550/arXiv.2312.17120
-
[50]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
work page 2022
-
[51]
T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023
work page 2023
-
[52]
M. Wenzel, L. C. Paulson, and T. Nipkow. The isabelle framework. In O. A. Mohamed, C. A. Mu \ n oz, and S. Tahar, editors, Theorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume 5170 of Lecture Notes in Computer Science, pages 33--38. Springer, 2008. doi:10.1007/978-3-5...
-
[53]
H. Xia, T. Ge, P. Wang, S.-Q. Chen, F. Wei, and Z. Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909--3925, Singapore, Dec. 2023. Association for Computational Linguistics. doi:10.18...
- [54]
-
[55]
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. doi:10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.12284 2023
-
[57]
Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [58]
-
[59]
X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi:10.48550/ARXIV.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.05653 2023
-
[60]
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
K. Zheng, J. M. Han, and S. Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[61]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval : A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi:10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.06364 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.