DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Daya Guo; Haowei Zhang; Junxiao Song; Mingchuan Zhang; Peiyi Wang; Qihao Zhu; Runxin Xu; Xiao Bi; Y.K. Li; Y. Wu

arxiv: 2402.03300 · v3 · submitted 2024-02-05 · 💻 cs.CL · cs.AI· cs.LG

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao , Peiyi Wang , Qihao Zhu , Runxin Xu , Junxiao Song , Xiao Bi , Haowei Zhang , Mingchuan Zhang

show 3 more authors

Y.K. Li Y. Wu Daya Guo

This is my paper

Pith reviewed 2026-05-24 03:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mathematical reasoninglanguage modelscontinued pre-trainingreinforcement learningMATH benchmarkpolicy optimizationopen source modelsweb data curation

0 comments

The pith

DeepSeekMath 7B reaches 51.7% on the MATH benchmark by continuing pre-training on 120B curated web math tokens and applying Group Relative Policy Optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSeekMath 7B, created by taking DeepSeek-Coder-Base-v1.5 7B and continuing its pre-training on 120 billion math-related tokens extracted from Common Crawl along with natural language and code data. It reports 51.7 percent accuracy on the competition-level MATH benchmark without external toolkits or voting, and 60.9 percent when applying self-consistency over 64 samples. The authors present this result as evidence that the combination of a careful web-data selection pipeline and the new Group Relative Policy Optimization method can produce strong mathematical reasoning in an open 7B model.

Core claim

DeepSeekMath 7B shows that continued pre-training on a large volume of curated math tokens from public web data, followed by reinforcement learning with Group Relative Policy Optimization, enables a 7B open model to reach 51.7 percent on the MATH benchmark and approach the level of closed frontier systems without relying on external tools or ensembles.

What carries the argument

Group Relative Policy Optimization (GRPO), a memory-efficient variant of Proximal Policy Optimization that scores groups of responses relative to one another, combined with a data selection pipeline that extracts and filters 120B math-related tokens from Common Crawl.

If this is right

Self-consistency sampling over 64 responses raises MATH accuracy from 51.7 percent to 60.9 percent.
Open 7B models can reach performance close to closed models on competition math without tool use or voting.
GRPO reduces the memory footprint of PPO while still improving reasoning performance.
Public web data contains enough high-quality math content to support large-scale continued pre-training when filtered carefully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-curation approach could be tested on other structured reasoning domains such as code or physics problem solving.
GRPO might transfer to reinforcement learning settings outside mathematics where relative scoring within batches is feasible.
Further increases in the volume of filtered math tokens or model size could narrow the remaining gap to closed frontier systems.

Load-bearing premise

The performance on MATH is driven primarily by the data selection pipeline and the GRPO algorithm rather than by other details of the base model or training setup.

What would settle it

Train an otherwise identical 7B model on the same base checkpoint but without the math-data selection step or without GRPO and measure whether accuracy on MATH stays well below 51.7 percent.

read the original abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeepSeekMath 7B, obtained by continued pre-training of DeepSeek-Coder-Base-v1.5 7B on 120B math-related tokens from Common Crawl plus natural language and code data. It reports 51.7% accuracy on the MATH benchmark (60.9% with 64-sample self-consistency) without external toolkits or voting, approaching Gemini-Ultra and GPT-4. The authors attribute the gains to a meticulously engineered data selection pipeline from web data and the introduction of Group Relative Policy Optimization (GRPO), a PPO variant that improves mathematical reasoning while reducing memory usage.

Significance. If the attribution to the data pipeline and GRPO is substantiated by controls, the result would demonstrate that open 7B models can reach near-closed-model performance on competition math through public-data curation and a memory-efficient RL variant, offering a reproducible route for advancing reasoning capabilities.

major comments (2)

[Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.
[Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.

minor comments (2)

[Abstract] The abstract states “120B math-related tokens” but does not clarify the total token count, the exact mix of math/NL/code, or the filtering criteria used in the pipeline.
[Method section on GRPO] Notation for GRPO (reward formulation, group size, KL coefficient) should be defined with explicit equations in the method section to allow direct comparison with standard PPO.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns about missing ablations and table information below, and will make appropriate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.

Authors: We acknowledge the importance of ablations to substantiate the claims. In the revised manuscript, we will add the performance of the base model DeepSeek-Coder-Base-v1.5 7B on the MATH benchmark. For comparisons involving standard SFT or PPO, and the pipeline without filtering, these experiments were not conducted due to computational constraints. We will provide additional discussion on the rationale behind GRPO and the data curation process to better support the attribution. revision: partial
Referee: [Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.

Authors: We agree that including these baselines will improve clarity. We will update the tables in the results section to include the base model score and clarify the contributions from continued pre-training. revision: yes

standing simulated objections not resolved

Full ablation studies on the effects of the data filtering pipeline and direct comparisons between GRPO and standard PPO, as these require new experiments not present in the original work.

Circularity Check

0 steps flagged

No circularity; empirical training results with no self-referential derivation.

full rationale

The paper reports benchmark scores from continued pretraining of an existing base model (DeepSeek-Coder-Base-v1.5 7B) on 120B tokens followed by GRPO fine-tuning. The central claims are measured outcomes (51.7% MATH) and an attribution to two engineering choices (data pipeline + GRPO). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that close a logical loop appear in the abstract or stated claims. Attribution without ablations is a weakness of evidence, not circularity by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities beyond the high-level description of GRPO as a variant of PPO.

pith-pipeline@v0.9.0 · 5753 in / 1176 out tokens · 60175 ms · 2026-05-24T03:20:21.561498+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
cs.AI 2026-04 conditional novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
cs.LG 2026-05 conditional novelty 8.0

DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
cs.LG 2026-05 conditional novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
cs.LG 2026-05 conditional novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
cs.CR 2026-05 unverdicted novelty 8.0

STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
cs.SE 2026-04 unverdicted novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
cs.CV 2026-04 unverdicted novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
SEVerA: Verified Synthesis of Self-Evolving Agents
cs.LG 2026-03 unverdicted novelty 8.0

SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
cs.LG 2026-03 unverdicted novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
cs.CR 2025-09 conditional novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
cs.CL 2025-04 conditional novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
ETCHR: Editing To Clarify and Harness Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
cs.AI 2026-05 unverdicted novelty 7.0

EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
Visual-Advantage On-Policy Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
cs.CV 2026-05 conditional novelty 7.0

CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% s...
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
cs.CV 2026-05 unverdicted novelty 7.0

Seizure-Semiology-Suite provides a new clinically annotated video dataset and hierarchical benchmark that exposes weaknesses in current MLLMs for seizure semiology and demonstrates gains from fine-tuning and a neuro-s...
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
cs.CV 2026-05 unverdicted novelty 7.0

GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
cs.CV 2026-05 conditional novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
cs.CV 2026-05 unverdicted novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
Grounding Driving VLA via Inverse Kinematics
cs.CV 2026-05 conditional novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v...
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
cs.CV 2026-05 unverdicted novelty 7.0

Draw2Think recasts geometric reasoning as agentic interaction with a constraint engine, achieving 95.9% predicate-level construction fidelity and up to 16.4% accuracy gains on solid geometry tasks.
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
cs.LG 2026-05 unverdicted novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
RECIPE: Procedural Planning via Grounding in Instructional Video
cs.CV 2026-05 unverdicted novelty 7.0

RECIPE improves visual procedural planners by rewarding plans according to their grounding quality in ASR transcripts via GRPO, yielding +7–8 in-domain and up to +16 zero-shot macro-accuracy gains over base models and...
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

A new dual-protocol expert benchmark for image aesthetics is fused into ground truth and used to self-distill a VLM, raising SRCC from 0.504 to 0.709 across categories while matching closed-source performance.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 conditional novelty 7.0

PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer...
Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning
cs.SD 2026-05 unverdicted novelty 7.0

ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
cs.LG 2026-05 conditional novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
cs.CV 2026-05 unverdicted novelty 7.0

LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Aurora: Unified Video Editing with a Tool-Using Agent
cs.CV 2026-05 unverdicted novelty 7.0

Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation
cs.AI 2026-05 unverdicted novelty 7.0

PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generatio...
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
cs.CL 2026-05 unverdicted novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
cs.CV 2026-05 unverdicted novelty 7.0

IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
cs.LG 2026-05 unverdicted novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
Weak-to-Strong Elicitation via Mismatched Wrong Drafts
cs.CL 2026-05 conditional novelty 7.0

Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
cs.AI 2026-05 unverdicted novelty 7.0

Autonomous AI agents outperform humans in supply chain simulations but exhibit an inherent agent bullwhip effect of amplified decision unreliability, mitigated by GRPO reinforcement learning post-training.
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1265 Pith papers · 29 internal anchors

[1]

R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023
[2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Llemma: An Open Language Model For Mathematics

Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Burns, P

C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023
[6]

Chatglm3 series: Open bilingual chat llms, 2023

ChatGLM3 Team . Chatglm3 series: Open bilingual chat llms, 2023. URL https://github.com/THUDM/ChatGLM3

work page 2023
[7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022. doi:10.48550/ARXIV.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.12588 2022
[9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Computer

T. Computer. Redpajama: an open dataset for training large language models, Oct. 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023
[11]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. doi:10.48550/ARXIV.2401.02954. URL https://doi.org/10.48550/arXiv.2401.02954

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.02954 2024
[12]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

work page 2022
[13]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL: program-aided language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research,...

work page 2023
[14]

Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. doi:10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17452 2023
[15]

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024

work page 2024
[16]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023
[19]

Inflection-2, 2023

Inflection AI . Inflection-2, 2023. URL https://inflection.ai/inflection-2

work page 2023
[20]

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022

work page arXiv 2022
[21]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

FastText.zip: Compressing text classification models

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J \'e gou, and T. Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[24]

Leviathan, M

Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274--19286. PMLR, 2023

work page 2023
[25]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022 a

work page 2022
[26]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman - Solo, Y. Wu, B. Neyshabur, G. Gur - Ari, and V. Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processi...

work page 2022
[27]

Let's Verify Step by Step

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Mishra, M

S. Mishra, M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit, O. Tafjord, A. Sabharwal, P. Clark, and A. Kalyan. LILA: A unified benchmark for mathematical reasoning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab...

work page doi:10.18653/v1/2022.emnlp-main.392 2022
[31]

Nguyen, W

X. Nguyen, W. Zhang, X. Li, M. M. Aljunied, Q. Tan, L. Cheng, G. Chen, Y. Deng, S. Yang, C. Liu, H. Zhang, and L. Bing. Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023. doi:10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738

work page doi:10.48550/arxiv.2312.00738 2023
[32]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[34]

Paster, M

K. Paster, M. D. Santos, Z. Azerbayev, and J. Ba. Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi:10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786

work page doi:10.48550/arxiv.2310.06786 2023
[35]

L. C. Paulson. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In R. A. Schmidt, S. Schulz, and B. Konev, editors, Proceedings of the 2nd Workshop on Practical Aspects of Automated Reasoning, PAAR-2010, Edinburgh, Scotland, UK, July 14, 2010, volume 9 of EPiC Series in Computing, pages 1--10...

work page doi:10.29007/tnfd 2010
[36]

Generative Language Modeling for Automated Theorem Proving

S. Polu and I. Sutskever. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2009
[37]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023
[38]

Schulman

J. Schulman. Approximating kl divergence, 2020. URL http://joschu.net/blog/kl-approx.html

work page 2020
[39]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR...

work page 2023
[42]

F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023

work page arXiv 2023
[43]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch \"a rli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

T. Tao. Embracing change and resetting expectations, 2023. URL https://unlocked.microsoft.com/ai-anthology/terence-tao/

work page 2023
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton - Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[46]

T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024

work page 2024
[47]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y. Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 a

work page arXiv 2023
[48]

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Z. Wang, R. Xia, and P. Liu. Generative AI for math: Part I - mathpile: A billion-token-scale pretraining corpus for math. CoRR, abs/2312.17120, 2023 c . doi:10.48550/ARXIV.2312.17120. URL https://doi.org/10.48550/arXiv.2312.17120

work page doi:10.48550/arxiv.2312.17120 2023
[50]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022
[51]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023
[52]

Wenzel, L

M. Wenzel, L. C. Paulson, and T. Nipkow. The isabelle framework. In O. A. Mohamed, C. A. Mu \ n oz, and S. Tahar, editors, Theorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume 5170 of Lecture Notes in Computer Science, pages 33--38. Springer, 2008. doi:10.1007/978-3-5...

work page doi:10.1007/978-3-540-71067-7 2008
[53]

H. Xia, T. Ge, P. Wang, S.-Q. Chen, F. Wei, and Z. Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909--3925, Singapore, Dec. 2023. Association for Computational Linguistics. doi:10.18...

work page doi:10.18653/v1/2023.findings-emnlp.257 2023
[54]

H. Xia, Z. Yang, Q. Dong, P. Wang, Y. Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024

work page arXiv 2024
[55]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. doi:10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.12284 2023
[57]

Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023 b

work page arXiv 2023
[59]

X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi:10.48550/ARXIV.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.05653 2023
[60]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

K. Zheng, J. M. Han, and S. Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[61]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval : A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi:10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.06364 2023

[1] [1]

R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805 2023

[2] [2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Llemma: An Open Language Model For Mathematics

Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Burns, P

C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

work page arXiv 2023

[6] [6]

Chatglm3 series: Open bilingual chat llms, 2023

ChatGLM3 Team . Chatglm3 series: Open bilingual chat llms, 2023. URL https://github.com/THUDM/ChatGLM3

work page 2023

[7] [7]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022. doi:10.48550/ARXIV.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.12588 2022

[9] [9]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Computer

T. Computer. Redpajama: an open dataset for training large language models, Oct. 2023. URL https://github.com/togethercomputer/RedPajama-Data

work page 2023

[11] [11]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. doi:10.48550/ARXIV.2401.02954. URL https://doi.org/10.48550/arXiv.2401.02954

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.02954 2024

[12] [12]

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

work page 2022

[13] [13]

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL: program-aided language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research,...

work page 2023

[14] [14]

Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. doi:10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.17452 2023

[15] [15]

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024

work page 2024

[16] [16]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[17] [17]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Hai-llm: 高效且轻量的大模型训练工具, 2023

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

work page 2023

[19] [19]

Inflection-2, 2023

Inflection AI . Inflection-2, 2023. URL https://inflection.ai/inflection-2

work page 2023

[20] [20]

A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022

work page arXiv 2022

[21] [21]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

FastText.zip: Compressing text classification models

A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J \'e gou, and T. Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[24] [24]

Leviathan, M

Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274--19286. PMLR, 2023

work page 2023

[25] [25]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022 a

work page 2022

[26] [26]

Lewkowycz, A

A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman - Solo, Y. Wu, B. Neyshabur, G. Gur - Ari, and V. Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processi...

work page 2022

[27] [27]

Let's Verify Step by Step

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Mishra, M

S. Mishra, M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit, O. Tafjord, A. Sabharwal, P. Clark, and A. Kalyan. LILA: A unified benchmark for mathematical reasoning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab...

work page doi:10.18653/v1/2022.emnlp-main.392 2022

[31] [31]

Nguyen, W

X. Nguyen, W. Zhang, X. Li, M. M. Aljunied, Q. Tan, L. Cheng, G. Chen, Y. Deng, S. Yang, C. Liu, H. Zhang, and L. Bing. Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023. doi:10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738

work page doi:10.48550/arxiv.2312.00738 2023

[32] [32]

GPT-4 Technical Report

OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022

[34] [34]

Paster, M

K. Paster, M. D. Santos, Z. Azerbayev, and J. Ba. Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi:10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786

work page doi:10.48550/arxiv.2310.06786 2023

[35] [35]

L. C. Paulson. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In R. A. Schmidt, S. Schulz, and B. Konev, editors, Proceedings of the 2nd Workshop on Practical Aspects of Automated Reasoning, PAAR-2010, Edinburgh, Scotland, UK, July 14, 2010, volume 9 of EPiC Series in Computing, pages 1--10...

work page doi:10.29007/tnfd 2010

[36] [36]

Generative Language Modeling for Automated Theorem Proving

S. Polu and I. Sutskever. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393

work page internal anchor Pith review Pith/arXiv arXiv 2009

[37] [37]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

work page 2023

[38] [38]

Schulman

J. Schulman. Approximating kl divergence, 2020. URL http://joschu.net/blog/kl-approx.html

work page 2020

[39] [39]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR...

work page 2023

[42] [42]

F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023

work page arXiv 2023

[43] [43]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

M. Suzgun, N. Scales, N. Sch \"a rli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

T. Tao. Embracing change and resetting expectations, 2023. URL https://unlocked.microsoft.com/ai-anthology/terence-tao/

work page 2023

[45] [45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton - Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[46] [46]

T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024

work page 2024

[47] [47]

P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y. Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 a

work page arXiv 2023

[48] [48]

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Z. Wang, R. Xia, and P. Liu. Generative AI for math: Part I - mathpile: A billion-token-scale pretraining corpus for math. CoRR, abs/2312.17120, 2023 c . doi:10.48550/ARXIV.2312.17120. URL https://doi.org/10.48550/arXiv.2312.17120

work page doi:10.48550/arxiv.2312.17120 2023

[50] [50]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

work page 2022

[51] [51]

T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

work page 2023

[52] [52]

Wenzel, L

M. Wenzel, L. C. Paulson, and T. Nipkow. The isabelle framework. In O. A. Mohamed, C. A. Mu \ n oz, and S. Tahar, editors, Theorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume 5170 of Lecture Notes in Computer Science, pages 33--38. Springer, 2008. doi:10.1007/978-3-5...

work page doi:10.1007/978-3-540-71067-7 2008

[53] [53]

H. Xia, T. Ge, P. Wang, S.-Q. Chen, F. Wei, and Z. Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909--3925, Singapore, Dec. 2023. Association for Computational Linguistics. doi:10.18...

work page doi:10.18653/v1/2023.findings-emnlp.257 2023

[54] [54]

H. Xia, Z. Yang, Q. Dong, P. Wang, Y. Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024

work page arXiv 2024

[55] [55]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. doi:10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.12284 2023

[57] [57]

Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023 b

work page arXiv 2023

[59] [59]

X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi:10.48550/ARXIV.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.05653 2023

[60] [60]

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

K. Zheng, J. M. Han, and S. Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[61] [61]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval : A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi:10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.06364 2023