hub Mixed citations

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao · 2023 · cs.CL · arXiv 2308.09583

Mixed citation behavior. Most common role is background (67%).

47 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2 method 1

citation-polarity summary

background 4 baseline 2

representative citing papers

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.

HARP: Efficient Data Selection for Finetuning Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.

Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

CRPS synthesizes reasoning paths by contrasting high- and low-quality MCTS trajectories, enabling models trained on 60K examples to match or exceed those trained on 590K standard examples with better out-of-domain generalization.

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

cs.AI · 2025-12-21 · unverdicted · novelty 7.0

CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

cs.SE · 2025-10-16 · unverdicted · novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

cs.CR · 2025-09-30 · unverdicted · novelty 7.0

SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

cs.CV · 2024-03-21 · conditional · novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

CodeMind: Evaluating Large Language Models for Code Reasoning

cs.SE · 2024-02-15 · unverdicted · novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.

Learning Process Rewards via Success Visitation Matching for Efficient RL

cs.LG · 2026-06-22 · unverdicted · novelty 6.0

Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.

Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance

cs.IR · 2026-05-29 · unverdicted · novelty 6.0

Graph-GRPO builds a dependency graph over CoT steps and propagates outcome rewards to enable finer credit assignment in generative relevance modeling for e-commerce search.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

cs.CL · 2026-05-19 · conditional · novelty 6.0

DEL is a new loss for LLM numerical learning that applies supervised digit entropy optimization and extends to floating-point numbers, showing improved accuracy and distance metrics over prior methods on math benchmarks.

Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.

Distribution Corrected Offline Data Distillation for Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

cs.AI · 2025-11-17 · unverdicted · novelty 6.0

MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.

Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

cs.CL · 2025-08-21 · unverdicted · novelty 6.0

Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.

citing papers explorer

Showing 38 of 38 citing papers after filters.

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents cs.CL · 2026-06-11 · unverdicted · none · ref 10 · internal anchor
SENTINEL generates targeted tasks from model failures in a Controller-Proposer-Solver loop, raising Pass^1 from 66.4 to 74.9 on Tau2-Bench Retail and outperforming standard RL.
HARP: Efficient Data Selection for Finetuning Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 38 · internal anchor
HARP is a train-based data selector for LLM finetuning that uses hierarchical active region pruning and empirical Bayes posteriors to achieve up to 8.9 point gains with roughly 7 times fewer training examples.
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 34 · internal anchor
FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 13 · internal anchor
Semantic consensus on model outputs for public prompts enables federated LLM fine-tuning that matches parameter-aggregation baselines with orders-of-magnitude lower communication.
Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories cs.AI · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
CRPS synthesizes reasoning paths by contrasting high- and low-quality MCTS trajectories, enabling models trained on 60K examples to match or exceed those trained on 590K standard examples with better out-of-domain generalization.
CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning cs.AI · 2025-12-21 · unverdicted · none · ref 13 · internal anchor
CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models cs.SE · 2025-10-16 · unverdicted · none · ref 31 · internal anchor
LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From cs.CR · 2025-09-30 · unverdicted · none · ref 4 · internal anchor
SeedPrints fingerprints LLMs using persistent biases from initialization seeds for lineage verification across pretraining and adaptation stages.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 69 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
CodeMind: Evaluating Large Language Models for Code Reasoning cs.SE · 2024-02-15 · unverdicted · none · ref 11 · internal anchor
CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning cs.CL · 2026-06-30 · unverdicted · none · ref 98 · internal anchor
CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.
Learning Process Rewards via Success Visitation Matching for Efficient RL cs.LG · 2026-06-22 · unverdicted · none · ref 47 · internal anchor
Success Visitation Matching uses a discriminator to turn sparse outcome rewards into dense process rewards by matching visitations of successful episodes, provably preserving the optimal policy and speeding up robotic RL finetuning.
RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning cs.LG · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance cs.IR · 2026-05-29 · unverdicted · none · ref 17 · internal anchor
Graph-GRPO builds a dependency graph over CoT steps and propagates outcome rewards to enable finer credit assignment in generative relevance modeling for e-commerce search.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 71 · internal anchor
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
Firefly: Illuminating Large-Scale Verified Tool-Call Data Generation from Real APIs cs.SE · 2026-05-17 · unverdicted · none · ref 8 · internal anchor
FireFly inverts task synthesis by exploring real MCP servers first via pairwise tool graphs and sub-DAG sampling, then generates 5,144 verified tasks backward from outcomes to train a 4B model that matches Claude Sonnet 4.6 on tool-calling benchmarks.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference cs.CV · 2026-05-09 · unverdicted · none · ref 5 · internal anchor
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 5 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 8 · internal anchor
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI · 2025-11-17 · unverdicted · none · ref 21 · internal anchor
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models cs.CL · 2025-08-21 · unverdicted · none · ref 14 · internal anchor
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 30 · internal anchor
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs cs.CL · 2024-12-25 · unverdicted · none · ref 69 · internal anchor
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models cs.CL · 2024-02-05 · unverdicted · none · ref 29 · internal anchor
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 160 · internal anchor
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability cs.LG · 2026-06-17 · unverdicted · none · ref 24 · internal anchor
STARE applies surprisal-guided token-level advantage reweighting plus a target-entropy gate to stabilize entropy in GRPO RL for LLMs, yielding stable training and 4-8% gains on AIME24/25 over baselines.
Self-evolving LLM agents with in-distribution Optimization cs.LG · 2026-06-05 · unverdicted · none · ref 74 · internal anchor
Q-Evolve unifies automatic process-reward labeling via advantage estimation and behavior-proximal policy optimization inside an in-distribution RL loop to enable self-evolving LLM agents on interactive tasks.
CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO cs.AI · 2026-05-29 · unverdicted · none · ref 18 · internal anchor
CAST adds non-privileged self-teacher scoring and bidirectional advantage flipping to GRPO so that zero-variance groups still produce verifier-signed token gradients.
Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information cs.AI · 2026-05-27 · unverdicted · none · ref 26 · internal anchor
JTS trains reasoning models via supervised warm-up and missing-premise RL to make an explicit answerability commitment that triggers early termination on unanswerable inputs, raising Abstention@Detection near saturation.
GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 16 · internal anchor
A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.
ARMove: Learning to Predict Human Mobility through Agentic Reasoning cs.MA · 2026-04-19 · unverdicted · none · ref 39 · internal anchor
ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interpretability and robustness.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 155 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 102 · internal anchor
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 80 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery cs.AI · 2026-06-07 · unverdicted · none · ref 167 · internal anchor
An integrated survey organizing AI mathematical reasoning into informal, formal, discovery, and technique axes while cataloging benchmarks and assessing failure modes.
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models cs.AI · 2025-01-16 · unverdicted · none · ref 85 · internal anchor
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 121 · internal anchor
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer