Qwen2.5 Technical Report

Baosong Yang; Beichen Zhang; Binyuan Hui; Bowen Yu; Bo Zheng; Chengyuan Li; Dayiheng Liu; Fei Huang; Haoran Wei; Huan Lin

arxiv: 2412.15115 · v2 · submitted 2024-12-19 · 💻 cs.CL

Qwen2.5 Technical Report

Qwen: An Yang , Baosong Yang , Beichen Zhang , Binyuan Hui , Bo Zheng , Bowen Yu , Chengyuan Li , Dayiheng Liu

show 34 more authors

Fei Huang Haoran Wei Huan Lin Jian Yang Jianhong Tu Jianwei Zhang Jianxin Yang Jiaxi Yang Jingren Zhou Junyang Lin Kai Dang Keming Lu Keqin Bao Kexin Yang Le Yu Mei Li Mingfeng Xue Pei Zhang Qin Zhu Rui Men Runji Lin Tianhao Li Tianyi Tang Tingyu Xia Xingzhang Ren Xuancheng Ren Yang Fan Yang Su Yichang Zhang Yu Wan Yuqiong Liu Zeyu Cui Zhenru Zhang Zihan Qiu (additional authors not shown)

This is my paper

Pith reviewed 2026-05-23 06:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsQwen2.5pre-trainingpost-trainingreinforcement learninginstruction tuningbenchmarksmixture of experts

0 comments

The pith

Qwen2.5-72B-Instruct matches the performance of Llama-3-405B-Instruct on language and reasoning benchmarks despite being roughly five times smaller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report presents Qwen2.5 as a family of large language models whose performance stems from scaling pre-training data to 18 trillion tokens and applying over one million samples of supervised fine-tuning plus multistage reinforcement learning. These steps produce strong results in common sense, expert knowledge, long-context generation, structural data handling, and instruction following across multiple model sizes. The flagship 72B instruction-tuned model is shown to outperform many open and closed models while staying competitive with a much larger open-weight baseline. Proprietary mixture-of-experts variants are positioned as cost-effective alternatives to GPT-4o-mini and GPT-4o. The models also serve as the base for specialized follow-on systems in mathematics, coding, and multimodal tasks.

Core claim

Qwen2.5 scales high-quality pre-training data from 7 trillion to 18 trillion tokens to build foundations in knowledge and reasoning, then applies intricate supervised fine-tuning on more than one million samples together with multistage reinforcement learning to improve human preference alignment, long text generation, structural data analysis, and instruction following; the resulting 72B-Instruct model outperforms numerous open and proprietary systems and remains competitive with Llama-3-405B-Instruct while the Turbo and Plus mixture-of-experts variants match or exceed the cost-effectiveness of GPT-4o-mini and GPT-4o.

What carries the argument

Scaling of high-quality pre-training datasets to 18 trillion tokens combined with multistage reinforcement learning applied after supervised fine-tuning on over one million samples.

If this is right

The 72B model can serve as a drop-in replacement for larger models in many language, reasoning, mathematics, and coding tasks.
Mixture-of-experts variants deliver GPT-4o-level results at lower inference cost for hosted use.
Qwen2.5 models provide a stronger starting point for training specialized systems in mathematics, coding, and multimodal domains.
Open-weight releases in multiple sizes and quantization levels broaden access to high-performing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data volume at this scale may reduce the performance gap that size alone previously created between open and closed models.
The approach implies that continued investment in curated pre-training corpora can yield efficiency gains even without architectural breakthroughs.
If the post-training pipeline generalizes, similar multistage reinforcement learning could improve alignment in other model families.

Load-bearing premise

The 18 trillion tokens of pre-training data are sufficiently high-quality and free of major contamination or bias to deliver reliable gains in common sense, knowledge, and reasoning.

What would settle it

Performance of Qwen2.5-72B-Instruct falling below Llama-3-405B-Instruct on a fresh set of contamination-free benchmarks would falsify the claim of competitive capability at smaller scale.

Figures

Figures reproduced from arXiv: 2412.15115 by Baosong Yang, Beichen Zhang, Binyuan Hui, Bowen Yu, Bo Zheng, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Qwen: An Yang, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yuqiong Liu, Yu Wan, Zeyu Cui, Zhenru Zhang, Zihan Qiu (additional authors not shown).

**Figure 2.** Figure 2: Performance of Qwen2.5-Turbo on Passkey Retrieval Task with 1M Token Lengths. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: TTFT (Time To First Token) of Qwen2.5-Turbo and Qwen2.5-7B with Full Attention and Our Method. 6 Conclusion Qwen2.5 represents a significant advancement in large language models (LLMs), with enhanced pretraining on 18 trillion tokens and sophisticated post-training techniques, including supervised fine-tuning and multi-stage reinforcement learning. These improvements boost human preference alignment, long… view at source ↗

read the original abstract

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen2.5 is a straightforward model update that scales pretraining data and adds multistage RL to hit competitive benchmarks, but the lack of decontamination evidence leaves the performance claims open to doubt.

read the letter

The main thing to know is that the 72B instruct model reportedly matches or approaches Llama-3-405B on standard benchmarks after they grew the pretraining corpus from 7T to 18T tokens and layered in multistage RL on top of supervised finetuning. They also ship a full range of sizes plus two MoE hosted variants. That is the core of the report. What is actually new is the data scale and the multistage RL setup; the rest follows the usual post-training playbook. The paper does a service by open-weighting the models and showing concrete numbers on language understanding, math, coding, and preference alignment. Releasing the weights and the quantized versions is the part that matters most for downstream work. The post-training section at least claims gains on long-context and structured data tasks, which aligns with what people actually use these models for. The soft spot is exactly the one the stress-test flags. The abstract gives no quantitative decontamination numbers, overlap statistics, or membership-inference results against the usual eval suites. Without that, the headline claim that the smaller model competes with a five-times-larger one rests on an untested assumption about data cleanliness. The methodology description is also thin, which is typical for these reports but still limits how much anyone can learn or replicate the gains. This is not a methods paper; it is a release note. It is useful for anyone who tracks open-weight LLMs, builds on existing bases, or needs a strong 72B-class model right now. A reader who wants to cite current performance numbers or start fine-tuning from a competitive checkpoint will find it worth their time. It deserves a serious referee because these releases set the practical baseline the field works from, and the community should have a chance to examine the data and training details even if the underlying ideas are incremental.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Qwen2.5 series of LLMs, reporting that pre-training data was scaled from 7T to 18T high-quality tokens and that post-training used over 1M SFT samples plus multistage RL. It claims the open-weight Qwen2.5-72B-Instruct achieves top-tier results across language, reasoning, math, and coding benchmarks, outperforming several open/proprietary models and remaining competitive with the much larger Llama-3-405B-Instruct; hosted MoE variants (Turbo, Plus) are also presented as cost-effective alternatives to GPT-4o-mini/o.

Significance. If the reported benchmark numbers are shown to be free of test-set leakage, the work supplies concrete evidence that aggressive high-quality pre-training scaling combined with targeted post-training can produce open-weight models that match or approach the performance of models five times larger, strengthening the case for efficient open-source LLM development.

major comments (2)

[Pre-training description] Pre-training description (abstract and § on pre-training): the claim that the 18T-token corpus supplies a strong foundation for reasoning without significant contamination is unsupported by any quantitative decontamination analysis, n-gram overlap statistics, or membership-inference results against the cited evaluation suites (MMLU, GSM8K, HumanEval, etc.). This directly bears on the flagship competitiveness claim.
[Evaluation section] Evaluation section / abstract: benchmark results are presented without error bars, run-to-run variance, or explicit data-exclusion criteria, and without a table or appendix listing per-benchmark scores for Qwen2.5-72B-Instruct versus Llama-3-405B-Instruct. The absence of these details prevents independent verification of the central performance assertion.

minor comments (2)

The post-training paragraph refers to “intricate supervised finetuning with over 1 million samples” and “multistage reinforcement learning” without enumerating data sources, filtering steps, or the reward-model training procedure.
Figure or table summarizing model sizes, parameter counts, and key benchmark scores would improve readability and allow direct comparison with Llama-3-405B.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on the Qwen2.5 Technical Report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Pre-training description] Pre-training description (abstract and § on pre-training): the claim that the 18T-token corpus supplies a strong foundation for reasoning without significant contamination is unsupported by any quantitative decontamination analysis, n-gram overlap statistics, or membership-inference results against the cited evaluation suites (MMLU, GSM8K, HumanEval, etc.). This directly bears on the flagship competitiveness claim.

Authors: We acknowledge the referee's concern that the absence of quantitative decontamination metrics leaves the contamination claim unsupported. Due to the proprietary nature of the 18T corpus, we cannot release n-gram overlap statistics or membership-inference results. We will revise the pre-training section to describe our internal data curation pipeline, quality filtering, and decontamination procedures in greater detail while explicitly noting that specific overlap metrics are withheld for confidentiality reasons. This constitutes a partial revision. revision: partial
Referee: [Evaluation section] Evaluation section / abstract: benchmark results are presented without error bars, run-to-run variance, or explicit data-exclusion criteria, and without a table or appendix listing per-benchmark scores for Qwen2.5-72B-Instruct versus Llama-3-405B-Instruct. The absence of these details prevents independent verification of the central performance assertion.

Authors: We agree that the current presentation lacks sufficient methodological transparency. In the revised manuscript we will add error bars or reported variance for applicable benchmarks, state explicit data-exclusion criteria, and include a new appendix table with per-benchmark scores directly comparing Qwen2.5-72B-Instruct and Llama-3-405B-Instruct. These changes will enable independent verification of the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The document is a technical report on model scaling and benchmark results with no mathematical derivations, fitted parameters presented as predictions, or self-referential definitions. Performance claims rest on external benchmarks (MMLU, GSM8K, etc.) and data scaling statements that do not reduce to the reported outcomes by construction. No load-bearing self-citations or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The report relies on standard assumptions in LLM training such as the benefit of scaling data and the effectiveness of RLHF-like methods, but introduces no new free parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 6053 in / 1129 out tokens · 29825 ms · 2026-05-23T06:23:03.326026+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
econ.EM 2026-05 accept novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
cs.CR 2026-05 unverdicted novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...
FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
cs.AI 2026-05 conditional novelty 8.0

FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
cs.CV 2026-05 unverdicted novelty 8.0

Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
cs.LG 2026-05 conditional novelty 8.0

HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
cs.CV 2026-05 unverdicted novelty 8.0

VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
cs.CR 2025-09 conditional novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs
cs.DC 2026-05 unverdicted novelty 7.0

HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
cs.LG 2026-05 unverdicted novelty 7.0

Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints o...
WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents
cs.LG 2026-05 unverdicted novelty 7.0

WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.
Brain-LLM Alignment Tracks Training Data, Not Typology
cs.CL 2026-05 unverdicted novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic...
Test-Time Training Undermines Safety Guardrails
cs.LG 2026-05 unverdicted novelty 7.0

Test-time training enables three new threat models that raise jailbreak attack success rates on language models to averages of 95% and 93% ASR@10 under LoRA for few-shot and generation-phase attacks across model families.
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...
Self-Policy Distillation via Capability-Selective Subspace Projection
cs.CL 2026-05 unverdicted novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
cs.LG 2026-05 unverdicted novelty 7.0

GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
cs.CL 2026-05 unverdicted novelty 7.0

IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Grounding Driving VLA via Inverse Kinematics
cs.CV 2026-05 conditional novelty 7.0

By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v...
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
cs.LG 2026-05 unverdicted novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
cs.CR 2026-05 conditional novelty 7.0

Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
cs.LG 2026-05 unverdicted novelty 7.0

In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes
cs.LG 2026-05 conditional novelty 7.0

CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexit...
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAP replaces intractable categorical mask parameterization with a differentiable per-weight Bernoulli relaxation, delivering +2.59 average zero-shot accuracy gain over the best layer-wise baseline across 0.5B-8B LLMs...
The Unlearnability Phenomenon in RLVR for Language Models
cs.LG 2026-05 unverdicted novelty 7.0

RLVR training for language models exhibits an unlearnability phenomenon where certain hard examples stay unlearnable due to low gradient similarity and ungeneralizable reasoning patterns.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
Artificial Aphasias in Lesioned Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Lesioning parameters in large language models produces aphasia-like symptoms whose distributions vary by attention versus feed-forward components and by layer depth, but differ qualitatively from human clinical profiles.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
cs.CL 2026-05 conditional novelty 7.0

MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical...
MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Do Language Models Align with Brains? Prediction Scores Are Not Enough
q-bio.NC 2026-05 unverdicted novelty 7.0

Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction
cs.LG 2026-05 unverdicted novelty 7.0

GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
cs.CL 2026-05 accept novelty 7.0

Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
cs.AI 2026-05 unverdicted novelty 7.0

PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
cs.LG 2026-05 conditional novelty 7.0

AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
cs.AI 2026-05 unverdicted novelty 7.0

Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Deep Minds and Shallow Probes
cs.LG 2026-05 unverdicted novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods
cs.LG 2026-05 unverdicted novelty 7.0

gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best amo...
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CAQ-ZO aligns ZO query stencils to compander grids, eliminating query-time residual error and improving NF4 fine-tuning performance on Qwen and Llama models compared to standard quantized baselines.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
cs.CL 2026-05 unverdicted novelty 7.0

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
Unsupervised Process Reward Models
cs.LG 2026-05 unverdicted novelty 7.0

Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
cs.CL 2026-05 unverdicted novelty 7.0

LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 722 Pith papers · 29 internal anchors

[1]

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S´ebastien Bubeck, Martin Cai, Caio C´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Rone...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert H...

work page arXiv
[3]

The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Maz- zotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The Falcon series of open language models. CoRR, abs/2311.16867,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training-free long-context scaling of large language models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463,

work page arXiv
[5]

Program Synthesis with Large Language Models

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud e 3.pdf. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884,

work page arXiv
[8]

Towards Scalable Automated Alignment of LLMs: A Survey,

Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, and Bowen Yu. Towards scalable automated alignment of LLMs: A survey. CoRR, abs/2406.01252,

work page arXiv
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In EMNLP, pp. 7889–7901. Association for Computational Linguistics, 2023a. Zhihong Chen, Shuo Yan, Juhao Liang, Feng Jiang, Xiangbo Wu, Fei Yu, Guiming Hardy Chen, Junying Chen, Hongbo Zhang, Li Jianqua...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

20 Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542,

work page arXiv
[14]

Multi-programming language sandbox for llms

Shihan Dou, Jiazheng Zhang, Jianxiang Zang, Yunbo Tao, Haoxiang Jia, Shichun Liu, Yuming Yang, Shenxi Wu, Shaoqing Zhang, Muling Wu, et al. Multi-programming language sandbox for llms. CoRR, abs/2410.23074,

work page arXiv
[15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton A. Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Ka- terina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, and Sergey Markov. MERA: A...

work page arXiv
[17]

Athene-70b: Redefining the boundaries of post-training for open models, July 2024a

Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, and Banghua Zhu. Athene-70b: Redefining the boundaries of post-training for open models, July 2024a. URL https://nexusflow.ai/b logs/athene. Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, and Ion...

work page arXiv
[18]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni v1 5 report.pdf. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size. CoRR, abs/2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Training Compute-Optimal Large Language Models

21 Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MA...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?CoRR, abs/2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language ...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.0682...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[25]

Smith, and Hanna Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. RewardBench: Evaluating reward models for language modeling. CoRR, abs/2403.13787,

work page arXiv
[26]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[27]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

22 Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Online merging optimizers for boosting rewards and mitigating tax in alignment

Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, and Chang Zhou. Online merging optimizers for boosting rewards and mitigating tax in alignment. CoRR, abs/2405.17931, 2024a. Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. CoRR, abs/2401.124...

work page arXiv
[29]

BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla P ´erez-Almendros, Abinew Ali Ayele, V ´ıctor Guti´errez-Basulto, Yazm´ın Ib´a˜nez- Garc´ıa, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki-Woong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidho...

work page arXiv
[30]

GPT-4 Technical Report

OpenAI. GPT4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Language models can self-lengthen to generate long texts

Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, and Junyang Lin. Language models can self-lengthen to generate long texts. CoRR, abs/2410.23933,

work page arXiv
[33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Sto...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Secrets of RLHF in large language models part II: Reward modeling

Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of RLHF in large language models part II: Reward modeling. CoRR, abs/2401.06080, 2024a. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords. In AAAI, pp. 9154–9160. AAAI Press,

work page arXiv
[37]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024b. Zhilin Wang, Alexander Bukharin,...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Aligning large language models via self-steering optimization

Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun, Jingren Zhou, and Junyang Lin. Aligning large language models via self-steering optimization. CoRR, abs/2410.17131,

work page arXiv
[39]

A., Oguz, B., et al

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...

work page arXiv
[40]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136,

work page arXiv
[43]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs

Yidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, and Jingren Zhou. P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs. CoRR, abs/2411.09116,

work page arXiv
[45]

RMB: Comprehensively benchmarking reward models in LLM alignment

Enyu Zhou, Guodong Zheng, Bing Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. RMB: Comprehensively benchmarking reward models in LLM alignment. CoRR, abs/2410.09893,

work page arXiv
[46]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

25 Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. CoRR, abs/2202.08906,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S´ebastien Bubeck, Martin Cai, Caio C´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Rone...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert H...

work page arXiv

[3] [3]

The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Maz- zotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The Falcon series of open language models. CoRR, abs/2311.16867,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training-free long-context scaling of large language models

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463,

work page arXiv

[5] [5]

Program Synthesis with Large Language Models

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud e 3.pdf. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884,

work page arXiv

[8] [8]

Towards Scalable Automated Alignment of LLMs: A Survey,

Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, and Bowen Yu. Towards scalable automated alignment of LLMs: A survey. CoRR, abs/2406.01252,

work page arXiv

[9] [9]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In EMNLP, pp. 7889–7901. Association for Computational Linguistics, 2023a. Zhihong Chen, Shuo Yan, Juhao Liang, Feng Jiang, Xiangbo Wu, Fei Yu, Guiming Hardy Chen, Junying Chen, Hongbo Zhang, Li Jianqua...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

20 Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542,

work page arXiv

[14] [14]

Multi-programming language sandbox for llms

Shihan Dou, Jiazheng Zhang, Jianxiang Zang, Yunbo Tao, Haoxiang Jia, Shichun Liu, Yuming Yang, Shenxi Wu, Shaoqing Zhang, Muling Wu, et al. Multi-programming language sandbox for llms. CoRR, abs/2410.23074,

work page arXiv

[15] [15]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton A. Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Ka- terina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, and Sergey Markov. MERA: A...

work page arXiv

[17] [17]

Athene-70b: Redefining the boundaries of post-training for open models, July 2024a

Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, and Banghua Zhu. Athene-70b: Redefining the boundaries of post-training for open models, July 2024a. URL https://nexusflow.ai/b logs/athene. Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, and Ion...

work page arXiv

[18] [18]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni v1 5 report.pdf. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size. CoRR, abs/2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Training Compute-Optimal Large Language Models

21 Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MA...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?CoRR, abs/2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language ...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.0682...

work page internal anchor Pith review Pith/arXiv arXiv 2001

[25] [25]

Smith, and Hanna Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. RewardBench: Evaluating reward models for language modeling. CoRR, abs/2403.13787,

work page arXiv

[26] [26]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[27] [27]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

22 Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Online merging optimizers for boosting rewards and mitigating tax in alignment

Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, and Chang Zhou. Online merging optimizers for boosting rewards and mitigating tax in alignment. CoRR, abs/2405.17931, 2024a. Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. CoRR, abs/2401.124...

work page arXiv

[29] [29]

BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla P ´erez-Almendros, Abinew Ali Ayele, V ´ıctor Guti´errez-Basulto, Yazm´ın Ib´a˜nez- Garc´ıa, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki-Woong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidho...

work page arXiv

[30] [30]

GPT-4 Technical Report

OpenAI. GPT4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Language models can self-lengthen to generate long texts

Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, and Junyang Lin. Language models can self-lengthen to generate long texts. CoRR, abs/2410.23933,

work page arXiv

[33] [33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Sto...

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Secrets of RLHF in large language models part II: Reward modeling

Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of RLHF in large language models part II: Reward modeling. CoRR, abs/2401.06080, 2024a. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords. In AAAI, pp. 9154–9160. AAAI Press,

work page arXiv

[37] [37]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024b. Zhilin Wang, Alexander Bukharin,...

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Aligning large language models via self-steering optimization

Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun, Jingren Zhou, and Junyang Lin. Aligning large language models via self-steering optimization. CoRR, abs/2410.17131,

work page arXiv

[39] [39]

A., Oguz, B., et al

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...

work page arXiv

[40] [40]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136,

work page arXiv

[43] [43]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs

Yidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, and Jingren Zhou. P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs. CoRR, abs/2411.09116,

work page arXiv

[45] [45]

RMB: Comprehensively benchmarking reward models in LLM alignment

Enyu Zhou, Guodong Zheng, Bing Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. RMB: Comprehensively benchmarking reward models in LLM alignment. CoRR, abs/2410.09893,

work page arXiv

[46] [46]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

25 Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. CoRR, abs/2202.08906,

work page internal anchor Pith review Pith/arXiv arXiv