pith. sign in

arxiv: 2412.15115 · v2 · submitted 2024-12-19 · 💻 cs.CL

Qwen2.5 Technical Report

Pith reviewed 2026-05-23 06:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsQwen2.5pre-trainingpost-trainingreinforcement learninginstruction tuningbenchmarksmixture of experts
0
0 comments X

The pith

Qwen2.5-72B-Instruct matches the performance of Llama-3-405B-Instruct on language and reasoning benchmarks despite being roughly five times smaller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The report presents Qwen2.5 as a family of large language models whose performance stems from scaling pre-training data to 18 trillion tokens and applying over one million samples of supervised fine-tuning plus multistage reinforcement learning. These steps produce strong results in common sense, expert knowledge, long-context generation, structural data handling, and instruction following across multiple model sizes. The flagship 72B instruction-tuned model is shown to outperform many open and closed models while staying competitive with a much larger open-weight baseline. Proprietary mixture-of-experts variants are positioned as cost-effective alternatives to GPT-4o-mini and GPT-4o. The models also serve as the base for specialized follow-on systems in mathematics, coding, and multimodal tasks.

Core claim

Qwen2.5 scales high-quality pre-training data from 7 trillion to 18 trillion tokens to build foundations in knowledge and reasoning, then applies intricate supervised fine-tuning on more than one million samples together with multistage reinforcement learning to improve human preference alignment, long text generation, structural data analysis, and instruction following; the resulting 72B-Instruct model outperforms numerous open and proprietary systems and remains competitive with Llama-3-405B-Instruct while the Turbo and Plus mixture-of-experts variants match or exceed the cost-effectiveness of GPT-4o-mini and GPT-4o.

What carries the argument

Scaling of high-quality pre-training datasets to 18 trillion tokens combined with multistage reinforcement learning applied after supervised fine-tuning on over one million samples.

If this is right

  • The 72B model can serve as a drop-in replacement for larger models in many language, reasoning, mathematics, and coding tasks.
  • Mixture-of-experts variants deliver GPT-4o-level results at lower inference cost for hosted use.
  • Qwen2.5 models provide a stronger starting point for training specialized systems in mathematics, coding, and multimodal domains.
  • Open-weight releases in multiple sizes and quantization levels broaden access to high-performing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data volume at this scale may reduce the performance gap that size alone previously created between open and closed models.
  • The approach implies that continued investment in curated pre-training corpora can yield efficiency gains even without architectural breakthroughs.
  • If the post-training pipeline generalizes, similar multistage reinforcement learning could improve alignment in other model families.

Load-bearing premise

The 18 trillion tokens of pre-training data are sufficiently high-quality and free of major contamination or bias to deliver reliable gains in common sense, knowledge, and reasoning.

What would settle it

Performance of Qwen2.5-72B-Instruct falling below Llama-3-405B-Instruct on a fresh set of contamination-free benchmarks would falsify the claim of competitive capability at smaller scale.

Figures

Figures reproduced from arXiv: 2412.15115 by Baosong Yang, Beichen Zhang, Binyuan Hui, Bowen Yu, Bo Zheng, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Qwen: An Yang, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yuqiong Liu, Yu Wan, Zeyu Cui, Zhenru Zhang, Zihan Qiu (additional authors not shown).

Figure 1
Figure 1. Figure 1: In the iterative development of the Qwen series, data scaling has played a crucial role. Qwen 2.5, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of Qwen2.5-Turbo on Passkey Retrieval Task with 1M Token Lengths. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TTFT (Time To First Token) of Qwen2.5-Turbo and Qwen2.5-7B with Full Attention and Our Method. 6 Conclusion Qwen2.5 represents a significant advancement in large language models (LLMs), with enhanced pre￾training on 18 trillion tokens and sophisticated post-training techniques, including supervised fine-tuning and multi-stage reinforcement learning. These improvements boost human preference alignment, long… view at source ↗
read the original abstract

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Qwen2.5 series of LLMs, reporting that pre-training data was scaled from 7T to 18T high-quality tokens and that post-training used over 1M SFT samples plus multistage RL. It claims the open-weight Qwen2.5-72B-Instruct achieves top-tier results across language, reasoning, math, and coding benchmarks, outperforming several open/proprietary models and remaining competitive with the much larger Llama-3-405B-Instruct; hosted MoE variants (Turbo, Plus) are also presented as cost-effective alternatives to GPT-4o-mini/o.

Significance. If the reported benchmark numbers are shown to be free of test-set leakage, the work supplies concrete evidence that aggressive high-quality pre-training scaling combined with targeted post-training can produce open-weight models that match or approach the performance of models five times larger, strengthening the case for efficient open-source LLM development.

major comments (2)
  1. [Pre-training description] Pre-training description (abstract and § on pre-training): the claim that the 18T-token corpus supplies a strong foundation for reasoning without significant contamination is unsupported by any quantitative decontamination analysis, n-gram overlap statistics, or membership-inference results against the cited evaluation suites (MMLU, GSM8K, HumanEval, etc.). This directly bears on the flagship competitiveness claim.
  2. [Evaluation section] Evaluation section / abstract: benchmark results are presented without error bars, run-to-run variance, or explicit data-exclusion criteria, and without a table or appendix listing per-benchmark scores for Qwen2.5-72B-Instruct versus Llama-3-405B-Instruct. The absence of these details prevents independent verification of the central performance assertion.
minor comments (2)
  1. The post-training paragraph refers to “intricate supervised finetuning with over 1 million samples” and “multistage reinforcement learning” without enumerating data sources, filtering steps, or the reward-model training procedure.
  2. Figure or table summarizing model sizes, parameter counts, and key benchmark scores would improve readability and allow direct comparison with Llama-3-405B.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on the Qwen2.5 Technical Report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Pre-training description] Pre-training description (abstract and § on pre-training): the claim that the 18T-token corpus supplies a strong foundation for reasoning without significant contamination is unsupported by any quantitative decontamination analysis, n-gram overlap statistics, or membership-inference results against the cited evaluation suites (MMLU, GSM8K, HumanEval, etc.). This directly bears on the flagship competitiveness claim.

    Authors: We acknowledge the referee's concern that the absence of quantitative decontamination metrics leaves the contamination claim unsupported. Due to the proprietary nature of the 18T corpus, we cannot release n-gram overlap statistics or membership-inference results. We will revise the pre-training section to describe our internal data curation pipeline, quality filtering, and decontamination procedures in greater detail while explicitly noting that specific overlap metrics are withheld for confidentiality reasons. This constitutes a partial revision. revision: partial

  2. Referee: [Evaluation section] Evaluation section / abstract: benchmark results are presented without error bars, run-to-run variance, or explicit data-exclusion criteria, and without a table or appendix listing per-benchmark scores for Qwen2.5-72B-Instruct versus Llama-3-405B-Instruct. The absence of these details prevents independent verification of the central performance assertion.

    Authors: We agree that the current presentation lacks sufficient methodological transparency. In the revised manuscript we will add error bars or reported variance for applicable benchmarks, state explicit data-exclusion criteria, and include a new appendix table with per-benchmark scores directly comparing Qwen2.5-72B-Instruct and Llama-3-405B-Instruct. These changes will enable independent verification of the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The document is a technical report on model scaling and benchmark results with no mathematical derivations, fitted parameters presented as predictions, or self-referential definitions. Performance claims rest on external benchmarks (MMLU, GSM8K, etc.) and data scaling statements that do not reduce to the reported outcomes by construction. No load-bearing self-citations or ansatzes are invoked in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The report relies on standard assumptions in LLM training such as the benefit of scaling data and the effectiveness of RLHF-like methods, but introduces no new free parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 6053 in / 1129 out tokens · 29825 ms · 2026-05-23T06:23:03.326026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

    econ.EM 2026-05 accept novelty 8.0

    EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

  2. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  3. Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

    cs.CR 2026-05 unverdicted novelty 8.0

    Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this pro...

  4. FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

    cs.AI 2026-05 conditional novelty 8.0

    FormalRewardBench is the first benchmark for reward models in formal theorem proving, consisting of 250 Lean 4 preference pairs that show frontier LLMs scoring 59.8% while specialized provers score only 24.4%.

  5. OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

    cs.LG 2026-05 unverdicted novelty 8.0

    OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...

  6. Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

    cs.CV 2026-05 unverdicted novelty 8.0

    Creates the first benchmark dataset integrating papers, slides, videos, and presentations for evaluating AI models on fine-grained multimodal correspondences in science.

  7. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  8. VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

    cs.CV 2026-05 unverdicted novelty 8.0

    VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.

  9. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  10. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

    cs.CL 2026-04 unverdicted novelty 8.0

    ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

  11. RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

    cs.CR 2025-09 conditional novelty 8.0

    RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

  12. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  13. HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

    cs.DC 2026-05 unverdicted novelty 7.0

    HyperParallel-MoE achieves up to 1.58x lower Dispatch-to-Combine MoE-FFN latency on Ascend A3 clusters via tile-level heterogeneous scheduling of AIC and AIV resources.

  14. Instance-Optimal Estimation with Multiple LLM Judges on a Budget

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces budgeted heteroskedastic multi-judge estimation and proves instance-optimality of an adaptive inverse-variance weighted estimator via matching upper and lower bounds.

  15. Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints o...

  16. WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    WMAttack automates finite-budget attack search for world-model agents via SCAS and RGAR, reporting higher normalized reward drops than baselines on Atari and DMC tasks.

  17. Brain-LLM Alignment Tracks Training Data, Not Typology

    cs.CL 2026-05 unverdicted novelty 7.0

    Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic...

  18. Test-Time Training Undermines Safety Guardrails

    cs.LG 2026-05 unverdicted novelty 7.0

    Test-time training enables three new threat models that raise jailbreak attack success rates on language models to averages of 95% and 93% ASR@10 under LoRA for few-shot and generation-phase attacks across model families.

  19. Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Learned Relay Representations enable masked diffusion models to propagate useful latent information across denoising steps, scaling to Fast-dLLM v2 to outperform supervised finetuning on coding tasks while cutting inf...

  20. Self-Policy Distillation via Capability-Selective Subspace Projection

    cs.CL 2026-05 unverdicted novelty 7.0

    Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines...

  21. GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

    cs.LG 2026-05 unverdicted novelty 7.0

    GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.

  22. IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

    cs.CL 2026-05 unverdicted novelty 7.0

    IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.

  23. Grounding Driving VLA via Inverse Kinematics

    cs.CV 2026-05 conditional novelty 7.0

    By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v...

  24. Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

    cs.LG 2026-05 unverdicted novelty 7.0

    Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

  25. Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

    cs.CR 2026-05 conditional novelty 7.0

    Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.

  26. The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.

  27. CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

    cs.CL 2026-05 unverdicted novelty 7.0

    CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...

  28. Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

    cs.LG 2026-05 conditional novelty 7.0

    CPD applies CUSUM change-point detection to standardized next-token entropy streams to identify and localize optimization-based adversarial suffixes, achieving higher F1 and better localization than windowed-perplexit...

  29. LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...

  30. LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAP replaces intractable categorical mask parameterization with a differentiable per-weight Bernoulli relaxation, delivering +2.59 average zero-shot accuracy gain over the best layer-wise baseline across 0.5B-8B LLMs...

  31. The Unlearnability Phenomenon in RLVR for Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    RLVR training for language models exhibits an unlearnability phenomenon where certain hard examples stay unlearnable due to low gradient similarity and ungeneralizable reasoning patterns.

  32. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.

  33. Artificial Aphasias in Lesioned Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Lesioning parameters in large language models produces aphasia-like symptoms whose distributions vary by attention versus feed-forward components and by layer depth, but differ qualitatively from human clinical profiles.

  34. PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 7.0

    PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...

  35. MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical...

  36. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  37. RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

    cs.LG 2026-05 unverdicted novelty 7.0

    RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.

  38. Do Language Models Align with Brains? Prediction Scores Are Not Enough

    q-bio.NC 2026-05 unverdicted novelty 7.0

    Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.

  39. GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

    cs.LG 2026-05 unverdicted novelty 7.0

    GHGbench is a new multi-entity benchmark for company- and building-level carbon emission prediction that shows building tasks are harder, out-of-distribution gaps dominate, and multimodal data aids generalization.

  40. What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation

    cs.CL 2026-05 accept novelty 7.0

    Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.

  41. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

    cs.AI 2026-05 unverdicted novelty 7.0

    PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.

  42. From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

    cs.LG 2026-05 conditional novelty 7.0

    AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

  43. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  44. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Tool-use agents suffer large accuracy drops from reward and transition perturbations but domain-randomized RL on static perturbations closes about 27% of the unseen transition gap while retaining most clean performance.

  45. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  46. CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

    cs.AI 2026-05 unverdicted novelty 7.0

    CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.

  47. Deep Minds and Shallow Probes

    cs.LG 2026-05 unverdicted novelty 7.0

    Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

  48. gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

    cs.LG 2026-05 unverdicted novelty 7.0

    gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best amo...

  49. Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

    cs.AI 2026-05 unverdicted novelty 7.0

    VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

  50. Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

    cs.LG 2026-05 unverdicted novelty 7.0

    CAQ-ZO aligns ZO query stencils to compander grids, eliminating query-time residual error and improving NF4 fine-tuning performance on Qwen and Llama models compared to standard quantized baselines.

  51. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

    cs.CL 2026-05 unverdicted novelty 7.0

    GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

  52. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  53. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  54. Unsupervised Process Reward Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Unsupervised PRMs derived from LLM probabilities achieve up to 15% better error detection than LLM judges and match supervised PRMs in verification and RL tasks.

  55. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.

  56. GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

    cs.SI 2026-05 unverdicted novelty 7.0

    GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...

  57. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

    cs.AI 2026-05 unverdicted novelty 7.0

    Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.

  58. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  59. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

    cs.AI 2026-05 unverdicted novelty 7.0

    Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

  60. LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

    cs.CL 2026-05 unverdicted novelty 7.0

    LEAF-SQL uses level-wise exploration with adaptive fine-graining and dual agents to generate diverse SQL skeletons, reaching 71.6% execution accuracy on the BIRD benchmark and outperforming prior search- and skeleton-...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 722 Pith papers · 29 internal anchors

  1. [1]

    Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S´ebastien Bubeck, Martin Cai, Caio C´esar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Rone...

  2. [2]

    Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert H...

  3. [3]

    The Falcon Series of Open Language Models

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M´erouane Debbah, ´Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Maz- zotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. The Falcon series of open language models. CoRR, abs/2311.16867,

  4. [4]

    Training-free long-context scaling of large language models

    Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463,

  5. [5]

    Program Synthesis with Large Language Models

    URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud e 3.pdf. Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

  6. [6]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  7. [7]

    The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884,

  8. [8]

    Towards Scalable Automated Alignment of LLMs: A Survey,

    Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han, Le Sun, Hongyu Lin, and Bowen Yu. Towards scalable automated alignment of LLMs: A survey. CoRR, abs/2406.01252,

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

  10. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. In EMNLP, pp. 7889–7901. Association for Computational Linguistics, 2023a. Zhihong Chen, Shuo Yan, Juhao Liang, Feng Jiang, Xiangbo Wu, Fei Yu, Guiming Hardy Chen, Junying Chen, Hongbo Zhang, Li Jianqua...

  11. [11]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

  12. [12]

    20 Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066,

  13. [13]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models

    Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542,

  14. [14]

    Multi-programming language sandbox for llms

    Shihan Dou, Jiazheng Zhang, Jianxiang Zang, Yunbo Tao, Haoxiang Jia, Shichun Liu, Yuming Yang, Shenxi Wu, Shaoqing Zhang, Muling Wu, et al. Multi-programming language sandbox for llms. CoRR, abs/2410.23074,

  15. [15]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...

  16. [16]

    Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton A. Emelyanov, Denis Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Ka- terina Kolomeytseva, Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova, Denis Dimitrov, Alexander Panchenko, and Sergey Markov. MERA: A...

  17. [17]

    Athene-70b: Redefining the boundaries of post-training for open models, July 2024a

    Evan Frick, Peter Jin, Tianle Li, Karthik Ganesan, Jian Zhang, Jiantao Jiao, and Banghua Zhu. Athene-70b: Redefining the boundaries of post-training for open models, July 2024a. URL https://nexusflow.ai/b logs/athene. Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, and Ion...

  18. [18]

    Gemma 2: Improving Open Language Models at a Practical Size

    URL https://storage.googleapis.com/deepmind-media/gemini/gemi ni v1 5 report.pdf. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size. CoRR, abs/2408.00118,

  19. [19]

    Training Compute-Optimal Large Language Models

    21 Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MA...

  20. [20]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?CoRR, abs/2404.06654,

  21. [21]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language ...

  22. [22]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,

  23. [23]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

  24. [24]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.0682...

  25. [25]

    Smith, and Hanna Hajishirzi

    Nathan Lambert, Valentina Pyatkin, Jacob Daniel Morrison, Lester James Validad Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hanna Hajishirzi. RewardBench: Evaluating reward models for language modeling. CoRR, abs/2403.13787,

  26. [26]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. GShard: Scaling giant models with conditional computation and automatic sharding. CoRR, abs/2006.16668,

  27. [27]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    22 Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,

  28. [28]

    Online merging optimizers for boosting rewards and mitigating tax in alignment

    Keming Lu, Bowen Yu, Fei Huang, Yang Fan, Runji Lin, and Chang Zhou. Online merging optimizers for boosting rewards and mitigating tax in alignment. CoRR, abs/2405.17931, 2024a. Keming Lu, Bowen Yu, Chang Zhou, and Jingren Zhou. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. CoRR, abs/2401.124...

  29. [29]

    BLEnD: A Benchmark for LLMs on Ev- eryday Knowledge in Diverse Cultures and Languages,

    Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla P ´erez-Almendros, Abinew Ali Ayele, V ´ıctor Guti´errez-Basulto, Yazm´ın Ib´a˜nez- Garc´ıa, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki-Woong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidho...

  30. [30]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report. CoRR, abs/2303.08774,

  31. [31]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,

  32. [32]

    Language models can self-lengthen to generate long texts

    Shanghaoran Quan, Tianyi Tang, Bowen Yu, An Yang, Dayiheng Liu, Bofei Gao, Jianhong Tu, Yichang Zhang, Jingren Zhou, and Junyang Lin. Language models can self-lengthen to generate long texts. CoRR, abs/2410.23933,

  33. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300,

  35. [35]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Sto...

  36. [36]

    Secrets of RLHF in large language models part II: Reward modeling

    Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of RLHF in large language models part II: Reward modeling. CoRR, abs/2401.06080, 2024a. Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords. In AAAI, pp. 9154–9160. AAAI Press,

  37. [37]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024b. Zhilin Wang, Alexander Bukharin,...

  38. [38]

    Aligning large language models via self-steering optimization

    Hao Xiang, Bowen Yu, Hongyu Lin, Keming Lu, Yaojie Lu, Xianpei Han, Le Sun, Jingren Zhou, and Junyang Lin. Aligning large language models via self-steering optimization. CoRR, abs/2410.17131,

  39. [39]

    A., Oguz, B., et al

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...

  40. [40]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

  41. [41]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong D...

  42. [42]

    LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K

    Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136,

  43. [43]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825,

  44. [44]

    P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs

    Yidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, and Jingren Zhou. P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of LLMs. CoRR, abs/2411.09116,

  45. [45]

    RMB: Comprehensively benchmarking reward models in LLM alignment

    Enyu Zhou, Guodong Zheng, Bing Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. RMB: Comprehensively benchmarking reward models in LLM alignment. CoRR, abs/2410.09893,

  46. [46]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

  47. [47]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    25 Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. CoRR, abs/2202.08906,