super hub Mixed citations

Qwen2.5-Coder Technical Report

Binyuan Hui, Dayiheng Liu, Jian Yang, Jiaxi Yang, Lei Zhang, Zeyu Cui · 2024 · cs.CL · arXiv 2409.12186

Mixed citation behavior. Most common role is background (66%).

312 Pith papers citing it

Background 66% of classified citations

open full Pith review browse 312 citing papers more from Binyuan Hui arXiv PDF

abstract

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 44 baseline 7 method 5 dataset 3 other 2

citation-polarity summary

background 40 baseline 7 unclear 5 use method 5 use dataset 3 support 1

claims ledger

abstract In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-rel

authors

Binyuan Hui Dayiheng Liu Jian Yang Jiaxi Yang Lei Zhang Zeyu Cui

co-cited works

representative citing papers

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

cs.AI · 2026-06-06 · unverdicted · novelty 8.0

UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

cs.CL · 2026-06-04 · accept · novelty 8.0

HKJudge is a new ~290k-sentence expert-annotated corpus of Hong Kong criminal judgments with 26 rhetorical roles and 3 sentencing elements, plus benchmarks on classification and extraction tasks.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

Regression Accumulation in Multi-Turn LLM Programming Conversations

cs.SE · 2026-07-02 · conditional · novelty 7.0

Regression accumulation affects 40-73% of 8-turn LLM coding tasks on extended HumanEval+/MBPP+ benchmarks, with verification gates improving final-turn pass rates on prior tests.

The Illusion of Safety: Multi-Tier Verification of AI vs. Human C++ Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Multi-tier verification on VULBENCH-CPP shows AI-generated C++ code triggers confirmed runtime violations roughly twice as often as human code, while static analysis misleadingly indicates parity due to code length.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

cs.SE · 2026-06-24 · unverdicted · novelty 7.0

CodeChat-Eval shows LLMs lose 19.2% to 69.2% functional correctness over multi-turn refinement dialogues, with largest drops on logic-level and additive changes.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning

cs.SE · 2026-06-18 · unverdicted · novelty 7.0

Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

ModSleuth reconstructs dependency graphs from public artifacts for four LLM releases, recovering 1,060 source-verified dependencies and exposing license issues, train-evaluation coupling, and documentation gaps.

CODEBLOCK: Learning to Supervise Code at the Right Granularity

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

CodeBlock partitions code responses into syntactically coherent blocks, scores them with generalized cross-entropy and data-flow signals, and applies sparse supervision to achieve higher pass@1 than full SFT using 1.9% of tokens on six benchmarks.

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

INFRAMIND is an infrastructure-aware multi-agent orchestration framework that uses RL on a hierarchical constrained MDP to jointly optimize topology, model selection, and scheduling under dynamic load.

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

OpenRTLSet supplies 131k+ Verilog samples with AI-generated descriptions to enable fine-tuning of LLMs for hardware module design.

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation

cs.SE · 2026-06-07 · unverdicted · novelty 7.0

First empirical study shows crate hallucination in Rust LLMs has consistent rates across models insensitive to parameters and tests prompt-based mitigation.

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

cs.SD · 2026-06-05 · unverdicted · novelty 7.0

UniSinger unifies speaker-cloned song generation and accompaniment co-generation SVC in one multimodal diffusion transformer model trained with curriculum learning via task-specific modality masking.

SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

ACE-SQL jointly optimizes schema linking and SQL generation via RL with empirical credit assignment from execution-correct rollouts, achieving 65.3% greedy execution accuracy on BIRD Dev using 0.93k output tokens.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation cs.CV · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation cs.SE · 2026-04-03 · unverdicted · none · ref 17 · internal anchor
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
PARM: Pipeline-Adapted Reward Model cs.AI · 2026-04-20 · unverdicted · none · ref 42 · internal anchor
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs cs.LG · 2026-05-08 · unverdicted · none · ref 43 · internal anchor
Graph representation learning plus iterative augmented Lagrangian optimization creates stronger, harder-to-detect model manipulation attacks on federated LLM fine-tuning, cutting global accuracy by up to 26%.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 32 · internal anchor
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.

Qwen2.5-Coder Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer