hub Canonical reference

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang · 2020 · cs.SE · arXiv 2009.10297

Canonical reference. 89% of citing Pith papers cite this work as background.

46 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 8 use method 1

representative citing papers

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

cs.SE · 2026-05-04 · accept · novelty 7.0 · 2 refs

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

cs.SE · 2025-08-22 · unverdicted · novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

cs.SE · 2025-08-21 · accept · novelty 7.0 · 2 refs

The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.

Contextualized Code Pretraining for Code Generation

cs.SE · 2026-05-18 · unverdicted · novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

cs.SE · 2026-05-17 · conditional · novelty 6.0

MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.

Beyond BLEU: A Semantic Evaluation Method for Code Translation

cs.PL · 2026-05-06 · unverdicted · novelty 6.0

A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

Hallucination Inspector: A Fact-Checking Judge for API Migration

cs.SE · 2026-04-22 · unverdicted · novelty 6.0

Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.

Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.

ARuleCon: Agentic Security Rule Conversion

cs.CR · 2026-04-08 · unverdicted · novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.

Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

cs.SE · 2026-03-28 · unverdicted · novelty 6.0

Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

Improving Code Comprehension through Cognitive-Load Aware Automated Refactoring for Novice Programmers

cs.SE · 2026-03-17 · conditional · novelty 6.0

CDDRefactorER constrains AI-driven refactoring using Cognitive-Driven Development rules to cut failures by 54-71% and raise novice comprehension scores by 22-31%.

AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

cs.AI · 2026-02-05 · unverdicted · novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.

Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

cs.CR · 2026-01-30 · unverdicted · novelty 6.0

Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

cs.CL · 2025-11-26 · unverdicted · novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.

Project-Level C-to-Rust Translation via Pointer Knowledge Graphs

cs.SE · 2025-10-13 · unverdicted · novelty 6.0

PtrTrans builds a Pointer Knowledge Graph with points-to flows, struct abstractions, and Rust annotations to guide LLMs toward project-level C-to-Rust translations that cut unsafe code by 99.9% and raise functional correctness by 29.3%.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation cs.RO · 2025-06-22 · unverdicted · none · ref 39 · internal anchor
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer