hub Canonical reference

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang · 2020 · cs.SE · arXiv 2009.10297

Canonical reference. 89% of citing Pith papers cite this work as background.

58 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 58 citing papers arXiv PDF

abstract

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 8 use method 1

representative citing papers

Detecting Functional Memorization in Code Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

cs.SE · 2026-05-04 · accept · novelty 7.0 · 2 refs

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

cs.SE · 2025-08-22 · unverdicted · novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

cs.SE · 2025-08-21 · accept · novelty 7.0 · 2 refs

The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.

Semantics-Aware Bilevel Co-Evolution: Towards Automated Multicomponent Algorithm Design

cs.NE · 2026-06-29 · unverdicted · novelty 6.0

STABLE introduces semantics-aware bilevel co-evolution for automated multicomponent algorithm design and reports outperformance over human and prior LES baselines.

Humanizing Automatically Generated Unit Test Suites with LLM-Based Refactoring

cs.SE · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

TestHumanizer uses LLMs as refactoring layers on EvoSuite suites to reach 88-98% compilation rates and better readability on 350 classes from Defects4J and SF110 while preserving coverage.

When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs

cs.SE · 2026-06-26 · unverdicted · novelty 6.0

Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.

Acoda: Adversarial Code Obfuscation for Defending against LLM-based Analysis

cs.SE · 2026-06-10 · unverdicted · novelty 6.0

Acoda uses a genetic algorithm to optimize eight obfuscation methods that reduce LLM code analysis success rates to as low as 30% while preserving original semantics.

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

cs.SE · 2026-06-04 · unverdicted · novelty 6.0

Code2LoRA generates repo-specific LoRA adapters via hypernetwork for code LMs, matching per-repo LoRA on static tasks and exceeding shared LoRA by 5.2 pp on evolving code in a 604-repo benchmark.

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.

Physics-in-the-Loop: A Hybrid Agentic Architecture for Validated CAD Engineering Design

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.

Contextualized Code Pretraining for Code Generation

cs.SE · 2026-05-18 · unverdicted · novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

cs.SE · 2026-05-17 · conditional · novelty 6.0

MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.

Beyond BLEU: A Semantic Evaluation Method for Code Translation

cs.PL · 2026-05-06 · unverdicted · novelty 6.0

A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

Hallucination Inspector: A Fact-Checking Judge for API Migration

cs.SE · 2026-04-22 · unverdicted · novelty 6.0

Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.

Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer