Authors demonstrate functional memorization in code LLMs via counterfactual midtraining comparison on functional equivalence metrics beyond textual overlap.
hub Canonical reference
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Multilingual execution-grounded benchmark finds top open code LLM at 23.64% correctness versus 57.2% human baseline, with compile errors dominating 63% of failures.
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
STABLE introduces semantics-aware bilevel co-evolution for automated multicomponent algorithm design and reports outperformance over human and prior LES baselines.
TestHumanizer uses LLMs as refactoring layers on EvoSuite suites to reach 88-98% compilation rates and better readability on 350 classes from Defects4J and SF110 while preserving coverage.
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
Acoda uses a genetic algorithm to optimize eight obfuscation methods that reduce LLM code analysis success rates to as low as 30% while preserving original semantics.
Code2LoRA generates repo-specific LoRA adapters via hypernetwork for code LMs, matching per-repo LoRA on static tasks and exceeding shared LoRA by 5.2 pp on evolving code in a 604-repo benchmark.
Introduces functional equivalence methods and functional entropy to predict functional correctness of LLM-generated code via uncertainty quantification, outperforming NLI-based baselines in most tested settings.
A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.
A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.
Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.
citing papers explorer
No citing papers match the current filters.