A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
hub Canonical reference
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Canonical reference. 89% of citing Pith papers cite this work as background.
abstract
Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
A hybrid agentic architecture integrates knowledge-based physical verification tools into LLM-driven CAD design loops, producing more complex and functionally valid designs than prior agentic baselines.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
ContraFix couples differential runtime evidence from execution variants with reusable repair skills to achieve 84.0% resolution on SEC-Bench and 73.8% on PatchEval using GPT-5-mini, outperforming baselines at lower cost.
MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refinement memories with iterative feedback.
A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.
Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
CDDRefactorER constrains AI-driven refactoring using Cognitive-Driven Development rules to cut failures by 54-71% and raise novice comprehension scores by 22-31%.
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
PtrTrans builds a Pointer Knowledge Graph with points-to flows, struct abstractions, and Rust annotations to guide LLMs toward project-level C-to-Rust translations that cut unsafe code by 99.9% and raise functional correctness by 29.3%.
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
citing papers explorer
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.