CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
hub
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.
Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
citing papers explorer
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.