citation dossier

Codebleu: a method for automatic evaluation of code synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma · 2020 · arXiv 2009.10297

19Pith papers citing it

20reference links

cs.SEtop field · 13 papers

UNVERDICTEDtop verdict bucket · 15 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.SE (13 papers). The largest review-status bucket among citing papers is UNVERDICTED (15 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

cs.SE · 2026-05-04 · accept · novelty 7.0 · 2 refs

LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.

SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

Beyond BLEU: A Semantic Evaluation Method for Code Translation

cs.PL · 2026-05-06 · unverdicted · novelty 6.0

A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

Hallucination Inspector: A Fact-Checking Judge for API Migration

cs.SE · 2026-04-22 · unverdicted · novelty 6.0

Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.

Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.

ARuleCon: Agentic Security Rule Conversion

cs.CR · 2026-04-08 · unverdicted · novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

cs.RO · 2025-06-22 · unverdicted · novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

Measuring Coding Challenge Competence With APPS

cs.SE · 2021-05-20 · unverdicted · novelty 6.0

APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.

On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies

cs.SE · 2026-05-07 · unverdicted · novelty 5.0

Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.

How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

cs.SE · 2026-05-06 · conditional · novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

cs.SE · 2026-04-30 · unverdicted · novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment

cs.SE · 2026-04-03 · unverdicted · novelty 5.0

DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Can Code Evaluation Metrics Detect Code Plagiarism?

cs.SE · 2026-04-28 · unverdicted · novelty 4.0

Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

citing papers explorer

Showing 19 of 19 citing papers.

Deep Graph-Language Fusion for Structure-Aware Code Generation cs.SE · 2026-05-05 · unverdicted · none · ref 22
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair cs.SE · 2026-05-04 · accept · none · ref 31 · 2 links
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing cs.SE · 2026-04-21 · unverdicted · none · ref 45
A cascaded large-small model system generates edit sketches with the large model and applies them with the small model to make code editing both accurate and token-efficient.
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair cs.SE · 2026-04-19 · unverdicted · none · ref 32
SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Beyond BLEU: A Semantic Evaluation Method for Code Translation cs.PL · 2026-05-06 · unverdicted · none · ref 10
A semantic correctness score based on execution matching shows LLM decompilers outperform heuristics for binary lifting while BLEU correlates poorly with functional accuracy.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study cs.SE · 2026-04-27 · unverdicted · none · ref 34
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Hallucination Inspector: A Fact-Checking Judge for API Migration cs.SE · 2026-04-22 · unverdicted · none · ref 20
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.
Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning cs.LG · 2026-04-15 · unverdicted · none · ref 2
Evolving Parameter Isolation (EPI) periodically updates parameter isolation masks using online gradient signals during supervised fine-tuning to protect emerging task-critical parameters and reduce interference and forgetting.
ARuleCon: Agentic Security Rule Conversion cs.CR · 2026-04-08 · unverdicted · none · ref 25
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation cs.RO · 2025-06-22 · unverdicted · none · ref 39
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 261
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Measuring Coding Challenge Competence With APPS cs.SE · 2021-05-20 · unverdicted · none · ref 13
APPS benchmark shows models like GPT-Neo pass roughly 20% of test cases on introductory problems, indicating machine learning is beginning to learn basic coding.
On Fixing Insecure AI-Generated Code through Model Fine-Tuning and Prompting Strategies cs.SE · 2026-05-07 · unverdicted · none · ref 54
Fine-tuning and prompting reduce some CWEs in AI-generated code but frequently introduce new weaknesses, with no strategy working reliably across models or languages.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study cs.SE · 2026-05-06 · conditional · none · ref 17
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding cs.SE · 2026-04-30 · unverdicted · none · ref 66
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment cs.SE · 2026-04-03 · unverdicted · none · ref 37
DepTrans translates entire C repositories to Rust at 60.7% compilation success and 43.5% functional accuracy by combining reinforcement-aligned syntax training with dependency-guided iterative refinement.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 186
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Can Code Evaluation Metrics Detect Code Plagiarism? cs.SE · 2026-04-28 · unverdicted · none · ref 22
Code evaluation metrics like CrystalBLEU perform comparably to dedicated tools such as Dolos and JPlag when ranking plagiarized code pairs across modification levels on open datasets.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 226
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Codebleu: a method for automatic evaluation of code synthesis

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer