BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
hub Canonical reference
GraphCodeBERT: Pre-training Code Representations with Data Flow
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep MCP repository pairs manually verified as true clones.
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
Fisher information selects task-relevant parts of graph features to fuse with pretrained code models, improving vulnerability detection F1 by up to 6.3 points on BigVul, Devign, and ReVeal.
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
Replication of TCS strategies on 17 LLM instances across three code tasks shows only partial generalization from vision DNN results, with uncertainty features aiding early failure discovery and representation features aiding accuracy estimation.
XSearch achieves explainable code search by breaking queries into functional concepts and matching them directly to code statements, delivering large gains on out-of-distribution benchmarks.
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.
DiffHLS predicts HLS QoR via differential learning: separate GNN+LLM models for kernel baseline and design delta are composed to yield the final estimate, showing lower MAPE than GNN baselines on PolyBench.
Sliceformer reformulates static program slicing as seq2seq using CodeT5+ with dataflow-aware pretraining via DFG permutation and span corruption plus constrained decoding, yielding up to 22% ExactMatch gains on Java and Python benchmarks.
AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
Student models distilled from code language models often fail to deeply mimic teachers, showing up to 62% behavioral discrepancies and 285% worse drops under attacks that accuracy metrics miss.
Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
PseudoBridge uses LLM-synthesized pseudo-code to bridge NL semantics and PL logic plus logic-invariant style augmentation to boost robustness and generalization in code retrieval.
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
citing papers explorer
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
Evaluating Tool Cloning in Agentic-AI Ecosystems
Tool cloning is pervasive in agentic AI ecosystems, with 60% of high-Jaccard and 85% of high-ssdeep MCP repository pairs manually verified as true clones.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
-
How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
-
Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection
Fisher information selects task-relevant parts of graph features to fuse with pretrained code models, improving vulnerability detection F1 by up to 6.3 points on BigVul, Devign, and ReVeal.
-
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
-
Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs
LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.
-
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
-
Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code
Replication of TCS strategies on 17 LLM instances across three code tasks shows only partial generalization from vision DNN results, with uncertainty features aiding early failure discovery and representation features aiding accuracy estimation.
-
XSearch: Explainable Code Search via Concept-to-Code Alignment
XSearch achieves explainable code search by breaking queries into functional concepts and matching them directly to code statements, delivering large gains on out-of-distribution benchmarks.
-
NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
NeuroFlake integrates discriminative token mining into LLMs to classify flaky tests, raising F1-score to 69.34% on FlakeBench while showing greater robustness to semantic-preserving perturbations than prior methods.
-
VulStyle: A Multi-Modal Pre-Training for Code Stylometry-Augmented Vulnerability Detection
VulStyle pre-trains on 4.9M functions using code, non-terminal ASTs, and stylometry features, then fine-tunes to achieve SOTA F1 gains of 4-48% on BigVul and VulDeePecker.
-
Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach
Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.
-
DiffHLS: Differential Learning for High-Level Synthesis QoR Prediction with GNNs and LLM Code Embeddings
DiffHLS predicts HLS QoR via differential learning: separate GNN+LLM models for kernel baseline and design delta are composed to yield the final estimate, showing lower MAPE than GNN baselines on PolyBench.
-
Static Program Slicing Using Language Models With Dataflow-Aware Pretraining and Constrained Decoding
Sliceformer reformulates static program slicing as seq2seq using CodeT5+ with dataflow-aware pretraining via DFG permutation and span corruption plus constrained decoding, yielding up to 22% ExactMatch gains on Java and Python benchmarks.
-
AFGNN: API Misuse Detection using Graph Neural Networks and Clustering
AFGNN detects API misuses in Java code more effectively than prior methods by representing usage as graphs and clustering learned embeddings from self-supervised training.
-
On the Role of Fault Localization Context for LLM-Based Program Repair
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
-
A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?
Student models distilled from code language models often fail to deeply mimic teachers, showing up to 62% behavioral discrepancies and 285% worse drops under attacks that accuracy metrics miss.
-
Multi Language Models for On-the-Fly Syntax Highlighting
Unified multi-language deep learning model for on-the-fly syntax highlighting using normalization and few-shot learning to support six languages with lower deployment cost.
-
PseudoBridge: Pseudo Code as the Bridge for Better Semantic and Logic Alignment in Code Retrieval
PseudoBridge uses LLM-synthesized pseudo-code to bridge NL semantics and PL logic plus logic-invariant style augmentation to boost robustness and generalization in code retrieval.
-
Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Fine-tuning 13 CodeLMs on a constructed CLB dataset with nine interaction types improves detection, with UniXcoder-base reaching F1 0.7407 and small models outperforming large ones.
-
UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
-
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.
-
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
-
Prompt Optimization for LLM Code Generation via Reinforcement Learning
A PPO agent with hybrid actions and test-driven rewards optimizes prompts for code LLMs, raising strict Pass@1 scores on MBPP+, HumanEval+, and APPS over prior methods.
-
DCVD: Dual-Channel Cross-Modal Fusion for Joint Vulnerability Detection and Localization
DCVD performs joint function-level vulnerability detection and statement-level localization by extracting control-dependency and semantic features in parallel branches, fusing them with contrastive alignment and bidirectional cross-attention, and applying explicit supervision at both granularities.
-
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
-
PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection
Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.
-
Social Life of Code: Modeling Evolution through Code Embedding and Opinion Dynamics
Code embeddings combined with the Expressed-Private Opinion model produce trajectories that quantify developer influence and consensus formation across three open-source repositories.
-
SEER: Spectral Entropy Encoding of Roles for Context-Aware Attention-Based Design Pattern Detection
SEER adds spectral-entropy role encoding from Laplacian spectra and empirically calibrated time-weighted calling contexts to raise macro-F1 from 92.47% to 93.20% and accuracy from 92.52% to 93.98% on PyDesignNet for 23 GoF patterns.
-
Context-Guided Decompilation: A Step Towards Re-executability
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
-
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code
Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
-
DeepFWI: Identifying Bug-Sensitive Warnings with Multi-Modal Code-Warning Semantics
DeepFWI is a multi-modal LSTM model with cross-attention that identifies bug-sensitive warnings at warning granularity, reaching 67.06% F1 on a 280k-warning dataset and surfacing 25 confirmed bugs in four open-source projects.
-
Prompt-Driven Code Summarization: A Systematic Literature Review
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
-
LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.
-
HYDRA: A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions
HYDRA is a hybrid model that uses heuristics plus deep embeddings and a VAE to predict latent zero-day vulnerabilities in patched functions from Chrome, Android, and ImageMagick.
-
Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
Fine-tuned decoder-only LLMs achieve up to 40.4% higher MAP than UniXcoder on CoSQA+ for code search, with non-monotonic size scaling and data composition sensitivity.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
- FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics