The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
hub
Unixcoder: Unified cross-modal pre-training for code representation
15 Pith papers cite this work, alongside 459 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
MARGIN reduces geometric distortions in imbalanced vulnerability embeddings by dynamically regularizing margins with von Mises-Fisher concentration estimates and hyperspherical prototypes.
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
Text fine-tuning of 8B LLMs on C/C++ vulnerability data inflates cross-language false-positive rates through surface-cue memorization, which an AST inference probe can partially reverse while direct AST fine-tuning cannot.
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.
citing papers explorer
-
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
Deep Graph-Language Fusion for Structure-Aware Code Generation
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
-
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
-
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
-
Do not copy and paste! Rewriting strategies for code retrieval
Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
-
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
-
UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
-
MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection
MARGIN reduces geometric distortions in imbalanced vulnerability embeddings by dynamically regularizing margins with von Mises-Fisher concentration estimates and hyperspherical prototypes.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection
Text fine-tuning of 8B LLMs on C/C++ vulnerability data inflates cross-language false-positive rates through surface-cue memorization, which an AST inference probe can partially reverse while direct AST fine-tuning cannot.
-
Towards General Text Embeddings with Multi-stage Contrastive Learning
GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.
-
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
-
LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.
-
MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.