BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,
25 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
LLM-based merge conflict resolution performs well on imbalanced conflicts but struggles with large or non-English inputs, while search-based methods show better generalization and strength on balanced conflicts.
VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
AdaTrans uses strategy-driven RAG, error-stratified transformation, and multi-stage validation to reach 95.51% mean compilation pass rate and 81.09% solve rate on 104 algorithmic problems with only 1.19% unsafe files.
Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
Empirical study finds instruction tuning on CodeLLMs improves instruction following at the expense of infilling performance, termed the Instruction-Tuning Tax.
MailoHLS combines LLM semantic reasoning and GNN structural modeling with multi-adapter PEFT and Pareto optimization to produce near-Pareto-optimal HLS pragma configurations, reporting up to 12.42x latency speedup on seen kernels and 10.2x on unseen ones.
QD-LLM applies neuroevolution to prompt embeddings within a quality-diversity framework, producing 46% higher coverage and 41% higher QD-score than QDAIF on HumanEval, MBPP, and creative writing benchmarks.
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
A controlled benchmark on 2040 problems reveals poor generalization and high interference in model editing for API updates in code LLMs, with many successes being workarounds rather than true migrations.
ReCode is a new RL framework combining contrastive reasoning-process reward learning with consistency-gated GRPO to improve code generation, yielding a 16.1% gain for a 7B model to match GPT-4-Turbo levels on benchmarks.
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
UNICS pre-trains on a pseudocode dataset for cross-lingual logic then applies multi-task transfer learning with hard-positive mining and dynamic hard-negative sampling to reach claimed SOTA on multilingual code-search benchmarks.
SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.
MARGIN uses von Mises-Fisher concentration to dynamically adjust geometric regularization, aligning embedding distributions with Voronoi cells for more stable decision boundaries in imbalanced vulnerability detection.
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
citing papers explorer
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
LLM-based vs. Search-based Merge Conflict Resolution: An Empirical Study of Competing Paradigms
LLM-based merge conflict resolution performs well on imbalanced conflicts but struggles with large or non-English inputs, while search-based methods show better generalization and strength on balanced conflicts.
-
VulKey: Automated Vulnerability Repair Guided by Domain-Specific Repair Patterns
VulKey introduces hierarchical expert knowledge abstractions to guide LLMs in vulnerability repair, reporting 31.5% accuracy on PrimeVul (7.6% above best baseline) and strong results on Vul4J.
-
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
-
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
AdaTrans: Automated C to Rust Transformation via Error-Adaptive Repair
AdaTrans uses strategy-driven RAG, error-stratified transformation, and multi-stage validation to reach 95.51% mean compilation pass rate and 81.09% solve rate on 104 algorithmic problems with only 1.19% unsafe files.
-
Context-Aware Distillation and Ablation for Text2DSL
Context-aware distillation with BNF+API+vocabulary scales PolkitBench to 10,073 pairs at 99.7% runtime pass rate; ablation on GigaChat-10B shows vocabulary adds +0.198 combined score while API/BNF add 22-25pp structural validity.
-
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
-
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
-
Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks
Empirical study finds instruction tuning on CodeLLMs improves instruction following at the expense of infilling performance, termed the Instruction-Tuning Tax.
-
MailoHLS: Multi-Adapter Structure-Aware Learning for Pareto-Driven HLS Pragma Optimization
MailoHLS combines LLM semantic reasoning and GNN structural modeling with multi-adapter PEFT and Pareto optimization to produce near-Pareto-optimal HLS pragma configurations, reporting up to 12.42x latency speedup on seen kernels and 10.2x on unseen ones.
-
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
QD-LLM applies neuroevolution to prompt embeddings within a quality-diversity framework, producing 46% higher coverage and 41% higher QD-score than QDAIF on HumanEval, MBPP, and creative writing benchmarks.
-
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
-
Understanding Robustness of Model Editing in Code LLMs
A controlled benchmark on 2040 problems reveals poor generalization and high interference in model editing for API updates in code LLMs, with many successes being workarounds rather than true migrations.
-
ReCode: Reinforcing Code Generation with Reasoning-Process Rewards
ReCode is a new RL framework combining contrastive reasoning-process reward learning with consistency-gated GRPO to improve code generation, yielding a 16.1% gain for a 7B model to match GPT-4-Turbo levels on benchmarks.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
-
UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning
UNICS pre-trains on a pseudocode dataset for cross-lingual logic then applies multi-task transfer learning with hard-positive mining and dynamic hard-negative sampling to reach claimed SOTA on multilingual code-search benchmarks.
-
SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents
SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.
-
MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection
MARGIN uses von Mises-Fisher concentration to dynamically adjust geometric regularization, aligning embedding distributions with Voronoi cells for more stable decision boundaries in imbalanced vulnerability detection.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.