Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
Large language models for software engineering: Survey and open problems
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
LLMs generated 615 vulnerable code snippets aligned with CAPEC and CWE frameworks across three languages, with 0.98 cosine similarity between model outputs.
citing papers explorer
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks
LLMs generated 615 vulnerable code snippets aligned with CAPEC and CWE frameworks across three languages, with 0.98 cosine similarity between model outputs.