{"total":13,"items":[{"citing_arxiv_id":"2606.29124","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CornerCase: Automated Extremal Testing of Protocol Implementations using LLMs","primary_cat":"cs.NI","submitted_at":"2026-06-28T00:25:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CornerCase automates extremal testing of protocol implementations by using LLMs to extract validity constraints from specs and generating boundary test cases, uncovering 42 anomalies (26 acknowledged as bugs) in HTTP, DNS, BGP, SMTP, and QUIC implementations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16790","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-04-18T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This broader pattern supports the view that prompt or presentation cues may act as confounders in automated evaluation pipelines, including code evaluation. Finally, extensive work on social and demographic biases reports that LLM outputs can exhibit systematic stereotypes and representational harms across protected attributes, including gender- and race-related effects [7, 14, 26]. Beyond overt stereotypes, this line of research emphasizes that bias can appear as disparate treatment (e.g., different assumptions about competence, trustworthi- ness, or intent), unequal exposure to harmful or toxic content, and skewed coverage of social groups in both generation and evaluation settings [7]. Importantly, these concerns extend to SE-facing ap-"},{"citing_arxiv_id":"2604.13725","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation","primary_cat":"cs.SE","submitted_at":"2026-04-15T11:00:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"context compression as a viable approach for repository-level code intelligence and provide guidance for paradigm selection under different deployment constraints. ∗Corresponding author 1 Introduction Large language models (LLMs) have revolutionized many software engineering tasks, including code generation, program repair, code summarization, and automated testing [6, 18, 27, 30, 38]. State-of- the-art code LLMs such as DeepSeek-Coder [ 13] and Qwen2.5- Coder [15] are increasingly applied to repository-level tasks includ- ing cross-file completion, API-aware generation, and codebase ques- tion answering, which require reasoning over long, heterogeneous contexts spanning API definitions, dependency files, and historical"},{"citing_arxiv_id":"2604.02548","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks","primary_cat":"cs.CR","submitted_at":"2026-04-02T21:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs generated 615 vulnerable code snippets aligned with CAPEC and CWE frameworks across three languages, with 0.98 cosine similarity between model outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Several studies discuss the integration of LLMs into different aspects of the software development process, highlighting both the potential and challenges of this technology [16], [17]. Contemporary research emphasizes thorough investigations into the effectiveness and challenges of LLMs in software security tasks. GPT-3.5 and GPT-4 can achieve competitive results with appropriate prompts [18]. Zhang et al. [19] studied how ChatGPT detects software vulnerabilities with different prompts and found that ChatGPT identifies vulnerabilities in Java programs better than in C/C++ programs with a basic prompt but lacks a comprehensive understanding of vulnerabilities. Evaluations of various LLMs highlight their promise despite performance gaps when compared to static analysis tools [22]-"},{"citing_arxiv_id":"2604.00239","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Taxonomy of Programming Languages for Code Generation","primary_cat":"cs.CL","submitted_at":"2026-03-31T21:05:39+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00989","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sustainable Code Generation Using Large Language Models: A Systematic Literature Review","primary_cat":"cs.SE","submitted_at":"2026-03-01T08:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As a first step, we searched for existing systematic literature reviews and surveys related to LLMs and code generation in digital libraries such as Compendex, Inspec, Scopus, and Google Scholar. We identified several recent reviews covering related areas, including general LLM-based code generation [51]-[58], LLMs in software engineering [59]-[61], code completion [62], [63], automated program repair [64]-[66], parameter-efficient fine-tuning for large code models [67], [68], domain-specific and low-resource code generation [69], [70], LLMs in programming education [71]-[74], developer productivity [75], [76], and quality and efficiency of LLM- generated code [77]-[80]. In addition, systematic mapping studies have examined the use of LLMs for generating and"},{"citing_arxiv_id":"2411.10656","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models","primary_cat":"cs.SE","submitted_at":"2024-11-16T01:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Smaller LLMs produce functional but limited Python code with variable quantization effects and quality/maintainability concerns that require validation before use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.09916","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"\"Should I Give Up Now?\" Investigating LLM Pitfalls in Software Engineering","primary_cat":"cs.SE","submitted_at":"2024-11-15T03:29:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"User study reveals nine LLM failure categories in SE tasks and quantifies abandonment factors from 26 participants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.13501","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on the Memory Mechanism of Large Language Model based Agents","primary_cat":"cs.AI","submitted_at":"2024-04-21T01:49:46+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(IR) and summarize studies on LLM-based query processes. Xu et al. [38] pay more attention to information extraction (IE) and provide comprehensive taxonomies for LLM-based models in this field. Li et al. [50], Lin et al. [51] and Wang et al. [52] discuss the applications of LLMs in the field of recommender system, where they utilize agents to generate data and provide recommendations. Fan et al. [39], Wang et al. [40], and Zheng et al. [41] concentrate on how LLMs can benefit software engineering (SE) in terms of software design, development, and testing. Zeng et al. [42] summarize LLM-based methods in the field of robotics. Cui et al. [43] and Yang et al. [44] focus on the application of autonomous driving and summarize models in this domain based on LLMs from"},{"citing_arxiv_id":"2404.01535","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation","primary_cat":"cs.SE","submitted_at":"2024-04-01T23:55:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.07974","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2024-03-12T17:58:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.19173","ref_index":198,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StarCoder 2 and The Stack v2: The Next Generation","primary_cat":"cs.SE","submitted_at":"2024-02-29T13:53:35+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.12138","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs","primary_cat":"cs.SE","submitted_at":"2023-05-20T08:43:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs achieve strong results on syntax parsing tasks but show limited and variable performance on dynamic reasoning, with a clear performance hierarchy across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}