{"total":10,"items":[{"citing_arxiv_id":"2605.17957","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Contextualized Code Pretraining for Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-18T07:12:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05267","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code","primary_cat":"cs.SE","submitted_at":"2026-05-06T09:38:31+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM-generated code with multiple quality defects. However, despite their remarkable utility, LLMs frequently produce code outputs with critical quality issues that hinder practical adoption [88]. Empirical studies confirm that unfiltered LLM-generated code suffers from widespread syntax and logical errors, numerous security vulnerabilities, and significant maintainability flaws [13]. For example, Figure 1 illustrates a typical LLM-generated user registration endpoint, which exhibits multiple overlapping quality defects: an unparameterized SQL query (Line 11) that exposes critical SQL injection vulnerabilities, use of the deprecated pandas.append() method (Line 13) that breaks compatibility with modern pandas versions, and complete lack of"},{"citing_arxiv_id":"2604.27001","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code","primary_cat":"cs.CR","submitted_at":"2026-04-29T03:58:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-generated cryptographic Rust code compiles successfully only 23% of the time and contains detectable vulnerabilities in 57% of the cases that do compile.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24678","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study","primary_cat":"cs.SE","submitted_at":"2026-04-27T16:38:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tic correctness [ 30]. Code-specific variants (e.g., CodeBLEU) in- corporate syntax/semantic signals, but are primarily designed for GPLs [33]. For DSL generation, functional unit-test execution (e.g., HumanEval-style evaluation) is often not directly applicable be- cause the DSL artifacts are typically inputs to generators rather than executable programs [4]. Accordingly, DSL generation evalu- ation commonly combines: (i) syntactic validity (parse/grammar conformance), (ii) toolchain acceptance (generator/build success), and (iii) expert review of structural fidelity and maintainability [31]. 3 Related Work Early work on translating NL into formal representations is rooted in semantic parsing and program synthesis [ 2, 11]."},{"citing_arxiv_id":"2604.24621","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions","primary_cat":"cs.SE","submitted_at":"2026-04-27T15:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09515","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-10T17:37:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"the differences (diffs) between related coding tasks, mimicking hu- man analogical learning [51]. At the repository level, Zhang et al. introduced CODEAGENT, an LLM-based agent framework that integrates external tools to support complex, multi-file code gen- eration tasks [8]. To systematically evaluate these models and ap- proaches, a range of benchmark datasets have been developed [4]. Representative function-level benchmarks include HumanEval [25], HumanEval+ [28], MBPP [2], EvalPlus [29], and APPS [15]. For repository-level evaluation, commonly used benchmarks include SWE-bench [18], RepoBench [31], CrossCodeEval [7], and Repo- Fuse [27]. Despite these advances, existing studies either draw knowledge from models' parametric knowledge or predominantly assume that"},{"citing_arxiv_id":"2604.06742","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios","primary_cat":"cs.SE","submitted_at":"2026-04-08T07:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00989","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sustainable Code Generation Using Large Language Models: A Systematic Literature Review","primary_cat":"cs.SE","submitted_at":"2026-03-01T08:32:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Overview As a first step, we searched for existing systematic literature reviews and surveys related to LLMs and code generation in digital libraries such as Compendex, Inspec, Scopus, and Google Scholar. We identified several recent reviews covering related areas, including general LLM-based code generation [51]-[58], LLMs in software engineering [59]-[61], code completion [62], [63], automated program repair [64]-[66], parameter-efficient fine-tuning for large code models [67], [68], domain-specific and low-resource code generation [69], [70], LLMs in programming education [71]-[74], developer productivity [75], [76], and quality and efficiency of LLM- generated code [77]-[80]. In addition, systematic mapping"},{"citing_arxiv_id":"2509.22202","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries","primary_cat":"cs.SE","submitted_at":"2025-09-26T11:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.17181","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Study of LLMs' Preferences for Libraries and Programming Languages","primary_cat":"cs.SE","submitted_at":"2025-03-21T14:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}