pith. machine review for the scientific record. sign in

arxiv: 2402.19173 · v1 · submitted 2024-02-29 · 💻 cs.SE · cs.AI

Recognition: no theorem link

StarCoder 2 and The Stack v2: The Next Generation

Alex Gu, Anton Lozhkov, Ao Tang, Arjun Guha, Armel Zebaze, Arthur Zucker, Binyuan Hui, Canwen Xu, Carlos Mu\~noz Ferrandis, Carolyn Jane Anderson, Chenghao Mou, Christopher Akiki, Denis Kocetkov, Dmitry Abulkhanov, Dmytro Pykhtar, Edoardo Abati, Evgenii Zheltonozhskii, Federico Cassano, Han Hu, Harm de Vries, Indraneil Paul, Jennifer Robinson, Jia Li, Jian Zhu, Jiawei Liu, Joel Lamy-Poirier, Julian McAuley, Leandro Von Werra, Lingming Zhang, Loubna Ben Allal, Lucas Krau{\ss}, Manan Dey, Marc Marone, Max Tian, Mayank Mishra, Megan Risdal, Mostofa Patwary, Muhtasham Oblokulov, Naman Jain, Nicolas Chapados, Nicolas Patry, Nii Osae Osae Dade, Niklas Muennighoff, Nima Tajbakhsh, Nouamane Tazi, Olivier Dehaene, Qian Liu, Raymond Li, Sean Hughes, Sebastien Paquet, Terry Yue Zhuo, Thomas Wolf, Tianyang Liu, Torsten Scholak, Tri Dao, Wen-Ding Li, Wenhao Yu, Xiangru Tang, Xuanli He, Yacine Jernite, Yekun Chai, Yixuan Su, Younes Belkada, Yuxiang Wei, Zhuang Li, Zijian Wang

Pith reviewed 2026-05-12 17:22 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code language modelslarge language models for codetraining datasetsmodel performance benchmarksdata curationcode generationprogramming languages
0
0 comments X

The pith

StarCoder2's 3B model outperforms prior 15B versions while its 15B model matches models more than twice its size on code tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new family of code language models trained on an expanded dataset four times larger than the previous version, compiled from a broad digital archive of source code across hundreds of languages plus selected additional sources. Through evaluation on a comprehensive set of benchmarks, it shows that the 3B parameter model surpasses other models of similar size and even the authors' own prior 15B model, while the 15B model significantly outperforms peers of comparable scale and matches or exceeds a model more than twice as large. The work also reports advantages over competitors on math and code reasoning tasks plus several low-resource languages. A sympathetic reader would care because these results indicate that careful expansion and curation of training data can produce more efficient and capable code models without requiring proportional increases in model size or compute.

Core claim

The authors train models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens from the expanded dataset. They find that the 3B model outperforms other code language models of similar size on most benchmarks and also outperforms the prior 15B base model. The 15B model significantly outperforms other models of comparable size. In addition, it matches or outperforms a model more than twice its size. It further outperforms competing models on math and code reasoning benchmarks as well as several low-resource languages.

What carries the argument

The Stack v2, the four-times-larger curated training dataset spanning hundreds of programming languages and additional high-quality sources that supplies the tokens for training the models.

If this is right

  • Smaller models achieving strong results lowers the resources needed to deploy capable code assistants.
  • Stronger performance on low-resource languages expands the reach of automated code tools to more programming contexts.
  • Gains on reasoning benchmarks suggest the models can handle hybrid coding and mathematical tasks more effectively.
  • Open release of weights and data identifiers enables independent verification and building on the results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If data quality drives the gains, future code models may achieve more by refining curation rather than scaling parameters alone.
  • The efficiency improvements could support wider use of code models in settings with limited compute or on personal devices.
  • Direct comparisons on tasks drawn from actual developer projects would test whether benchmark gains translate outside controlled evaluations.

Load-bearing premise

The chosen benchmarks and data curation rules produce results that generalize to real developer workflows and that no significant data contamination or overlap exists with the evaluation sets.

What would settle it

Running the models on a new set of code completion and reasoning tasks drawn exclusively from private or post-training sources with no possible overlap, where the reported advantages over prior models disappear.

read the original abstract

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces StarCoder2, a family of Code LLMs (3B, 7B, 15B parameters) trained on The Stack v2, a dataset 4x larger than the original Stack, constructed from Software Heritage archives across 619 languages plus curated GitHub PRs, Kaggle notebooks, and documentation. It reports that StarCoder2-3B outperforms other Code LLMs of similar size and even StarCoderBase-15B on most benchmarks, while StarCoder2-15B significantly outperforms comparable models and matches or exceeds CodeLlama-34B (more than twice its size), with additional strengths on math, reasoning, and low-resource languages. Model weights are released under OpenRAIL and training data transparency is provided via SWHIDs.

Significance. If the results hold, this advances open Code LLMs by showing that scaled, high-quality data curation enables smaller models to outperform larger predecessors, with full data transparency as a key strength. The release of SWHIDs and model weights supports reproducibility, and the consistent cross-benchmark outperformance (including against models like DeepSeekCoder-33B on specific tasks) provides concrete evidence for the value of the expanded corpus and training approach.

major comments (1)
  1. [Data Curation and Evaluation] The manuscript describes careful selection of additional data sources (GitHub PRs, Kaggle notebooks, documentation) and releases SWHIDs for the Software Heritage portion, but provides no explicit decontamination protocol, overlap statistics, or verification that benchmark problems (HumanEval, MBPP, DS-1000, etc.) are absent from the 4x larger training corpus. This is load-bearing for the central performance claims, such as StarCoder2-3B outperforming StarCoderBase-15B and StarCoder2-15B matching CodeLlama-34B, because even modest leakage could undermine the generalization interpretation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Data Curation and Evaluation] The manuscript describes careful selection of additional data sources (GitHub PRs, Kaggle notebooks, documentation) and releases SWHIDs for the Software Heritage portion, but provides no explicit decontamination protocol, overlap statistics, or verification that benchmark problems (HumanEval, MBPP, DS-1000, etc.) are absent from the 4x larger training corpus. This is load-bearing for the central performance claims, such as StarCoder2-3B outperforming StarCoderBase-15B and StarCoder2-15B matching CodeLlama-34B, because even modest leakage could undermine the generalization interpretation.

    Authors: We agree that explicit documentation of the decontamination protocol is essential to substantiate the generalization claims. The current manuscript prioritizes transparency via SWHIDs for the Software Heritage data, enabling external verification, but does not detail the steps taken to exclude benchmark contamination. We will add a dedicated subsection under Data Curation describing our decontamination procedure (including the methods used to detect and remove overlaps with HumanEval, MBPP, DS-1000, and related benchmarks) along with the resulting overlap statistics. This revision will directly address the load-bearing nature of the concern. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical dataset construction, training, and benchmarking

full rationale

The paper describes building The Stack v2 from Software Heritage archives plus selected sources, training 3B/7B/15B models on 3.3-4.3T tokens, and reporting benchmark scores on HumanEval, MBPP, DS-1000 and similar suites. No equations, derivations, or 'predictions' are claimed. Performance statements are direct outcomes of large-scale training runs evaluated on external benchmarks, not reductions of fitted parameters or self-citations. The central claims rest on independent empirical measurement rather than any self-referential logic or ansatz smuggled via prior work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Claims rest on the effectiveness of data selection from Software Heritage and other sources plus the validity of standard code benchmarks as proxies for capability.

free parameters (2)
  • Model parameter counts (3B, 7B, 15B)
    Chosen scales for exploring performance across sizes; not derived from data.
  • Training token volume (3.3-4.3 trillion)
    Determined by available curated data and compute budget.
axioms (1)
  • domain assumption The curated mix of Software Heritage repositories, GitHub pull requests, Kaggle notebooks, and documentation constitutes high-quality training data representative of real code.
    Stated as careful selection without quantitative validation of quality or contamination in the abstract.

pith-pipeline@v0.9.0 · 5900 in / 1279 out tokens · 92439 ms · 2026-05-12T17:22:30.549372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 30 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

    cs.AI 2026-04 unverdicted novelty 8.0

    HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

  2. Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks

    cs.SE 2026-04 conditional novelty 8.0

    The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.

  3. An Empirical Study of Speculative Decoding on Software Engineering Tasks

    cs.SE 2026-04 unverdicted novelty 7.0

    Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

  4. When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

  5. SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair

    cs.SE 2026-04 unverdicted novelty 7.0

    SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.

  6. From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

    cs.CL 2026-04 unverdicted novelty 7.0

    SA-BPE regularizes standard BPE training for code by incorporating source diversity to skip problematic merges, substantially cutting unused tokens without altering inference.

  7. AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

    cs.SE 2026-04 unverdicted novelty 7.0

    AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

  8. Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

  9. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  10. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  11. SynConfRoute: Syntax-Aware Routing for Efficient Code Completion with Small CodeLLMs

    cs.SE 2026-05 unverdicted novelty 6.0

    SynConfRoute routes code completions using syntax validation and token confidence, improving pass@1 by up to 31% on hard tasks and reducing accelerator usage by 58% versus always using the largest model.

  12. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

  13. Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

    cs.SE 2026-04 conditional novelty 6.0

    SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...

  14. A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

    cs.SE 2026-04 unverdicted novelty 6.0

    Metamorphic testing on Defects4J and GitBug-Java reveals substantial performance drops in seven LLMs that correlate with NLL, indicating data leakage in LLM-based program repair.

  15. Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  16. LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL

    cs.AI 2026-04 unverdicted novelty 6.0

    Modular curriculum learning with tier-specific adapters outperforms standard fine-tuning on complex Text-to-SQL queries in Spider and BIRD benchmarks by avoiding catastrophic forgetting.

  17. CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.

  18. DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency

    cs.DB 2026-04 unverdicted novelty 6.0

    DPC selects correct text-to-SQL outputs by enforcing execution consistency between SQL and Python on an adversarially constructed minimal distinguishing database.

  19. Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.

  20. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  21. Automated Attention Pattern Discovery at Scale in Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.

  22. TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning

    cs.SE 2026-04 unverdicted novelty 6.0

    By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.

  23. A Taxonomy of Programming Languages for Code Generation

    cs.CL 2026-03 accept novelty 6.0

    The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.

  24. Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair

    cs.SE 2026-05 unverdicted novelty 5.0

    Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.

  25. A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

    cs.CR 2026-05 accept novelty 5.0

    The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

  26. Learning Generalizable Multimodal Representations for Software Vulnerability Detection

    cs.SE 2026-04 unverdicted novelty 5.0

    MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.

  27. PLMGH: What Matters in PLM-GNN Hybrids for Code Classification and Vulnerability Detection

    cs.SE 2026-04 unverdicted novelty 5.0

    Controlled experiments show PLM-GNN hybrids improve code tasks over GNN-only baselines, with PLM source having larger impact than GNN backbone.

  28. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  29. An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

  30. A Survey on Large Language Models for Code Generation

    cs.CL 2024-06 unverdicted novelty 3.0

    A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...

Reference graph

Works this paper leans on

232 extracted references · 232 canonical work pages · cited by 30 Pith papers · 17 internal anchors

  1. [1]

    , author =

    Open science is a research accelerator. , author =. Nature chemistry , volume =

  2. [3]

    Unsupervised cross-lingual representation learning at scale , author =

  3. [9]

    arXiv preprint , url =

    Quantifying the Carbon Emissions of Machine Learning , author =. arXiv preprint , url =

  4. [10]

    Computing in Science & Engineering , volume = 22, number = 2, pages =

    Referencing Source Code Artifacts: A Separate Concern in Software Citation , author =. Computing in Science & Engineering , volume = 22, number = 2, pages =

  5. [11]

    International Conference on Learning Representations , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. International Conference on Learning Representations , year=

  6. [17]

    arXiv preprint , url=

    Crosslingual Generalization through Multitask Finetuning , author=. arXiv preprint , url=. 2022 , eprint=

  7. [18]

    What language model to train if you have one million gpu hours? , author =

  8. [20]

    Bloom+ 1: Adding language support to bloom for zero-shot prompting , author =

  9. [23]

    Journal of Machine Learning Research , volume = 24, number = 240, pages =

    PaLM: Scaling Language Modeling with Pathways , author =. Journal of Machine Learning Research , volume = 24, number = 240, pages =

  10. [26]

    The Stack: 3

    Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu. The Stack: 3. Transactions on Machine Learning Research , issn =

  11. [28]

    Frontiers in Artificial Intelligence and Applications , series =

    Helping Code Reviewer Prioritize: Pinpointing Personal Data and Its Processing , author =. Frontiers in Artificial Intelligence and Applications , series =

  12. [32]

    WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author =

  13. [35]

    AI Ethics , url =

    Auditing Large Language Models: A Three-Layered Approach , author =. AI Ethics , url =

  14. [36]

    2024 , url=

    Niklas Muennighoff and Qian Liu and Armel Randy Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro Von Werra and Shayne Longpre , booktitle=. 2024 , url=

  15. [39]

    arXiv preprint , url=

    Gemini: a family of highly capable multimodal models , author =. arXiv preprint , url=

  16. [40]

    Red teaming

    Terry Yue Zhuo and Yujin Huang and Chunyang Chen and Zhenchang Xing , year = 2023, journal =. Red teaming. 2301.12867 , archiveprefix =

  17. [41]

    Magicoder: Source Code Is All You Need , author =

  18. [45]

    arXiv preprint , url =

    Source Code Data Augmentation for Deep Learning: A Survey , author =. arXiv preprint , url =. 2305.19915 , archiveprefix =

  19. [61]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Scaling Data-Constrained Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  20. [62]

    arXiv preprint , url=

    Generative Representational Instruction Tuning , author=. arXiv preprint , url=. 2024 , eprint=

  21. [63]

    arXiv preprint , url=

    Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning , author=. arXiv preprint , url=. 2024 , eprint=

  22. [65]

    Alice and Rice, Andrew and Rifkin, Devon and Simister, Shawn and Sittampalam, Ganesh and Aftandilian, Edward , year = 2024, month =

    Ziegler, Albert and Kalliamvakou, Eirini and Li, X. Alice and Rice, Andrew and Rifkin, Devon and Simister, Shawn and Sittampalam, Ganesh and Aftandilian, Edward , year = 2024, month =. Measuring. Commun. ACM , publisher =. doi:10.1145/3633453 , issn =

  23. [72]

    arXiv preprint , url =

    Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models , author =. arXiv preprint , url =

  24. [74]

    Annual symposium on combinatorial pattern matching , pages =

    Identifying and filtering near-duplicate documents , author =. Annual symposium on combinatorial pattern matching , pages =

  25. [75]

    Lattner, Chris and Adve, Vikram , year = 2004, booktitle =

  26. [76]

    Kingma and Jimmy Ba , year = 2015, booktitle =

    Diederik P. Kingma and Jimmy Ba , year = 2015, booktitle =. Adam:

  27. [77]

    , year = 2015, booktitle =

    Nanz, Sebastian and Furia, Carlo A. , year = 2015, booktitle =. A comparative study of programming languages in

  28. [78]

    Advances in Neural Information Processing Systems , publisher =

    Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , publisher =

  29. [79]

    iPRES 2017: 14th International Conference on Digital Preservation , address =

    Software Heritage: Why and How to Preserve Software Source Code , author =. iPRES 2017: 14th International Conference on Digital Preservation , address =

  30. [81]

    International Conference on Learning Representations , url =

    Analysing Mathematical Reasoning Abilities of Neural Models , author =. International Conference on Learning Representations , url =

  31. [88]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , publisher =

    Measuring Coding Challenge Competence With APPS , author =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , publisher =

  32. [89]

    Jiang, Albert Qiaochu and Li, Wenda and Han, Jesse Michael and Wu, Yuhuai , year = 2021, booktitle =

  33. [90]

    Ruchir Puri and David S Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladimir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Shyam Ramji and Ulrich Finkler and Susan Malaika and Frederick Reiss , year = 2021, booktitle =

  34. [94]

    Workshop on Broadening Research Collaborations 2022 , url =

    Christopher Akiki and Giada Pistilli and Margot Mieskes and Matthias Gall. Workshop on Broadening Research Collaborations 2022 , url =

  35. [95]

    Advances in Neural Information Processing Systems , publisher =

    Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. Advances in Neural Information Processing Systems , publisher =

  36. [96]

    International Conference on Learning Representations , url =

    Finetuned Language Models are Zero-Shot Learners , author =. International Conference on Learning Representations , url =

  37. [97]

    Towards Openness Beyond Open Access: User Journeys through 3 Open

    Jennifer Ding and Christopher Akiki and Yacine Jernite and Anne Lee Steele and Temi Popo , year = 2022, booktitle =. Towards Openness Beyond Open Access: User Journeys through 3 Open

  38. [98]

    2022 IEEE Symposium on Security and Privacy (SP) , pages =

    Asleep at the keyboard? assessing the security of github copilot’s code contributions , author =. 2022 IEEE Symposium on Security and Privacy (SP) , pages =

  39. [99]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , publisher =

    Transformer Language Models without Positional Encodings Still Learn Positional Information , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , publisher =. doi:10.18653/v1/2022.findings-emnlp.99 , url =. 2203.16634 , archiveprefix =

  40. [100]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , publisher =

    The Curious Case of Absolute Position Embeddings , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , publisher =. doi:10.18653/v1/2022.findings-emnlp.326 , url =. 2210.12574 , archiveprefix =

  41. [101]

    Proceedings of the 40th International Conference on Machine Learning , publisher =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , publisher =

  42. [102]

    Erik Nijkamp and Bo Pang and Hiroaki Hayashi and Lifu Tu and Huan Wang and Yingbo Zhou and Silvio Savarese and Caiming Xiong , year = 2023, booktitle =

  43. [103]

    Proceedings of the 40th International Conference on Machine Learning , publisher =

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , year = 2023, month =. Proceedings of the 40th International Conference on Machine Learning , publisher =

  44. [104]

    Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Hamza Alobeidli and Alessandro Cappelli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , year = 2023, booktitle =. The

  45. [105]

    Proceedings of the 40th International Conference on Machine Learning , publisher =

    Lai, Yuhang and Li, Chengxi and Wang, Yiming and Zhang, Tianyi and Zhong, Ruiqi and Zettlemoyer, Luke and Yih, Wen-Tau and Fried, Daniel and Wang, Sida and Yu, Tao , year = 2023, month =. Proceedings of the 40th International Conference on Machine Learning , publisher =

  46. [106]

    Is Your Code Generated by Chat

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , year = 2023, booktitle =. Is Your Code Generated by Chat

  47. [107]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , url =

    Data Portraits: Recording Foundation Model Training Data , author =. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , url =

  48. [108]

    The Eleventh International Conference on Learning Representations , url =

    Code Translation with Compiler Representations , author =. The Eleventh International Conference on Learning Representations , url =

  49. [109]

    Yangruibo Ding and Zijian Wang and Wasi Uddin Ahmad and Hantian Ding and Ming Tan and Nihal Jain and Murali Krishna Ramanathan and Ramesh Nallapati and Parminder Bhatia and Dan Roth and Bing Xiang , year = 2023, booktitle =

  50. [114]

    The First International Workshop on Large Language Model for Code , url =

    Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions , author =. The First International Workshop on Large Language Model for Code , url =

  51. [115]

    Tri Dao , year = 2024, booktitle =

  52. [116]

    The Twelfth International Conference on Learning Representations , url =

    Lemur: Harmonizing Natural Language and Code for Language Agents , author =. The Twelfth International Conference on Learning Representations , url =

  53. [117]

    The Twelfth International Conference on Learning Representations , url =

    Llemma: An Open Language Model for Mathematics , author =. The Twelfth International Conference on Learning Representations , url =

  54. [118]

    Caballero, Ethan and OpenAI and Sutskever, Ilya , year = 2016, month = aug, doi =

  55. [119]

    Codeforces: Results of 2020 [Annual Report] , author =

  56. [121]

    doi:10.57967/hf/0003 , url =

  57. [122]

    GitHub repository , publisher =

    A framework for the evaluation of code generation models , author =. GitHub repository , publisher =

  58. [123]

    arXiv preprint , url=

    A Hazard Analysis Framework for Code Synthesis Large Language Models , author =. arXiv preprint , url=. 2207.14157 , archiveprefix =

  59. [124]

    BigCode Model License Agreement , author =

  60. [125]

    Archival of Software Metadata , author =

  61. [126]

    SWH Statement on LLM for Code , author =

  62. [127]

    Big Code Models Leaderboard , author =

  63. [128]

    Go smol or go home , author =

  64. [129]

    Language Models as a Service: Overview of a New Paradigm and its Challenges , author =

  65. [130]

    Polymorphic Virus , author =

  66. [131]

    Chatting Our Way Into Creating a Polymorphic Malware , author =

  67. [132]

    Secure by Design , author =

  68. [133]

    The Generative World Order:

  69. [134]

    Open Sourcing Highly Capable Foundation Models , author =

  70. [135]

    Introducing StarChat Alpha: A New Milestone in Conversational AI , author =

  71. [136]

    Summarization LLM: Enhancing Document Summarization with Large Language Models , author =

  72. [137]

    Bulk Access Terms of Use , author =

  73. [138]

    How LLM Adoption Has Impacted AI Job Roles , author =

  74. [139]

    Jobs of Tomorrow: Large Language Models and Jobs , author =

  75. [140]

    The Stack V2 , author =

  76. [141]

    Models by

    BigCode , year = 2024, url =. Models by

  77. [145]

    Stable Code

    Pinnaparaju, Nikhil and Adithyan, Reshinth and Phung, Duy and Tow, Jonathan and Baicoianu, James and and Cooper, Nathan , year = 2024, journal =. Stable Code

  78. [146]

    Software Heritage Community , author =

  79. [147]

    doi:10.3386/w30389 , url =

    New Frontiers: The Origins and Content of New Work, 1940–2018 , author =. doi:10.3386/w30389 , url =

  80. [148]

    Conference on the theory and application of cryptographic techniques , pages=

    A digital signature based on a conventional encryption function , author=. Conference on the theory and application of cryptographic techniques , pages=. 1987 , organization=

Showing first 80 references.