Recognition: no theorem link
StarCoder: may the source be with you!
Pith reviewed 2026-05-10 23:27 UTC · model grok-4.3
The pith
A 15.5 billion parameter code model trained on a trillion tokens outperforms open multilingual alternatives and matches proprietary performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StarCoderBase is a 15.5B parameter model with 8K context length trained on 1 trillion tokens sourced from The Stack that outperforms every open Code LLM supporting multiple programming languages and matches or outperforms the code-cushman-001 model. StarCoder, produced by fine-tuning StarCoderBase on 35B Python tokens, outperforms every Python-fine-tuned model, can be prompted to 40 percent pass@1 on HumanEval, and retains performance across other languages.
What carries the argument
The 15.5B parameter StarCoderBase transformer with multi-query attention and 8K context, trained on the large collection of permissively licensed repositories, which supplies the scale and data quality needed for strong cross-language code generation and infilling.
If this is right
- Developers can access and run a competitive code model locally or adapt it without API fees or usage limits.
- Specializing the model on one language through fine-tuning preserves capability on the remaining languages.
- Responsible release practices such as improved personal data removal and origin tracing become feasible at this scale.
- Prompt engineering can raise performance on targeted tasks like HumanEval without sacrificing breadth.
Where Pith is reading between the lines
- Larger open code datasets may become easier to assemble if more projects adopt permissive licenses.
- Integration into actual software development environments could expose practical limits not visible in isolated benchmarks.
- The same training approach might apply to other domains that produce large volumes of structured, publicly licensed text such as mathematical proofs or configuration files.
Load-bearing premise
The chosen benchmarks and prompting methods accurately reflect real code-generation utility across languages and tasks without hidden advantages from training data overlap or evaluation setup.
What would settle it
Evaluating the models on a fresh collection of coding problems assembled after the training data period and directly comparing pass rates against other open models and code-cushman-001 would confirm or refute the performance claims.
read the original abstract
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40\% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StarCoderBase, a 15.5B parameter Code LLM trained on 1 trillion tokens from The Stack (permissively licensed GitHub data with opt-out tools), and StarCoder, obtained by fine-tuning StarCoderBase on 35B Python tokens. The central claims are that StarCoderBase outperforms every open multi-language Code LLM and matches or exceeds OpenAI's code-cushman-001, while StarCoder reaches 40% pass@1 on HumanEval, retains cross-language performance, and supports infilling with 8K context. The work also describes responsible release steps including improved PII redaction and an attribution tracing tool, with public model availability under a commercially viable open license.
Significance. If the performance claims hold under the reported evaluation protocol, this constitutes a substantial contribution by delivering a strong, openly available multi-language Code LLM that rivals proprietary models while emphasizing data transparency, reproducibility, and safety measures. The public release of models, data inspection tools, and the comprehensive benchmark suite can serve as a reference point for future open Code LLM research and responsible AI practices.
minor comments (4)
- [Abstract] Abstract: the assertion of 'the most comprehensive evaluation of Code LLMs to date' would be strengthened by a short explicit comparison (in the introduction or § on evaluation) to the scope of prior Code LLM benchmarks such as those in CodeGen or InCoder papers.
- [Model Architecture] The description of multi-query attention for fast large-batch inference is mentioned but lacks implementation specifics (e.g., head grouping factor or kernel details) that would aid exact reproduction of the reported inference speeds.
- [Evaluation] Evaluation section: while benchmark numbers are provided, inclusion of variance estimates, number of runs, or statistical significance tests for key comparisons (e.g., vs. code-cushman-001 on HumanEval) would improve the robustness of the outperformance claims.
- [Data] Data section: the version or commit hash of The Stack used for the 1T-token training run should be stated explicitly to support exact reproducibility of the pre-training corpus.
Simulated Author's Rebuttal
We thank the referee for their positive review, recognition of the significance of the work, and recommendation to accept the manuscript. We are pleased that the contributions around multi-language code modeling, responsible release practices, and comprehensive evaluation have been acknowledged.
Circularity Check
No significant circularity detected
full rationale
The paper reports empirical training of StarCoderBase on 1T tokens from The Stack and fine-tuning to StarCoder, followed by benchmark evaluations on HumanEval and other standard code tasks. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or self-referential definitions appear in the abstract or described methodology. Performance claims rest on direct experimental results and external benchmarks rather than any reduction to inputs by construction. Self-citations, if present, are not load-bearing for any central claim and do not create circularity under the specified criteria.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 45 Pith papers
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
-
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
-
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
Frontier LLMs pass unit tests over 76% of the time on debugging tasks but achieve edit precision below 45%, indicating regeneration rather than precise debugging.
-
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
-
CodeComp: Structural KV Cache Compression for Agentic Coding
CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...
-
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
Programmatic Context Augmentation for LLM-based Symbolic Regression
Programmatic context augmentation lets LLM-based symbolic regression perform code-driven data analysis during search, yielding superior efficiency and accuracy over baselines on LLM-SRBench.
-
DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion
DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.
-
BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis
BlenderRAG improves LLM-generated Blender code for 3D objects by retrieving semantically similar examples from a curated multimodal dataset of 500 expert-validated cases.
-
REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)
REBench is a new benchmark that consolidates existing datasets into a large collection of binaries with knowledge-base-driven ground truth to enable fair LLM evaluation on stripped-binary type and name recovery.
-
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
Optimas deploys a multi-agent LLM workflow to convert performance diagnostics into correct code transformations, delivering 100% valid code and performance gains in 98.82% of 3,410 experiments across benchmarks and HP...
-
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
-
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness...
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Runtime Execution Traces Guided Automated Program Repair with Multi-Agent Debate
TraceRepair deploys a probe agent for runtime snapshots and a committee of agents for cross-verification to fix 392 defects on Defects4J, outperforming prior LLM-based automated program repair methods.
-
A Taxonomy of Programming Languages for Code Generation
The researchers provide a systematic 4-tier classification of 646 programming languages, quantifying the extreme data scarcity facing over 70% of the world's programming languages in the age of LLMs.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
DeepSeek-Coder open-source models trained on 2T code tokens with fill-in-the-blank pretraining achieve SOTA results among open models and surpass closed-source Codex and GPT-3.5 on code benchmarks.
-
Prompt-Driven Code Summarization: A Systematic Literature Review
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
-
Qwen2.5-Coder Technical Report
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark...
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Unified pre-training for program understanding and generation
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. In Proceedings of NAACL, 2021. URL https://aclanthology.org/2021.naacl-main.211
work page 2021
-
[4]
Andersen et al v. Stability AI et al . 3:23-cv-00201 N.D. Cal. 2023
work page 2023
-
[7]
A maximum likelihood approach to continuous speech recognition
Lalit Bahl, Frederick Jelinek, and Robert Mercer. A maximum likelihood approach to continuous speech recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-5: 0 179 -- 190, 04 1983. doi:10.1109/TPAMI.1983.4767370
-
[9]
ChatGPT accessible again in Italy
BBC. ChatGPT accessible again in Italy . https://www.bbc.com/news/technology-65431914, 2023
work page 2023
-
[10]
A framework for the evaluation of code generation models
Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro Von Werra. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, December 2022
work page 2022
-
[11]
SantaCoder: don't reach for the stars! In Deep Learning for Code Workshop (DL4C), 2023
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García ...
work page 2023
-
[12]
A neural probabilistic language model
Yoshua Bengio, R\' e jean Ducharme, and Pascal Vincent. A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper_files/paper/2000/hash/728f206c2a01bf572b5940d7d9a8fa4c-Abstract.html
work page 2000
-
[14]
BLOOM (revision 4ab0472), 2022
BigScience Workshop . BLOOM (revision 4ab0472), 2022. URL https://huggingface.co/bigscience/bloom
work page 2022
-
[15]
Gpt-neox-20b: An open-source autoregressive language model
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: an open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022
-
[18]
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pp.\ 858--867, Prague, Czech Republic, June 2007. Association for Computati...
work page 2007
-
[19]
Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial pattern matching, pp.\ 1--10. Springer, 2000
2000
-
[21]
N -gram counts and language models from the C ommon C rawl
Christian Buck, Kenneth Heafield, and Bas van Ooyen. N -gram counts and language models from the C ommon C rawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14) , pp.\ 3579--3584, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec...
2014
-
[22]
This CoPilot is stupid and wants to kill me
Matthew Butterick. This CoPilot is stupid and wants to kill me. https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html, 2022
2022
-
[24]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
2021
-
[27]
Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
2022
-
[29]
and GitHub, Inc
DOE 1 v. and GitHub, Inc. 4:22-cv-06823 N.D. Cal. 2022
2022
-
[30]
Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. GPTs are GPTs : An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023
-
[31]
Microsoft attracting users to its code-writing, generative AI software
Euronews. Microsoft attracting users to its code-writing, generative AI software. https://www.euronews.com/next/2023/01/25/microsoft-results-ai, 2023
2023
-
[32]
The general data protection regulation
European Council . The general data protection regulation. https://www.consilium.europa.eu/en/policies/data-protection/data-protection-regulation/, 2018
2018
-
[37]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL : Program-aided language models. arXiv preprint arXiv:2211.10435, 2022
work page Pith review arXiv 2022
-
[39]
Clark, and Philipp Koehn
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 690--696, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL https://aclantholo...
2013
-
[42]
On the naturalness of software
Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE), pp.\ 837--847. IEEE, 2012
2012
-
[44]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH
2020
-
[45]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019
work page internal anchor Pith review arXiv 1909
-
[48]
Learning and evaluating contextual embedding of source code
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020
work page 2020
-
[51]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[52]
The stack: 3 tb of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The S tack: 3 TB of permissively licensed source code. Preprint, 2022. URL https://arxiv.org/abs/2211.15533
-
[53]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022
work page internal anchor Pith review arXiv 2022
-
[54]
Bradley M. Kuhn. If software is my copilot, who programmed my software? https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/, 2022
work page 2022
-
[57]
DS-1000: a natural and reliable benchmark for data science code generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. DS-1000: a natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022
-
[58]
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, number 2, pp.\ 896, 2013
work page 2013
-
[59]
Comparing code explanations created by students and large language models, 2023
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. Comparing code explanations created by students and large language models, 2023
work page 2023
-
[60]
Mark A Lemley and Bryan Casey. Fair learning. Tex. L. Rev., 99: 0 743, 2020. URL https://texaslawreview.org/fair-learning/
work page 2020
-
[61]
How copyright law can fix artificial intelligence's implicit bias problem
Amanda Levendowski. How copyright law can fix artificial intelligence's implicit bias problem. Wash. L. Rev., 93: 0 579, 2018
work page 2018
-
[64]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[65]
Unpicking the rules shaping generative AI
Natasha Lomas. Unpicking the rules shaping generative AI . https://techcrunch.com/2023/04/13/generative-ai-gdpr-enforcement/, 2022
work page 2023
-
[66]
arXiv preprint arXiv:2102.04664 (2021) 16 A
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. CodeXGLUE : A machine learning benchmark dataset for code understanding ...
-
[70]
Using in-context learning to improve dialogue safety, February 2023
Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tür. Using in-context learning to improve dialogue safety, February 2023. URL http://arxiv.org/abs/2302.00871. arXiv:2302.00871 [cs]
-
[71]
Recurrent neural network based language model
Tom \' a s Mikolov, Martin Karafi \' a t, Luk \' a s Burget, Jan Cernock \' y , and Sanjeev Khudanpur. Recurrent neural network based language model. In Takao Kobayashi, Keikichi Hirose, and Satoshi Nakamura (eds.), INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010...
work page 2010
-
[78]
CodeGen: an open large language model for code with multi-turn program synthesis
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. CodeGen: an open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_
work page 2023
-
[79]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[81]
OpenAI. GPT-4 system card. https://cdn.openai.com/papers/gpt-4-system-card.pdf, 2023 b
work page 2023
-
[83]
Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? Assessing the security of GitHub Copilot 's code contributions. In IEEE Symposium on Security and Privacy, San Francisco, CA, 2022. URL https://arxiv.org/abs/2108.09293
-
[87]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[88]
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...
work page internal anchor Pith review arXiv 2021
-
[89]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020
work page 2020
-
[91]
Rothchild and Daniel Rothchild
John A. Rothchild and Daniel Rothchild. Copyright implications of the use of code repositories to train a machine learning model. https://www.fsf.org/licensing/copilot/copyright-implications-of-the-use-of-code-repositories-to-train-a-machine-learning-model, 2022
work page 2022
-
[92]
Lost at C : A user study on the security implications of large language model code assistants, 2023
Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. Lost at C : A user study on the security implications of large language model code assistants, 2023
work page 2023
-
[95]
Arfon Smith. Kernel description. https://github.blog/2016-06-29-making-open-source-data-more-available/, 2016
work page 2016
-
[96]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using DeepSpeed and Megatron to ...
work page Pith review arXiv 2022
-
[97]
The gradient of generative AI release: Methods and considerations
Irene Solaiman. The gradient of generative AI release: Methods and considerations. arXiv preprint arXiv:2302.04844, 2023
-
[99]
How an ai became my code-writing genie, Mar 2022
Clive Thompson. How an ai became my code-writing genie, Mar 2022. URL https://www.wired.com/story/openai-copilot-autocomplete-for-code/
work page 2022
-
[101]
Julian Togelius and Georgios N. Yannakakis. Choose your weapon: Survival strategies for depressed AI academics. arXiv preprint arXiv:2304.06035, 2023
-
[102]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[103]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 5998--6008, 2017
work page 2017
-
[105]
Poisoning language models during instruction tuning, 2023
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning, 2023
work page 2023
-
[106]
GPT-J-6B: a 6 billion parameter autoregressive language model, 2021
Ben Wang and Aran Komatsuzaki. GPT-J-6B: a 6 billion parameter autoregressive language model, 2021
work page 2021
-
[109]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...
2022
-
[111]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
2020
-
[112]
Future of jobs report
World Economic Forum . Future of jobs report. https://www3.weforum.org/docs/WEF_Future_of_Jobs_2023.pdf, 2023
2023
-
[114]
Do machine learning models produce TypeScript types that type check? In European Conference on Object-Oriented Programming (ECOOP), 2023
Ming-Ho Yee and Arjun Guha. Do machine learning models produce TypeScript types that type check? In European Conference on Object-Oriented Programming (ECOOP), 2023
2023
-
[115]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022
work page internal anchor Pith review arXiv 2022
-
[116]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[117]
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. CodeGeeX : A pre-trained model for code generation with multilingual evaluations on HumanEval-X . arXiv preprint arXiv:2303.17568, 2023. doi:10.48550/arXiv.2303.17568
-
[118]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022
work page internal anchor Pith review arXiv 2022
-
[119]
BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). doi:10.18653/v1/N19-1423
-
[120]
Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin , journal=
-
[121]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Simcse: Simple contrastive learning of sentence embeddings , author=. arXiv preprint arXiv:2104.08821 , year=
work page internal anchor Pith review arXiv
-
[122]
Proceedings of the 38th International Conference on Machine Learning , pages =
Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
2021
-
[123]
arXiv preprint , archivePrefix=
Evaluating Large Language Models Trained on Code , author=. arXiv preprint , archivePrefix=. 2021 , eprint=
2021
-
[124]
2023 , url=
Erik Nijkamp and Bo Pang and Hiroaki Hayashi and Lifu Tu and Huan Wang and Yingbo Zhou and Silvio Savarese and Caiming Xiong , booktitle=. 2023 , url=
2023
-
[125]
Zheng, Qinkai and Xia, Xiao and Zou, Xu and Dong, Yuxiao and Wang, Shan and Xue, Yufei and Wang, Zihan and Shen, Lei and Wang, Andi and Li, Yang and Su, Teng and Yang, Zhilin and Tang, Jie , year =. arXiv preprint arXiv:2303.17568 , doi =
-
[126]
Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu , journal=
-
[127]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili. arXiv preprint arXiv:2211.05100 , year=
work page internal anchor Pith review arXiv
-
[128]
Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , journal=
-
[129]
Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and Peng Zhang and Yuxiao Dong and Jie Tang , journal=
-
[130]
Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis
Fried, Daniel and Aghajanyan, Armen and Lin, Jessy and Wang, Sida and Wallace, Eric and Shi, Freda and Zhong, Ruiqi and Yih, Wen-tau and Zettlemoyer, Luke and Lewis, Mike , keywords =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05999 , journal=
-
[131]
Christopoulou, Fenia and Lampouras, Gerasimos and Gritta, Milan and Zhang, Guchun and Guo, Yinpeng and Li, Zhongqi and Zhang, Qi and Xiao, Meng and Shen, Bo and Li, Lin and Yu, Hao and Yan, Li and Zhou, Pingyi and Wang, Xin and Ma, Yuchi and Iacobacci, Ignacio and Wang, Yasheng and Liang, Guangtai and Wei, Jiansheng and Jiang, Xin and Wang, Qianxiang and ...
-
[132]
Competition-Level Code Generation with AlphaCode , author=. arXiv preprint arXiv:2203.07814 , year=
-
[133]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and Doro...
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.