Recognition: 2 theorem links
· Lean TheoremKernelBench: Can LLMs Write Efficient GPU Kernels?
Pith reviewed 2026-05-15 16:50 UTC · model grok-4.3
The pith
Language models match PyTorch GPU kernel performance in fewer than 20 percent of cases
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KernelBench evaluates language models on writing efficient GPU kernels for 250 real PyTorch ML workloads in a setting that mirrors production engineering needs. Frontier reasoning models achieve the highest out-of-the-box success rates but still produce kernels that are correct and faster than the PyTorch baseline in under 20 percent of cases. Iterative refinement that incorporates runtime execution and profiling feedback improves results, yet the benchmark becomes substantially harder as the required speedup threshold p is raised.
What carries the argument
KernelBench, a suite of 250 PyTorch workloads together with the fast_p metric that reports the share of generated kernels which are functionally correct and exceed a speedup threshold p over baseline
If this is right
- Progress on KernelBench directly translates into faster practical kernels for machine learning systems
- Iterative refinement that uses execution and profiling feedback raises the number of successful kernels
- Raising the speedup threshold p increases the difficulty of the benchmark for all tested models
- Frontier reasoning models achieve the best performance when generating kernels without extra techniques
Where Pith is reading between the lines
- Models may close more of the gap if trained on larger corpora of low-level GPU code
- The benchmark could test whether new test-time search methods outperform simple iterative feedback
- Wider use of such evaluation suites might reduce dependence on manual kernel tuning in ML development
Load-bearing premise
The 250 selected workloads are representative of the kernels that matter most in current and near-future ML systems
What would settle it
An LLM that produces functionally correct kernels offering at least a 1x speedup over PyTorch on more than 30 percent of the 250 workloads would contradict the reported overall shortfall
read the original abstract
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KernelBench, an open-source evaluation framework consisting of 250 carefully selected PyTorch ML workloads to measure LLMs' ability to generate functionally correct and high-performance GPU kernels. It defines the fast_p metric (percentage of kernels that are correct and exceed a tunable speedup threshold p over a PyTorch baseline) and reports that frontier reasoning models achieve the highest out-of-the-box success but still match the baseline in fewer than 20% of cases, with modest gains from iterative refinement that incorporates execution and profiling feedback.
Significance. If the workload suite is representative, the results establish a clear, reproducible baseline showing that current LLMs remain far from replacing expert kernel engineering for real ML systems. The open release of the benchmark, the fast_p metric, and the empirical comparison across multiple models and test-time strategies constitute a concrete contribution that can guide subsequent work on execution-aware code generation.
major comments (1)
- [Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.
minor comments (2)
- [Abstract and Evaluation] Abstract and §4: the description of functional-correctness verification for fast_p should explicitly state the test harness, numerical tolerance, and failure modes considered.
- [Results] Figure and table captions: ensure every speedup plot and table reports the exact value of p used and the number of samples per model.
Simulated Author's Rebuttal
Thank you for the constructive review. We address the major comment on workload curation below and have updated the manuscript to include additional quantitative details on the benchmark suite.
read point-by-point responses
-
Referee: [Workload curation] Workload curation section: the claim that the 250 workloads are 'carefully selected' from PyTorch ML code and that progress on KernelBench 'directly translates to faster practical kernels' is not supported by any quantitative breakdown (operation-class distribution, model-family coverage, or comparison against production traces or MLPerf). Without this evidence the headline result (<20% baseline-matching rate) cannot be read as a general statement about LLM performance on kernels that matter in current systems.
Authors: We agree that a quantitative breakdown strengthens the claims. In the revised manuscript we have expanded the Workload Curation section with: (1) an operation-class distribution table (e.g., GEMM 38%, convolution 27%, elementwise 18%, reduction 12%, other 5%), (2) model-family coverage (ResNet/VGG 22%, Transformer 35%, diffusion 15%, other vision/language 28%), and (3) a brief comparison to MLPerf and public production traces showing substantial overlap in dominant operations. We have also revised the abstract and introduction to state that progress on KernelBench is expected to translate to practical kernels for workloads of similar structure rather than claiming universal applicability. These additions allow readers to assess representativeness directly. revision: yes
Circularity Check
No circularity; purely empirical benchmark with external baseline comparison.
full rationale
The paper introduces KernelBench as an empirical evaluation suite of 250 PyTorch workloads, measuring LLM-generated kernels against a fixed external PyTorch baseline using the fast_p metric. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The central result (frontier models match baseline in <20% of cases) is a direct count from execution, not reduced by construction to any input definition or prior self-citation. Representativeness of workloads is an external-validity issue, not a circularity flaw in any derivation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBench-X benchmark shows task category predicts LLM kernel correctness better than method choice, iterative refinement trades performance for higher success rates, and correctness does not ensure efficiency gains...
-
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBenchX benchmark shows task category explains nearly three times more variance in LLM kernel correctness than method choice, iterative refinement boosts correctness but reduces performance, and quantization rema...
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
-
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kerncap automatically extracts isolated, reproducible GPU kernels from large HIP and Triton applications on AMD GPUs by capturing HSA dispatches and producing self-contained reproducer projects that preserve virtual-a...
-
Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs
Kerncap automates extraction of faithful, self-contained GPU kernel reproducers from AMD HIP and Triton workloads via HSA interception and address-space closure, delivering 13.6x faster isolated tuning.
-
FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow
FACT is a three-stage agent-driven system that synthesizes and composes CUTLASS kernels from PyTorch modules, achieving up to 2.03x speedup on transformer blocks over PyTorch and competing optimizers.
-
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
-
SkillEvolver: Skill Learning as a Meta-Skill
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
-
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
Optimas deploys a multi-agent LLM workflow to convert performance diagnostics into correct code transformations, delivering 100% valid code and performance gains in 98.82% of 3,410 experiments across benchmarks and HP...
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
AdaExplore improves correctness and speed of Triton kernel generation by converting recurring failures into a memory of rules and organizing search as a tree that mixes local refinements with larger regenerations, yie...
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
AI-Driven Research for Databases
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
-
Benchmarking Compound AI Applications for Hardware-Software Co-Design
Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.
Reference graph
Works this paper leans on
-
[1]
Apple ml compute framework (mlx), 2020
Apple. Apple ml compute framework (mlx), 2020. URL https://developer.apple.com/metal/
work page 2020
-
[2]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher R´ e. Simple linear attention language models balance the recall-throughput tradeoff. International Conference on Machine Learning , 2024
work page 2024
-
[3]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R´ e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Cerebras wafer-scale engine wse architecture
Cerebras. Cerebras wafer-scale engine wse architecture. Online. https://cerebras.ai/product-chip/
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. International Conference on Learning Representations, 2024
work page 2024
-
[7]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. International Conference on Machine Learning (ICML) , 2024
work page 2024
-
[8]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R´ e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[9]
Deepseek-v3 technical report, 2025
DeepSeek-AI. Deepseek-v3 technical report, 2025. URL https://github.com/deepseek-ai/ DeepSeek-V3
work page 2025
-
[10]
Graphcore. Graphcore IPU architecture. Online. https://www.graphcore.ai/products/ipu
- [11]
-
[12]
Priority sampling of large language models for compilers, 2024
Dejan Grubisic, Chris Cummins, Volker Seeker, and Hugh Leather. Priority sampling of large language models for compilers, 2024. URL https://arxiv.org/abs/2402.18734. 11
-
[13]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023. URL https://arxiv.org/ abs/1606.08415
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Norman P. Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and David Patterson. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings, 2023. URL https://arxiv.org/abs/2304.01433
-
[15]
Peter Kim. Flashattention minimal. Online, 2024. https://github.com/tspeterkim/ flash-attention-minimal
work page 2024
-
[16]
The stack: 3 tb of permissively licensed source code, 2022
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Mu˜ noz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022. URL https: //arxiv.org/abs/2211.15533
-
[17]
Ds-1000: A natural and reliable benchmark for data science code generation, 2022
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation, 2022. URL https://arxiv.org/abs/2211.11501
-
[18]
StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, Jo˜ ao Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R´ emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Mas- son d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...
-
[20]
Christian J. Mills. Cuda mode notes - lecture 004. Online, 2024. https://christianjmills.com/ posts/cuda-mode-notes/lecture-004/
work page 2024
-
[21]
Performance-aligned llms for generating fast code, 2024
Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code, 2024. URL https://arxiv.org/abs/2404. 18864
work page 2024
-
[22]
cudnn: Gpu-accelerated library for deep neural networks, 2014
NVIDIA. cudnn: Gpu-accelerated library for deep neural networks, 2014. URL https://developer. nvidia.com/cudnn
work page 2014
-
[23]
Cuda templates for linear algebra subroutines, 2017
NVIDIA. Cuda templates for linear algebra subroutines, 2017. URL https://github.com/NVIDIA/ cutlass
work page 2017
-
[24]
Nvidia Tesla V100 GPU architecture, 2017
NVIDIA. Nvidia Tesla V100 GPU architecture, 2017
work page 2017
-
[25]
Nvidia A100 tensor core GPU architecture, 2020
NVIDIA. Nvidia A100 tensor core GPU architecture, 2020
work page 2020
-
[26]
Nvidia H100 tensor core GPU architecture, 2022
NVIDIA. Nvidia H100 tensor core GPU architecture, 2022. 12
work page 2022
- [27]
-
[28]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨ opf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfor...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[29]
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, and Jiaming et al. Kong. Rwkv: Reinventing rnns for the transformer era. Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023
work page 2023
-
[30]
Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low-precision, 2024. URL https://arxiv.org/ abs/2407.08608
-
[31]
Can language models solve olympiad programming?, 2024
Quan Shi, Michael Tang, Karthik Narasimhan, and Shunyu Yao. Can language models solve olympiad programming?, 2024. URL https://arxiv.org/abs/2404.10952
-
[32]
Thunderkittens: Simple, fast, and adorable ai kernels
Benjamin Spector, Simran Arora, Aaryan Singhal, Daniel Fu, and Christopher R´ e. Thunderkittens: Simple, fast, and adorable ai kernels. International Conference on Learning Representations (ICLR) , 2024
work page 2024
-
[33]
Efficient transformers: A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022
work page 2022
-
[34]
FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024
Team PyTorch, Horace He, Driss Guessous, Yanbo Liang, and Joy Dong. FlexAttention: The flexi- bility of PyTorch with the performance of FlashAttention, 2024. URL https://pytorch.org/blog/ flexattention/
work page 2024
-
[35]
Ahmed, Amir Yazdanbakhsh, and Ali Jannesari
Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed, Amir Yazdanbakhsh, and Ali Jannesari. Coderosetta: Pushing the boundaries of unsupervised code translation for parallel programming. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/forum?id=V6hrg4O9gg
work page 2024
-
[36]
Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, 2019
work page 2019
-
[37]
Alan M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceed- ings of the London Mathematical Society , 2(42):230–265, 1936. URL http://www.cs.helsinki.fi/u/ gionis/cc05/OnComputableNumbers.pdf
work page 1936
-
[38]
Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S
Pedro Valero-Lara, Alexis Huante, Mustafa Al Lail, William F. Godoy, Keita Teranishi, Prasanna Balaprakash, and Jeffrey S. Vetter. Comparing llama-2 and gpt-3 llms for hpc kernels generation, 2023. URL https://arxiv.org/abs/2309.07103
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017) , 2017
work page 2017
-
[40]
Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, and Daniel Fried. ECCO: Can we improve model-generated code efficiency without sacrificing functional correctness? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 15362–15376, Miami, Florida, USA,...
-
[41]
BabelTower: Learning to auto-parallelized program translation
Yuanbo Wen, Qi Guo, Qiang Fu, Xiaqing Li, Jianxing Xu, Yanlin Tang, Yongwei Zhao, Xing Hu, Zidong Du, Ling Li, Chao Wang, Xuehai Zhou, and Yunji Chen. BabelTower: Learning to auto-parallelized program translation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Co...
work page 2022
-
[42]
Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, and Elizabeth Barnes. Re-bench: Evaluating frontier ai r&d c...
-
[43]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv:2405.15793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URL https://arxiv.org/abs/2410.03859
-
[45]
Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementa- tions of linear attention mechanism, January 2024. URL https://github.com/sustcsonglin/ flash-linear-attention
work page 2024
-
[46]
Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, and Charles Sutton. Natural language to code generation in interactive data science notebooks, 2022. URL https://arxiv.org/abs/2212. 09248. 14 A KernelBench Task Example Here we provide an example t...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.