{"work":{"id":"a671e43f-ceab-49e7-adc3-473d802a97ca","openalex_id":null,"doi":null,"arxiv_id":"2410.07095","raw_key":null,"title":"MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering","authors":null,"authors_text":"Chan, Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, et al","year":2024,"venue":"cs.CL","abstract":"We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.","external_url":"https://arxiv.org/abs/2410.07095","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T22:23:15.654946+00:00","pith_arxiv_id":"2410.07095","created_at":"2026-05-09T05:55:29.693713+00:00","updated_at":"2026-05-14T22:23:15.654946+00:00","title_quality_ok":true,"display_title":"Mle-bench: Evaluating machine learning agents on machine learning engineering","render_title":"Mle-bench: Evaluating machine learning agents on machine learning engineering"},"hub":{"state":{"work_id":"a671e43f-ceab-49e7-adc3-473d802a97ca","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":23,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2025-01-24T05:27:46+00:00","last_pith_cited_at":"2026-05-13T15:00:29+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T00:16:14.730832+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":1},{"context_polarity":"unclear","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}