arxiv: 2501.14249 · v10 · submitted 2025-01-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Humanity's Last Exam

Aakaash Nattanmai, Aaron Kirtland, Aarush Sinha, Abdallah Galal, Abdelkader Dendane, Abdurrahim Yilmaz, Abhijeet Saha, Abhishek Shukla, Abram Jackson, Adam Bouyamourn, Adam Jones, Adam Khoja, Adam Wecker, Adam Zweiger, Adithya Shenoy, Aditya Malusare, Adrian Cosma, Advaith Avadhanam, Ahmad Sakor, Ahmed Elkhanany, Ahmed Menshawy, Aidan Wu, Alan Givr\'e, Alan Goldfarb, Alan Zhou, Alejandro Jos\'e Moyano, Aleksandar Mikov, Aleksandr Maksapetyan, Aleksey Kuchkin, Alena Friedrich, Alesia Yakimchyk, Alessandro Stolfo, Alessandro Tomasiello, Alexander Ivanov, Alexander Piperski, Alexander Pondaven, Alexander Shen, Alexandra Rodriguez-Romero, Alexandre Oliveira Arrais, Alexandr Wang, Alexei Kopylov, Alexey Pronin, Alex Hoover, Alexis C Garretson, Alex Meiburg, Alex Slen, Alex Zhang, Alham Fikri Aji, Ali Anil Demircali, Alice Bizeul, Alice Gatti, Ali Dasouqi, Ali Dehghan, Ali ElSheikh, Ali Karakoc, Ali Khajegili Mirabadi, Ali M. R. Minissi, Alina Borisovna Zhidkovskaya, Aline Menezes, Allen Baranov, Allen G Hart, Allen Zang, Allison Tee, Alon Amit, Alon Ragoler, Alun Cennyth Stokes, Alvaro Sanchez, Alvin Jin, Ameya Prabhu, Amin Shabani, Andrea Achilleos, Andrea Caciolai, Andres Algaba, Andres M Bran, Andrew Favre D.O., Andrew Gritsevskiy, Andrew Ho, Andrew Le, Andrew Redenti, Andrew R. Tawfeek, Andrey Pupasov Maksimov, Andy Zou, Angela Hammon, Angel Ramirez-Trinidad, Anh N. Nhu, Anil Radhakrishnan, Anish Agrawal, Anish Cheraku, Anjiang Wei, Anji Zhang, Anka Reuel, Ankit Agrawal, Ankit Singh, Anmol Sahu, Anna-Katharina Dick, Anna Liakhovitskaia, Anna Plassart, Anna Sztyber-Betley, Anthony Gitter, Antoine Jallon, Antoine Moulin, Antonella Pinto, Antonio A. W. L. Wong, Antonio Franca, Antonio Terpin, Anton Peristyy, Antrell Cheatom, Anupam Nayak, Anwith Telluri, Aras Bacho, Archan Sen, Archimedes Apronti, Ariel Ghislain Kemogne Kamdoum, Arif Engin Demircali, Arina Kharlamova, Arkil Patel, Armel Randy Zebaze, Arnav Chopra, Arshad Anil Fasiludeen, Artem Gazizov, Artem Lukoianov, Arunim Agarwal, Arun Rao, Aryan Singh, Asankhaya Sharma, Ashley Aaron, Ashley Cartwright, Ashley Zhang, Asim Suhail, Assaf Brown, Atak Talay Y\"ucel, Atharv Singh Patlan, Avi Semler, Avishy Carmi, Aymeric Dieuleveut, Barbara Dworakowska, Behzad Ansarinejad, Benedito Alves de Oliveira Junior, Benj\'amin Borb\'as, Benjamin Myklebust, Ben McCarty, Ben Pageler, Ben Racz, Ben Rank, Ben Segev, Ben Wu, Ben Zhao, Bikun Li, Bingchen Zhao, Bingsen Chen, Bir\'o B\'alint, Bita Golshani, Blake Sims, Bonan Pu, Boyi Wei, Brad Ma, Brad Raynor, Brandon Christof, Brecht Verbeken, Brian Amaro, Brian P Coppola, Brian Rabern, Brian Weber, Bruno Hebling Vieira, Bryan Johnson, Carl J Fossum, Carlo Bosio, Caroline Geirhos, Carter Harris, Cary Friday, Cedegao E. Zhang, Cesare Giulio Ardito, Changhao Li, Chao Zhuang, Chelsea Zou, Chen Bo Calvin Zhang, Chenguang Wang, Chenkai Sun, Chiara Ceconello, Chidozie Agu, Chris G. Willcocks, Chris Harjadi, Christian Schroeder de Witt, Christian Stump, Christoph Demian, Christopher R. Scotese, Christopher Toukmaji, Christopher W. Bartlett, Chuanyang Jin, Ciprian Manolescu, Claas Beger, Claire Sparrow, Clark Peng, Claudio Di Fratta, Colin Ni, Colin Tang, Colin White, Core Francisco Park, Costin Cozianu, Daattavya Aggarwal, Dae Hyun Kim, Dakotah Martinez, Damien Sileo, Dan Bar Hava, Dan Hendrycks, Dan Hoyer, Daniel Bugas, Daniel Espinosa Gonzalez, D\'aniel Kondor, Daniel Munro, Daniel Pyda, Daniel Tordera, Daniil Orel, Daniil S. Antonenko, Danyelle Ferreira, Daofeng Li, Daphiny Pottmaier, Dario Abbondanza, Dario Bezzi, Darling Duclosel, Daron Anderson, Daryl Echeazu, Dashiell Stander, Dave Hulbert, David Aldous, David Anugraha, David Avagian, Davide Manini, Davide Scaramuzza, David Holmes, David K. Zhang, David M. Cunningham, David Noever, David Outevsky, David Perrella, David (Quod) Soler Bartomeu, David Stap, David Sun, David Zhang, Dawn Song, Declan Grabb, Deepakkumar Patil, Deepayan Banik, Demosthenes Patramanis, Denis Efremov, Denis Peskoff, Derek Lim, Diana T. Pham, Dianzhuo Wang, Dimitri Zvonkine, Dingsu Wang, Diogo M. Caetano, Dmitry Dodonov, Dmitry Kazakov, Dmitry Malishev, Dominic Williamson, Donato Crisostomi, Don Clarke, Doru Cojoc, D.P. Shinde, Duarte V. Gon\c{c}alves, Dustin Wehr, Dylan Ler, Earth Anderson, Ed Chalstrey, Edoardo M. Ponti, Edson Oliveira, Edward Vendrow, Edwin Taylor, Eeshaan Jain, Egor Kretov, Eli Meril, Elizabeth Kelley, Elliott Thornley, Emanuele Rodol\`a, Emilien Duc, Emil Verkama, Emily de Oliveira Santos, Emma Rodman, Erica Weng, Eric Chu, Eric Hallman, Eric Singer, Eric Vergo, Eric Zheng, Erik Maung, Erik Y. Wang, Eshawn Jessica Scipio, Ethan Delaney, Ethan D. L. Brown, Ethan Luo, Eunmi Yu, Evan Chen, Evan Fu, Evan Kim, Eve J. Y. Lo, Evgenii Zheltonozhskii, Fabian Giska, Fanfei Li, Faraz Farhidi, Farzad Habibi, Fatimah Adesanya, Felipe Meneguitti Dias, Felix Juefei-Xu, Ferenc Jeanplong, Fereshteh Kazemi, Filippo Bigi, Fiona Feng, Firuz Kamalov, Florencia de la Rosa, Forough Mohammadzadeh, Fortuna Samuele, Francesco Fournier-Facio, Francesco Pinto, Francisco-Javier Rodrigo-Gin\'es, Franck Dernoncourt, Frank Reidegeld, Frank Sommerhage, Freddie Martin, Freddie Vargus, Fredrik Ekstr\"om, Gabe Maayan, Gabriele Sarti, Gabriel Loiseau, Gabriel Poesia Reis e Silva, Gabriel Recchia, Ga\"el Gendron, Gang Zhang, Gaoxiang Luo, Gashaw M. Goshu, Gautier Abou Loume, Gavin Wang, Gbenga Daniel Obikoya, G. Bruno De Luca, Genghan Zhang, Genta Indra Winata, Geoff Galgon, George Balabanian, George Medley, Gerben Sewuster, Gerol Petruzella, Glen Sherman, Glib Briia, Gongbo Sun, Gordon McKellips, G\"ozdenur Demir, Greg Bateman, Grzegorz Luczyna, Guangyao Zheng, Guglielmo Albani, Guilherme Maximiano, Guillaume Douville, Guillaume Malod, Gunjan Chhablani, Haile Kassahun, Hailey Schoelkopf, Haline Heidinger, Hamid Mostaghimi, Hanchen Li, Handoko, Hangrui Cao, Han Lin, Hanmeng Xu, Hannah Szlyk, Hans Gundlach, Haocheng Xi, Hao He, Haon Park, Hao Qi, Haoqin Tu, Haoran Qiu, Haoran Zhao, Haoxuan Chen, Hao-Yu Sun, Harrison K Wang, Harsh Kumar, Hassan Shapourian, Ha Thi Hoang, Hector Haffenden, Henry Tang, Hew Wolff, Hieu Hoang, Hieu Nguyen, Hieu Tran, Himanshu Gupta, Himanshu Narayan, Hodjat Mariji, Honglu Fan, Hongsen Qin, Hongzheng Chen, Hossam Elgnainy, Hossein Shahrtash, Hsiaoyun Milliron, Huanxu (Quinn) Liu, Hubert Yang, Hubeyb Gurdogan, Hugh Zhang, Hugo Lunn, Hunar Batra, Hu Shiyu, Hyunjun Kim, Hyun Kyu Park, Hyunwoo Park, Ida Bosio, Ido Akov, Ignacio D. Lopez-Miguel, Ignat Soroko, Igor Chernyavsky, Ilias Magoulas, Ilia Sucholutsky, Ilya Gusev, Imad Ali Shah, I.M.J. McInnis, Immo Klose, Innocent Enyekwe, Ioannis Pantidis, Isaac C. McAlister, Isaac Park, Isha Gupta, Ismail Alarab, Ivan Dewerpe, Ivan Fosin, Ivan Rannev, Ivar \"Angquist, Jacek Karwowski, Jack Lindsey, Jack Stade, Jack Wei Lun Shi, Jacob Drori, Jacob Loader, Jacob Platnick, Jacob Votava, Jaeho Lee, Jaehyeok Jin, Jae-Won Chung, Jainam Shah, Jakob Hauser, Jakob Zsambok, Jakub {\L}ucki, James Bailey, James Koppel, James Leung, Jamie Tucker-Foltz, Jan Hendrik Kirchner, Jasdeep Sidhu, Jason Gross, Jason Hausenloy, Jason Luo, Jason O. Matos, Jason Poulos, Javier Gimenez, Jay Paek, Jayson Lynch, Jean-Christophe Mourrat, Jean Kaddour, Jeffery Li, Jeff J. Ma, Jennifer Sandlin, Jennifer Zampese, Jenny Reddish, Jeremiah Milbauer, J\'er\'emy Andr\'eoletti, Jeremy Nguyen, Jessica P. Wang, Jesus Colino, Jesyin Lai, Jiachen Liu, Jiale Chen, Jiang Muzhen, Jiangnan Xu, Jianxin Wang, Jianzhu Yao, Jiaqi Cai, Jiaqi Deng, Jiaqi Wang, Jiawei Shen, Jiaxin Ge, Jiaxuan Wu, Jiayi Pan, Jichao Fang, Jing Fan, Jingxuan Fan, Jinzhou Yang, Joanna Tam, Joan of Arc Xavier, Johan Ferret, Johannes Lengler, Johannes Schmitt, Johannes Veith, John Arnold Ambay, Johnathan Morris, John B. Wydallis, John-Clark Levin, John Lai, John Ling, John Maar, Jonathan Crozier, Jonathan Eicher, Jonathan Roberts, Jonathon Kean, Jongee Park, Jorge Chamorro-Padial, Jorge Pretel Villanueva, Jorge Sanz-Ros, Josef Tkadlec, Jose Hernandez-Orallo, Josephina Hu, Joseph Marvin Imperial, Joseph M Cavanagh, Joseph McGowan, Joseph W. Jackson, Josh Ducey, Joshua Cole, Joshua Duersch, Joshua Jaeger, Joshua Lass, Joshua Mak, Joshua Newbould, Joshua Robinson, Joshua Vendrow, JP Heimonen, Juan Carlos Gonzalez, Juan Gonzalez, Juehang Qin, Jules Kreuer, Jules Robins, Julia Chernyavsky, Julian Noah Leser, Julian Salazar, Julian Wykowski, Julia Yoon, Julien Degorre, Julien Guillod, Julien Laurendeau, Julien Portier, Julien Wist, Juncheng Wu, Junda Chen, Jungbae Nam, Jun Jin, Junwoo Ha, Junyi Guan, Jun Yuan, Junyu Luo, Justine Leon Uro, Justin Tan, Justin Xu, Kaiqu Liang, Kaivalya Rawal, Kaixin Wang, Kalon J. Overholt, Kalyan Ramakrishnan, Kang Yong Loh, Kaniuar Bacho, Kanu Priya Agarwal, K\'aroly Zsolnai-Feh\'er, Kasper Halevy, Katarzyna Olszewska, Kaushik Bar, Kaustubh Dhole, Kaustubh Ponkshe, Kaustubh Sridhar, Kavin Jindel, Kaylie Hausknecht, Kazuki Matsumoto, Keith Krenek, Keith Schneider, Kelin Zhu, Kelsey Van den Houte, Kenchi Okutsu, Kengo Zenitani, Ketan Jha, Kevin Chen, Kevin Joseph Scaria, Kevin Zhou, Khalida Meer, Koen Sponselee, Kostiantyn Dobarskyi, Krishnamurthy Iyer, Kristof Meding, Krzysztof Burdzy, Kumar Shridhar, Kunal Pai, Kunvar Thaman, Kunyang Sun, Kushal Thaman, Kushin Mukherjee, Kutay Tire, Kwok Hao Lee, Kyle Montgomery, Laasya Nagumalli, Laila Bashmal, Laila Yacar, Lavr Vetoshkin, Lawrence Hollom, Laxman Prasad Goswami, Lennart Finke, Leon Lang, Leon Nguyen, Leonor Brito-Santana, Leo Smucker, Letitia Parcalabescu, Liam Do, Lianghui Li, Liangti Dai, Lina Br\"ussel, Ling Zhang, Linh Ho, Linjie Dai, Linwei Xin, Lisa Schut, Li S. Yifei, Lixin Zhang, Longke Tang, Long Le, Long Phan, Long (Tony) Lian, Lorenzo Vaquero, Loukmane Karim, Luca Arnaboldi, Lukas Lewark, Lukas S. Huber, Luke Askew, Luke Basler, Luk Gloor, Luther Yap, Lynna Kvistad, Lynn Van Der Sypt, Madellene Pe\~naflor, Maja Somrak, Maksim Radionov, Maksym Ovchynnikov, Mantas Mazeika, Manuel Schottdorf, Mao Mao, Maosen Tang, Mara Popescu, Marc Carauleanu, Marcin Bria\'nski, Marco Lukas, Marco Piccardo, Marc Roth, Marc Sperzel, Marcus Abramovitch, Maria del Rio-Chanona, Maria In\^es S. Nunes, Mariana Costa, Mark H Inlow, Mark Nandor, Martin Lackner, Martino Maggetti, Martin Q. Ma, Martin Stehberger, Mart\'i Oller, Martyna Plomecka, Marvin Deng, Matheus Piza, Matthew Brooks, M\'aty\'as Vincze, Max Bartolo, Max Lamparth, Maxwell Shepherd, M.C. Bosc\'a, Mengze Tang, Micah Carroll, Michael Chen, Michael Choi, Michael Foster, Michael K. Cohen, Michael Kirchhof, Michael Krause, Michael Liu, Michael P. Brenner, Michael Richmond, Michael Wang, Michael Yu, Micha{\l} Pere{\l}kiewicz, Michelle X Yuan, Micka\"el Noy\'e, Miguel Orbegozo Rodriguez, Mikalai Uzhou, Mike Battaglia, Mike He, Mike Peterson, Mike Zhang, Mikhail Doroshenko, Mikhail Kalinin, Milind Jagota, Mingfang Zhang, Minghao Yan, Ming Yin, Mobeen Mahmood, Mohamed Sayed, Mohamed Shaaban, Mohamed Zekry, Mohammad Maghsoudimehrabani, Mohammadreza Mofayezi, Mohammad Safdari, Mohammed Berkani, Mohammed Mahfoud, Mohanad Mohamed, Mohinder Maheshbhai Naiya, Mohsen Bahaloohoreh, Moon Twayana, Moritz Firsching, M Saiful Bari, Mstyslav Kazakov, Mu Cai, Muhammad Fayez Aziz, Muhammad Rehan Siddiqi, Mukhwinder Singh, Murat Eron, Murat Islam, Murat Tiryakioglu, Mustafa Mehkary, Muthu Chidambaram, Muyan Jiang, My Chiffon Nguyen, Namkyu Park, Nasser Heydari, Natanael Wildner Fraga, Nate Resman, Nate Stambaugh, Nathan Cho, Nathaniel Li, Ngefor Mildred Tanwie, Ng Ze-An, Nicholas Farina, Nick Crispino, Nick Winter, Nicolas Daans, Nicolas Remy, Niels M\"undler, Nihar Shah, Nikita Shulga, Niklas Muennighoff, Nikola Zubi\'c, Nils Gustafsson, Ning Tang, Nitin Chandok, Niv Cohen, Noah Burns, Noam Kolt, Nurdin Kaparov, Oam Patel, Oleg Iskra, Oleg Shumar, Oleksandr Pokutnyi, Oliver Zhang, Olle H\"aggstr\"om, Omer Faruk Bodur, Omid Taheri, Omkar Dhamane, Ondrej Bohdal, Orion Weller, Ori Press, Orr Paradise, Orr Zohar, Pablo Hern\'andez-C\'amara, Paolo Faraboschi, Paolo Giordano, Paolo Rissone, Parker Whitfill, Pascal Lauer, Patrick Tser Jern Kon, Paul Rosu, Pavel Arkhipov, Pavel Zhelnov, Pawan Kumar, Peter Bradshaw, Peter E. Chen, Peter Turchin, Petr Spelda, Peyman Kassani, Philipp D. Siedler, Philippe Schwaller, Philipp Petersen, Phuong M. Cao, Pierre Clavier, Pierre Marion, Pierrot Arsene, Pieter Francois, Piotr Padlewski, Prajvi Saxena, Prashant Joshi, Priti Shukla, Qiaochu Yuan, Qijia Chen, Qiutong Men, Qiuyu Ren, Qizheng Zhang, Rafael Sayous, Rafa{\l} Po\'swiata, Ragavendran P V, Raghav Singhal, Rai (Michael Pokorny), Rajat Maheshwari, Rami Aly, Rasoul Pouriamanesh, Ra\'ul Adri\'an Huerta Rodr\'iguez, Rayner Hernandez Perez, Rebeka Plecnik, Renas Bacho, Ricardo Lorena, Richard Moat, Richard Ren, Richard Stanley, Richard Wheeler, Rickard Br\"uel Gabrielsson, Rishab Kumar Jain, Rishit Agrawal, Ritesh Kasamsetty, Ritwik Mishra, Robert Geirhos, Robert Gerbicz, Robert Lauff, Roberto Pereira, Robin Riblet, Robin Zhang, Rodrigo De Oliveira Pena, Rohan Pandey, Roman Leventov, Romano De Maddalena, Roman Pflugfelder, Ronak Pradeep, Ronald Clark, Rongwu Xu, Roselynn Grace Montecillo, Ross Finocchio, Roy Yue, Ruicheng Xian, Ruiji Sun, Rui Li, Rui Pan, Runjia Li, Russell Campbell, Ryan G. Hoerr, Ryan Kim, Ryan Stendall, Ryan Yang, Rynaa Grover, Saeed Soori, Sai Prajwal Reddy, Saiteja Utpala, Samaksh Gulati, Sam Ali, Samir Shamseldeen, Sam Lee, Samuel Albanie, Samuele Sala, Samuel Perry, Sandra Mendoza, Sandy Zhao, Sangwon Lee, Sanxing Chen, Sara Fish, Sarah Hoback, Sarah-Jane Crowson, Sarah Martinson, Sara Vera Marjanovi\'c, S Ashwin Hebbar, Satyapriya Krishna, Scott Creighton, Scott Sauers, Sean Li, Sean R. Green, Sean Shi, Sejong Kim, Sergei Bogdanov, Sergey Bogdanik, Sergey Ivanov, Serguei Popov, Seri Khoury, Shadab Khan, Shailesh Shah, Shaipranesh Senthilkuma, Shalev Ben-David, Shankar Sivarajan, Shannon Coleman, Shashank Agnihotri, Shaul Barkan, Shaun Phillips, Sheeshram Siddh, Shehzaad Dhuliawala, Sherwin Abdoli, Sherwin Lai, Shikhar Dhingra, Shiqi Wang, Shiv Halasyamani, Shi-Zhuo Looi, Shoubin Yu, Shreen Gul, Shreyas Subramanian, Shreyas Verma, Shuyu Wu, Siddharth Suresh, Sihan Xu, Simon Weber, Sina Mollaei, Sina Rismanchian, Siranut Usawasutsakorn, Siriphan Arthornthurasuk, Sivakanth Gopi, Sk Md Salauddin, Soham Sachin Purohit, Soham Samal, Song Bian, Songyang Zhang, S\"oren M\"oller, S{\o}ren Riis, Spandan Patel, Sreekar Chigurupati, Srikar Yalam, Stanislaw Barzowski, Stanley Stepanic, Stefan Ciob\^ac\u{a}, Stefan Ivanov, Stefano Cavalleri, Stefano Ermon, Stefan Steinerberger, Stefan Todoran, Steffi Chern, Stephane Durand, Stephen Ebert, Stephen Malina, Stephen Mensah, Steven Dillmann, Steven Y. Feng, Subhashini Venugopalan, Subrata Mishra, Suchandra Datta, Sumeet Motwani, Summer Yue, Sunny Sun, Surya Sunkari, Sybille Rosset, Syed M. Shahid, Tad Hogg, Tania C. B. Santos, Taom Sakal, Taozhi Wang, Taylor D. Hartman, Tejal Patwardhan, Tejas Kalpathi, Tej Shah, Thai-Hoa Nguyen, Theo Knights, Thomas C.H. Lux, Thomas Preu, Thom Kamphuis, Thorben Jansen, Tianbo Qi, Tianchi Zhang, Tianneng Shi, Tilen Medved, Tim Gehrunger, Timothy Kang, Timothy Manik, Timothy Wu, Tim Santens, Tim Tarver, Ting Sun, Ting Wang, Tobias Garcia Vilchis, Tobias Kreiman, Tomek Korbak, Tom Goertzen, Tong Jiang, Tong Yang, Tony CY Pang, Tony Fruhauff, Tran {\DJ}uc Huy, Tran Quoc Kh\'anh, Truong An Nguyen, T. Ryan Rogers, Tung Nguyen, Tyler Osbey, Tyler Xiao, Ujjwala Anantheswaran, Usman Qazi, V\'aclav Rozho\v{n}, Vage Taamazyan, Vaidehi Patil, Varun Gangal, Vasilios Mavroudis, Vassilis Kostakos, Veerupaksh Singla, Veit Elser, Victor Efren Guadarrama Vilchis, Victor Souza, Vidhi Kulkarni, Vijaykaarti Sundarapandiyan, Vil\'em Zouhar, Ville Heilala, Vincent Cheng, Vincent Ginis, Vinh-Kha Le, Violet Ai, Virendra Singh, Vishruth Bharath, Vit Stritecky, Vivek Sanker, Vivek Vajipey, Vivien Rossbach, Vladimir Goryachev, Vladimir Vinnikov, Vladislav Poritski, Vladyslav Kuchkin, Volodymyr Nevirkovets, Wanyoung Kim, Warren S. Vaz, Wei Hao, Wei Hu, Weizhi Zhang, Wenchao Dong, Wen-Ding Li, Wenjie Ma, Wenjin Zhang, Wentao Wu, Wiktor Morak, Will Cai, William Alley, William Held, William Merrill, Will Yeadon, Woongyeong Yeo, Xavier Alapont, Xianjun Yang, Xiaohan Wang, Xiaoxiang Zhou, Xi Jiang, Xi Lin, Xilin Jiang, Xing Han L\`u, Xingyu Qu, Xinlu Zhang, Xinyao Han, Xinyu Zhang, Xiuyu Li, Xuandong Zhao, Xue Wang, Yana Malysheva, Yan Carlos Leyva Labrador, Yanxu Chen, Yaowen Chang, Yashaswini Jain, Yasin Sonmez, Yewen Sun, Yibo Jiang, Yifan Xiong, Yifan Yin, Yihao Liang, Yingheng Wang, Yinuo Ren, Yinwei Dai, Yi\u{g}it Yal{\i}n, Yiyang Fan, Yizhuo Liang, Yongki Lee, Yosi Kratish, Yotam Perlitz, Younesse Kaddar, Yuanli Wang, Yuchen Anna Zhou, Yuexuan Zu, Yueying Liu, Yuhui Zhang, Yunze Xiao, Yuqi Li, Yury Makarychev, Yushun Chen, Yuval Kansal, Yuyin Zhou, Yuzheng Hu, Yuzhou Nie, Yuzhou Wang, Zachary Berger, Zachary Brown, Zachary Giboney, Zafir Nasim, Zahra Adoul, Zakayo Kazibwe, Zaki Hossain, Zechen Zhang, Zerui Cheng, Zewen Shen, Zhanda Zhu, Zhehang Du, Zhengxiang Wang, Zheng-Xin Yong, Zhe Wang, Zhe Ye, Zhibai Jia, Zhiyi Sun, Zhun Wang, Zhuo Cheng, Zienab EL-Wasif, Zihan Wang, Zihao Wang, Zijian Song, Ziqiao Ma, Ziqi Liu, Ziqi Xu, Zishun Yu, Ziwen Han, Zixuan Wang, Ziye Chen, Ziyi Zhang

Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM benchmarkAI evaluationexpert human performanceacademic questionsmodel calibrationfrontier knowledgeclosed-ended questionsmulti-modal benchmark

0 comments

The pith

A benchmark of 2500 expert-level questions shows state-of-the-art LLMs still perform poorly on hard academic problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a collection of 2500 closed-ended questions spanning mathematics, humanities, natural sciences and other fields, each with a definite answer that experts can check but that resists quick web lookup. These questions were assembled by subject specialists worldwide to sit at the current limits of human knowledge. When tested, leading language models record low accuracy and weak calibration on the set, in contrast to their high scores on easier existing tests. This gap indicates that current systems have not yet reached expert human performance on demanding closed-ended tasks. If the results hold, the benchmark offers a stable reference point for tracking future progress toward that level.

Core claim

The authors assembled 2500 multi-modal questions across dozens of subjects, each carrying a known, unambiguous solution that is easily verified yet not quickly retrievable from the internet. State-of-the-art LLMs achieve low accuracy and poor calibration on this collection, in contrast to their near-ceiling performance on saturated earlier benchmarks, thereby exposing a measurable distance between present model abilities and the expert human frontier on closed-ended academic questions.

What carries the argument

The Humanity's Last Exam benchmark itself, a fixed set of 2500 expert-developed questions with verifiable answers that resist rapid retrieval.

If this is right

The benchmark supplies a durable yardstick for measuring gains in reasoning and knowledge on genuinely difficult problems.
Model developers gain a concrete signal that current approaches leave substantial headroom before expert-level closed-ended performance.
Policymakers receive a clearer view of the distance between deployed systems and human-expert capability on academic tasks.
Subsequent evaluation efforts can adopt the same global-expert, verifiable-answer design for other domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strong performance on this set may correlate with competence on complex real-world expert workflows that mix facts and reasoning.
The multi-modal format points to a need for joint advances in text and visual understanding at frontier difficulty.
Repeated use of the same questions over time will let researchers quantify whether gains are genuine or partly due to data leakage.
Similar coordinated expert efforts could produce parallel tests for fields where knowledge moves faster than static benchmarks allow.

Load-bearing premise

The questions have clear solutions that cannot be quickly found through internet searches and sit at the current edge of what human experts know.

What would settle it

An independent check that shows many of the questions can be answered correctly by standard web search or that top LLMs reach above 60 percent accuracy on the full set without additional training.

read the original abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Humanity's Last Exam (HLE), a multi-modal benchmark of 2,500 closed-ended questions (multiple-choice and short-answer) spanning mathematics, humanities, and natural sciences. Questions were developed globally by subject-matter experts and are asserted to have unambiguous, verifiable solutions that cannot be quickly answered via internet retrieval. The paper claims that existing benchmarks like MMLU are saturated (>90% LLM accuracy) and positions HLE as a frontier benchmark on which state-of-the-art LLMs exhibit low accuracy and poor calibration, revealing a substantial gap to expert human performance. The benchmark is released publicly at lastexam.ai.

Significance. If the questions are rigorously validated as non-retrievable and frontier-level, HLE would be a valuable contribution by supplying a non-saturated, broad-coverage benchmark for tracking LLM progress on expert academic tasks. The global expert curation and multi-modal design are strengths, and the public release supports reproducibility. However, the claimed significance of the LLM capability gap rests on unshown validation evidence, limiting its current impact for research and policy.

major comments (2)

[Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.
[Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.

minor comments (1)

[Abstract] Abstract: Including one or two concrete accuracy figures (with model names) would make the 'low accuracy' claim more precise and informative for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing Humanity's Last Exam. We address each major comment point by point below, with clear indications of planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.

Authors: We agree that explicit validation details are essential to support the non-retrievability claim and distinguish capability gaps from data leakage. The manuscript describes global expert curation and the requirement for verifiable solutions, but we acknowledge the need for greater specificity. In the revised version, we will add a dedicated subsection under question development that outlines the concrete procedures: expert-conducted web searches for each question, checks against academic databases and prior benchmarks for originality, and any quantitative thresholds or audit logs used to confirm that solutions cannot be quickly retrieved. Examples of such checks for representative questions will be included where feasible without compromising the benchmark. revision: yes
Referee: [Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.

Authors: We apologize that the quantitative results were not presented with sufficient prominence or completeness in the version under review. The manuscript does contain an evaluation section reporting model performance, but we will revise it to include explicit tables with per-model accuracies (e.g., for GPT-4o, Claude 3.5 Sonnet, and others), direct comparisons to human expert baselines, calibration metrics such as expected calibration error, and basic statistical details including confidence intervals or variance across question subsets. This will enable readers to evaluate the scale and reliability of the observed gap. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset release without derivations or fits

full rationale

The paper introduces Humanity's Last Exam as a new multi-modal benchmark consisting of 2,500 expert-authored questions. It contains no mathematical derivations, model equations, parameter fittings, or predictions derived from internal computations. The central claims—that questions are unambiguous, verifiable, and not quickly retrievable via internet, and that current LLMs show low accuracy—rest on the empirical construction and release of the dataset itself rather than any self-referential reduction of outputs to inputs. No self-citation chains, ansatzes, or renamings of known results are used to justify load-bearing steps. The work is therefore self-contained as a benchmark contribution with no derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation of LLM capabilities on HLE depends on the assumption that the questions accurately reflect the frontier of human knowledge without being solvable through non-expert means.

axioms (1)

domain assumption Questions have known, unambiguous, and easily verifiable solutions that cannot be quickly answered via internet retrieval.
This is presented as a core design principle in the abstract.

pith-pipeline@v0.9.0 · 10825 in / 1194 out tokens · 71822 ms · 2026-05-10T18:36:09.526569+00:00 · methodology

discussion (0)

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
cs.CV 2026-04 unverdicted novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
q-fin.CP 2026-04 conditional novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
cs.AI 2026-05 unverdicted novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
MaD Physics: Evaluating information seeking under constraints in physical environments
cs.AI 2026-05 unverdicted novelty 7.0

MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
cs.AI 2026-05 unverdicted novelty 7.0

TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
cs.LG 2026-05 unverdicted novelty 7.0

The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Super Apriel: One Checkpoint, Many Speeds
cs.LG 2026-04 unverdicted novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
cs.LG 2026-04 unverdicted novelty 7.0

Stargazer benchmarks AI agents on physics-constrained model fitting for astrophysical data, revealing that agents achieve statistical fits but often fail to recover correct physical parameters.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
cs.LG 2026-04 unverdicted novelty 7.0

Stargazer benchmark shows frontier AI agents achieve statistical fits to radial velocity data but frequently fail to recover correct physical planetary system parameters.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
cs.LG 2026-05 unverdicted novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
The Generalized Turing Test: A Foundation for Comparing Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
cs.CL 2026-05 unverdicted novelty 6.0

Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Learning Agent Routing From Early Experience
cs.CL 2026-05 unverdicted novelty 6.0

BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
Cripping AI: Reimagining AI Through Lived Disability Experiences
cs.HC 2026-05 unverdicted novelty 6.0

Cripping AI is a proposed framework that dismantles ableist assumptions in AI by centering disabled ways of knowing and respecting disabled labor in co-creation.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
cs.LG 2026-04 unverdicted novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
cs.AI 2026-04 unverdicted novelty 6.0

Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
Large Language Models Decide Early and Explain Later
cs.CL 2026-04 unverdicted novelty 6.0

LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
cs.LG 2026-04 unverdicted novelty 6.0

PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
LLMs Get Lost In Multi-Turn Conversation
cs.CL 2025-05 unverdicted novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 5.0

Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
cs.IR 2026-05 conditional novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
pAI/MSc: ML Theory Research with Humans on the Loop
cs.AI 2026-04 unverdicted novelty 5.0

pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
cs.AI 2026-04 unverdicted novelty 5.0

EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
Toward Human-AI Complementarity Across Diverse Tasks
cs.HC 2026-04 unverdicted novelty 5.0

Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.
COMPOSITE-Stem
cs.AI 2026-04 conditional novelty 5.0

COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
cs.CV 2026-04 unverdicted novelty 3.0

Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...

Reference graph

Works this paper leans on

300 extracted references · 200 canonical work pages · cited by 53 Pith papers · 20 internal anchors

[1]

Alberti, K

C. Alberti, K. Lee, and M. Collins. A bert baseline for the natural questions, 2019. URL https: //arxiv.org/abs/1901.08634

work page arXiv 2019
[2]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y . Gal, and X. Davies. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. URLhttps://arxiv.org/abs/2410.09024

work page internal anchor Pith review arXiv 2024
[3]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499

2024
[4]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf

2024
[5]

Responsible scaling policy updates, 2024

Anthropic. Responsible scaling policy updates, 2024. URL https://www.anthropic.com/ rsp-updates

2024
[6]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URLhttps://arxiv.org/abs/2505.08775

work page internal anchor Pith review arXiv 2025
[7]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108. 07732

2021
[8]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. Mc- Candlish, C. Olah, B. Mann, and J. Kaplan...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URLhttps://arxiv.org/abs/1611.09268

work page internal anchor Pith review arXiv 2018
[10]

Purple llama CyberSecEval : A secure coding benchmark for language models

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Ascher- mann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models, 2023. URLhttps://arxiv...

work page arXiv 2023
[11]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410.07095

work page arXiv 2024
[12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Arc prize 2024: Technical report

F. Chollet, M. Knoop, G. Kamradt, and B. Landers. Arc prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024
[14]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://github.com/deepseek-ai/ DeepSeek-V3/blob/main/DeepSeek_V3.pdf

2024
[16]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903. 00161. 10

2019
[17]

The Llama 3 Herd of Models

A. Dubey et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

B. Gao, F. Song, Z. Yang, Z. Cai, Y . Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y . Zhang, X. Ren, T. Liu, and B. Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07985

work page arXiv 2024
[19]

Frontiermath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, O. Järviniemi, M. Barnett, R. Sandler, J. Sevilla, Q. Ren, E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, and S. V . Enugandla. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai,...

work page arXiv 2024
[20]

C. He, R. Luo, Y . Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps://arxiv.org/abs/2402.14008

work page internal anchor Pith review arXiv 2024
[21]

Measuring Coding Challenge Competence With APPS

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021. URL https: //arxiv.org/abs/2105.09938

work page internal anchor Pith review arXiv 2021
[22]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103. 03874

2021
[24]

Hendrycks, A

D. Hendrycks, A. Zou, M. Mazeika, L. Tang, B. Li, D. Song, and J. Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. URLhttps://arxiv.org/abs/2112.05135

work page arXiv 2022
[25]

Hosseini, A

A. Hosseini, A. Sordoni, D. Toyama, A. Courville, and R. Agarwal. Not all llm reasoners are created equal,
[26]

URLhttps://arxiv.org/abs/2410.01748

work page arXiv
[27]

Jacovi, A

A. Jacovi, A. Wang, C. Alberti, C. Tao, J. Lipovetz, K. Olszewska, L. Haas, M. Liu, N. Keating, A. Bloniarz, C. Saroufim, C. Fry, D. Marcus, D. Kukliansky, G. S. Tomar, J. Swirhun, J. Xing, L. W. andMadhu Gurumurthy, M. Aaron, M. Ambar, R. Fellinger, R. Wang, R. Sims, Z. Zhang, S. Goldshtein, and D. Das. Facts leaderboard. https://kaggle.com/facts-leaderb...

2024
[28]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking benchmarking in nlp, 2021. URLhttps://arxiv.org/abs/2104.14337

work page arXiv 2021
[30]

Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,

P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, S. R. Team, E. Chang, V . Robinson, S. Hendryx, S. Zhou, M. Fredrikson, S. Yue, and Z. Wang. Refusal-trained llms are easily jailbroken as browser agents, 2024. URLhttps://arxiv.org/abs/2410.13886

work page arXiv 2024
[31]

J. M. Laurent, J. D. Janizek, M. Ruzo, M. M. Hinks, M. J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research,
[32]

URLhttps://arxiv.org/abs/2407.10362

work page arXiv
[33]

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi, A. Khoja, Z. Zhao, A. Herbert-V oss, C. B. Breuer, S. Marks, O. Patel, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Lin, A. A. Hunt,...

work page arXiv 2024
[34]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/2310.02255

work page internal anchor Pith review arXiv 2024
[35]

T. R. McIntosh, T. Susnjak, N. Arachchilage, T. Liu, P. Watters, and M. N. Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelligence, 2024. URL https: //arxiv.org/abs/2402.09880. 11

work page arXiv 2024
[36]

Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URLhttps://arxiv.org/abs/1910.14599

work page arXiv 2020
[37]

Openai o1 system card, 2024

OpenAI. Openai o1 system card, 2024. URLhttps://cdn.openai.com/o1-system-card-20240917. pdf

2024
[38]

Openai and los alamos national laboratory announce bio- science research partnership, 2024

OpenAI. Openai and los alamos national laboratory announce bio- science research partnership, 2024. URL https://openai.com/index/ openai-and-los-alamos-national-laboratory-work-together/

2024
[39]

Introducing swe-bench verified, 2024

OpenAI. Introducing swe-bench verified, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/

2024
[40]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

2022
[42]

D. Owen. How predictable is language model benchmark performance?, 2024. URL https://arxiv. org/abs/2401.04757

work page arXiv 2024
[43]

arXiv preprint arXiv:2212.09251 , year=

E. Perez, S. Ringer, K. Lukoši ¯ut˙e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L. ...

work page arXiv 2022
[44]

Phuong, M

M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V . Krakovna, D. Lindner, M. Rahtz, Y . Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabil...

2024
[45]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URLhttps://arxiv.org/abs/1606.05250

work page internal anchor Pith review arXiv 2016
[46]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URLhttps://arxiv.org/abs/1806.03822

work page Pith review arXiv 2018
[47]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023
[48]

Singhal, S

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[49]

Skarlinski, J

M. Skarlinski, J. Laurent, A. Bou, and A. White. About 30% ofHumanity’s Last Exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/ research-announcements/hle-exam

2025
[50]

V . K. Srinivasan, Z. Dong, B. Zhu, B. Yu, H. Mao, D. Mosk-Aoyama, K. Keutzer, J. Jiao, and J. Zhang. Nexusraven: A commercially-permissive language model for function calling. InNeurIPS 2023 F oun- dation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id= 5lcPe6DqfI

2023
[51]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Slone, A. Rahane, A. S. Iyer, A. Andreassen, A. Madotto, A. S...

work page internal anchor Pith review arXiv 2023
[52]

S. A. Taghanaki, A. Khani, and A. Khasahmadi. Mmlu-pro+: Evaluating higher-order reasoning and shortcut learning in llms, 2024. URLhttps://arxiv.org/abs/2409.02257. 12

work page arXiv 2024
[53]

Team et al

G. Team et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,
[54]

URLhttps://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv
[55]

PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition.arXiv preprint arXiv:2407.11214,

G. Tsoukalas, J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024. URLhttps://arxiv. org/abs/2407.11214

work page arXiv 2024
[56]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804. 07461

2019
[57]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020. URL https://arxiv.org/abs/1905.00537

work page arXiv 2020
[58]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 2024. URL https://arxiv.org/abs/2406.01574

work page internal anchor Pith review arXiv 2024
[59]

J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024. URLhttps://arxiv.org/abs/2411.04368

work page arXiv 2024
[60]

H. Wijk, T. Lin, J. Becker, S. Jawhar, N. Parikh, T. Broadley, L. Chan, M. Chen, J. Clymer, J. Dhyani, E. Ericheva, K. Garcia, B. Goodrich, N. Jurkovic, M. Kinniment, A. Lajko, S. Nix, L. Sato, W. Saunders, M. Taran, B. West, and E. Barnes. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts, 2024. URLhttps://a...

work page arXiv 2024
[61]

Grok-2 beta release, 2024

xAI. Grok-2 beta release, 2024. URLhttps://x.ai/blog/grok-2

2024
[62]

F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function call- ing leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html, 2024

2024
[63]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/ 1809.09600

work page internal anchor Pith review arXiv 2018
[64]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045

work page internal anchor Pith review arXiv 2024
[65]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, J. W. Lin, E. Jones, C. Menders, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluating ...

work page arXiv 2024
[66]

Agieval: A human-centric benchmark for evaluating foundation models

W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/ 2304.06364. 13 A Authors We offered optional co-authorship to all question submitters with an accepted question in HUMANITY’SLAST EXAM(including both public and private...

work page arXiv 2023
[67]

Independent Researcher
[68]

University of California, Berkeley
[69]

Massachusetts Institute of Technology
[70]

University of Cambridge
[71]

University of Oxford
[72]

Princeton University
[73]

Carnegie Mellon University
[74]

University of Chicago
[75]

University of Michigan
[76]

École Polytechnique Fédérale de Lausanne
[77]

University of Toronto
[78]

University of Illinois Urbana-Champaign
[79]

Washington University
[80]

University of Wisconsin-Madison

Showing first 80 references.