arxiv: 2403.05530 · v5 · submitted 2024-03-08 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Aakanksha Chowdhery, Aaron Cohen, Aaron Parisi, Abe Ittycheriah, Abhanshu Sharma, Abhijit Karmarkar, Abhimanyu Goyal, Abhi Mohan, Abhishek Chakladar, Abhishek Sinha, Achintya Singhal, Ada Ma, Adam Bloniarz, Adam Iwanicki, Adam Paszke, Adam R. Brown, Adam Sadovsky, Adams Yu, Aditya Barua, Aditya Siddhant, Adnan Ozturel, Adrian Goedeckemeyer, Adrian Hutter, Adria Puigdomenech Badia, Adria Recasens, Aedan Pope, Agoston Weisz, Aishwarya Kamath, Ajay Kannan, Alanna Walton, Alban Rrustemi, Albert Cui, Alberto Magni, Albert Webson, Albert Weston, Albin Cassirer, Ale Jakse Hartman, Alejandro Lince, Alek Andreev, Alek Dimitriev, Aleksandra Faust, Alek Wenjiao Wang, Alena Repina, Alen Carin, Alexander Chen, Alexander Neitz, Alexander Pritzel, Alexandra Chronopoulou, Alexandre Frechette, Alexandre Moufarek, Alexandre Senges, Alex Castro-Ros, Alexey Guseynov, Alex Goldin, Alex Grills, Alex Kaskasoli, Alex Korchemniy, Alex Morris, Alex Polozov, Alex Tomala, Alex Tudor, Alex Yakubovich, Alex Zhai, Aliaksei Severyn, Alice Talbert, Alicia Parrish, Ali Elqursh, Ali Khodaei, Alireza Ghaffarkhah, Allan Dafoe, Ambrose Slone, Amir Globerson, Amit Marathe, Amit Raul, Amol Mandhane, Anais White, Anand Iyer, Ananth Agarwal, Anastasia Petrushkina, Anastasija Ilic, Anca Dragan, Anders Andreassen, Andras Orban, Andrea Burns, Andrea Michi, Andreas Terzis, Andrea Tacchetti, Andreea Marzoca, Andre Elisseeff, Andrei Sozanschi, Andrew Bolt, Andrew Brock, Andrew Dai, Andrew Leach, Andrey Khorlin, Andy Coenen, Andy Swing, Angeliki Lazaridou, Angelos Filos, Anhad Mohananey, Anirudh Baddepudi, Anirudh GP, Anita Gergely, Anitha Vijayakumar, Anja Hauth, Ankesh Anand, Ankur Bapna, Ankush Garg, Anmol Gulati, Anna Bortsova, Anna Bulanova, Anna Koop, Annie Louis, Anselm Levskaya, Anthony Baryshnikov, Antoine He, Antoine Miech, Antoine Yang, Anton Algymr, Anton Briukhov, Anton Ruddock, Anton Tsitsulin, Anudhyan Boral, Anurag Arnab, Anu Sinha, Arjun Kar, Arnar Mar Hrafnkelsson, Arpi Vezer, Arthur Guez, Artiom Myaskovsky, Arun Ahuja, Asaf Aharoni, Ashish Shenoy, Ashwin Sreevatsa, Aurko Roy, Austin Matthews, Aviral Kumar, Avi Singh, Axel Stjerngren, Ayush Dubey, Azade Nova, Balaji Lakshminarayanan, Balaji Venkatraman, Bartek Perz, Basil Mustafa, Bat-Orgil Batsaikhan, Becca Roelofs, Beer Changpinyo, Behnam Neyshabur, Ben Bariach, Ben Caine, Benjamin Lee, Bernd Bohnet, Betty Chan, Bhavishya Mittal, Biao Zhang, Bibo Xu, Bill Rosgen, Bogdan Damoc, Bo Li, Boxi Wu, Boyu Wang, Bramandia Ramadhana, Brennan Saeta, Brian McWilliams, Brice Hulse, Brona Robenek, Bryan Seybold, Bryce Petrini, Caglar Unlu, Canfer Akbulut, Carey Radebaugh, Carl Crous, Carl Lebsack, Carlos Araya, Carl Saroufim, Carrie Grimes Bostock, Carrie Muir, Celine Smith, Ce Zheng, Chaitanya Malaviya, Chalence Safranek-Shrader, Chao Jia, Charbel Kaed, Charlie Chen, Charline Le Lan, Charlotte Smith, Chen Elkind, Cheng Li, Chenjie Gu, Chenkai Kuang, Chenxi Liu, Chester Kwak, Chetan Tekur, Chih-Kuan Yeh, Chih-Wei Chen, Chimezie Iwuanyanwu, Chintu Kumar, Chloe Thornton, Cho-Jui Hsieh, Chong Jiang, Chongyang Shi, Chris Alberti, Chris Dyer, Chris Gorgolewski, Chris Larkin, Christel Ngani, Christian Frank, Christina Butterfield, Christina Kouridi, Christina Lyu, Christina Sorokin, Christof Angermueller, Christopher A. Choquette-Choo, Christopher Yew, Christoph Hirnschall, Christos Kaplanis, Chris Welty, Chu-Cheng Lin, Chulayuth Asawaroengchai, Chung-Cheng Chiu, Cicero Nogueira dos Santos, CJ Carey, Clara Huiyi Hu, Clemens Meyer, Clement Farabet, Colin Gaffney, Colton Bishop, Connie Tao, Constant Segal, Corey Fry, Cosmin Paduraru, Cosmo Du, Courtney Biles, Craig Swanson, Dalia El Badawy, Damien Vincent, Damion Yates, Dan Banica, Dan Garrette, Dangyi Liu, Danhao Guo, Dan Holtmann-Rice, Dan Horgan, Dan Hurt, Daniel Balle, Daniel Finchelstein, Danielle Eisenbud, Daniel Sohn, Daniel Toyama, Daniel Vlasic, Daniel Zheng, Danilo Martins, Dan Popovici, Dario de Cesare, Dasha Valter, Dave Lacey, Dave Orr, David Barker, David Engel, David Kao, David Madras, David Reid, David Reitter, David Silver, David Soergel, David Steiner, David Tao, Dawei Jia, Dawn Bloxwich, Da-Woon Chung, Dayou Du, Demetra Brady, Demis Hassabis, Denese Owusu-Afriyie, Denis Teplyashin, Denis Vnukov, Dennis Daun, Denny Zhou, Dessie Petrova, Devendra Sachan, Diana Gage Wright, Diana Mincu, Diane Wu, Dian Yu, Diego de las Casas, Dinghua Li, Dipanjan Das, Disha Shrivastava, Dj Dvijotham, DJ Strouse, Dmitry Lepikhin, Dominika Rogozinska, Dominik Grewe, Dominik Paulus, DongHyun Choi, Dong Li, Dongseong Hwang, Doug Fritz, Drew A. Hudson, Drew Garmon, Dror Marcus, Duc Dung Nguyen, Dustin Tran, Dustin Zelle, Ed Chi, Egor Filonov, Ehsan Amid, Elahe Dabir, Elahe Rahimtoroghi, Elena Buchatskaya, Elena Gribovskaya, Eli Collins, Elie Bursztein, Elizabeth Cole, Eliza Rutherford, Elnaz Davoodi, Elspeth White, Emanuel Taropa, Emilio Parisotto, Emily Caveness, Emily Xue, Emma Wang, Enrique Piqueras, Eren Sezener, Erica Moreira, Eric Chu, Eric Ni, Eric Noland, Eri Latorre-Chimoto, Ethan Dyer, Evan Palmer, Evan Rosen, Evan Senter, Evgenii Eltyshev, Ewa Andrejczuk, Fabian Mentzer, Fabio Pardo, Fabio Viola, Fadi Biadsy, Fangxiaoyu Feng, Fangyu Liu, Fantine Huot, Fan Yang, Federico Lebron, Federico Piccinini, Fei Xia, Felipe Tiengo Ferreira, Felix de Chaumont Quitry, Felix Fischer, Felix Gimeno, Feryal Behbahani, Filip Pavetic, Flavien Prost, Florian Luisier, Folake Abu, Fran\c{c}ois-Xavier Aubet, Francesco Piccinno, Francesco Pongetti, Francoise Beaufays, Francois Galilee, Frank Perbet, Fred Alcober, Frederick Liu, Gabe Barth-Maron, Gabriela Surita, Gabriel Carvajal, Gamaleldin Elsayed, Gan Song, Garrett Bingham, Garrett Tanzer, Gary Wang, Gemini Team Google: Petko Georgiev, Gena Gibson, Geng Yan, Geoff Brown, George Polovets, George Tucker, George van den Driessche, Gheorghe Comanici, Goker Erdogan, Golan Pundak, Golnaz Ghiasi, Gowoon Chen, Grant Uy, Gregory Thornton, Guangda Lai, Guillermo Garrido, Guolong Su, Haibin Zhang, Hanie Sedghi, Hanjun Dai, Han Lu, Hannah Forbes, Hannah Muckenhirn, Hannah Sheahan, Hanzhao Lin, Hardie Cate, Haroon Qureshi, Harry Askham, Harry Richardson, Harshal Godhia, Harsha Vashisht, Harsh Mehta, Heidi Howard, Heiga Zen, Helen Miller, Heng Chen, Heng-Tze Cheng, Henryk Michalewski, Hideto Kazawa, Hilal Dib, Hoang Nguyen, Hoi Lam, Hongkun Yu, Honglong Cai, Hongmin Fan, Huaixiu Steven Zheng, Huanjie Zhou, Hui Li, Hyeontaek Lim, Hyo Lee, HyunJeong Choe, Iain Barr, Ian Mackinnon, Ian Tenney, Igor Mordatch, Ilia Shumailov, Inaki Iturrate, Inderjit Dhillon, Indro Bhattacharya, Ioana Bica, Ioannis Antonoglou, Ionel Gog, Irene Cai, Isaac Caswell, Isabel Gao, Ishita Dasgupta, Iulia Comsa, Ivan Jurin, Ivan Philips, Ivo Danihelka, Ivo Penchev, Ivy Zheng, Izhak Shafran, Jackie Xiang, Jack W. Rae, Jaclyn Konzelmann, Jacob Austin, Jacob Devlin, Jaehoon Lee, Jake Walker, Jakub Sygnowski, James Besley, James Cobon-Kerr, James Keeling, James Lee-Thorp, James Lottes, James Manyika, James Martens, James Molloy, James Qin, James Svensson, Janek Nowakowski, Jane Labanowski, Jane Park, Jarek Wilkiewicz, Jasmine Liu, Jason Riesa, Javier Snaider, Jayaram Mudigonda, Jay Hoover, Jay Pavagadhi, Jay Whang, JD Co-Reyes, Jean-Baptiste Alayrac, Jean-Baptiste Lespiau, Jeff Dean, Jeffrey Zhao, Jeff Seibert, Jeff Stanway, Jennifer Beattie, Jennifer Prendki, Jennifer Pullman, Jenny Brennan, Jens Heitkaemper, Jeremiah Liu, Jeremy Chen, Jeremy Greer, Jeremy Wiesner, Jessica Austin, Jessica Landon, Jessica Lo, Jiageng Zhang, Jiao Sun, Jiaqi Mu, Jiawei Xia, Jiepu Jiang, Jilin Chen, Ji Liu, Jingchen Ye, Jing Li, Jin Huang, Jin Miao, Jinwei Xing, Jiri Simsa, Joana Ijazi, Joe Stanton, Johan Schalkwyk, John Carpenter, Johnson Jia, John Wieting, John Zhang, Jonas Adler, Jonas Rothfuss, Jonathan Caton, Jonathan Lai, Jon Clark, Jong Lee, Jon Simon, Joost van Amersfoort, Jordan Griffith, Jordan Grimstad, Jordi Orbay, Josef Broder, Joseph Pagadora, Josh Lipschultz, Josh Newlan, Joshua Maynez, Josip Djolonga, Josip Matak, Jovana Mitrovic, Julian Eisenschlos, Julian Schrittwieser, Julia Wiesinger, Juliette Love, Junehyuk Jung, Junhyuk Oh, Junwen Bai, Junwhan Ahn, Jun Xu, Juraj Gottweis, Justin Chiu, Justin Frye, Justin Gilmer, Justin Mao-Jones, Kai Kang, Kaisheng Yao, Kalesha Bullard, Kalpesh Krishna, Kareem Ayoub, Kareem Mohamed, Karel Lenc, Karolis Misiunas, Kartikeya Badola, Kashyap Krishnakumar, Kate Baumli, Kate Olszewska, Katerina Tsihlas, Katherine Lee, Kathryn Tunyasuvunakool, Katie Millican, Kati Goshvadi, Kaushal Patel, Kaushik Shivakumar, Kavya Kopparapu, Kay McKinney, Kazuki Osawa, Kedar Soparkar, Kefan Xiao, Keith Anderson, Kelvin Xu, Ken Durden, Ken Franko, Keran Rong, Kevin Hui, Kevin Kilgour, Kevin Ramirez, Kevin Swersky, Kevin Villela, Khalid Salama, Khe Chai Sim, Khuslen Baatarsukh, Kiam Choo, Kieran Milan, Kim Paterson, Kingshuk Dasgupta, Kiran Vodrahalli, Komal Jalan, Koray Kavukcuoglu, Kornraphop Kawintiranon, Kostas Aisopos, Kremena Goranova, Kris Cao, Krishna Haridasan, Krystal Kallarackal, Kyle Levin, Lakshman Yagati, Lambert Rosique, Lam Nguyen Thiet, Lampros Lamprou, Lars Lowe Sjos, Laura Knight, Laurent El Shafey, Laurent Shefey, Le Hou, Lei Zhang, Lenin Simicich, Lev Proleev, Lewis Ho, Lewis Liu, Lexi Walker, Libin Bai, Li Lao, Lilly Taylor, Lily Wang, Lily Yu, Linda Friso, Lisa Anne Hendricks, Lisa Lee, Lisa Wang, Livio Baldini Soares, Loic Matthey, Lora Aroyo, Loren Maggiore, Lorenzo Blanco, Luca Invernizzi, Lucas Dixon, Lucas Gonzalez, Lucia Loher, Lucy Kim, Luheng He, Luis C. Cobo, Lukas Zilka, Luke Vilnis, Lu Li, Luyu Wang, Machel Reid, Madhavi Sewak, Madhu Gurumurthy, Mahdis Mahdieh, Mahmoud Alnahlawi, Mai Gimenez, Maigo Le, Maja Trebacz, Majd Al Merey, Malcolm Reynolds, Manaal Faruqui, Mandy Guo, Manish Reddy Vuyyuru, Mani Varadarajan, Mantas Pajarskas, Mara Finkelstein, Marcello Maggioni, Marco Selvi, Marco Tagliasacchi, Marcus Wainwright, Marcus Wu, Maria Abi Raad, Maria Georgaki, Mariko Iinuma, Marin Georgiev, Mario Cortes, Mario Lucic, Mario Pinto, Mark Epstein, Mark Geller, Mark Omernick, Marko Velic, Martin Baeuml, Martin Chadwick, Martin Polacek, Martin Sundermeyer, Martin Wicke, Marvin Ritter, Mary Chesus, Mary Phuong, Massimo Nicosia, Matan Eyal, Mateo Wirth, Matko Bosnjak, Matt Harvey, Matthew Johnson, Matthew Lamm, Matthew Mauger, Matthew Rahtz, Matthew Tung, Matthew Wiethoff, Matthias Bauer, Matt Miecnikowski, Maxim Krikun, Meenu Gaba, Megan Barnes, Megan Li, Megha Goel, Mehran Kazemi, Meire Fortunato, Melvin Johnson, Mia Chen, Mia Glaese, Michael Azzam, Michael B. Chang, Michael Chang, Michael Fink, Michael Isard, Michael Kwong, Michael Laskin, Michael Quinn, Michael Sharman, Michela Paganini, Mihaela Rosca, Mihajlo Velimirovic, Milad Nasr, Mimi Jasarevic, Mina Khan, Mingqiu Wang, Ming Zhang, Minh Giang, Minmin Chen, Misha Khalman, Miteyan Patel, Mohamed Elhawaty, Mohammad Saleh, Mohsen Jafari, Moran Ambar, Mostafa Dehghani, Motoki Sano, Mrinal Shukla, Mukarram Tariq, Mukund Sundararajan, Nandita Dukkipati, Nan Hua, Nan Wei, Nanxin Chen, Naseer Shaik, Natalie Clay, Nathan Byrd, Nathan Lintz, Nathan Schucher, Neeraj Gaur, Neera Vats, Neil Houlsby, Nejc Trdin, Nemanja Raki\'cevi\'c, Niccolo Dal Santo, Nicholas FitzGerald, Nick Felt, Nick Fernando, Nicola De Cao, Nikhil Sethi, Nikolay Savinov, Nilesh Tripuraneni, Nimesh Ghelani, Nina Martin, Nir Levine, Nir Shabat, Nishant Ranka, Nishesh Gupta, Nithya Attaluri, Noah Fiedel, Nobuyuki Morioka, Nora Kassner, Norbert Kalb, Norman Casagrande, Obaid Sarvana, Olaf Ronneberger, Olcan Sercinoglu, Oliver Woodman, Olivia Wiles, Olivier Dousse, Orhan Firat, Oriol Vinyals, Oscar Chang, Oskar Bunyan, Pablo Sprechmann, Paramjit Sandhu, Parker Schuh, Paul Barham, Paul Kishan Rubenstein, Paul Komarek, Paul Michel, Paul Natsev, Paul Voigtlaender, Pedram Pejman, Pedro Valenzuela, Pei Sun, Pen Li, Peter Choy, Peter Hawkins, Peter Humphreys, Phil Chen, Phil Crone, Phoebe Thacker, Pidong Wang, Piyush Patil, Pouya Samangouei, Prakash Shroff, Pranav Shyam, Praseem Banzal, Prateek Jain, Pratik Joshi, Praveen Kallakuri, Praveen Kumar, Praveen Srinivasan, Premal Shah, Priya Jhakra, Priyanka Agrawal, Priya Ponnapalli, Pulkit Mehta, Qiao Zhang, Qijun Tan, Qingze Wang, Qiujia Li, Quan Wang, Quan Yuan, Quoc Le, Rachel Saputro, Rachel Sterneck, Radu Soricut, Rahma Chaabouni, Rajagopal Ananthanarayanan, Rajkumar Samuel, Rakesh Shivanna, Ramona Comanescu, Ramya Sree Boppana, Raoul de Liedekerke, Raphael Lopez Kaufman, Rasmus Larsen, Ravi Addanki, Ravin Kumar, Ravi Rajwar, Rebeca Santamaria-Fernandez, Reiko Tojo, Remi Crocker, Renshen Wang, Rhys May, Ricardo Aguilar, Richard Ives, Richard Powell, Richard Tanburn, Rich Munoz, Riham Mansour, Rishabh Agarwal, Rishabh Joshi, Rishika Sinha, RJ Skerry-Ryan, Robin Strudel, Rohan Anil, Rohan Jain, Rohin Shah, Roman Ring, Romina Datta, Ronny Huang, Roopal Garg, Roopali Vij, Rory Blevins, Rosanne Liu, Ross Hemsley, Ross Mcilroy, Roy Frostig, Ruibo Liu, Rui Wang, Ruizhe Zhao, Rui Zhu, Ruoxin Sang, Rupert Kemp, Ruslan Habalov, Ryan Burnell, Saaber Fatehi, Sadegh Jazayeri, Sadh MNM Khan, Sahitya Potluri, Salem Haykal, Salvatore Scellato, Sameer Agarwal, Samer Hassan, Samira Daruki, Sammy Jerome, Sanaz Bahargam, Sandeep Kumar, Sanil Jain, Sanjay Ganapathy, Sanjay Ghemawat, Sankalp Singh, Santiago Ontanon, Sarah Cogan, Sarah Hodkinson, Sarah York, Sara Mc Carthy, Sara McCarthy, Sasha Goldshtein, Sayed Hadi Hashemi, Sean Sechrist, Sean Sun, Seb Arnold, Sebastian Borgeaud, Sebastian Krause, Sebastian Riedel, Sebastien Cevey, Sebastien M. R. Arnold, Sebastien Pereira, Seb Noury, Sergey Brin, Sergi Caelles, Seth Odoom, Shaan Bijwadia, Shalini Pal, Shane Gu, Shantanu Thakoor, Shaobo Hou, Sharad Vikram, Shariq Iqbal, Sharon Lin, Shashank V, Shawn Lu, Sheleem Kashem, Shereen Ashraf, Sherry Yang, Shibo Wang, Shixiang Shane Gu, Sholto Douglas, Shourya Sarcar, Shreya Singh, Shreyas Rammohan Belle, Shruti Rijhwani, Shuang Song, Shubham Agrawal, Shubin Zhao, Shuo-yiin Chang, Shyam Upadhyay, Siamak Shakeri, Sid Dalmia, Siddhartha Brahma, Siddhartha Reddy Jonnalagadda, Siddharth Gopal, Siddharth Goyal, Sid Lall, Sid Mittal, Siim Poder, Simon Tokumine, Sina Samangooei, Siyuan Qiao, Skye Giordano, Slav Petrov, S. M. Ali Eslami, Sneha Kudugunta, Soheil Hassas Yeganeh, Solomon Chang, Solomon Kim, Somer Greene, Sonam Goenka, Soo Kwak, Sophia Austin, Sophie Bridgers, Soroosh Mariooryad, Soroush Radpour, Srini Narayanan, Srivatsan Srinivasan, Stephanie Winkler, Stephan Lee, Stephen Spencer, Steph Hughes-Fitt, Steven Baker, Steven Hand, Steven Hansen, Steven Kan, Steven Zheng, Sujeevan Rajayogam, Sujoy Basu, Sumit Bagri, Susan Zhang, Swaroop Mishra, Takaki Makino, Tamara von Glehn, Tao Zhu, Tara Sainath, Taylan Bilal, Taylor Tobin, Ted Klimenko, Tejasi Latkar, Thais Kagohara, Thang Luong, Thanumalayan Sankaranarayana Pillai, Thi Avrahami, Thibault Sellam, Thibault Sottiaux, Thomas Brovelli, Tianhe Yu, Tianqi Liu, Tiberiu Sosea, Tim Blyth, Tim Green, Timothy Chung, Timothy Dozat, Timothy Lillicrap, Tina Ornduff, Toby Shevlane, Tolga Bolukbasi, Tomas Kocisky, Tomasz K\k{e}pa, Tom Eccles, Tom Hennigan, Tom Hudson, Tom Kwiatkowski, Tom Le Paine, Tomy Tsai, Tong Zhou, Trevor Strohman, Trieu Trinh, Tsendsuren Munkhdalai, Tulsee Doshi, Tyler Liechty, Vahab Mirrokni, Vaibhav Aggarwal, Vaishakh Keshava, Valentin Anklin, Valentin Dalibard, Vedant Misra, Victor Campos, Victor Cotruta, Victor Ungureanu, Vihan Jain, Vijay Bolina, Vikas Yadav, Vikram Rao, Vinay Ramasesh, Vincent Hellendoorn, Vincent Zhuang, Ving Ian Lei, Vinod Koverkathu, Viorica Patraucean, Vitaly Nikolaev, Vittorio Selo, Vivek Sharma, Vlad-Doru Ion, Vladimir Feinberg, Vlado Galic, Wael Farhan, Warren Weilun Chen, Wei Chen, Weiyi Wang, Wen Ding, Wenhao Jia, Wiktor Gworek, Will Hawkins, William Wong, William Zeng, Willi Gierke, Wojciech Fica, Wojciech Stokowiec, Wolfgang Macherey, Woohyun Han, Wooyeol Kim, Xavier Garcia, Xerxes Dotiwalla, Xiance Si, XiangHai Sheng, Xiang Zhou, Xiaowei Li, Xiao Wu, Xi Chen, Xihui Wu, Xinjian Li, Xinyang Geng, Xinyi Wu, Xinyun Chen, Xinyu Ye, Xi Xiong, Xuehan Xiong, Xuezhi Wang, Yaguang Li, Yaming Xu, Yamini Bansal, Yana Kulizhskaya, Yang Gao, Yang Xu, Yanhua Sun, Yannie Liang, Yannis Assael, Yao Zhao, Yasemin Altun, Yaxin Liu, Yelin Kim, Ye Yuan, Ye Zhang, Yicheng Wang, Yifan Ding, Yifan He, Yiming Gu, Yingjie Miao, Ying Xu, Yingying Bi, Yiran Mao, Yi Su, Yi Sun, Yi-Xuan Tan, Yi Yao, Yong Cheng, Yonghui Wu, Yuan Cao, Yuan Liu, Yuan Zhang, Yuanzhong Xu, Yuchung Cheng, Yujia Li, Yujing Zhang, Yuma Koizumi, Yunhan Xu, Yunhao Tang, Yunjie Li, Yuri Chervonyi, Yury Sulsky, Zachary Nado, Zach Fisher, Zach Gleicher, Zafarali Ahmed, Zaheer Abbas, Zalan Borsos, Zeyncep Cankara, Zeynep Cankara, Zhe Chen, Zheng Xu, Zhenkai Zhu, Zhen Yang, Zhichun Wu, Zhishuai Zhang, Zhitao Gong, Zhufeng Pan, Zhuyun Xiao, Ziyue Wang, Zizhao Zhang, Zoe Ashwood, Zoltan Egyed, Zora Tung

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Gemini 1.5long contextmultimodalcontext lengthlanguage modelsvideo understandingdocument QA

0 comments

The pith

Gemini 1.5 models recall and reason over fine-grained details from millions of tokens of multimodal context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gemini 1.5 family of models, consisting of an updated Pro version and a new lightweight Flash variant. These models process context lengths reaching millions of tokens across text documents, video, and audio inputs. They demonstrate near-perfect accuracy on retrieval tasks while advancing performance on long-document question answering, long-video question answering, and long-context speech recognition. The models also match or exceed the prior Gemini 1.0 Ultra results on a wide range of standard benchmarks and exhibit new behaviors such as learning translations for rare languages from grammar manuals.

Core claim

Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens.

What carries the argument

The long-context processing in Gemini 1.5 models that supports recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.

Load-bearing premise

The internal benchmarks accurately measure genuine long-context utilization rather than benefiting from training-data overlap or selective test construction.

What would settle it

A test inserting a unique fact at a random position in a fresh 10-million-token document never seen in training, then querying the model for that fact and measuring whether retrieval accuracy stays above 99 percent.

read the original abstract

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemini 1.5 shows reliable million-token multimodal context on internal tests and a neat Kalamang few-shot result, but the closed setup makes the headline claims hard to verify independently.

read the letter

The core news is that Gemini 1.5 reaches near-perfect recall on needle-in-haystack retrieval out to 10M tokens across text, video, and audio, while also lifting long-document and long-video QA and long-context ASR. It keeps or beats the prior Gemini 1.0 Ultra numbers on a wide set of shorter benchmarks and adds a lightweight Flash variant with little quality drop. The Kalamang grammar-manual translation example is the clearest new capability: the model picks up a language with under 200 speakers at roughly human level after seeing the same material a person would use. That sits outside the usual scaling curves and is worth noting on its own.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Gemini 1.5 family of multimodal models, including an updated Gemini 1.5 Pro and a new lightweight Gemini 1.5 Flash. It claims these models achieve near-perfect recall (>99%) on long-context retrieval tasks across modalities up to at least 10M tokens, improve the state-of-the-art on long-document QA, long-video QA, and long-context ASR, match or surpass Gemini 1.0 Ultra on broad benchmarks, show continued scaling in next-token prediction, and demonstrate real-world utility including 26-75% time savings in professional tasks and the ability to learn English-to-Kalamang translation from a grammar manual.

Significance. If the long-context performance claims hold under independent scrutiny, the work would mark a substantial advance in scaling multimodal context windows to millions of tokens, enabling new capabilities in processing extended documents, video, and audio. The reported generational leap over prior models (e.g., Claude 3.0 at 200k, GPT-4 Turbo at 128k) and the novel low-resource language learning example could influence evaluation standards and architectural research in the field.

major comments (2)

[Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).
[Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.

minor comments (2)

[Real-world use cases section] The professional time-savings study (26-75% across 10 job categories) lacks details on methodology, sample size, or controls, which would strengthen the real-world use-case claims.
[Benchmark comparison paragraphs] Some comparisons to prior models (Claude 3.0, GPT-4 Turbo) would benefit from explicit citations to the exact evaluation protocols or papers being referenced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive review of our manuscript introducing the Gemini 1.5 family of models. We address the major comments point by point below, providing the strongest honest clarifications possible given the proprietary nature of certain evaluation details.

read point-by-point responses

Referee: [Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).

Authors: We agree that greater transparency on benchmark construction would strengthen verifiability. However, as these are proprietary internal benchmarks, we cannot release raw data, exact test-set definitions, full needle-insertion protocols, contamination checks, or ablation studies. The evaluations adapt standard needle-in-a-haystack methods to multimodal long contexts, using novel or held-out content to test genuine retrieval and reasoning. We have partially revised the manuscript to include additional high-level descriptions of the evaluation approach in the relevant sections. Error bars are not reported because performance is near ceiling across consistent runs; the results demonstrate clear improvements on long-document QA, long-video QA, and long-context ASR over prior models. revision: partial
Referee: [Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.

Authors: We acknowledge that specific held-out test set details and multiple-run statistics are not provided. Overlap with pre-training data is avoided by constructing evaluation contexts from post-cutoff or synthetic sources, but exact protocols cannot be disclosed to maintain benchmark integrity. The generational leap is demonstrated by the models' ability to process and recall from contexts up to 10M tokens, far exceeding the limits of models like Claude 3.0 (200k) and GPT-4 Turbo (128k), with near-perfect recall observed consistently. We have added a clarifying note in the revised manuscript on the use of held-out data for these limits studies. revision: partial

standing simulated objections not resolved

Full disclosure of proprietary internal benchmark construction details, raw data, exact test sets, and complete ablation studies due to confidentiality requirements.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an empirical model release paper reporting benchmark results for Gemini 1.5 on long-context retrieval, QA, and ASR tasks. No algebraic derivations, first-principles predictions, or fitted parameters are presented that reduce by construction to the paper's own inputs. Self-citations to prior Gemini work are present but not load-bearing for the new long-context claims, which rest on held-out evaluations rather than tautological redefinitions or renamed fits. The central results are externally falsifiable via benchmark performance and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical model-release report containing no mathematical derivations, fitted constants, or theoretical postulates; all claims rest on benchmark measurements whose construction details are not supplied.

pith-pipeline@v0.9.0 · 10729 in / 1323 out tokens · 48951 ms · 2026-05-10T14:30:11.265354+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
cs.SE 2026-04 unverdicted novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
cs.OS 2026-05 unverdicted novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than pri...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
cs.SD 2026-04 unverdicted novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
On Bayesian Softmax-Gated Mixture-of-Experts Models
stat.ML 2026-04 unverdicted novelty 7.0

Bayesian softmax-gated mixture-of-experts models achieve posterior contraction for density estimation and parameter recovery using Voronoi losses, plus two strategies for choosing the number of experts.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Verification Modulo Tested Library Contracts
cs.PL 2026-04 unverdicted novelty 7.0

A new framework synthesizes library method contracts that are adequate for client verification and pass testing scrutiny, using CHC solvers and ICE learning.
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
cs.CV 2026-04 unverdicted novelty 7.0

MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
TextGrad: Automatic "Differentiation" via Text
cs.CL 2024-06 unverdicted novelty 7.0

TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
Training-Inference Consistent Segmented Execution for Long-Context LLMs
cs.CL 2026-05 conditional novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
cs.CV 2026-05 unverdicted novelty 6.0

Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D...
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
On the Blessing of Pre-training in Weak-to-Strong Generalization
cs.LG 2026-05 unverdicted novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
cs.AI 2026-05 unverdicted novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
Shattering the Echo Chamber: Hidden Safeguards in Manuscripts Against the AI Takeover of Peer Review
cs.CR 2026-05 unverdicted novelty 6.0

IntraGuard uses three intra-stream PDF injection methods to embed explicit refusal triggers and implicit review markers, achieving up to 84% defense success against 7 commercial chatbots across 12 venues without affec...
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
From Priors to Perception: Grounding Video-LLMs in Physical Reality
cs.CV 2026-05 unverdicted novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
cs.CV 2026-05 unverdicted novelty 6.0

AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 150 Pith papers

[1]

Nicholas Carlini, Milad Nasr, Christopher A

URL https://arxiv.org/abs/2202.07646. Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?, 2024. BenCarterette, PaulClough, MarkHall, EvangelosKanoulas, andMarkSanderson....

work page doi:10.1145/2911451.2914675 2024
[2]

doi: 10.18653/v1/2020.findings-emnlp.301

URL https://aclanthology.org/N19-1246. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html. Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look ...

work page doi:10.18653/v1/2020.findings-emnlp.301 2011
[3]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang

URL https://arxiv.org/abs/2112.07916. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Ju...

work page doi:10.1162/089976600300015637 2020
[4]

Dense passage retrieval for open-domain question answering

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Li...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[5]

M at C ha: Enhancing visual language pretraining with math reasoning and chart derendering

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.714. URL https://aclanthology.org/2023.acl-long.714. Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention, 2024. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei...

work page doi:10.18653/v1/2023.acl-long.714 2023
[6]

Contributions and Acknowledgments Core Contributors Petko Georgiev Ving Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang Soroosh Mariooryad Yifan Ding Xinyang Geng Fred Alcober Roy Frostig Mark Omernick Lexi Walker Cosmin Paduraru Christina Sorokin Andrea Tacchetti Colin Gaffney Samira Daruki Olcan Sercinogl...

work page
[7]

multiple needle-in-a-haystack

Appendix 12.1. Model Card We present the Gemini 1.5 Model card in Table 45 Model summary Model architecture Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer based model that builds on scaling MoE vision/language models at Google (Clark et al., 2020; Fedus et al., 2021; Lepikhin et al., 2020; Riquelme et al., 2021; Shazeer et al., 2017; Zoph ...

work page 2020
[8]

Initial solution:The model proposed an initial solution

work page
[9]

It then attempted to generate a corrected solution

Error correction (attempt 1):If the initial solution resulted in an error, the model was provided with the first 10 lines of the error trace, along with the previously used prompt and the first solution. It then attempted to generate a corrected solution. 110 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Video Link Qu...

work page
[10]

"): """ Recursively get a list of all file paths in a GitHub repository directory and its subdirectories

Error correction (attempt 2):If the second solution also resulted in an error, the model was given one final attempt. This time, the context included both previous solutions and their corresponding error traces. Using this process, Gemini 1.5 Pro successfully solved between 80 and 110 problems (out of 528) in the first stage, 20 to 30 problems in the seco...

work page
[11]

Since we want to find the maximum, we can minimize the negative of the function

**‘scipy.optimize.minimize‘**: This function is used to find the minimum of a given function. Since we want to find the maximum, we can minimize the negative of the function

work page
[12]

of the week

**‘scipy.optimize.Bounds‘**: This class allows us to specify constraints on the variables, ensuring they remain nonnegative and sum to 1. ### Python Solution Here’s the Python code to solve the problem: ‘‘‘python 115 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context import numpy as np from scipy.optimize import minimize, ...

work page 2024
[13]

Photos that may have quality issues based on the camera settings used, such as: - Shutter speed slower than 1/60 (potential blur/camera shake) - Aperture wider than f/8 (reduced sharpness) - ISO higher than 800 (excessive noise)

work page
[14]

kitchen_01.jpg, kitchen_02.jpg, etc.)

A list of photos grouped by room/area of the house, based on timestamps and/or similar filenames (e.g. kitchen_01.jpg, kitchen_02.jpg, etc.)

work page
[15]

The 10 photos with the widest angle of view based on focal length, as these often make the best ""hero"" shots for real estate listings

work page
[16]

25 years experience as a professional classical pianist

Any filenames that don’t follow our standard naming convention of [room/area]_[number].jpg I’ve also attached a reference sheet with examples of our studio’s technical quality standards and filename conventions. Attached documents: * CSV file containing key details for each of the 58 photos * PDF of the studio’s technical quality standards Table 50| Examp...

work page 2022
[17]

Look through each frame in the video carefully and answer the question

Westartwiththepromptheader, "Look through each frame in the video carefully and answer the question."

work page
[18]

{i//60:02}:{i%60:02}

Then for each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 10000 would be formatted as"166:40".) (b) and then we append the frame bytes

work page
[19]

What is the secret word?

Lastly we append the user query,"What is the secret word?" . The needle is a frame from the video with the caption “The secret word is "needle". ” embedded in the frame. 12.16.8. Audio Needle-in-a-Haystack The prompt is constructed as follows:

work page
[20]

We start with the prompt header,‘Listen to the audio carefully and answer the question that comes after the audio input.’

work page
[21]

Then we feed the audio input, which is the Voxpopuli haystack with the needle embedded

work page
[22]

What is the secret keyword?’

Lastly we append the user query,‘In the audio above, someone says ’The secret keyword is X’. What is the secret keyword?’ . The needle is a speech segment‘The secret keyword is "needle".’ embedded in the speech sample. 12.16.9. Multi-round Co-reference Resolution (MRCR) For our evaluations with Gemini Pro 1.5 and GPT-4 Turbo, the prompt format for MRCR is...

work page
[23]

{i//60:02}:{i%60:02}

For each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 5000 would be formatted as"83:20".) (b) and then we append the frame bytes

work page
[24]

Final Answer: (X)

Lastly we append the question, given below: Format your answer as "Final Answer: (X)" where X is the correct letter choice. {question} Options: (A) {option 1} (B) {option 2} (C) {option 3} (D) {option 4} (E) {option 5} 143 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context The answer choices were randomized. If the number ...

work page 2023