arxiv: 2403.05530 · v5 · submitted 2024-03-08 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team Google: Petko Georgiev , Ving Ian Lei , Ryan Burnell , Libin Bai , Anmol Gulati , Garrett Tanzer , Damien Vincent , Zhufeng Pan

show 1126 more authors

Shibo Wang Soroosh Mariooryad Yifan Ding Xinyang Geng Fred Alcober Roy Frostig Mark Omernick Lexi Walker Cosmin Paduraru Christina Sorokin Andrea Tacchetti Colin Gaffney Samira Daruki Olcan Sercinoglu Zach Gleicher Juliette Love Paul Voigtlaender Rohan Jain Gabriela Surita Kareem Mohamed Rory Blevins Junwhan Ahn Tao Zhu Kornraphop Kawintiranon Orhan Firat Yiming Gu Yujing Zhang Matthew Rahtz Manaal Faruqui Natalie Clay Justin Gilmer JD Co-Reyes Ivo Penchev Rui Zhu Nobuyuki Morioka Kevin Hui Krishna Haridasan Victor Campos Mahdis Mahdieh Mandy Guo Samer Hassan Kevin Kilgour Arpi Vezer Heng-Tze Cheng Raoul de Liedekerke Siddharth Goyal Paul Barham DJ Strouse Seb Noury Jonas Adler Mukund Sundararajan Sharad Vikram Dmitry Lepikhin Michela Paganini Xavier Garcia Fan Yang Dasha Valter Maja Trebacz Kiran Vodrahalli Chulayuth Asawaroengchai Roman Ring Norbert Kalb Livio Baldini Soares Siddhartha Brahma David Steiner Tianhe Yu Fabian Mentzer Antoine He Lucas Gonzalez Bibo Xu Raphael Lopez Kaufman Laurent El Shafey Junhyuk Oh Tom Hennigan George van den Driessche Seth Odoom Mario Lucic Becca Roelofs Sid Lall Amit Marathe Betty Chan Santiago Ontanon Luheng He Denis Teplyashin Jonathan Lai Phil Crone Bogdan Damoc Lewis Ho Sebastian Riedel Karel Lenc Chih-Kuan Yeh Aakanksha Chowdhery Yang Xu Mehran Kazemi Ehsan Amid Anastasia Petrushkina Kevin Swersky Ali Khodaei Gowoon Chen Chris Larkin Mario Pinto Geng Yan Adria Puigdomenech Badia Piyush Patil Steven Hansen Dave Orr Sebastien M. R. Arnold Jordan Grimstad Andrew Dai Sholto Douglas Rishika Sinha Vikas Yadav Xi Chen Elena Gribovskaya Jacob Austin Jeffrey Zhao Kaushal Patel Paul Komarek Sophia Austin Sebastian Borgeaud Linda Friso Abhimanyu Goyal Ben Caine Kris Cao Da-Woon Chung Matthew Lamm Gabe Barth-Maron Thais Kagohara Kate Olszewska Mia Chen Kaushik Shivakumar Rishabh Agarwal Harshal Godhia Ravi Rajwar Javier Snaider Xerxes Dotiwalla Yuan Liu Aditya Barua Victor Ungureanu Yuan Zhang Bat-Orgil Batsaikhan Mateo Wirth James Qin Ivo Danihelka Tulsee Doshi Martin Chadwick Jilin Chen Sanil Jain Quoc Le Arjun Kar Madhu Gurumurthy Cheng Li Ruoxin Sang Fangyu Liu Lampros Lamprou Rich Munoz Nathan Lintz Harsh Mehta Heidi Howard Malcolm Reynolds Lora Aroyo Quan Wang Lorenzo Blanco Albin Cassirer Jordan Griffith Dipanjan Das Stephan Lee Jakub Sygnowski Zach Fisher James Besley Richard Powell Zafarali Ahmed Dominik Paulus David Reitter Zalan Borsos Rishabh Joshi Aedan Pope Steven Hand Vittorio Selo Vihan Jain Nikhil Sethi Megha Goel Takaki Makino Rhys May Zhen Yang Johan Schalkwyk Christina Butterfield Anja Hauth Alex Goldin Will Hawkins Evan Senter Sergey Brin Oliver Woodman Marvin Ritter Eric Noland Minh Giang Vijay Bolina Lisa Lee Tim Blyth Ian Mackinnon Machel Reid Obaid Sarvana David Silver Alexander Chen Lily Wang Loren Maggiore Oscar Chang Nithya Attaluri Gregory Thornton Chung-Cheng Chiu Oskar Bunyan Nir Levine Timothy Chung Evgenii Eltyshev Xiance Si Timothy Lillicrap Demetra Brady Vaibhav Aggarwal Boxi Wu Yuanzhong Xu Ross Mcilroy Kartikeya Badola Paramjit Sandhu Erica Moreira Wojciech Stokowiec Ross Hemsley Dong Li Alex Tudor Pranav Shyam Elahe Rahimtoroghi Salem Haykal Pablo Sprechmann Xiang Zhou Diana Mincu Yujia Li Ravi Addanki Kalpesh Krishna Xiao Wu Alexandre Frechette Matan Eyal Allan Dafoe Dave Lacey Jay Whang Thi Avrahami Ye Zhang Emanuel Taropa Hanzhao Lin Daniel Toyama Eliza Rutherford Motoki Sano HyunJeong Choe Alex Tomala Chalence Safranek-Shrader Nora Kassner Mantas Pajarskas Matt Harvey Sean Sechrist Meire Fortunato Christina Lyu Gamaleldin Elsayed Chenkai Kuang James Lottes Eric Chu Chao Jia Chih-Wei Chen Peter Humphreys Kate Baumli Connie Tao Rajkumar Samuel Cicero Nogueira dos Santos Anders Andreassen Nemanja Raki\'cevi\'c Dominik Grewe Aviral Kumar Stephanie Winkler Jonathan Caton Andrew Brock Sid Dalmia Hannah Sheahan Iain Barr Yingjie Miao Paul Natsev Jacob Devlin Feryal Behbahani Flavien Prost Yanhua Sun Artiom Myaskovsky Thanumalayan Sankaranarayana Pillai Dan Hurt Angeliki Lazaridou Xi Xiong Ce Zheng Fabio Pardo Xiaowei Li Dan Horgan Joe Stanton Moran Ambar Fei Xia Alejandro Lince Mingqiu Wang Basil Mustafa Albert Webson Hyo Lee Rohan Anil Martin Wicke Timothy Dozat Abhishek Sinha Enrique Piqueras Elahe Dabir Shyam Upadhyay Anudhyan Boral Lisa Anne Hendricks Corey Fry Josip Djolonga Yi Su Jake Walker Jane Labanowski Ronny Huang Vedant Misra Jeremy Chen RJ Skerry-Ryan Avi Singh Shruti Rijhwani Dian Yu Alex Castro-Ros Beer Changpinyo Romina Datta Sumit Bagri Arnar Mar Hrafnkelsson Marcello Maggioni Daniel Zheng Yury Sulsky Shaobo Hou Tom Le Paine Antoine Yang Jason Riesa Dominika Rogozinska Dror Marcus Dalia El Badawy Qiao Zhang Luyu Wang Helen Miller Jeremy Greer Lars Lowe Sjos Azade Nova Heiga Zen Rahma Chaabouni Mihaela Rosca Jiepu Jiang Charlie Chen Ruibo Liu Tara Sainath Maxim Krikun Alex Polozov Jean-Baptiste Lespiau Josh Newlan Zeyncep Cankara Soo Kwak Yunhan Xu Phil Chen Andy Coenen Clemens Meyer Katerina Tsihlas Ada Ma Juraj Gottweis Jinwei Xing Chenjie Gu Jin Miao Christian Frank Zeynep Cankara Sanjay Ganapathy Ishita Dasgupta Steph Hughes-Fitt Heng Chen David Reid Keran Rong Hongmin Fan Joost van Amersfoort Vincent Zhuang Aaron Cohen Shixiang Shane Gu Anhad Mohananey Anastasija Ilic Taylor Tobin John Wieting Anna Bortsova Phoebe Thacker Emma Wang Emily Caveness Justin Chiu Eren Sezener Alex Kaskasoli Steven Baker Katie Millican Mohamed Elhawaty Kostas Aisopos Carl Lebsack Nathan Byrd Hanjun Dai Wenhao Jia Matthew Wiethoff Elnaz Davoodi Albert Weston Lakshman Yagati Arun Ahuja Isabel Gao Golan Pundak Susan Zhang Michael Azzam Khe Chai Sim Sergi Caelles James Keeling Abhanshu Sharma Andy Swing Yaguang Li Chenxi Liu Carrie Grimes Bostock Yamini Bansal Zachary Nado Ankesh Anand Josh Lipschultz Abhijit Karmarkar Lev Proleev Abe Ittycheriah Soheil Hassas Yeganeh George Polovets Aleksandra Faust Jiao Sun Alban Rrustemi Pen Li Rakesh Shivanna Jeremiah Liu Chris Welty Federico Lebron Anirudh Baddepudi Sebastian Krause Emilio Parisotto Radu Soricut Zheng Xu Dawn Bloxwich Melvin Johnson Behnam Neyshabur Justin Mao-Jones Renshen Wang Vinay Ramasesh Zaheer Abbas Arthur Guez Constant Segal Duc Dung Nguyen James Svensson Le Hou Sarah York Kieran Milan Sophie Bridgers Wiktor Gworek Marco Tagliasacchi James Lee-Thorp Michael Chang Alexey Guseynov Ale Jakse Hartman Michael Kwong Ruizhe Zhao Sheleem Kashem Elizabeth Cole Antoine Miech Richard Tanburn Mary Phuong Filip Pavetic Sebastien Cevey Ramona Comanescu Richard Ives Sherry Yang Cosmo Du Bo Li Zizhao Zhang Mariko Iinuma Clara Huiyi Hu Aurko Roy Shaan Bijwadia Zhenkai Zhu Danilo Martins Rachel Saputro Anita Gergely Steven Zheng Dawei Jia Ioannis Antonoglou Adam Sadovsky Shane Gu Yingying Bi Alek Andreev Sina Samangooei Mina Khan Tomas Kocisky Angelos Filos Chintu Kumar Colton Bishop Adams Yu Sarah Hodkinson Sid Mittal Premal Shah Alexandre Moufarek Yong Cheng Adam Bloniarz Jaehoon Lee Pedram Pejman Paul Michel Stephen Spencer Vladimir Feinberg Xuehan Xiong Nikolay Savinov Charlotte Smith Siamak Shakeri Dustin Tran Mary Chesus Bernd Bohnet George Tucker Tamara von Glehn Carrie Muir Yiran Mao Hideto Kazawa Ambrose Slone Kedar Soparkar Disha Shrivastava James Cobon-Kerr Michael Sharman Jay Pavagadhi Carlos Araya Karolis Misiunas Nimesh Ghelani Michael Laskin David Barker Qiujia Li Anton Briukhov Neil Houlsby Mia Glaese Balaji Lakshminarayanan Nathan Schucher Yunhao Tang Eli Collins Hyeontaek Lim Fangxiaoyu Feng Adria Recasens Guangda Lai Alberto Magni Nicola De Cao Aditya Siddhant Zoe Ashwood Jordi Orbay Mostafa Dehghani Jenny Brennan Yifan He Kelvin Xu Yang Gao Carl Saroufim James Molloy Xinyi Wu Seb Arnold Solomon Chang Julian Schrittwieser Elena Buchatskaya Soroush Radpour Martin Polacek Skye Giordano Ankur Bapna Simon Tokumine Vincent Hellendoorn Thibault Sottiaux Sarah Cogan Aliaksei Severyn Mohammad Saleh Shantanu Thakoor Laurent Shefey Siyuan Qiao Meenu Gaba Shuo-yiin Chang Craig Swanson Biao Zhang Benjamin Lee Paul Kishan Rubenstein Gan Song Tom Kwiatkowski Anna Koop Ajay Kannan David Kao Parker Schuh Axel Stjerngren Golnaz Ghiasi Gena Gibson Luke Vilnis Ye Yuan Felipe Tiengo Ferreira Aishwarya Kamath Ted Klimenko Ken Franko Kefan Xiao Indro Bhattacharya Miteyan Patel Rui Wang Alex Morris Robin Strudel Vivek Sharma Peter Choy Sayed Hadi Hashemi Jessica Landon Mara Finkelstein Priya Jhakra Justin Frye Megan Barnes Matthew Mauger Dennis Daun Khuslen Baatarsukh Matthew Tung Wael Farhan Henryk Michalewski Fabio Viola Felix de Chaumont Quitry Charline Le Lan Tom Hudson Qingze Wang Felix Fischer Ivy Zheng Elspeth White Anca Dragan Jean-Baptiste Alayrac Eric Ni Alexander Pritzel Adam Iwanicki Michael Isard Anna Bulanova Lukas Zilka Ethan Dyer Devendra Sachan Srivatsan Srinivasan Hannah Muckenhirn Honglong Cai Amol Mandhane Mukarram Tariq Jack W. Rae Gary Wang Kareem Ayoub Nicholas FitzGerald Yao Zhao Woohyun Han Chris Alberti Dan Garrette Kashyap Krishnakumar Mai Gimenez Anselm Levskaya Daniel Sohn Josip Matak Inaki Iturrate Michael B. Chang Jackie Xiang Yuan Cao Nishant Ranka Geoff Brown Adrian Hutter Vahab Mirrokni Nanxin Chen Kaisheng Yao Zoltan Egyed Francois Galilee Tyler Liechty Praveen Kallakuri Evan Palmer Sanjay Ghemawat Jasmine Liu David Tao Chloe Thornton Tim Green Mimi Jasarevic Sharon Lin Victor Cotruta Yi-Xuan Tan Noah Fiedel Hongkun Yu Ed Chi Alexander Neitz Jens Heitkaemper Anu Sinha Denny Zhou Yi Sun Charbel Kaed Brice Hulse Swaroop Mishra Maria Georgaki Sneha Kudugunta Clement Farabet Izhak Shafran Daniel Vlasic Anton Tsitsulin Rajagopal Ananthanarayanan Alen Carin Guolong Su Pei Sun Shashank V Gabriel Carvajal Josef Broder Iulia Comsa Alena Repina William Wong Warren Weilun Chen Peter Hawkins Egor Filonov Lucia Loher Christoph Hirnschall Weiyi Wang Jingchen Ye Andrea Burns Hardie Cate Diana Gage Wright Federico Piccinini Lei Zhang Chu-Cheng Lin Ionel Gog Yana Kulizhskaya Ashwin Sreevatsa Shuang Song Luis C. Cobo Anand Iyer Chetan Tekur Guillermo Garrido Zhuyun Xiao Rupert Kemp Huaixiu Steven Zheng Hui Li Ananth Agarwal Christel Ngani Kati Goshvadi Rebeca Santamaria-Fernandez Wojciech Fica Xinyun Chen Chris Gorgolewski Sean Sun Roopal Garg Xinyu Ye S. M. Ali Eslami Nan Hua Jon Simon Pratik Joshi Yelin Kim Ian Tenney Sahitya Potluri Lam Nguyen Thiet Quan Yuan Florian Luisier Alexandra Chronopoulou Salvatore Scellato Praveen Srinivasan Minmin Chen Vinod Koverkathu Valentin Dalibard Yaming Xu Brennan Saeta Keith Anderson Thibault Sellam Nick Fernando Fantine Huot Junehyuk Jung Mani Varadarajan Michael Quinn Amit Raul Maigo Le Ruslan Habalov Jon Clark Komal Jalan Kalesha Bullard Achintya Singhal Thang Luong Boyu Wang Sujeevan Rajayogam Julian Eisenschlos Johnson Jia Daniel Finchelstein Alex Yakubovich Daniel Balle Michael Fink Sameer Agarwal Jing Li Dj Dvijotham Shalini Pal Kai Kang Jaclyn Konzelmann Jennifer Beattie Olivier Dousse Diane Wu Remi Crocker Chen Elkind Siddhartha Reddy Jonnalagadda Jong Lee Dan Holtmann-Rice Krystal Kallarackal Rosanne Liu Denis Vnukov Neera Vats Luca Invernizzi Mohsen Jafari Huanjie Zhou Lilly Taylor Jennifer Prendki Marcus Wu Tom Eccles Tianqi Liu Kavya Kopparapu Francoise Beaufays Christof Angermueller Andreea Marzoca Shourya Sarcar Hilal Dib Jeff Stanway Frank Perbet Nejc Trdin Rachel Sterneck Andrey Khorlin Dinghua Li Xihui Wu Sonam Goenka David Madras Sasha Goldshtein Willi Gierke Tong Zhou Yaxin Liu Yannie Liang Anais White Yunjie Li Shreya Singh Sanaz Bahargam Mark Epstein Sujoy Basu Li Lao Adnan Ozturel Carl Crous Alex Zhai Han Lu Zora Tung Neeraj Gaur Alanna Walton Lucas Dixon Ming Zhang Amir Globerson Grant Uy Andrew Bolt Olivia Wiles Milad Nasr Ilia Shumailov Marco Selvi Francesco Piccinno Ricardo Aguilar Sara McCarthy Misha Khalman Mrinal Shukla Vlado Galic John Carpenter Kevin Villela Haibin Zhang Harry Richardson James Martens Matko Bosnjak Shreyas Rammohan Belle Jeff Seibert Mahmoud Alnahlawi Brian McWilliams Sankalp Singh Annie Louis Wen Ding Dan Popovici Lenin Simicich Laura Knight Pulkit Mehta Nishesh Gupta Chongyang Shi Saaber Fatehi Jovana Mitrovic Alex Grills Joseph Pagadora Tsendsuren Munkhdalai Dessie Petrova Danielle Eisenbud Zhishuai Zhang Damion Yates Bhavishya Mittal Nilesh Tripuraneni Yannis Assael Thomas Brovelli Prateek Jain Mihajlo Velimirovic Canfer Akbulut Jiaqi Mu Wolfgang Macherey Ravin Kumar Jun Xu Haroon Qureshi Gheorghe Comanici Jeremy Wiesner Zhitao Gong Anton Ruddock Matthias Bauer Nick Felt Anirudh GP Anurag Arnab Dustin Zelle Jonas Rothfuss Bill Rosgen Ashish Shenoy Bryan Seybold Xinjian Li Jayaram Mudigonda Goker Erdogan Jiawei Xia Jiri Simsa Andrea Michi Yi Yao Christopher Yew Steven Kan Isaac Caswell Carey Radebaugh Andre Elisseeff Pedro Valenzuela Kay McKinney Kim Paterson Albert Cui Eri Latorre-Chimoto Solomon Kim William Zeng Ken Durden Priya Ponnapalli Tiberiu Sosea Christopher A. Choquette-Choo James Manyika Brona Robenek Harsha Vashisht Sebastien Pereira Hoi Lam Marko Velic Denese Owusu-Afriyie Katherine Lee Tolga Bolukbasi Alicia Parrish Shawn Lu Jane Park Balaji Venkatraman Alice Talbert Lambert Rosique Yuchung Cheng Andrei Sozanschi Adam Paszke Praveen Kumar Jessica Austin Lu Li Khalid Salama Bartek Perz Wooyeol Kim Nandita Dukkipati Anthony Baryshnikov Christos Kaplanis XiangHai Sheng Yuri Chervonyi Caglar Unlu Diego de las Casas Harry Askham Kathryn Tunyasuvunakool Felix Gimeno Siim Poder Chester Kwak Matt Miecnikowski Alek Dimitriev Aaron Parisi Dangyi Liu Tomy Tsai Toby Shevlane Christina Kouridi Drew Garmon Adrian Goedeckemeyer Adam R. Brown Anitha Vijayakumar Ali Elqursh Sadegh Jazayeri Jin Huang Sara Mc Carthy Jay Hoover Lucy Kim Sandeep Kumar Wei Chen Courtney Biles Garrett Bingham Evan Rosen Lisa Wang Qijun Tan David Engel Francesco Pongetti Dario de Cesare Dongseong Hwang Lily Yu Jennifer Pullman Srini Narayanan Kyle Levin Siddharth Gopal Megan Li Asaf Aharoni Trieu Trinh Jessica Lo Norman Casagrande Roopali Vij Loic Matthey Bramandia Ramadhana Austin Matthews CJ Carey Matthew Johnson Kremena Goranova Rohin Shah Shereen Ashraf Kingshuk Dasgupta Rasmus Larsen Yicheng Wang Manish Reddy Vuyyuru Chong Jiang Joana Ijazi Kazuki Osawa Celine Smith Ramya Sree Boppana Taylan Bilal Yuma Koizumi Ying Xu Yasemin Altun Nir Shabat Ben Bariach Alex Korchemniy Kiam Choo Olaf Ronneberger Chimezie Iwuanyanwu Shubin Zhao David Soergel Cho-Jui Hsieh Irene Cai Shariq Iqbal Martin Sundermeyer Zhe Chen Elie Bursztein Chaitanya Malaviya Fadi Biadsy Prakash Shroff Inderjit Dhillon Tejasi Latkar Chris Dyer Hannah Forbes Massimo Nicosia Vitaly Nikolaev Somer Greene Marin Georgiev Pidong Wang Nina Martin Hanie Sedghi John Zhang Praseem Banzal Doug Fritz Vikram Rao Xuezhi Wang Jiageng Zhang Viorica Patraucean Dayou Du Igor Mordatch Ivan Jurin Lewis Liu Ayush Dubey Abhi Mohan Janek Nowakowski Vlad-Doru Ion Nan Wei Reiko Tojo Maria Abi Raad Drew A. Hudson Vaishakh Keshava Shubham Agrawal Kevin Ramirez Zhichun Wu Hoang Nguyen Ji Liu Madhavi Sewak Bryce Petrini DongHyun Choi Ivan Philips Ziyue Wang Ioana Bica Ankush Garg Jarek Wilkiewicz Priyanka Agrawal Danhao Guo Emily Xue Naseer Shaik Andrew Leach Sadh MNM Khan Julia Wiesinger Sammy Jerome Abhishek Chakladar Alek Wenjiao Wang Tina Ornduff Folake Abu Alireza Ghaffarkhah Marcus Wainwright Mario Cortes Frederick Liu Joshua Maynez Andreas Terzis Pouya Samangouei Riham Mansour Tomasz K\k{e}pa Fran\c{c}ois-Xavier Aubet Anton Algymr Dan Banica Agoston Weisz Andras Orban Alexandre Senges Ewa Andrejczuk Mark Geller Niccolo Dal Santo Valentin Anklin Majd Al Merey Martin Baeuml Trevor Strohman Junwen Bai Slav Petrov Yonghui Wu Demis Hassabis Koray Kavukcuoglu Jeff Dean Oriol Vinyals

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Gemini 1.5long contextmultimodalcontext lengthlanguage modelsvideo understandingdocument QA

0 comments

The pith

Gemini 1.5 models recall and reason over fine-grained details from millions of tokens of multimodal context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Gemini 1.5 family of models, consisting of an updated Pro version and a new lightweight Flash variant. These models process context lengths reaching millions of tokens across text documents, video, and audio inputs. They demonstrate near-perfect accuracy on retrieval tasks while advancing performance on long-document question answering, long-video question answering, and long-context speech recognition. The models also match or exceed the prior Gemini 1.0 Ultra results on a wide range of standard benchmarks and exhibit new behaviors such as learning translations for rare languages from grammar manuals.

Core claim

Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens.

What carries the argument

The long-context processing in Gemini 1.5 models that supports recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.

Load-bearing premise

The internal benchmarks accurately measure genuine long-context utilization rather than benefiting from training-data overlap or selective test construction.

What would settle it

A test inserting a unique fact at a random position in a fresh 10-million-token document never seen in training, then querying the model for that fact and measuring whether retrieval accuracy stays above 99 percent.

read the original abstract

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemini 1.5 shows reliable million-token multimodal context on internal tests and a neat Kalamang few-shot result, but the closed setup makes the headline claims hard to verify independently.

read the letter

The core news is that Gemini 1.5 reaches near-perfect recall on needle-in-haystack retrieval out to 10M tokens across text, video, and audio, while also lifting long-document and long-video QA and long-context ASR. It keeps or beats the prior Gemini 1.0 Ultra numbers on a wide set of shorter benchmarks and adds a lightweight Flash variant with little quality drop. The Kalamang grammar-manual translation example is the clearest new capability: the model picks up a language with under 200 speakers at roughly human level after seeing the same material a person would use. That sits outside the usual scaling curves and is worth noting on its own.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Gemini 1.5 family of multimodal models, including an updated Gemini 1.5 Pro and a new lightweight Gemini 1.5 Flash. It claims these models achieve near-perfect recall (>99%) on long-context retrieval tasks across modalities up to at least 10M tokens, improve the state-of-the-art on long-document QA, long-video QA, and long-context ASR, match or surpass Gemini 1.0 Ultra on broad benchmarks, show continued scaling in next-token prediction, and demonstrate real-world utility including 26-75% time savings in professional tasks and the ability to learn English-to-Kalamang translation from a grammar manual.

Significance. If the long-context performance claims hold under independent scrutiny, the work would mark a substantial advance in scaling multimodal context windows to millions of tokens, enabling new capabilities in processing extended documents, video, and audio. The reported generational leap over prior models (e.g., Claude 3.0 at 200k, GPT-4 Turbo at 128k) and the novel low-resource language learning example could influence evaluation standards and architectural research in the field.

major comments (2)

[Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).
[Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.

minor comments (2)

[Real-world use cases section] The professional time-savings study (26-75% across 10 job categories) lacks details on methodology, sample size, or controls, which would strengthen the real-world use-case claims.
[Benchmark comparison paragraphs] Some comparisons to prior models (Claude 3.0, GPT-4 Turbo) would benefit from explicit citations to the exact evaluation protocols or papers being referenced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive review of our manuscript introducing the Gemini 1.5 family of models. We address the major comments point by point below, providing the strongest honest clarifications possible given the proprietary nature of certain evaluation details.

read point-by-point responses

Referee: [Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).

Authors: We agree that greater transparency on benchmark construction would strengthen verifiability. However, as these are proprietary internal benchmarks, we cannot release raw data, exact test-set definitions, full needle-insertion protocols, contamination checks, or ablation studies. The evaluations adapt standard needle-in-a-haystack methods to multimodal long contexts, using novel or held-out content to test genuine retrieval and reasoning. We have partially revised the manuscript to include additional high-level descriptions of the evaluation approach in the relevant sections. Error bars are not reported because performance is near ceiling across consistent runs; the results demonstrate clear improvements on long-document QA, long-video QA, and long-context ASR over prior models. revision: partial
Referee: [Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.

Authors: We acknowledge that specific held-out test set details and multiple-run statistics are not provided. Overlap with pre-training data is avoided by constructing evaluation contexts from post-cutoff or synthetic sources, but exact protocols cannot be disclosed to maintain benchmark integrity. The generational leap is demonstrated by the models' ability to process and recall from contexts up to 10M tokens, far exceeding the limits of models like Claude 3.0 (200k) and GPT-4 Turbo (128k), with near-perfect recall observed consistently. We have added a clarifying note in the revised manuscript on the use of held-out data for these limits studies. revision: partial

standing simulated objections not resolved

Full disclosure of proprietary internal benchmark construction details, raw data, exact test sets, and complete ablation studies due to confidentiality requirements.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an empirical model release paper reporting benchmark results for Gemini 1.5 on long-context retrieval, QA, and ASR tasks. No algebraic derivations, first-principles predictions, or fitted parameters are presented that reduce by construction to the paper's own inputs. Self-citations to prior Gemini work are present but not load-bearing for the new long-context claims, which rest on held-out evaluations rather than tautological redefinitions or renamed fits. The central results are externally falsifiable via benchmark performance and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical model-release report containing no mathematical derivations, fitted constants, or theoretical postulates; all claims rest on benchmark measurements whose construction details are not supplied.

pith-pipeline@v0.9.0 · 10729 in / 1323 out tokens · 48951 ms · 2026-05-10T14:30:11.265354+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
cs.LG 2026-05 unverdicted novelty 8.0

OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
cs.SE 2026-04 unverdicted novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 7.0

HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
cs.OS 2026-05 unverdicted novelty 7.0

Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than pri...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
cs.SD 2026-04 unverdicted novelty 7.0

ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
On Bayesian Softmax-Gated Mixture-of-Experts Models
stat.ML 2026-04 unverdicted novelty 7.0

Bayesian softmax-gated mixture-of-experts models achieve posterior contraction for density estimation and parameter recovery using Voronoi losses, plus two strategies for choosing the number of experts.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Verification Modulo Tested Library Contracts
cs.PL 2026-04 unverdicted novelty 7.0

A new framework synthesizes library method contracts that are adequate for client verification and pass testing scrutiny, using CHC solvers and ICE learning.
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
Validity-Calibrated Reasoning Distillation
cs.LG 2026-04 unverdicted novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
Skill-Conditioned Visual Geolocation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
cs.CV 2026-04 unverdicted novelty 7.0

MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
cs.AI 2026-04 unverdicted novelty 7.0

A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
Offline Materials Optimization with CliqueFlowmer
cs.AI 2026-03 unverdicted novelty 7.0

CliqueFlowmer combines clique-based model-based optimization with transformer and flow models to generate materials that optimize target properties better than generative baselines.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
cs.CV 2025-01 unverdicted novelty 7.0

Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
TextGrad: Automatic "Differentiation" via Text
cs.CL 2024-06 unverdicted novelty 7.0

TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
Training-Inference Consistent Segmented Execution for Long-Context LLMs
cs.CL 2026-05 conditional novelty 6.0

A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
cs.CV 2026-05 unverdicted novelty 6.0

Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D...
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
cs.LG 2026-05 unverdicted novelty 6.0

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
On the Blessing of Pre-training in Weak-to-Strong Generalization
cs.LG 2026-05 unverdicted novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
cs.AI 2026-05 unverdicted novelty 6.0

A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 160 Pith papers

[1]

Nicholas Carlini, Milad Nasr, Christopher A

URL https://arxiv.org/abs/2202.07646. Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?, 2024. BenCarterette, PaulClough, MarkHall, EvangelosKanoulas, andMarkSanderson....

work page doi:10.1145/2911451.2914675 2024
[2]

doi: 10.18653/v1/2020.findings-emnlp.301

URL https://aclanthology.org/N19-1246. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html. Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look ...

work page doi:10.18653/v1/2020.findings-emnlp.301 2011
[3]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang

URL https://arxiv.org/abs/2112.07916. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Ju...

work page doi:10.1162/089976600300015637 2020
[4]

Dense passage retrieval for open-domain question answering

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Li...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[5]

M at C ha: Enhancing visual language pretraining with math reasoning and chart derendering

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.714. URL https://aclanthology.org/2023.acl-long.714. Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention, 2024. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei...

work page doi:10.18653/v1/2023.acl-long.714 2023
[6]

Contributions and Acknowledgments Core Contributors Petko Georgiev Ving Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang Soroosh Mariooryad Yifan Ding Xinyang Geng Fred Alcober Roy Frostig Mark Omernick Lexi Walker Cosmin Paduraru Christina Sorokin Andrea Tacchetti Colin Gaffney Samira Daruki Olcan Sercinogl...

work page
[7]

multiple needle-in-a-haystack

Appendix 12.1. Model Card We present the Gemini 1.5 Model card in Table 45 Model summary Model architecture Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer based model that builds on scaling MoE vision/language models at Google (Clark et al., 2020; Fedus et al., 2021; Lepikhin et al., 2020; Riquelme et al., 2021; Shazeer et al., 2017; Zoph ...

work page 2020
[8]

Initial solution:The model proposed an initial solution

work page
[9]

It then attempted to generate a corrected solution

Error correction (attempt 1):If the initial solution resulted in an error, the model was provided with the first 10 lines of the error trace, along with the previously used prompt and the first solution. It then attempted to generate a corrected solution. 110 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Video Link Qu...

work page
[10]

"): """ Recursively get a list of all file paths in a GitHub repository directory and its subdirectories

Error correction (attempt 2):If the second solution also resulted in an error, the model was given one final attempt. This time, the context included both previous solutions and their corresponding error traces. Using this process, Gemini 1.5 Pro successfully solved between 80 and 110 problems (out of 528) in the first stage, 20 to 30 problems in the seco...

work page
[11]

Since we want to find the maximum, we can minimize the negative of the function

**‘scipy.optimize.minimize‘**: This function is used to find the minimum of a given function. Since we want to find the maximum, we can minimize the negative of the function

work page
[12]

of the week

**‘scipy.optimize.Bounds‘**: This class allows us to specify constraints on the variables, ensuring they remain nonnegative and sum to 1. ### Python Solution Here’s the Python code to solve the problem: ‘‘‘python 115 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context import numpy as np from scipy.optimize import minimize, ...

work page 2024
[13]

Photos that may have quality issues based on the camera settings used, such as: - Shutter speed slower than 1/60 (potential blur/camera shake) - Aperture wider than f/8 (reduced sharpness) - ISO higher than 800 (excessive noise)

work page
[14]

kitchen_01.jpg, kitchen_02.jpg, etc.)

A list of photos grouped by room/area of the house, based on timestamps and/or similar filenames (e.g. kitchen_01.jpg, kitchen_02.jpg, etc.)

work page
[15]

The 10 photos with the widest angle of view based on focal length, as these often make the best ""hero"" shots for real estate listings

work page
[16]

25 years experience as a professional classical pianist

Any filenames that don’t follow our standard naming convention of [room/area]_[number].jpg I’ve also attached a reference sheet with examples of our studio’s technical quality standards and filename conventions. Attached documents: * CSV file containing key details for each of the 58 photos * PDF of the studio’s technical quality standards Table 50| Examp...

work page 2022
[17]

Look through each frame in the video carefully and answer the question

Westartwiththepromptheader, "Look through each frame in the video carefully and answer the question."

work page
[18]

{i//60:02}:{i%60:02}

Then for each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 10000 would be formatted as"166:40".) (b) and then we append the frame bytes

work page
[19]

What is the secret word?

Lastly we append the user query,"What is the secret word?" . The needle is a frame from the video with the caption “The secret word is "needle". ” embedded in the frame. 12.16.8. Audio Needle-in-a-Haystack The prompt is constructed as follows:

work page
[20]

We start with the prompt header,‘Listen to the audio carefully and answer the question that comes after the audio input.’

work page
[21]

Then we feed the audio input, which is the Voxpopuli haystack with the needle embedded

work page
[22]

What is the secret keyword?’

Lastly we append the user query,‘In the audio above, someone says ’The secret keyword is X’. What is the secret keyword?’ . The needle is a speech segment‘The secret keyword is "needle".’ embedded in the speech sample. 12.16.9. Multi-round Co-reference Resolution (MRCR) For our evaluations with Gemini Pro 1.5 and GPT-4 Turbo, the prompt format for MRCR is...

work page
[23]

{i//60:02}:{i%60:02}

For each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 5000 would be formatted as"83:20".) (b) and then we append the frame bytes

work page
[24]

Final Answer: (X)

Lastly we append the question, given below: Format your answer as "Final Answer: (X)" where X is the correct letter choice. {question} Options: (A) {option 1} (B) {option 2} (C) {option 3} (D) {option 4} (E) {option 5} 143 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context The answer choices were randomized. If the number ...

work page 2023