Recognition: 1 theorem link
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Pith reviewed 2026-05-10 14:30 UTC · model grok-4.3
The pith
Gemini 1.5 models recall and reason over fine-grained details from millions of tokens of multimodal context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens.
What carries the argument
The long-context processing in Gemini 1.5 models that supports recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.
Load-bearing premise
The internal benchmarks accurately measure genuine long-context utilization rather than benefiting from training-data overlap or selective test construction.
What would settle it
A test inserting a unique fact at a random position in a fresh 10-million-token document never seen in training, then querying the model for that fact and measuring whether retrieval accuracy stays above 99 percent.
read the original abstract
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Gemini 1.5 family of multimodal models, including an updated Gemini 1.5 Pro and a new lightweight Gemini 1.5 Flash. It claims these models achieve near-perfect recall (>99%) on long-context retrieval tasks across modalities up to at least 10M tokens, improve the state-of-the-art on long-document QA, long-video QA, and long-context ASR, match or surpass Gemini 1.0 Ultra on broad benchmarks, show continued scaling in next-token prediction, and demonstrate real-world utility including 26-75% time savings in professional tasks and the ability to learn English-to-Kalamang translation from a grammar manual.
Significance. If the long-context performance claims hold under independent scrutiny, the work would mark a substantial advance in scaling multimodal context windows to millions of tokens, enabling new capabilities in processing extended documents, video, and audio. The reported generational leap over prior models (e.g., Claude 3.0 at 200k, GPT-4 Turbo at 128k) and the novel low-resource language learning example could influence evaluation standards and architectural research in the field.
major comments (2)
- [Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).
- [Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.
minor comments (2)
- [Real-world use cases section] The professional time-savings study (26-75% across 10 job categories) lacks details on methodology, sample size, or controls, which would strengthen the real-world use-case claims.
- [Benchmark comparison paragraphs] Some comparisons to prior models (Claude 3.0, GPT-4 Turbo) would benefit from explicit citations to the exact evaluation protocols or papers being referenced.
Simulated Author's Rebuttal
We thank the referee for their constructive review of our manuscript introducing the Gemini 1.5 family of models. We address the major comments point by point below, providing the strongest honest clarifications possible given the proprietary nature of certain evaluation details.
read point-by-point responses
-
Referee: [Abstract and evaluation sections on long-context retrieval/QA/ASR] The central claims of near-perfect recall (>99%) up to 10M tokens and SOTA improvements on long-context tasks rest on internal benchmarks whose construction details, test-set definitions, needle-insertion protocols, contamination checks, raw data, error bars, and ablation studies are not provided. This makes it impossible to verify whether the results reflect genuine long-context utilization rather than test-set artifacts or post-hoc choices (see abstract and the sections describing retrieval, QA, and ASR evaluations).
Authors: We agree that greater transparency on benchmark construction would strengthen verifiability. However, as these are proprietary internal benchmarks, we cannot release raw data, exact test-set definitions, full needle-insertion protocols, contamination checks, or ablation studies. The evaluations adapt standard needle-in-a-haystack methods to multimodal long contexts, using novel or held-out content to test genuine retrieval and reasoning. We have partially revised the manuscript to include additional high-level descriptions of the evaluation approach in the relevant sections. Error bars are not reported because performance is near ceiling across consistent runs; the results demonstrate clear improvements on long-document QA, long-video QA, and long-context ASR over prior models. revision: partial
-
Referee: [Sections reporting benchmark results and limits of long-context ability] The manuscript does not report the exact held-out test sets, how they avoid overlap with pre-training data, or multiple-run statistics for the reported performance figures. Without these, the robustness of the 'generational leap' claim over existing models cannot be assessed.
Authors: We acknowledge that specific held-out test set details and multiple-run statistics are not provided. Overlap with pre-training data is avoided by constructing evaluation contexts from post-cutoff or synthetic sources, but exact protocols cannot be disclosed to maintain benchmark integrity. The generational leap is demonstrated by the models' ability to process and recall from contexts up to 10M tokens, far exceeding the limits of models like Claude 3.0 (200k) and GPT-4 Turbo (128k), with near-perfect recall observed consistently. We have added a clarifying note in the revised manuscript on the use of held-out data for these limits studies. revision: partial
- Full disclosure of proprietary internal benchmark construction details, raw data, exact test sets, and complete ablation studies due to confidentiality requirements.
Circularity Check
No significant circularity in derivation chain
full rationale
This is an empirical model release paper reporting benchmark results for Gemini 1.5 on long-context retrieval, QA, and ASR tasks. No algebraic derivations, first-principles predictions, or fitted parameters are presented that reduce by construction to the paper's own inputs. Self-citations to prior Gemini work are present but not load-bearing for the new long-context claims, which rest on held-out evaluations rather than tautological redefinitions or renamed fits. The central results are externally falsifiable via benchmark performance and do not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents
OTora provides the first unified framework for reasoning-level denial-of-service attacks on LLM agents, achieving up to 10x more reasoning tokens and order-of-magnitude latency increases while preserving task accuracy...
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
-
Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving
Tutti is a GPU-direct SSD-backed KV cache that removes CPU bottlenecks via object abstraction, GPU io_uring, and slack scheduling, delivering near-DRAM performance at 2x higher request rate and 27% lower cost than pri...
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
On Bayesian Softmax-Gated Mixture-of-Experts Models
Bayesian softmax-gated mixture-of-experts models achieve posterior contraction for density estimation and parameter recovery using Voronoi losses, plus two strategies for choosing the number of experts.
-
Using large language models for embodied planning introduces systematic safety risks
LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Verification Modulo Tested Library Contracts
A new framework synthesizes library method contracts that are adequate for client verification and pass testing scrutiny, using CHC solvers and ICE learning.
-
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves small LLMs by using relative local validity of next steps to dynamically adjust imitation strength instead of enforcing full trajectory matching.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.
-
Skill-Conditioned Visual Geolocation for Vision-Language Models
GeoSkill uses an evolving Skill-Graph initialized from expert trajectories and grown via autonomous analysis of successful and failed reasoning rollouts to boost geolocation accuracy, faithfulness, and generalization ...
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
MARINER is a new benchmark dataset and evaluation framework for fine-grained perception and causal reasoning in open-water scenes using 16,629 images across 63 vessel categories, diverse environments, and maritime incidents.
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
-
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
-
Offline Materials Optimization with CliqueFlowmer
CliqueFlowmer combines clique-based model-based optimization with transformer and flow models to generate materials that optimize target properties better than generative baselines.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
TextGrad: Automatic "Differentiation" via Text
TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
PRISM achieves higher accuracy than baselines on long-horizon agent tasks at an order-of-magnitude smaller context budget by combining hierarchical bundle search, query-sensitive costing, evidence compression, and ada...
-
Training-Inference Consistent Segmented Execution for Long-Context LLMs
A training-inference consistent segmented execution framework for long-context LLMs matches full-context performance with substantially lower peak memory at very long lengths.
-
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D generates efficient 3D proxy representations via semantic clustering from video frames and aligns them to VLMs through multi-stage training on the new SpaceSpan dataset, achieving competitive performance on 3D...
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
-
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...
Reference graph
Works this paper leans on
-
[1]
Nicholas Carlini, Milad Nasr, Christopher A
URL https://arxiv.org/abs/2202.07646. Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural networks adversarially aligned?, 2024. BenCarterette, PaulClough, MarkHall, EvangelosKanoulas, andMarkSanderson....
-
[2]
doi: 10.18653/v1/2020.findings-emnlp.301
URL https://aclanthology.org/N19-1246. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html. Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look ...
-
[3]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang
URL https://arxiv.org/abs/2112.07916. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Ju...
-
[4]
Dense passage retrieval for open-domain question answering
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Li...
-
[5]
M at C ha: Enhancing visual language pretraining with math reasoning and chart derendering
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.714. URL https://aclanthology.org/2023.acl-long.714. Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention, 2024. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai- Wei...
-
[6]
Contributions and Acknowledgments Core Contributors Petko Georgiev Ving Ian Lei Ryan Burnell Libin Bai Anmol Gulati Garrett Tanzer Damien Vincent Zhufeng Pan Shibo Wang Soroosh Mariooryad Yifan Ding Xinyang Geng Fred Alcober Roy Frostig Mark Omernick Lexi Walker Cosmin Paduraru Christina Sorokin Andrea Tacchetti Colin Gaffney Samira Daruki Olcan Sercinogl...
-
[7]
Appendix 12.1. Model Card We present the Gemini 1.5 Model card in Table 45 Model summary Model architecture Gemini 1.5 Pro is a sparse mixture-of-expert (MoE) Transformer based model that builds on scaling MoE vision/language models at Google (Clark et al., 2020; Fedus et al., 2021; Lepikhin et al., 2020; Riquelme et al., 2021; Shazeer et al., 2017; Zoph ...
work page 2020
-
[8]
Initial solution:The model proposed an initial solution
-
[9]
It then attempted to generate a corrected solution
Error correction (attempt 1):If the initial solution resulted in an error, the model was provided with the first 10 lines of the error trace, along with the previously used prompt and the first solution. It then attempted to generate a corrected solution. 110 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Video Link Qu...
-
[10]
Error correction (attempt 2):If the second solution also resulted in an error, the model was given one final attempt. This time, the context included both previous solutions and their corresponding error traces. Using this process, Gemini 1.5 Pro successfully solved between 80 and 110 problems (out of 528) in the first stage, 20 to 30 problems in the seco...
-
[11]
Since we want to find the maximum, we can minimize the negative of the function
**‘scipy.optimize.minimize‘**: This function is used to find the minimum of a given function. Since we want to find the maximum, we can minimize the negative of the function
-
[12]
**‘scipy.optimize.Bounds‘**: This class allows us to specify constraints on the variables, ensuring they remain nonnegative and sum to 1. ### Python Solution Here’s the Python code to solve the problem: ‘‘‘python 115 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context import numpy as np from scipy.optimize import minimize, ...
work page 2024
-
[13]
Photos that may have quality issues based on the camera settings used, such as: - Shutter speed slower than 1/60 (potential blur/camera shake) - Aperture wider than f/8 (reduced sharpness) - ISO higher than 800 (excessive noise)
-
[14]
kitchen_01.jpg, kitchen_02.jpg, etc.)
A list of photos grouped by room/area of the house, based on timestamps and/or similar filenames (e.g. kitchen_01.jpg, kitchen_02.jpg, etc.)
-
[15]
The 10 photos with the widest angle of view based on focal length, as these often make the best ""hero"" shots for real estate listings
-
[16]
25 years experience as a professional classical pianist
Any filenames that don’t follow our standard naming convention of [room/area]_[number].jpg I’ve also attached a reference sheet with examples of our studio’s technical quality standards and filename conventions. Attached documents: * CSV file containing key details for each of the 58 photos * PDF of the studio’s technical quality standards Table 50| Examp...
work page 2022
-
[17]
Look through each frame in the video carefully and answer the question
Westartwiththepromptheader, "Look through each frame in the video carefully and answer the question."
-
[18]
Then for each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 10000 would be formatted as"166:40".) (b) and then we append the frame bytes
-
[19]
Lastly we append the user query,"What is the secret word?" . The needle is a frame from the video with the caption “The secret word is "needle". ” embedded in the frame. 12.16.8. Audio Needle-in-a-Haystack The prompt is constructed as follows:
-
[20]
We start with the prompt header,‘Listen to the audio carefully and answer the question that comes after the audio input.’
-
[21]
Then we feed the audio input, which is the Voxpopuli haystack with the needle embedded
-
[22]
Lastly we append the user query,‘In the audio above, someone says ’The secret keyword is X’. What is the secret keyword?’ . The needle is a speech segment‘The secret keyword is "needle".’ embedded in the speech sample. 12.16.9. Multi-round Co-reference Resolution (MRCR) For our evaluations with Gemini Pro 1.5 and GPT-4 Turbo, the prompt format for MRCR is...
-
[23]
For each framei: (a) first we append a text timestamp, formatted asf"{i//60:02}:{i%60:02}" (e.g. frame 5000 would be formatted as"83:20".) (b) and then we append the frame bytes
-
[24]
Lastly we append the question, given below: Format your answer as "Final Answer: (X)" where X is the correct letter choice. {question} Options: (A) {option 1} (B) {option 2} (C) {option 3} (D) {option 4} (E) {option 5} 143 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context The answer choices were randomized. If the number ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.