arxiv: 2403.08295 · v4 · submitted 2024-03-13 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team: Thomas Mesnard , Cassidy Hardin , Robert Dadashi , Surya Bhupatiraju , Shreya Pathak , Laurent Sifre , Morgane Rivi\`ere , Mihir Sanjay Kale

show 99 more authors

Juliette Love Pouya Tafti L\'eonard Hussenot Pier Giuseppe Sessa Aakanksha Chowdhery Adam Roberts Aditya Barua Alex Botev Alex Castro-Ros Ambrose Slone Am\'elie H\'eliou Andrea Tacchetti Anna Bulanova Antonia Paterson Beth Tsai Bobak Shahriari Charline Le Lan Christopher A. Choquette-Choo Cl\'ement Crepy Daniel Cer Daphne Ippolito David Reid Elena Buchatskaya Eric Ni Eric Noland Geng Yan George Tucker George-Christian Muraru Grigory Rozhdestvenskiy Henryk Michalewski Ian Tenney Ivan Grishchenko Jacob Austin James Keeling Jane Labanowski Jean-Baptiste Lespiau Jeff Stanway Jenny Brennan Jeremy Chen Johan Ferret Justin Chiu Justin Mao-Jones Katherine Lee Kathy Yu Katie Millican Lars Lowe Sjoesund Lisa Lee Lucas Dixon Machel Reid Maciej Miku{\l}a Mateo Wirth Michael Sharman Nikolai Chinaev Nithum Thain Olivier Bachem Oscar Chang Oscar Wahltinez Paige Bailey Paul Michel Petko Yotov Rahma Chaabouni Ramona Comanescu Reena Jana Rohan Anil Ross Mcilroy Ruibo Liu Ryan Mullins Samuel L Smith Sebastian Borgeaud Sertan Girgin Sholto Douglas Shree Pandya Siamak Shakeri Soham De Ted Klimenko Tom Hennigan Vlad Feinberg Wojciech Stokowiec Yu-hui Chen Zafarali Ahmed Zhitao Gong Tris Warkentin Ludovic Peran Minh Giang Cl\'ement Farabet Oriol Vinyals Jeff Dean Koray Kavukcuoglu Demis Hassabis Zoubin Ghahramani Douglas Eck Joelle Barral Fernando Pereira Eli Collins Armand Joulin Noah Fiedel Evan Senter Alek Andreev Kathleen Kenealy

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Gemmaopen language modelsGeminiLLM benchmarksmodel releasesafety evaluationpretrained models

0 comments

The pith

Gemma open models built from Gemini research outperform similar open models on 11 of 18 text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gemma as a family of lightweight open language models developed using the research and technology from the Gemini models. It releases both 2 billion and 7 billion parameter versions in pretrained and instruction-tuned forms, with evaluations showing they lead other open models on eleven of eighteen standard text-based benchmarks for understanding, reasoning, and safety. The authors argue that making such capable models openly available, together with detailed responsibility assessments, supports safer progress across the field and opens the door to further LLM innovations.

Core claim

Gemma is a family of lightweight, state-of-the-art open models built from the research and technology used to create Gemini models. The models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. Two sizes are released (2 billion and 7 billion parameters) with both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, accompanied by comprehensive safety and responsibility evaluations.

What carries the argument

The Gemma model family, which adapts Gemini research and technology to produce efficient open language models at 2B and 7B scales.

Load-bearing premise

The chosen academic benchmarks and safety metrics are representative of real-world capabilities and risks.

What would settle it

Independent tests on new tasks or external safety audits where the Gemma models fail to match or exceed the reported advantages on the majority of evaluations.

read the original abstract

This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemma is a straightforward open release of competitive 2B and 7B models that beat most peers on standard benchmarks, with safety checks included but little new on the training side.

read the letter

The main takeaway is that Google has put out two small open models, 2B and 7B, that outperform other open models of similar size on 11 of the 18 text tasks they tested, along with pretrained and fine-tuned checkpoints and a set of safety evaluations. The weights are available, so the numbers can be checked directly. That is the concrete contribution here. The paper follows standard evaluation protocols for these releases and reports the results cleanly without obvious overclaims on the benchmarks themselves. The safety section covers common metrics and does not flag major issues, which is the responsible move for this kind of work. Releasing both sizes gives the community usable artifacts for fine-tuning or deployment experiments. The comparisons to other open models are fair and the outperformance on most tasks is the part that matters for users. The soft spot is the thin coverage of how the models were actually built. The text says they draw from Gemini research and technology, but it gives almost no detail on the data mixture, weighting, exact architecture changes, or training schedule. That leaves the performance numbers as an empirical fact without much to learn from on the methods side. Reproducibility of the training run is limited as a result, though the checkpoints themselves let people test downstream use. This paper is for groups that need smaller open LLMs for research or product work and want a documented baseline with safety notes. It is not a methods paper, so readers looking for new training techniques or first-principles derivations will not find much. The evidence on the performance claims is solid enough given the released models and standard benchmarks. I would bring it to a reading group to talk through the results and what the release means for open model availability. It deserves peer review because the models are a real addition to the public set and the evaluations are worth having on record, even if the paper stays mostly descriptive.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces the Gemma family of lightweight open language models (2B and 7B parameters) derived from Gemini research and technology. It reports strong performance on academic benchmarks for language understanding, reasoning, and safety, with the claim that Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks. The authors release both pretrained and instruction-tuned model checkpoints along with comprehensive safety and responsibility evaluations and a description of the model development process.

Significance. If the benchmark results hold, the work makes a meaningful contribution to open LLM research by releasing high-performing, accessible models with accompanying safety assessments. The provision of model weights enables direct verification of the performance claims and supports further community experimentation, which strengthens the paper's value for reproducibility.

minor comments (3)

[Abstract and §1] The abstract and introduction reference outperformance on 11 of 18 tasks but would benefit from an early summary table or explicit list of the tasks and baseline models to improve immediate readability for readers scanning the paper.
[Model Development and Evaluation sections] The description of model development provides a high-level overview of training but could clarify the exact evaluation protocols (e.g., few-shot settings, prompt templates) used for the 18 text-based tasks to facilitate precise replication by others.
[Results tables and figures] Figure and table captions should explicitly state the source of any external baseline numbers (e.g., from original papers or reproduced runs) to avoid ambiguity in the comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of our manuscript and for recommending acceptance. We appreciate the recognition of the value in releasing high-performing open models with accompanying safety evaluations and detailed development descriptions.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a model release report that describes Gemma as built from Gemini research technology and evaluates it empirically on public academic benchmarks. The central claim of outperforming similar open models on 11 of 18 tasks rests on externally verifiable benchmark scores and released checkpoints, not on any internal equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. No uniqueness theorems, ansatzes, or self-definitional steps appear; the evaluation methodology follows standard practices and supplies artifacts for independent checking, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Claims rest on standard transformer training practices and benchmark comparisons rather than new theoretical constructs; several training hyperparameters and data choices are fitted during development.

free parameters (2)

model scale and architecture details
Specific layer counts, hidden dimensions, and attention heads chosen for the 2B and 7B sizes.
training data mixture and weighting
Selection and proportions of pretraining data sources tuned to achieve reported performance.

axioms (1)

domain assumption Transformer-based language models trained on large text corpora can achieve strong benchmark performance
Invoked throughout the model development and evaluation sections.

pith-pipeline@v0.9.0 · 5891 in / 1196 out tokens · 38486 ms · 2026-05-10T15:50:05.869074+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ViMU: Benchmarking Video Metaphorical Understanding
cs.CV 2026-05 unverdicted novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
Pretraining Exposure Explains Popularity Judgments in Large Language Models
cs.CL 2026-05 unverdicted novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Adaptive Stopping for Multi-Turn LLM Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG an...
The Linear Representation Hypothesis and the Geometry of Large Language Models
cs.CL 2023-11 conditional novelty 8.0

Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
cs.CL 2026-05 unverdicted novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
cs.CV 2026-05 conditional novelty 7.0

LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without an...
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
cs.LG 2026-05 unverdicted novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
Entropy-informed Decoding: Adaptive Information-Driven Branching
cs.LG 2026-05 unverdicted novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fi...
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
cs.CL 2026-05 conditional novelty 7.0

K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
Privacy Without Losing Place: A Paradigm for Private Retrieval in Spatial RAGs
cs.CR 2026-05 unverdicted novelty 7.0

PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering
cs.CR 2026-05 unverdicted novelty 7.0

Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
cs.CL 2026-05 unverdicted novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
cs.CR 2026-05 unverdicted novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
The Hidden Cost of Thinking: Energy Use and Environmental Impact of LMs Beyond Pretraining
cs.CY 2026-05 unverdicted novelty 7.0

Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than ...
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
cs.DC 2026-05 unverdicted novelty 7.0

Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
cs.LG 2026-04 unverdicted novelty 7.0

Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Can an MLP Absorb Its Own Skip Connection?
cs.LG 2026-04 accept novelty 7.0

Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
cs.CL 2026-04 unverdicted novelty 7.0

Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
cs.LG 2026-04 unverdicted novelty 7.0

ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
cs.CV 2026-04 unverdicted novelty 7.0

RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
cs.LG 2026-04 unverdicted novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
cs.CL 2026-04 unverdicted novelty 7.0

TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

AtlasOCR delivers the first open-source Darija OCR by fine-tuning Qwen2.5-VL 3B, achieving state-of-the-art results on custom and existing benchmarks for both Darija and Arabic.
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
cs.CL 2026-04 unverdicted novelty 7.0

BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
cs.SE 2026-04 accept novelty 7.0

Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Jamba: A Hybrid Transformer-Mamba Language Model
cs.CL 2024-03 conditional novelty 7.0

Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
cs.LG 2026-05 unverdicted novelty 6.0

Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning
stat.ML 2026-05 unverdicted novelty 6.0

A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% ...
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Large Spectrum Models (LSMs): Decoder-Only Transformer-Powered Spectrum Activity Forecasting via Tokenized RF Data
cs.NI 2026-05 unverdicted novelty 6.0

Decoder-only transformers trained on tokenized RF spectrum data from 22 TB of measurements achieve 3.25 dB RMSE in spectrum activity forecasting across 33 bands.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Remember to Forget: Gated Adaptive Positional Encoding
cs.LG 2026-05 unverdicted novelty 6.0

GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
cs.AI 2026-05 unverdicted novelty 6.0

EditRisk-Bench demonstrates that malicious knowledge editing reliably induces incorrect or unsafe reasoning in LLMs while largely preserving general capabilities.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
Common-agency Games for Multi-Objective Test-Time Alignment
cs.GT 2026-05 unverdicted novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
cs.CV 2026-05 unverdicted novelty 6.0

ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RefusalGuard constrains updates in hidden representation space to preserve safety-relevant geometric structure during fine-tuning, maintaining low attack success rates on safety benchmarks while preserving task performance.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

DUAL-BLADE uses a dual-path KV-cache framework with NVMe-direct access to reduce prefill and decode latency by up to 33% and 42% while improving SSD utilization 2.2x under tight memory budgets.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
Estimating Tail Risks in Language Model Output Distributions
cs.LG 2026-04 unverdicted novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 118 Pith papers · 20 internal anchors

[1]

Almazrouei, H

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models, 2023

work page 2023
[2]

Amodei, C

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man \'e . Concrete problems in AI safety. arXiv preprint, 2016

work page 2016
[5]

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. L...

work page 2022
[6]

Barham, A

P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu. Pathways: Asynchronous distributed dataflow for ml, 2022

work page 2022
[8]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952

work page 1952
[11]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H...

work page 2022
[12]

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[14]

Clark, I

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[16]

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. a. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng. Large scale distributed deep networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/pap...

work page 2012
[18]

Gemini: A family of highly capable multimodal models, 2023

Gemini Team . Gemini: A family of highly capable multimodal models, 2023

work page 2023
[20]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[22]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023

work page 2023
[24]

Kavukcuoglu, P

K. Kavukcuoglu, P. Kohli, L. Ibrahim, D. Bloxwich, and S. Brown. How our principles helped define alphafold’s release, 2022

work page 2022
[28]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521 0 (7553): 0 436--444, 2015

work page 2015
[29]

Efficient Estimation of Word Representations in Vector Space

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings , 2013. URL http://arxiv.org/abs/1301.3781

work page internal anchor Pith review arXiv 2013
[31]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[32]

Pacchiardi, A

L. Pacchiardi, A. J. Chan, S. Mindermann, I. Moscovitz, A. Y. Pan, Y. Gal, O. Evans, and J. Brauner. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions, 2023

work page 2023
[35]

Roberts, H

A. Roberts, H. W. Chung, A. Levskaya, G. Mishra, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, C. Hawthorne, A. Lewkowycz, A. Salcianu, M. van Zee, J. Austin, S. Goodman, L. B. Soares, H. Hu, S. Tsvyashchenko, A. Chowdhery, J. Bastings, J. Bulian, X. Garcia, J. Ni, A. Chen, K. Kenealy, J. H. Clark, S. Lee, D. Garrette, J. Lee-Thor...

work page 2022
[36]

Roberts, H

A. Roberts, H. W. Chung, G. Mishra, A. Levskaya, J. Bradbury, D. Andor, S. Narang, B. Lester, C. Gaffney, A. Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24 0 (377): 0 1--8, 2023

work page 2023
[41]

J. M. V. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and characterizing reward gaming. In NeurIPS, 2022

work page 2022
[44]

Suzgun, N

M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022

work page 2022
[45]

Talmor, J

A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019

work page 2019
[46]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models, 2023 a

work page 2023
[47]

Touvron, L

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev,...

work page 2023
[51]

Xla: Optimizing compiler for tensorflow, 2019

XLA. Xla: Optimizing compiler for tensorflow, 2019. URL https://www.tensorflow.org/xla

work page 2019
[54]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[55]

Zhong, R

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

work page 2023
[56]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency, 2023

work page 2023
[57]

Concrete problems in

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. arXiv preprint , year =

work page
[58]

Quantifying Memorization Across Neural Language Models

Quantifying memorization across neural language models , author=. arXiv preprint arXiv:2202.07646 , year=

work page internal anchor Pith review arXiv
[59]

Feder Cooper, Daphne Ippolito, Christopher A

Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

work page arXiv
[60]

30th USENIX Security Symposium (USENIX Security 21) , pages=

Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

work page
[61]

arXiv preprint arXiv:2210.17546 , year=

Preventing verbatim memorization in language models gives a false sense of privacy , author=. arXiv preprint arXiv:2210.17546 , year=

work page arXiv
[62]

arXiv preprint arXiv:2309.04662 , year=

Madlad-400: A multilingual and document-level large audited dataset , author=. arXiv preprint arXiv:2309.04662 , year=

work page arXiv
[63]

NeurIPS , year =

Defining and Characterizing Reward Gaming , author =. NeurIPS , year =

work page
[64]

2023 , eprint=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[65]

2022 , eprint=

Scaling Laws for Reward Model Overoptimization , author=. 2022 , eprint=

work page 2022
[66]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

work page 2016
[67]

Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings , author=. Proceedings of the 50th Annual International Symposium on Computer Architecture , pages=

work page
[68]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

work page 2023
[69]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1911.11641 , timestamp =

work page arXiv 2019
[70]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1904.09728 , timestamp =

work page internal anchor Pith review arXiv 2019
[71]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark and Kenton Lee and Ming. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , journal =. 2019 , url =. 1905.10044 , timestamp =

work page internal anchor Pith review arXiv 2019
[72]

https://aclanthology.org/ Q19-1026/

Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

work page doi:10.1162/tacl_a_00276 2019
[73]

Measuring Massive Multitask Language Understanding

Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. CoRR , volume =. 2020 , url =. 2009.03300 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[74]

Program Synthesis with Large Language Models

Jacob Austin and Augustus Odena and Maxwell I. Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie J. Cai and Michael Terry and Quoc V. Le and Charles Sutton , title =. CoRR , volume =. 2021 , url =. 2108.07732 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[75]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Pond. Evaluating Large Language Models Trained on Code , journal =. 2021 , url =. 2107.03374 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

Training Verifiers to Solve Math Word Problems

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =. 2021 , url =. 2110.14168 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[77]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. CoRR , volume =. 2019 , url =. 1907.10641 , timestamp =

work page internal anchor Pith review arXiv 2019
[78]

Denis Paperno and Germ. The. CoRR , volume =. 2016 , url =. 1606.06031 , timestamp =

work page Pith review arXiv 2016
[79]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer , title =. CoRR , volume =. 2017 , url =. 1705.03551 , timestamp =

work page internal anchor Pith review arXiv 2017
[80]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[81]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[82]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[83]

2023 , eprint=

The Falcon Series of Open Language Models , author=. 2023 , eprint=

work page 2023
[84]

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever and Oriol Vinyals and Quoc V. Le , title =. CoRR , volume =. 2014 , url =. 1409.3215 , timestamp =

work page Pith review arXiv 2014
[85]

Attention Is All You Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. CoRR , volume =. 2017 , url =. 1706.03762 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[86]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

work page 2015
[87]

2022 , eprint=

Pathways: Asynchronous Distributed Dataflow for ML , author=. 2022 , eprint=

work page 2022
[88]

Journal of Machine Learning Research , volume=

Scaling up models and data with t5x and seqio , author=. Journal of Machine Learning Research , volume=

work page
[89]

2019 , url=

XLA: Optimizing compiler for TensorFlow , author=. 2019 , url=

work page 2019
[90]

2022 , publisher=

How our principles helped define AlphaFold’s release , author=. 2022 , publisher=

work page 2022
[91]

Large Scale Distributed Deep Networks , url =

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , booktitle =. Large Scale Distributed Deep Networks , url =

work page
[92]

Efficient Estimation of Word Representations in Vector Space , booktitle =

Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =. 2013 , url =

work page 2013
[93]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[94]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. CoRR , volume =. 2019 , url =. 1910.10683 , timestamp =

work page internal anchor Pith review arXiv 2019
[95]

Adam Roberts and Hyung Won Chung and Anselm Levskaya and Gaurav Mishra and James Bradbury and Daniel Andor and Sharan Narang and Brian Lester and Colin Gaffney and Afroz Mohiuddin and Curtis Hawthorne and Aitor Lewkowycz and Alex Salcianu and Marc van Zee and Jacob Austin and Sebastian Goodman and Livio Baldini Soares and Haitang Hu and Sasha Tsvyashchenk...

work page arXiv
[96]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer , title =. CoRR , volume =. 2019 , url =. 1911.02150 , timestamp =

work page internal anchor Pith review arXiv 2019
[97]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu , title =. CoRR , volume =. 2021 , url =. 2104.09864 , timestamp =

work page internal anchor Pith review arXiv 2021
[98]

GLU Variants Improve Transformer

Noam Shazeer , title =. CoRR , volume =. 2020 , url =. 2002.05202 , timestamp =

work page internal anchor Pith review arXiv 2020
[99]

Available: https://arxiv.org/abs/1910.07467

Biao Zhang and Rico Sennrich , title =. CoRR , volume =. 2019 , url =. 1910.07467 , timestamp =

work page arXiv 2019
[100]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Kudo, Taku and Richardson, John. S entence P iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2018. doi:10.18653/v1/D18-2012

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[101]

Yuanzhong Xu and HyoukJoong Lee and Dehao Chen and Blake A. Hechtman and Yanping Huang and Rahul Joshi and Maxim Krikun and Dmitry Lepikhin and Andy Ly and Marcello Maggioni and Ruoming Pang and Noam Shazeer and Shibo Wang and Tao Wang and Yonghui Wu and Zhifeng Chen , title =. CoRR , volume =. 2021 , url =. 2105.04663 , timestamp =

work page arXiv 2021
[102]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[103]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review arXiv
[104]

Advances in Neural Information Processing Systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems , volume=

work page
[105]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[106]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

work page 1952
[107]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[108]

2023 , eprint=

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , author=. 2023 , eprint=

work page 2023

Showing first 80 references.