CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
super hub Mixed citations
Gemma: Open Models Based on Gemini Research and Technology
Mixed citation behavior. Most common role is background (60%).
abstract
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed descript
authors
co-cited works
representative citing papers
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.
LiteLVLM prunes visual tokens for pixel grounding by reversing CLIP visual-text similarity to retain referent region tokens, outperforming prior methods by over 5% with 22% speedup and 2.3x memory reduction without any training.
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
PAS encodes locations via relative anchors and bins to deliver roughly 370-400m adversarial error in spatial RAG while retaining over half the baseline retrieval performance and keeping generation quality robust.
Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
Full development of 7B and 32B Olmo 3 models used 12.3 GWh datacenter energy and emitted 4,251 tCO2eq, with development overheads accounting for 82% of compute and reasoning models costing 17x more to post-train than instruction-tuned ones.
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Skip-connected MLPs and residual-free MLPs of equal width represent generically disjoint function classes for common activations, with explicit impossibility proofs and a non-generic absorption condition for ReLU and GELU.
Fine-tuning shows higher proficiency than in-context learning on in-distribution generalization in formal languages, with equal out-of-distribution performance and diverging inductive biases at high proficiency.
citing papers explorer
-
Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
EditRisk-Bench demonstrates that malicious knowledge editing reliably induces incorrect or unsafe reasoning in LLMs while largely preserving general capabilities.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
-
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.
-
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.