super hub Canonical reference

PaLM 2 Technical Report

Alexandre Passos, Andrew M. Dai, Dmitry Lepikhin, Melvin Johnson, Orhan Firat, Rohan Anil · 2023 · cs.CL · arXiv 2305.10403

Canonical reference. 88% of citing Pith papers cite this work as background.

104 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 104 citing papers more from Alexandre Passos arXiv PDF

abstract

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 baseline 2 method 1

citation-polarity summary

background 23 baseline 2 use method 1

claims ledger

abstract We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also all
method backbone by representing actions as another language, and training action text tokens together with vision-language data. metadata) per embodiment (Fig. 2(b)), where Franka dom- inates. Fig. 2(c) shows the breakdown of trajectories per embodiment. To further analyze the diversity, we use the language annotations present in our data. We use the PaLM language model [3] to extract objects and behaviors from the instructions. Fig. 2(d,e) show the diversity of skills and objects. While most skills be
background For instance, Flan- PaLM-540B, which is instruction-finetuned on 1.8K tasks, outperforms PaLM-540B by a large margin (+9.4% on av- erage). The finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks, as illustrated in Fig 14. Fig. 14: Flan-PaLM finetuning consist of 473 datasets in above task categories. Courtesy of [74]. PaLM-2 [75] is a more compute-efficient LLM with bet- ter multilingual and reasoning capabilities, compared to its predecessor PaLM. PaLM-2 is traine
baseline 0Prometheus-8x7b-v2 [353] Mixtral-8x7B-Instruct [319]93.0 47.1 80.5 77.4 74.5Critic-RM-Rank [991] Llama-3.1-70B-Instruct [168]97.0 58.0 84.0 92.0 82.8RM [689] Llama-3.1-70B-Instruct [168]98.3 74.5 83.8 88.0 86.4SynRM [968] Llama-3.1-70B-Instruct [168]97.5 76.8 86.3 88.5 87.3CLoud [17] Llama-3-70B-Instruct [168]98.0 75.6 87.6 89.0 87.6FLAMe-RM-24B [753] PaLM-2-24B [16] 92.2 75.7 89.6 93.8 87.8SteerLM-RM 70B [829] Llama-2-70B-chat [743]91.3 80.3 90.6 92.8 88.8Llama-3-OffsetBias-RM-8B [585]Llama-3-
background standing of the survey authors by reading the papers, blog articles, interview reports and APIs released by OpenAI. 14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- founder-of-openai models was already explored in the early days of Ope- nAI, while it was attempted with recurrent neural net- works (RNN) [121]. With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 [122] and GPT-2 [26], which can be considered as the foundation to more powerful models

authors

Alexandre Passos Andrew M. Dai Dmitry Lepikhin Melvin Johnson Orhan Firat Rohan Anil

co-cited works

representative citing papers

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

Logic-Regularized Verifier Elicits Reasoning from LLMs

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.

Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-based attacks.

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

cs.AI · 2026-04-23 · conditional · novelty 7.0

Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

cs.CL · 2026-03-11 · unverdicted · novelty 7.0

PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.

M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

cs.CL · 2025-12-23 · unverdicted · novelty 7.0

M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

cs.LG · 2025-05-25 · unverdicted · novelty 7.0

ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

cs.LG · 2024-06-06 · conditional · novelty 7.0

Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

cs.RO · 2023-10-13 · unverdicted · novelty 7.0

A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

cs.CL · 2023-10-10 · conditional · novelty 7.0

Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Large Language Models as Optimizers

cs.LG · 2023-09-07 · unverdicted · novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.

Large Language Models for Multi-Lingual Equivalent Mutant Detection: An Extended Empirical Study

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

LLM-based methods achieve higher F1-scores than traditional approaches for equivalent mutant detection in Java and C, with fine-tuned code embeddings performing best and showing cross-lingual generalization.

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

A unified detection and unlearning framework identifies and mitigates data poisoning in summarization models, achieving 85-92% detection and up to 96% behavior restoration across multiple architectures.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

citing papers explorer

Showing 10 of 10 citing papers after filters.

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning cs.AI · 2026-04-23 · conditional · none · ref 2 · internal anchor
Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 7 · internal anchor
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 4 · internal anchor
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 4 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 51 · internal anchor
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 1 · internal anchor
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 16 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Reflection of Episodes: Learning to Play Game from Expert and Self Experiences cs.AI · 2025-02-19 · unverdicted · none · ref 14 · internal anchor
ROE framework lets LLM defeat Very Hard bot in TextStarCraft II via keyframe selection, expert/self-experience decisions, and post-game reflection for new self-experience.
Data-driven Machine Learning Cannot Reach Symbolic-level Logical Reasoning -- The Limit of the Scaling Law cs.AI · 2026-06-24 · unverdicted · none · ref 35 · internal anchor
Supervised deep learning cannot reach symbolic-level syllogistic reasoning due to indistinguishable training data across 24 valid types and contradictory training targets in end-to-end premise-to-conclusion mapping.
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions cs.AI · 2024-08-23 · unverdicted · none · ref 21 · internal anchor
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

PaLM 2 Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer