MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
hub Canonical reference
PaLM 2 Technical Report
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also all
- method backbone by representing actions as another language, and training action text tokens together with vision-language data. metadata) per embodiment (Fig. 2(b)), where Franka dom- inates. Fig. 2(c) shows the breakdown of trajectories per embodiment. To further analyze the diversity, we use the language annotations present in our data. We use the PaLM language model [3] to extract objects and behaviors from the instructions. Fig. 2(d,e) show the diversity of skills and objects. While most skills be
- background For instance, Flan- PaLM-540B, which is instruction-finetuned on 1.8K tasks, outperforms PaLM-540B by a large margin (+9.4% on av- erage). The finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks, as illustrated in Fig 14. Fig. 14: Flan-PaLM finetuning consist of 473 datasets in above task categories. Courtesy of [74]. PaLM-2 [75] is a more compute-efficient LLM with bet- ter multilingual and reasoning capabilities, compared to its predecessor PaLM. PaLM-2 is traine
- baseline 0Prometheus-8x7b-v2 [353] Mixtral-8x7B-Instruct [319]93.0 47.1 80.5 77.4 74.5Critic-RM-Rank [991] Llama-3.1-70B-Instruct [168]97.0 58.0 84.0 92.0 82.8RM [689] Llama-3.1-70B-Instruct [168]98.3 74.5 83.8 88.0 86.4SynRM [968] Llama-3.1-70B-Instruct [168]97.5 76.8 86.3 88.5 87.3CLoud [17] Llama-3-70B-Instruct [168]98.0 75.6 87.6 89.0 87.6FLAMe-RM-24B [753] PaLM-2-24B [16] 92.2 75.7 89.6 93.8 87.8SteerLM-RM 70B [829] Llama-2-70B-chat [743]91.3 80.3 90.6 92.8 88.8Llama-3-OffsetBias-RM-8B [585]Llama-3-
- background standing of the survey authors by reading the papers, blog articles, interview reports and APIs released by OpenAI. 14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- founder-of-openai models was already explored in the early days of Ope- nAI, while it was attempted with recurrent neural net- works (RNN) [121]. With the advent of Transformer, OpenAI developed two initial GPT models, namely GPT-1 [122] and GPT-2 [26], which can be considered as the foundation to more powerful models
co-cited works
representative citing papers
BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-based attacks.
Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
M³KG-RAG improves multimodal reasoning in large language models by constructing multi-hop knowledge graphs and selectively pruning retrieved context with GRASP.
ActiveDPO is a theoretically grounded active data selection method for sample-efficient LLM alignment that parameterizes the reward model directly with the LLM being aligned.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
Introduces BanglaMedVQA dataset of clinically validated image-question-answer pairs and benchmarks foundation models, finding substantially lower performance than on English MedVQA especially on diagnostic questions.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Steer-to-Detect learns a steering vector injected into LLM hidden states to boost class separability and applies hypothesis testing with finite-sample Type I/II error guarantees for generated-text detection.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
citing papers explorer
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.