Recognition: 2 theorem links
· Lean TheoremPaLM 2 Technical Report
Pith reviewed 2026-05-12 11:54 UTC · model grok-4.3
The pith
PaLM 2 raises quality on English, multilingual, and reasoning tasks while cutting inference time and compute compared to PaLM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaLM 2 is a new family of language models that, across sizes, produces measurably higher accuracy on downstream English and multilingual tasks and on reasoning suites such as BIG-Bench, while requiring less compute per token at inference time than the original PaLM.
What carries the argument
Mixture-of-objectives training on a Transformer backbone that jointly optimizes for language modeling, translation, and reasoning signals.
If this is right
- Large gains on BIG-Bench and other reasoning benchmarks hold across model sizes.
- Faster inference enables more natural, lower-latency user interactions.
- Lower compute per token supports broader deployment of the models.
- Performance on responsible-AI evaluations stays stable while allowing inference-time toxicity control.
- The same efficiency pattern appears in both pre-trained and fine-tuned variants.
Where Pith is reading between the lines
- The efficiency pattern could lower the energy cost of running large models at scale.
- Similar training mixtures might be tested on non-Transformer architectures to check whether the gains are architecture-specific.
- If the multilingual improvements generalize, they could reduce the need for separate language-specific models.
Load-bearing premise
The chosen English, multilingual, and reasoning benchmarks plus the responsible-AI tests fully represent real-world use without undisclosed data filtering or post-training adjustments.
What would settle it
Running PaLM 2 and PaLM on a fresh set of tasks and hardware never seen during their development and finding no consistent quality or speed advantage for PaLM 2.
read the original abstract
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PaLM 2, a Transformer-based language model trained using a mixture of objectives. It claims superior multilingual and reasoning capabilities, greater compute efficiency, and faster inference relative to PaLM, supported by extensive evaluations showing significantly improved quality on English, multilingual, and reasoning benchmarks (including large gains on BIG-Bench) across model sizes, plus stable performance on responsible-AI evaluations and inference-time toxicity control.
Significance. If the performance gains are genuine and stem from the mixture-of-objectives training rather than data overlap or undisclosed adjustments, the work advances understanding of efficient scaling for large language models and demonstrates practical benefits for deployment. The broad evaluation suite covering reasoning, multilingual, and responsible-AI tasks is a strength, though the high-level reporting limits replicability.
major comments (2)
- [Evaluations and Training sections] The manuscript provides no description of training data sources, decontamination procedures, or explicit confirmation that benchmark test sets (e.g., BIG-Bench) were excluded from the pretraining mixture. This is load-bearing for the central claim of 'significantly improved quality on downstream tasks' and 'large improvements over PaLM on BIG-Bench' because gains could arise from data contamination rather than the new training approach.
- [Abstract and Efficiency discussion] Quantitative details on inference efficiency (e.g., latency, throughput, or FLOPs comparisons to PaLM) and the specific mixture weights or model-size variants are absent from the high-level descriptions. These omissions undermine evaluation of the 'faster and more efficient inference' and 'more compute-efficient' claims, which are central to the contribution.
minor comments (1)
- [Abstract] The distinction between pre-trained models, fine-tuned variants, and user-facing products is noted but could be clarified with explicit mapping of which reported results apply to base models versus products.
Simulated Author's Rebuttal
We thank the referee for their detailed review and valuable suggestions. We address the major comments below and have updated the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [Evaluations and Training sections] The manuscript provides no description of training data sources, decontamination procedures, or explicit confirmation that benchmark test sets (e.g., BIG-Bench) were excluded from the pretraining mixture. This is load-bearing for the central claim of 'significantly improved quality on downstream tasks' and 'large improvements over PaLM on BIG-Bench' because gains could arise from data contamination rather than the new training approach.
Authors: We appreciate this important point. Due to the proprietary nature of the training data, we are unable to provide a full description of the data sources. However, we confirm that the pretraining mixture was carefully curated to exclude evaluation benchmarks, including those in BIG-Bench, using standard decontamination techniques. We have added a clarification in the Training section of the revised manuscript to explicitly state that benchmark test sets were not included in pretraining. This addresses the concern regarding potential data contamination. revision: partial
-
Referee: [Abstract and Efficiency discussion] Quantitative details on inference efficiency (e.g., latency, throughput, or FLOPs comparisons to PaLM) and the specific mixture weights or model-size variants are absent from the high-level descriptions. These omissions undermine evaluation of the 'faster and more efficient inference' and 'more compute-efficient' claims, which are central to the contribution.
Authors: We agree that providing more quantitative details would strengthen the manuscript. In the revised version, we have included specific comparisons of inference latency and throughput for PaLM 2 versus PaLM, along with details on the mixture-of-objectives weights and the different model size variants used in our experiments. These additions are now present in the Efficiency discussion section. revision: yes
- Full disclosure of training data sources and exact compositions, which remain proprietary.
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The PaLM 2 technical report presents training details and measured performance on public external benchmarks (BIG-Bench, English/multilingual/reasoning suites). No load-bearing step reduces a claimed prediction or first-principles result to a quantity defined by the authors' own fitted parameters, self-citations, or ansatz. Distinctions between pre-trained models, fine-tuned variants, and user-facing products are explicit and do not create self-definition. Central claims rest on independent evaluation outcomes rather than internal re-labeling of inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- training objective mixture weights
- model size variants
axioms (1)
- domain assumption Standard scaling assumptions in large language model training hold for the new mixture of objectives.
Forward citations
Cited by 37 Pith papers
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
-
E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems
E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-bas...
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve evolves white-box inventory policies from LLMs with statistical safety guarantees and outperforms classical and deep learning methods on synthetic and real retail data.
-
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning
Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
-
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
-
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
InvEvolve uses LLMs and RL to generate certified inventory policies that outperform classical and deep learning methods on synthetic and real data while providing multi-period performance guarantees.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Corrective Retrieval Augmented Generation
CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generati...
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
Large Language Models Cannot Self-Correct Reasoning Yet
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama Guard is an instruction-tuned Llama2-7b model that performs multi-class safety classification on prompts and responses, matching or exceeding existing moderation tools on benchmarks while supporting taxonomy cus...
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Persistent anti-muslim bias in large language models
Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. arXiv preprint arXiv:2101.05783, 2021. URL https://arxiv.org/abs/2101.05783
-
[2]
Akhbardeh, F., Arkhangorodsky, A., Biesialska, M., Bojar, O., Chatterjee, R., Chaudhary, V., Costa-jussa, M. R., Espa \ n a-Bonet, C., Fan, A., Federmann, C., Freitag, M., Graham, Y., Grundkiewicz, R., Haddow, B., Harter, L., Heafield, K., Homan, C., Huck, M., Amponsah-Kaakyire, K., Kasai, J., Khashabi, D., Knight, K., Kocmi, T., Koehn, P., Lourie, N., Mo...
work page 2021
-
[3]
Appen. Guide to fair pay, 2023. URL https://success.appen.com/hc/en-us/articles/9557008940941-Guide-to-Fair-Pay
-
[5]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Bapna, A., Caswell, I., Kreutzer, J., Firat, O., van Esch, D., Siddhant, A., Niu, M., Baljekar, P., Garcia, X., Macherey, W., Breiner, T., Axelrod, V., Riesa, J., Cao, Y., Chen, M. X., Macherey, K., Krikun, M., Wang, P., Gutkin, A., Shah, A., Huang, Y., Chen, Z., Wu, Y., and Hughes, M. Building machine translation systems for the next thousand languages. ...
-
[7]
Pathways: Asynchronous distributed dataflow for ml
Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., et al. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4: 0 430--449, 2022
work page 2022
-
[8]
Fairness and machine learning limitations and opportunities
Barocas, S., Hardt, M., and Narayanan, A. Fairness and machine learning limitations and opportunities. 2017
work page 2017
-
[9]
Barocas, S., Guo, A., Kamar, E., Krones, J., Morris, M. R., Vaughan, J. W., Wadsworth, W. D., and Wallach, H. Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES '21, pp.\ 368–378, New York, NY, USA, 2021. Association for Computing Machin...
-
[10]
Bender, E. M. and Friedman, B. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6: 0 587--604, 2018. doi:10.1162/tacl_a_00041. URL https://aclanthology.org/Q18-1041
-
[11]
Semantic parsing on F reebase from question-answer pairs
Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on F reebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1533--1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1160
work page 2013
-
[12]
Re-contextualizing fairness in NLP : The case of india
Bhatt, S., Dev, S., Talukdar, P., Dave, S., and Prabhakaran, V. Re-contextualizing fairness in NLP : The case of india. September 2022. URL https://arxiv.org/abs/2209.12226
-
[13]
Piqa: Reasoning about physical commonsense in natural language
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020
work page 2020
-
[14]
L., Barocas, S., Daum \'e , III, H., and Wallach, H
Blodgett, S. L., Barocas, S., Daum \'e , III, H., and Wallach, H. Language (technology) is power: A critical survey of ``bias'' in NLP . May 2020. URL https://arxiv.org/abs/2005.14050
-
[15]
L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping N orwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1004--...
-
[16]
Nuanced metrics for measuring unintended bias with real data for text classification
Borkan, D., Dixon, L., Sorensen, J., Thain, N., and Vasserman, L. Nuanced metrics for measuring unintended bias with real data for text classification, 2019. URL https://arxiv.org/abs/1903.04561
-
[17]
Bowman, S. R. and Dahl, G. E. What will it take to fix benchmarking in natural language understanding?, 2021
work page 2021
-
[18]
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax
work page 2018
-
[19]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...
work page 1901
-
[20]
The secret sharer: Evaluating and testing unintended memorization in neural networks
Carlini, N., Liu, C., Erlingsson, \'U ., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, volume 267, 2019
work page 2019
-
[21]
B., Song, D., Erlingsson, U., et al
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021
work page 2021
-
[23]
Casad, B. J., Hale, P., and Wachs, F. L. Stereotype threat among girls: Differences by gender identity and math education context, 2017
work page 2017
-
[24]
Question directed graph attention network for numerical reasoning over text
Chen, K., Xu, W., Cheng, X., Xiaochuan, Z., Zhang, Y., Song, L., Wang, T., Qi, Y., and Chu, W. Question directed graph attention network for numerical reasoning over text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6759--6768, Online, November 2020. Association for Computational Linguistics. doi...
-
[26]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., et al. Pa LM : S caling language modeling with P athways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Scaling Instruction-Finetuned Language Models
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J
Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages. TACL, 2020. URL https://aclanthology.org/2020.tacl-1.30
work page 2020
-
[29]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? T ry arc, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics, 1989
work page 1989
-
[32]
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf
work page 2015
-
[33]
Daniels, P. T. and Bright, W. The world's writing systems. Oxford University Press on Demand, 1996
work page 1996
-
[34]
Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., and Scheuerman, M. K. Bringing the people back in: Contesting benchmark machine learning datasets, 2020
work page 2020
-
[35]
Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J. M., and Chang, K.-W. Harms of gender exclusivity and challenges in non-binary representation in language technologies, 2021 a . URL https://arxiv.org/abs/2108.12084
-
[36]
On measures of biases and harms in NLP
Dev, S., Sheng, E., Zhao, J., Amstutz, A., Sun, J., Hou, Y., Sanseverino, M., Kim, J., Nishi, A., Peng, N., and Chang, K.-W. On measures of biases and harms in NLP . August 2021 b . URL https://arxiv.org/abs/2108.03362
-
[37]
BERT : P re-training of deep bidirectional transformers for language understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : P re-training of deep bidirectional transformers for language understanding. NAACL, 2019. URL https://aclanthology.org/N19-1423
work page 2019
-
[38]
Diaz, M., Kivlichan, I. D., Rosen, R., Baker, D. K., Amironesei, R., Prabhakaran, V., and Denton, E. CrowdWorkSheets : Accounting for individual and collective identities underlying crowdsourced dataset annotation. June 2022. URL https://arxiv.org/abs/2206.08931
-
[39]
Build it break it fix it for dialogue safety: Robustness from adversarial human attack
Dinan, E., Humeau, S., Chintagunta, B., and Weston, J. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 4537--4546, Hong Kong, China,...
-
[40]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021
work page 2021
-
[41]
arXiv preprint arXiv:2112.06905 , year =
Du , N., Huang , Y., Dai , A. M., Tong , S., Lepikhin , D., Xu , Y., Krikun , M., Zhou , Y., Yu , A. W., Firat , O., Zoph , B., Fedus , L., Bosma , M., Zhou , Z., Wang , T., Wang , Y. E., Webster , K., Pellat , M., Robinson , K., Meier-Hellstern , K., Duke , T., Dixon , L., Zhang , K., Le , Q. V., Wu , Y., Chen , Z., and Cui , C. GLaM: Efficient Scaling o...
-
[42]
doi:10.18653/v1/N19-1246 , editor =
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 2368--237...
-
[44]
Experts, errors, and context: A large-scale study of human evaluation for machine translation
Freitag, M., Foster, G., Grangier, D., Ratnakar, V., Tan, Q., and Macherey, W. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9: 0 1460--1474, 2021. doi:10.1162/tacl_a_00437. URL https://aclanthology.org/2021.tacl-1.87
-
[45]
Freitag, M., Rei, R., Mathur, N., Lo, C.-k., Stewart, C., Avramidis, E., Kocmi, T., Foster, G., Lavie, A., and Martins, A. F. T. Results of WMT 22 metrics shared task: Stop using BLEU -- neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pp.\ 46--68, Abu Dhabi, United Arab Emirates (Hybrid), D...
work page 2022
-
[46]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., Jones, A., Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Hatfield-Dodds, Z., Henighan, T., Hernandez, D., Hume, T., Jacobson, J., Johnston, S., Kravec, S., Olsson, C., Ringer, S., Tran-J...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Word embeddings quantify 100 years of gender and ethnic stereotypes
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115 0 (16): 0 E3635--E3644, 2018. doi:10.1073/pnas.1720347115. URL https://www.pnas.org/doi/abs/10.1073/pnas.1720347115
-
[48]
Handling bias in toxic speech detection: A survey
Garg, T., Masud, S., Suresh, T., and Chakraborty, T. Handling bias in toxic speech detection: A survey. January 2022. URL https://arxiv.org/abs/2202.00126
-
[49]
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., au2, H. D. I., and Crawford, K. Datasheets for datasets, 2021
work page 2021
-
[50]
Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. R eal T oxicity P rompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 3356--3369, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.301. URL https://...
- [52]
-
[53]
Improving alignment of dialogue agents via targeted human judgements
Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Is...
work page internal anchor Pith review arXiv 2022
-
[54]
Intrinsic bias metrics do not correlate with application bias
Goldfarb-Tarrant, S., Marchant, R., Mu \ n oz S \'a nchez, R., Pandya, M., and Lopez, A. Intrinsic bias metrics do not correlate with application bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1926--194...
-
[55]
Google. Our principles, 2018. URL https://ai.google/responsibility/principles/. Accessed May 16, 2023
work page 2018
-
[56]
Generative ai prohibited use policy, 2023 a
Google. Generative ai prohibited use policy, 2023 a . URL https://policies.google.com/terms/generative-ai/use-policy. Accessed May 16, 2023
work page 2023
-
[57]
Palm api and makersuite additional terms of service, 2023 b
Google. Palm api and makersuite additional terms of service, 2023 b . URL https://developers.generativeai.google/terms. Accessed May 16, 2023
work page 2023
-
[58]
Is your toxicity my toxicity? E xploring the impact of rater identity on toxicity annotation
Goyal, N., Kivlichan, I., Rosen, R., and Vasserman, L. Is your toxicity my toxicity? E xploring the impact of rater identity on toxicity annotation. May 2022. URL https://arxiv.org/abs/2205.00501
-
[59]
Generating sequences with recurrent neural networks, 2014
Graves, A. Generating sequences with recurrent neural networks, 2014
work page 2014
-
[60]
Towards a critical race methodology in algorithmic fairness
Hanna, A., Denton, E., Smart, A., and Smith-Loud, J. Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* '20, pp.\ 501–512, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450369367. doi:10.1145/3351095.3372826. URL https://doi.org/10....
-
[61]
S., Mubasshir, K., Li, Y.-F., Kang, Y.-B., Rahman, M
Hasan, T., Bhattacharjee, A., Islam, M. S., Mubasshir, K., Li, Y.-F., Kang, Y.-B., Rahman, M. S., and Shahriyar, R. XL -sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.\ 4693--4703, Online, August 2021. Association for Computational Linguistics. doi...
-
[62]
A., Burns, K., Saenko, K., Darrell, T., and Rohrbach, A
Hendricks, L. A., Burns, K., Saenko, K., Darrell, T., and Rohrbach, A. Women also snowboard: Overcoming bias in captioning models (extended abstract), 2018
work page 2018
-
[64]
Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 11 1997. ISSN 0899-7667. doi:10.1162/neco.1997.9.8.1735. URL https://doi.org/10.1162/neco.1997.9.8.1735
-
[65]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., et al. Training compute-optimal large language models. NeurIPS, 2022. URL https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Universal language model fine-tuning for text classification
Howard, J. and Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 328--339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1031. URL https://aclanthology.org/P18-1031
-
[67]
Hsiao, S. and Collins, E. Try bard and share your feedback. https://blog.google/technology/ai/try-bard/, March 2023. Accessed: 2023-5-5
work page 2023
-
[68]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. LoRA : Low-Rank adaptation of large language models. June 2021. URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[70]
Jacobs, A. Z. and Wallach, H. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901
-
[72]
Survey of hallucination in natural language generation,
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys , 55 0 (12): 0 1--38, mar 2023. doi:10.1145/3571730. URL https://doi.org/10.1145
-
[73]
Toxic comment classification challenge, 2018
Jigsaw. Toxic comment classification challenge, 2018. URL https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
work page 2018
-
[74]
Exploring the role of human raters in creating nlp datasets, 2019 a
Jigsaw. Exploring the role of human raters in creating nlp datasets, 2019 a . URL https://medium.com/jigsaw/creating-labeled-datasets-and-exploring-the-role-of-human-raters-56367b6db298
work page 2019
-
[75]
Jigsaw multilingual toxic comment classification, 2019 b
Jigsaw. Jigsaw multilingual toxic comment classification, 2019 b . URL https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
work page 2019
-
[76]
Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10....
-
[77]
Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020
work page 2020
-
[79]
The misgendering machines: Trans/hci implications of automatic gender recognition
Keyes, O. The misgendering machines: Trans/hci implications of automatic gender recognition. Proc. ACM Hum.-Comput. Interact., 2 0 (CSCW), nov 2018. doi:10.1145/3274357. URL https://doi.org/10.1145/3274357
-
[80]
Kneser, R. and Ney, H. Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, volume 1, pp.\ 181--184 vol.1, 1995. doi:10.1109/ICASSP.1995.479394
-
[81]
Pretraining language models with human preferences
Korbak, T., Shi, K., Chen, A., Bhalerao, R., Buckley, C. L., Phang, J., Bowman, S. R., and Perez, E. Pretraining language models with human preferences, 2023. URL https://arxiv.org/abs/2302.08582
-
[82]
Quality at a glance: An audit of web-crawled multilingual datasets
Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., Ulzii-Orshikh, N., Tapo, A., Subramani, N., Sokolov, A., Sikasote, C., et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10: 0 50--72, 2022
work page 2022
-
[83]
Transactions of the Association for Computational Linguistics , author =
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguist...
-
[84]
Ladhak, F., Durmus, E., Cardie, C., and McKeown, K. W iki L ingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4034--4048, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.360. URL https://aclantholog...
-
[85]
RACE : Large-scale R e A ding comprehension dataset from examinations
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.\ 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL https://aclanthology...
-
[86]
Lee, C. Welcome, singular "they". https://apastyle.apa.org/blog/singular-they, 2019. Accessed: 2022-11-18
work page 2019
-
[88]
doi: 10.18653/v1/2021.emnlp-main.243
Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.emnlp-main.243. URL https:/...
-
[89]
Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012
work page 2012
-
[91]
Holistic Evaluation of Language Models
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., R \'e , C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[92]
The flan collection: Designing data and methods for effective instruction tuning
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts, A. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.