Recognition: 2 theorem links
· Lean TheoremPaLM: Scaling Language Modeling with Pathways
Pith reviewed 2026-05-10 23:39 UTC · model grok-4.3
The pith
Scaling a language model to 540 billion parameters produces state-of-the-art few-shot results on hundreds of benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training a 540-billion parameter densely activated Transformer language model using the Pathways system across multiple TPU pods, the authors demonstrate continued scaling benefits through state-of-the-art few-shot performance on hundreds of benchmarks. The model outperforms the finetuned state of the art on multi-step reasoning tasks and exceeds average human performance on BIG-bench, where a significant number of tasks show discontinuous improvements only at the largest size. PaLM also exhibits strong multilingual and code generation capabilities.
What carries the argument
PaLM, the 540-billion parameter Pathways Language Model, a densely activated Transformer trained efficiently via the Pathways ML system on 6144 TPU v4 chips.
If this is right
- Few-shot prompts alone suffice to exceed finetuned systems on multi-step reasoning tasks.
- Average human performance is reached on a broad suite of language tasks without task-specific training.
- Multilingual tasks and source code generation improve alongside English benchmarks as scale increases.
- Some tasks exhibit sharp performance increases only when model size reaches hundreds of billions of parameters.
- Analysis of bias, toxicity, and memorization becomes feasible at this scale for larger models.
Where Pith is reading between the lines
- Similar scaling combined with efficient training systems could reduce the data needed for new applications in other modalities.
- Models of this size may enable practical systems that handle varied real-world queries with minimal adaptation.
- The pattern of discontinuous gains suggests that certain capabilities emerge only after crossing specific size thresholds.
- Ongoing scaling will require new methods to manage memorization of training data and unintended biases.
Load-bearing premise
That the observed performance gains from scaling to 540 billion parameters will continue to appear on tasks and data outside the specific benchmarks and training distribution used.
What would settle it
A follow-up experiment that trains a model at or above 540 billion parameters and finds no further gains or discontinuous jumps on BIG-bench tasks, or that matches the reported results without scaling.
read the original abstract
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PaLM, a 540-billion parameter densely activated Transformer language model trained on 6144 TPU v4 chips using the Pathways system. It claims continued scaling benefits via state-of-the-art few-shot results across hundreds of language understanding and generation benchmarks, including breakthrough performance that outperforms finetuned SOTA on multi-step reasoning tasks and exceeds average human performance on BIG-bench (with discontinuous jumps on a significant number of tasks). Additional results cover multilingual tasks, code generation, bias/toxicity analysis, and memorization studies as a function of scale.
Significance. If the empirical results hold, the work provides substantial evidence for scaling benefits at the 540B parameter regime, particularly for few-shot reasoning and multilingual capabilities. The inclusion of bias, toxicity, and memorization analyses is a strength that aids responsible assessment of large models. The demonstration of efficient large-scale training via Pathways is a notable engineering contribution.
major comments (2)
- [Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.
- [Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.
minor comments (3)
- [Abstract] The abstract states results on 'hundreds of benchmarks' but does not enumerate the exact count or breakdown by category, reducing clarity.
- [Figures] Figure captions and scaling plots would benefit from explicit axis labels for model size and data volume to facilitate direct comparison with prior scaling studies.
- [Memorization analysis] The memorization analysis section could include a direct comparison table against smaller models (e.g., 8B or 62B variants) for quantitative context.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work and recommendation for minor revision. We appreciate the constructive feedback on improving the substantiation of our claims and the reproducibility of our experimental setup. We address each major comment below and outline the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [Benchmark results section] Benchmark results section: The claims of SOTA few-shot performance, breakthrough reasoning results, and outperforming human performance on BIG-bench are presented without reported statistical error bars, multiple evaluation runs, or precise protocol details (e.g., prompt formatting, decoding parameters), which are load-bearing for substantiating the scaling and discontinuous improvement assertions.
Authors: We agree that additional protocol details are necessary to fully substantiate the reported results. In the revised manuscript, we will expand the evaluation sections to provide precise information on prompt formatting (including exact templates and number of shots), decoding parameters (e.g., temperature, top-p, and beam size where applicable), and the standardized evaluation harness used across benchmarks. For BIG-bench, we followed the official few-shot protocol defined by the benchmark. Regarding statistical error bars and multiple runs, the computational cost of full evaluations on the 540B model across hundreds of tasks is extremely high, rendering repeated runs infeasible within our resource constraints. We prioritized comprehensive coverage of tasks over variance estimation. However, we will add notes on prompt sensitivity for key reasoning tasks where we observed consistent gains, and we maintain that the magnitude of the observed improvements (including discontinuous jumps) aligns with prior scaling studies even in the absence of error bars. revision: partial
-
Referee: [Training data and setup] Training data and setup: The description of the 780B token training corpus and data filtering/mixture is high-level; this directly impacts reproducibility of the reported scaling observations and assessment of potential contamination effects on the few-shot and BIG-bench results.
Authors: We acknowledge that a more detailed description would aid reproducibility and contamination analysis. We will revise the 'Training Data' section (and associated appendix) to include expanded details on the data mixture ratios, specific sources within each category (web, books, code, multilingual), the quality filtering and deduplication methods applied, and the resulting token counts per category that total 780B tokens. We will also add a subsection discussing our contamination mitigation steps, including n-gram overlap checks against major benchmarks. While the full corpus cannot be released due to its scale and proprietary elements, these additions will provide sufficient information to interpret the scaling results and assess potential data leakage effects. revision: yes
Circularity Check
No significant circularity; direct empirical scaling results
full rationale
This is a large-scale empirical study reporting training of a 540B-parameter Transformer on 6144 TPU v4 chips and its few-shot evaluation across hundreds of benchmarks, including BIG-bench and reasoning tasks. No derivations, equations, or first-principles predictions appear; all performance claims rest on the reported experimental measurements rather than any fitted parameter being renamed as a prediction or any self-citation chain substituting for independent evidence. Bias, toxicity, and memorization analyses are likewise direct empirical checks. The central claims therefore remain self-contained experimental observations without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter count
- training data mixture and volume
axioms (2)
- domain assumption The Transformer architecture remains effective at 540B scale
- domain assumption Few-shot evaluation on standard benchmarks measures meaningful capability gains
Forward citations
Cited by 60 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
AgentBench: Evaluating LLMs as Agents
AgentBench is a new multi-environment benchmark showing commercial LLMs outperform open-source models up to 70B parameters in agent tasks mainly due to better long-term reasoning and instruction following.
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
-
Rates of forgetting for the sequentially Markov coalescent
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
RWKV: Reinventing RNNs for the Transformer Era
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
-
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
VALL-E is a neural codec language model trained on 60K hours of speech that performs zero-shot TTS, synthesizing natural speech that matches an unseen speaker's voice, emotion, and environment from a 3-second prompt.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions
MetaColloc meta-learns a universal set of neural basis functions offline so that new PDEs can be solved at test time with a single linear solve instead of per-equation neural-network optimization.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
Repeating high-quality filtered German web data over multiple epochs produces better language models than single-pass training on larger, more diverse but lower-quality sets, even after seven epochs.
-
Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
CoUR uses LLMs for efficient RL reward design through uncertainty quantification and similarity selection, achieving better performance and lower evaluation costs on IsaacGym and Bidexterous Manipulation benchmarks.
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
SeLaR: Selective Latent Reasoning in Large Language Models
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models
GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
-
Textbooks Are All You Need II: phi-1.5 technical report
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
-
YaRN: Efficient Context Window Extension of Large Language Models
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
-
Textbooks Are All You Need
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
-
MiniLLM: On-Policy Distillation of Large Language Models
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Reference graph
Works this paper leans on
-
[1]
URL https://github.com/google-research/t5x
T5x, 2021. URL https://github.com/google-research/t5x
work page 2021
-
[2]
Persistent anti-muslim bias in large language models
Abid, A., Farooqi, M., and Zou, J. Persistent anti-muslim bias in large language models. CoRR, abs/2101.05783, 2021. URL https://arxiv.org/abs/2101.05783
-
[3]
R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y ., et al
Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade, G., Lu, Y., et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020
-
[4]
The adverse effects of code duplication in machine learning models of code
Allamanis, M. The adverse effects of code duplication in machine learning models of code. In SPLASH Onward! , 2019
work page 2019
-
[5]
T., Devanbu, P., and Sutton, C
Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51 0 (4), jul 2018. ISSN 0360-0300. doi:10.1145/3212695. URL https://doi.org/10.1145/3212695
-
[6]
arXiv preprint arXiv:1905.13319 , year=
Amini, A., Gabriel, S., Lin, S., Koncel - Kedziorski, R., Choi, Y., and Hajishirzi, H. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. CoRR, abs/1905.13319, 2019. URL http://arxiv.org/abs/1905.13319
-
[7]
Program Synthesis with Large Language Models
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Bakshi, S., Batra, S., Heidari, P., Arun, A., Jain, S., and White, M. Structure-to-text generation with self-training, acceptability classifiers and context-conditioning for the GEM shared task. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.\ 136--147, Online, August 2021. Association for Computa...
-
[9]
Barham, P., Chowdhery, A., Dean, J., Ghemawat, S., Hand, S., Hurt, D., Isard, M., Lim, H., Pang, R., Roy, S., Saeta, B., Schuh, P., Sepassi, R., Shafey, L. E., Thekkath, C. A., and Wu, Y. Pathways: Asynchronous distributed dataflow for ML . To appear in MLSys 2022, 2022. URL https://arxiv.org/abs/2203.12533
-
[10]
Barocas, S., Guo, A., Kamar, E., Krones, J., Morris, M. R., Vaughan, J. W., Wadsworth, W. D., and Wallach, H. Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs, pp.\ 368–378. Association for Computing Machinery, New York, NY, USA, 2021. ISBN 9781450384735. URL https://doi.org/10.1145/3461702.3462610
-
[11]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 610–623. Association for Computing Machinery, 2021. URL https://doi.org/10.1145/3442188.3445922
-
[12]
Semantic parsing on freebase from question-answer pairs
Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.\ 1533--1544, 2013
work page 2013
-
[13]
Beyond the imitation game: Measuring and extrapolating the capabilities of language models
BIG-bench collaboration . Beyond the imitation game: Measuring and extrapolating the capabilities of language models. In preparation, 2021. URL https://github.com/google/BIG-bench/
work page 2021
-
[14]
Bird, S. and Loper, E. NLTK : The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions , pp.\ 214--217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/P04-3031
work page 2004
-
[15]
Piqa: Reasoning about physical commonsense in natural language
Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641
-
[16]
L., Barocas, S., Daum \'e III, H., and Wallach, H
Blodgett, S. L., Barocas, S., Daum \'e III, H., and Wallach, H. Language (technology) is power: A critical survey of `` bias '' in NLP . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5454--5476, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.485. URL https://ac...
-
[17]
L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H
Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping N orwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 1004--...
-
[18]
Bommasani, R. and et. al., D. A. H. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. URL https://arxiv.org/abs/2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [19]
-
[20]
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : Composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax
work page 2018
-
[21]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...
work page 1901
-
[22]
Cao, Y. T. and Daum \'e III, H. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 4568--4595, Online, July 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.acl-main.418. URL https://aclanthology.org/2020.acl-main.418
-
[23]
Quantifying Memorization Across Neural Language Models
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022
work page internal anchor Pith review arXiv 2022
-
[24]
Castro Ferreira, T., Gardent, C., Ilinykh, N., van der Lee, C., Mille, S., Moussallem, D., and Shimorina, A. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd WebNLG Workshop on Natural Language Generation from the Semantic Web (WebNLG+ 2020), pp.\ 55--76, Dublin, Ireland (Virt...
work page 2020
-
[25]
Caswell, I., Chelba, C., and Grangier, D. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp.\ 53--63, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206
-
[26]
Evaluating Large Language Models Trained on Code
Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[28]
Qu AC : Question answering in context
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y., Liang, P., and Zettlemoyer, L. Qu AC : Question answering in context. CoRR, abs/1808.07036, 2018. URL http://arxiv.org/abs/1808.07036
-
[29]
Rethinking Attention with Performers
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review arXiv 2009
-
[30]
Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages
Clark, J., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., and Palomaki, J. Tydi QA : A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020. URL https://storage.googleapis.com/tydiqa/tydiqa.pdf
work page 2020
-
[31]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Dev, S., Li, T., Phillips, J. M., and Srikumar, V. On measuring and mitigating biased inferences of word embeddings. CoRR, abs/1908.09369, 2019. URL http://arxiv.org/abs/1908.09369
-
[34]
Harms of gender exclusivity and challenges in non-binary representation in language technologies
Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J., and Chang, K.-W. Harms of gender exclusivity and challenges in non-binary representation in language technologies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1968--1994, Online and Punta Cana, Dominican Republic, November 2021 a . Ass...
-
[35]
On measures of biases and harms in NLP
Dev, S., Sheng, E., Zhao, J., Sun, J., Hou, Y., Sanseverino, M., Kim, J., Peng, N., and Chang, K. What do bias measures measure? CoRR, abs/2108.03362, 2021 b . URL https://arxiv.org/abs/2108.03362
-
[36]
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapolis, Minnesot...
-
[37]
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society , pages =
Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES '18, pp.\ 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi:10.1145/3278721.3278729. URL https://doi.org/10...
-
[38]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Dodge, J., Sap, M., Marasovi \'c , A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1286--1305, 2021
work page 2021
-
[39]
arXiv preprint arXiv:2112.06905 , year =
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. GLaM : Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021. URL https://arxiv.org/pdf/2112.06905
-
[40]
doi:10.18653/v1/N19-1246 , editor =
Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP : A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 2368--237...
-
[41]
Dusek, O. and Jurvc'ivcek, F. Neural generation for czech: Data and baselines. 2019
work page 2019
-
[42]
Dušek, O. and Jurčíček, F. Neural Generation for Czech : Data and Baselines . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 563--574, Tokyo, Japan, October 2019. URL https://www.aclweb.org/anthology/W19-8670/
work page 2019
-
[43]
Dušek, O., Howcroft, D. M., and Rieser, V. Semantic Noise Matters for Neural Natural Language Generation . In Proceedings of the 12th International Conference on Natural Language Generation ( INLG 2019) , pp.\ 421--426, Tokyo, Japan, 2019. URL https://www.aclweb.org/anthology/W19-8652/
work page 2019
-
[44]
Understanding back-translation at scale
Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 489--500, 2018. URL https://aclanthology.org/D18-1045
work page 2018
-
[45]
Beyond english-centric multilingual machine translation
Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El - Kishky, A., Goyal, S., Baines, M., Celebi, O., Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S., Grave, E., Auli, M., and Joulin, A. Beyond english-centric multilingual machine translation. CoRR, abs/2010.11125, 2020. URL https://arxiv.org/abs/2010.11125
-
[46]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv.org/abs/2101.03961
work page internal anchor Pith review arXiv 2021
-
[47]
Freitag, M. and Firat, O. Complete multilingual neural machine translation. CoRR, abs/2010.10239, 2020. URL https://arxiv.org/abs/2010.10239
-
[48]
The state of sparsity in deep neural networks.ArXiv, abs/1902.09574, 2019
Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019
-
[49]
Gardent, C., Shimorina, A., Narayan, S., and Perez-Beltrachini, L. Creating training corpora for nlg micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 179--188. Association for Computational Linguistics, 2017. doi:10.18653/v1/P17-1017. URL http://www.aclweb.org/antholog...
-
[50]
W., Wallach, H., Daum \'e III, H., and Crawford, K
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., III, H. D., and Crawford, K. Datasheets for datasets. Commun. ACM, 64 0 (12): 0 86–92, nov 2021. ISSN 0001-0782. doi:10.1145/3458723. URL https://doi.org/10.1145/3458723
-
[51]
Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Realtoxicityprompts: Evaluating neural toxic degeneration in language models, 2020
work page 2020
-
[52]
S., Aremu, A., Bosselut, A., Chandu, K
Gehrmann, S., Adewumi, T., Aggarwal, K., Ammanamanchi, P. S., Aremu, A., Bosselut, A., Chandu, K. R., Clinciu, M.-A., Das, D., Dhole, K., Du, W., Durmus, E., Du s ek, O., Emezue, C. C., Gangal, V., Garbacea, C., Hashimoto, T., Hou, Y., Jernite, Y., Jhamtani, H., Ji, Y., Jolly, S., Kale, M., Kumar, D., Ladhak, F., Madaan, A., Maddela, M., Mahajan, K., Maha...
work page 2021
-
[53]
Gehrmann, S., Clark, E., and Sellam, T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. CoRR, abs/2202.06935, 2022. URL https://arxiv.org/abs/2202.06935
-
[54]
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2: 0 665--673, Nov 2020. doi:https://doi.org/10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z
-
[55]
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 0 346--361, 2021. doi:10.1162/tacl_a_00370. URL https://aclanthology.org/2021.tacl-1.21
-
[56]
Google cloud classifying content, a
Google Cloud NLP . Google cloud classifying content, a . URL https://cloud.google.com/natural-language/docs/classifying-text
-
[57]
Google cloud infotype detector, b
Google Cloud NLP . Google cloud infotype detector, b . URL https://cloud.google.com/dlp/docs/infotypes-reference
-
[58]
Gupta, R., Pal, S., Kanade, A., and Shevade, S. K. Deepfix: Fixing common C language errors by deep learning. In Singh, S. P. and Markovitch, S. (eds.), Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA , pp.\ 1345--1351. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/A...
work page 2017
-
[59]
Retrieval augmented language model pre-training
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. Retrieval augmented language model pre-training. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.\ 3929--3938. PMLR, 13--18 Jul 2020. URL https://proceedings.mlr.press/v119/guu20a.html
work page 2020
-
[60]
Measuring massive multitask language understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[61]
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. arXiv prepr...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. GPipe : Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pp.\ 103--112, 2019
work page 2019
-
[63]
Social biases in nlp models as barriers for persons with disabilities
Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., and Denuyl, S. Social biases in nlp models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 5491--5501, 2020
work page 2020
-
[64]
Jacobs, A. Z. and Wallach, H. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp.\ 375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445901. URL https://doi.org/10.1145/3442188.3445901
-
[65]
T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension
Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017. URL https://aclanthology.org/P17-1147
work page 2017
-
[66]
Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63 0 (7): 0 67--78, 2020
work page 2020
-
[67]
Deduplicating training data mitigates privacy risks in language models
Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models. 2022. URL https://arxiv.org/abs/2202.06539
-
[68]
Scaling Laws for Neural Language Models
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[69]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[70]
Reformer: The Efficient Transformer
Kitaev, N., Kaiser, ., and Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020
work page internal anchor Pith review arXiv 2001
-
[71]
URL https://knowyourdata.withgoogle.com/
Know Your Data . URL https://knowyourdata.withgoogle.com/
-
[72]
Koncel-Kedziorski, R., Roy, S., Amini, A., Kushman, N., and Hajishirzi, H. MAWPS : A math word problem repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , pp.\ 1152--1157, San Diego, California, June 2016. Association for Computational Linguistics....
-
[74]
Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Blanco, E. and Lu, W. (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018 , pp.\ 66--71...
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
-
[75]
SPoC : Search-based pseudocode to code
Kulal, S., Pasupat, P., Chandra, K., Lee, M., Padon, O., Aiken, A., and Liang, P. SPoC : Search-based pseudocode to code. In Advances in Neural Information Processing Systems, June 2019
work page 2019
-
[76]
Kurita, K., Vyas, N., Pareek, A., Black, A. W., and Tsvetkov, Y. Quantifying social biases in contextual word representations. 1st ACL Workshop on Gender Bias for Natural Language Processing, 2019. URL https://par.nsf.gov/biblio/10098355
-
[77]
M., Uszkoreit, J., Le, Q., and Petrov, S
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. Natural Q uestions: A benchmark for question answering research. Transactions of the Association for Computational Linguis...
work page 2019
-
[78]
Lachaux, M., Rozi \` e re, B., Chanussot, L., and Lample, G. Unsupervised translation of programming languages. CoRR, abs/2006.03511, 2020. URL https://arxiv.org/abs/2006.03511
-
[79]
Ladhak, F., Durmus, E., Cardie, C., and McKeown, K. W iki L ingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.\ 4034--4048, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.360. URL https://www.aclweb....
-
[80]
RACE : Large-scale R e A ding comprehension dataset from examinations
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE : Large-scale R e A ding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.\ 785--794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi:10.18653/v1/D17-1082. URL https://aclanthology...
-
[81]
T., Wang, Y., Zhang, D., and Lim, E.-P
Lan, Y., Wang, L., Zhang, Q., Lan, Y., Dai, B. T., Wang, Y., Zhang, D., and Lim, E.-P. Mwptoolkit: An open-source framework for deep learning-based math word problem solvers. arXiv preprint arXiv:2109.00799, 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.