MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Pith reviewed 2026-05-22 21:16 UTC · model grok-4.3
The pith
MultiBLiMP 1.0 supplies over 128000 minimal pairs across 101 languages to test LLMs on subject-verb agreement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MultiBLiMP 1.0 is a massively multilingual benchmark of linguistic minimal pairs covering 101 languages and two types of subject-verb agreement. It contains more than 128000 minimal pairs created using a fully automated pipeline that leverages the large-scale linguistic resources of Universal Dependencies and UniMorph. The benchmark is designed to evaluate the abilities of large language models at an unprecedented multilingual scale and to highlight the shortcomings of the current state-of-the-art in modelling low-resource languages.
What carries the argument
The fully automated pipeline that draws on Universal Dependencies treebanks and UniMorph morphological resources to construct minimal pairs testing subject-verb agreement.
Load-bearing premise
The fully automated pipeline produces minimal pairs that are linguistically valid and representative of the target agreement phenomena across all 101 languages.
What would settle it
A manual review of sampled minimal pairs from several low-resource languages that identifies a high rate of grammatically invalid or non-representative examples.
read the original abstract
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MultiBLiMP 1.0, a benchmark of over 128,000 minimal pairs targeting two types of subject-verb agreement phenomena across 101 languages. Pairs are generated via a fully automated pipeline that extracts from Universal Dependencies trees and UniMorph paradigms. The work positions the benchmark as a tool to evaluate LLMs at unprecedented multilingual scale and to demonstrate shortcomings of current models on low-resource languages.
Significance. If the generated pairs prove linguistically valid and representative, the scale of the resource (101 languages, >128k pairs) would constitute a substantial advance over existing agreement benchmarks, which are typically English-centric or limited to a handful of languages. The automated construction from public UD and UniMorph resources is a clear strength that enables this coverage without manual annotation per language.
major comments (1)
- [Abstract / pipeline description] The central claim that MultiBLiMP 1.0 provides a reliable benchmark for LLM evaluation rests on the assumption that the automated pipeline produces linguistically valid and representative minimal pairs for subject-verb agreement in all 101 languages. No human validation, inter-annotator agreement scores, per-language error analysis, or quantitative assessment of mismatches (e.g., due to null subjects, syncretism, or incomplete UD coverage) is described, leaving the soundness of the >128k pairs unanchored.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The concern regarding the lack of explicit validation for the generated minimal pairs is well-taken, and we address it directly below while outlining planned revisions.
read point-by-point responses
-
Referee: [Abstract / pipeline description] The central claim that MultiBLiMP 1.0 provides a reliable benchmark for LLM evaluation rests on the assumption that the automated pipeline produces linguistically valid and representative minimal pairs for subject-verb agreement in all 101 languages. No human validation, inter-annotator agreement scores, per-language error analysis, or quantitative assessment of mismatches (e.g., due to null subjects, syncretism, or incomplete UD coverage) is described, leaving the soundness of the >128k pairs unanchored.
Authors: We agree that the manuscript would benefit from greater transparency on pipeline soundness. The current version focuses on the scale enabled by fully automated extraction from Universal Dependencies and UniMorph without including human validation or per-language error rates. In revision we will add a new subsection under Methods that (i) quantifies UD coverage and annotation completeness for the targeted agreement phenomena across the 101 languages, (ii) discusses known sources of mismatch such as null subjects and syncretism with concrete examples drawn from the data, and (iii) reports results from a small-scale human validation study (approximately 200 pairs across 10 typologically diverse languages) including inter-annotator agreement. These additions will anchor the benchmark's reliability while preserving the core contribution of automated, large-scale coverage. revision: yes
Circularity Check
No circularity: benchmark constructed from independent external resources
full rationale
The paper presents MultiBLiMP 1.0 as a benchmark of >128k minimal pairs for subject-verb agreement across 101 languages, generated by a fully automated pipeline that extracts from Universal Dependencies trees and UniMorph paradigms. These are pre-existing, publicly maintained linguistic resources independent of the current work. No derivation chain, equations, fitted parameters, or predictions are described that reduce by construction to the paper's own outputs or self-citations. The central claim (evaluation of LLMs at scale and highlighting shortcomings for low-resource languages) rests on the benchmark's construction and downstream experiments rather than any self-referential loop. This matches the default expectation of a non-circular benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Universal Dependencies and UniMorph provide accurate and consistent linguistic annotations suitable for automated minimal-pair extraction in 101 languages.
Forward citations
Cited by 4 Pith papers
-
Implicit Representations of Grammaticality in Language Models
Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
-
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Sparse crosscoders on LLM checkpoint triplets track emergence, maintenance, and discontinuation of linguistic features during pretraining via a new RelIE metric.
-
Different types of syntactic agreement recruit the same units within large language models
Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
-
Multilingual Vision-Language Models, A Survey
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-base...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.