MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Arianna Bisazza; Jaap Jumelet; Joakim Nivre; Leonie Weissweiler

arxiv: 2504.02768 · v4 · submitted 2025-04-03 · 💻 cs.CL

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet , Leonie Weissweiler , Joakim Nivre , Arianna Bisazza This is my paper

Pith reviewed 2026-05-22 21:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords minimal pairsmultilingual benchmarksubject-verb agreementlarge language modelslow-resource languagesUniversal DependenciesUniMorphautomated pipeline

0 comments

The pith

MultiBLiMP 1.0 supplies over 128000 minimal pairs across 101 languages to test LLMs on subject-verb agreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MultiBLiMP 1.0, a benchmark consisting of more than 128000 linguistic minimal pairs. It spans 101 languages and focuses on two types of subject-verb agreement. The minimal pairs are produced by a fully automated pipeline that relies on Universal Dependencies and UniMorph. This resource enables large-scale evaluation of how well large language models handle these linguistic phenomena. In doing so it brings attention to the particular challenges faced by models when dealing with low-resource languages.

Core claim

MultiBLiMP 1.0 is a massively multilingual benchmark of linguistic minimal pairs covering 101 languages and two types of subject-verb agreement. It contains more than 128000 minimal pairs created using a fully automated pipeline that leverages the large-scale linguistic resources of Universal Dependencies and UniMorph. The benchmark is designed to evaluate the abilities of large language models at an unprecedented multilingual scale and to highlight the shortcomings of the current state-of-the-art in modelling low-resource languages.

What carries the argument

The fully automated pipeline that draws on Universal Dependencies treebanks and UniMorph morphological resources to construct minimal pairs testing subject-verb agreement.

Load-bearing premise

The fully automated pipeline produces minimal pairs that are linguistically valid and representative of the target agreement phenomena across all 101 languages.

What would settle it

A manual review of sampled minimal pairs from several low-resource languages that identifies a high rate of grammatically invalid or non-representative examples.

read the original abstract

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiBLiMP gives a large automated multilingual minimal-pair set for subject-verb agreement, but its reliability hinges on an unvalidated pipeline.

read the letter

MultiBLiMP 1.0 delivers over 128,000 minimal pairs for two types of subject-verb agreement across 101 languages. The pairs are generated automatically from Universal Dependencies trees and UniMorph paradigms. That scale is the main new element. Earlier minimal-pair work stayed smaller and covered far fewer languages, so this extends the approach to low-resource settings in a practical way. The automated route keeps the effort manageable and lets the authors reach languages that usually get left out of such tests. The paper also shows some model results that point to weaker performance on low-resource languages, which aligns with what many people already suspect but now has a broader test bed. The construction itself looks clean on paper, with no obvious circularity since it draws from public external resources. The soft spot is the validation. The abstract and stress-test note give no sign of human review, inter-annotator checks, or per-language error rates. In languages with null subjects, syncretism, or thin UD coverage, the extracted pairs could easily mismatch real agreement patterns. If those mismatches are common, the benchmark numbers on model behavior become harder to interpret. The full paper might contain more checks, but nothing in the provided material anchors the claim that the pairs are linguistically sound everywhere. This work is aimed at people who evaluate LLMs on grammatical generalization across languages. Anyone building or testing multilingual models could use the released pairs as a starting point, provided they treat the low-resource results with caution. It is worth sending to peer review because the scale is genuinely new and the data release could be useful, even if the authors need to add concrete validation evidence before the benchmark can be treated as reliable.

Referee Report

1 major / 0 minor

Summary. The paper introduces MultiBLiMP 1.0, a benchmark of over 128,000 minimal pairs targeting two types of subject-verb agreement phenomena across 101 languages. Pairs are generated via a fully automated pipeline that extracts from Universal Dependencies trees and UniMorph paradigms. The work positions the benchmark as a tool to evaluate LLMs at unprecedented multilingual scale and to demonstrate shortcomings of current models on low-resource languages.

Significance. If the generated pairs prove linguistically valid and representative, the scale of the resource (101 languages, >128k pairs) would constitute a substantial advance over existing agreement benchmarks, which are typically English-centric or limited to a handful of languages. The automated construction from public UD and UniMorph resources is a clear strength that enables this coverage without manual annotation per language.

major comments (1)

[Abstract / pipeline description] The central claim that MultiBLiMP 1.0 provides a reliable benchmark for LLM evaluation rests on the assumption that the automated pipeline produces linguistically valid and representative minimal pairs for subject-verb agreement in all 101 languages. No human validation, inter-annotator agreement scores, per-language error analysis, or quantitative assessment of mismatches (e.g., due to null subjects, syncretism, or incomplete UD coverage) is described, leaving the soundness of the >128k pairs unanchored.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The concern regarding the lack of explicit validation for the generated minimal pairs is well-taken, and we address it directly below while outlining planned revisions.

read point-by-point responses

Referee: [Abstract / pipeline description] The central claim that MultiBLiMP 1.0 provides a reliable benchmark for LLM evaluation rests on the assumption that the automated pipeline produces linguistically valid and representative minimal pairs for subject-verb agreement in all 101 languages. No human validation, inter-annotator agreement scores, per-language error analysis, or quantitative assessment of mismatches (e.g., due to null subjects, syncretism, or incomplete UD coverage) is described, leaving the soundness of the >128k pairs unanchored.

Authors: We agree that the manuscript would benefit from greater transparency on pipeline soundness. The current version focuses on the scale enabled by fully automated extraction from Universal Dependencies and UniMorph without including human validation or per-language error rates. In revision we will add a new subsection under Methods that (i) quantifies UD coverage and annotation completeness for the targeted agreement phenomena across the 101 languages, (ii) discusses known sources of mismatch such as null subjects and syncretism with concrete examples drawn from the data, and (iii) reports results from a small-scale human validation study (approximately 200 pairs across 10 typologically diverse languages) including inter-annotator agreement. These additions will anchor the benchmark's reliability while preserving the core contribution of automated, large-scale coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from independent external resources

full rationale

The paper presents MultiBLiMP 1.0 as a benchmark of >128k minimal pairs for subject-verb agreement across 101 languages, generated by a fully automated pipeline that extracts from Universal Dependencies trees and UniMorph paradigms. These are pre-existing, publicly maintained linguistic resources independent of the current work. No derivation chain, equations, fitted parameters, or predictions are described that reduce by construction to the paper's own outputs or self-citations. The central claim (evaluation of LLMs at scale and highlighting shortcomings for low-resource languages) rests on the benchmark's construction and downstream experiments rather than any self-referential loop. This matches the default expectation of a non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that UD and UniMorph annotations are sufficiently accurate for automatic minimal-pair generation across 101 languages; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Universal Dependencies and UniMorph provide accurate and consistent linguistic annotations suitable for automated minimal-pair extraction in 101 languages.
The pipeline depends on these resources being reliable without additional manual verification steps described.

pith-pipeline@v0.9.0 · 5621 in / 1135 out tokens · 46710 ms · 2026-05-22T21:16:01.188131+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Implicit Representations of Grammaticality in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Linear probes on LM hidden states detect grammaticality better than string probabilities, generalize to human benchmarks and other languages, and correlate weakly with likelihood.
Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
cs.CL 2025-09 unverdicted novelty 6.0

Sparse crosscoders on LLM checkpoint triplets track emergence, maintenance, and discontinuation of linguistic features during pretraining via a new RelIE metric.
Different types of syntactic agreement recruit the same units within large language models
cs.CL 2025-12 unverdicted novelty 5.0

Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
Multilingual Vision-Language Models, A Survey
cs.CL 2025-09 accept novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-base...