Recognition: no theorem link
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Pith reviewed 2026-05-13 22:36 UTC · model grok-4.3
The pith
A new benchmark of 1,729 Portuguese math problems shows frontier LLMs lose ground on figures and open-ended questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Math-PT supplies 1,729 problems in Portuguese from authentic Brazilian and Portuguese sources; when LLMs are tested on it, frontier reasoning models outperform open-weight models on multiple-choice questions while all models show reduced performance on items that contain figures or demand open-ended responses.
What carries the argument
The Math-PT dataset itself, a curated collection of 1,729 native Portuguese mathematical problems used to benchmark LLM reasoning across question formats.
If this is right
- Frontier models can serve as stronger baselines for Portuguese multiple-choice math tasks than open-weight alternatives.
- Visual and generative math reasoning remain weaker points even for top models when the language is not English.
- Releasing the dataset and model outputs allows direct comparison and future fine-tuning targeted at Portuguese math.
Where Pith is reading between the lines
- Multilingual LLM training should incorporate native-language math sources rather than relying on translated English data.
- Similar benchmarks in other languages could reveal whether the observed format gaps are language-specific or general.
- Open-weight models might narrow the performance difference if trained on comparable volumes of Portuguese competition problems.
Load-bearing premise
The 1,729 problems drawn from Portuguese Olympiads and exams form a representative and high-quality sample of mathematical reasoning in European and Brazilian Portuguese.
What would settle it
Re-running the same models on an independent collection of recent Portuguese math problems from different exams or textbooks would produce substantially different relative scores or format-specific drops.
Figures
read the original abstract
The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MATH-PT, a dataset of 1,729 mathematical problems in European and Brazilian Portuguese curated from native Olympiads, competitions, and exams. It benchmarks current LLMs and reports that frontier reasoning models outperform open-weight models on multiple-choice questions but exhibit reduced performance on figure-containing and open-ended questions, with the dataset and outputs released to support further research.
Significance. If the performance differences are shown to be robust, the work supplies a valuable non-English math reasoning benchmark that directly addresses linguistic bias in the field. Releasing the full dataset and model outputs strengthens reproducibility and enables targeted follow-up studies on multilingual reasoning.
major comments (2)
- [Abstract and Results] The central claim of decreased performance on figure-containing questions (abstract) is load-bearing but unsupported by details on input formatting: it is unclear whether figures were omitted, replaced by textual descriptions, or provided only to multimodal models. Without this protocol, the observed drop cannot be attributed to reasoning limitations rather than missing visual input.
- [Evaluation] No information is given on the scoring protocol for open-ended questions (exact match, LLM-as-judge, or human verification), inter-annotator agreement, or statistical significance of gaps across question types. These omissions undermine confidence in the reported performance differences.
minor comments (2)
- [Dataset Construction] Provide the exact breakdown of problem sources (Portugal vs. Brazil) and any curation or difficulty calibration steps applied to the 1,729 problems.
- [Model Evaluation] Clarify whether any vision-language models were included in the benchmark and how their results were handled for figure questions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional methodological detail will improve the clarity and interpretability of our results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Results] The central claim of decreased performance on figure-containing questions (abstract) is load-bearing but unsupported by details on input formatting: it is unclear whether figures were omitted, replaced by textual descriptions, or provided only to multimodal models. Without this protocol, the observed drop cannot be attributed to reasoning limitations rather than missing visual input.
Authors: We agree that the input formatting protocol for figure-containing questions must be stated explicitly to support the reported performance differences. The current manuscript does not provide this level of detail. We will add a dedicated paragraph in the Evaluation section describing the protocol: text-only models received the problem text with figures omitted (no textual descriptions were substituted), while multimodal models received the original images. This clarification will allow readers to attribute the observed drop to the lack of visual input for non-multimodal models rather than to reasoning limitations alone. revision: yes
-
Referee: [Evaluation] No information is given on the scoring protocol for open-ended questions (exact match, LLM-as-judge, or human verification), inter-annotator agreement, or statistical significance of gaps across question types. These omissions undermine confidence in the reported performance differences.
Authors: We acknowledge that the Evaluation section currently omits these specifics. Open-ended answers were scored via exact match to the ground-truth solutions after normalizing for equivalent LaTeX and symbolic representations; no LLM-as-judge or additional human verification was applied. Inter-annotator agreement is not applicable because scoring was fully automated and deterministic. We will expand the section to document this protocol, report bootstrap confidence intervals or McNemar tests for the performance gaps across question types, and include any relevant statistical details to increase confidence in the differences. revision: yes
Circularity Check
No circularity: benchmark dataset release with direct empirical evaluation
full rationale
The paper introduces the Math-PT dataset of 1,729 native Portuguese math problems and reports LLM performance on it. All claims are empirical comparisons (e.g., frontier models vs. open-weight models on multiple-choice vs. figure/open-ended subsets). No equations, fitted parameters, predictions, or self-citations reduce any reported result to prior quantities by construction. The work is self-contained as a data release plus evaluation; the central claims rest on the new dataset itself rather than any imported uniqueness theorem or ansatz.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical prob- lem solving with the math dataset.arXiv preprint arXiv:2103.03874. Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, and Kai Chen
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
MathBench: Evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark. InFindings of the Association for Com- putational Linguistics: ACL 2024, pages 6884–6915, Bangkok, Thailand. Association for Computational Linguistics. Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng,...
work page 2024
-
[3]
Evaluation of question an- swer generation for Portuguese: Insights and datasets. InFindings of the Association for Computational Lin- guistics: EMNLP 2024, pages 5315–5327, Miami, Florida, USA. Association for Computational Lin- guistics. Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vuli´c, and Anna Korhonen
work page 2024
-
[4]
XCOPA: A multilingual dataset for causal common- sense reasoning. InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online. As- sociation for Computational Linguistics. Isadora Salles, Francielle Vargas, and Fabrício Ben- evenuto
work page 2020
-
[5]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelli- gence.Preprint, arXiv:2507.20534. Jackson Trager, Francielle Vargas, Diego Alves, Mat- teo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Yalda Daryani, Farzan Karimi Malekabadi, and Flor Miriam Plaza-del Arco
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
MFTCXplain: A multilingual benchmark dataset for evaluating the moral reasoning of LLMs through multi-hop hate speech explanation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15709–15740, Suzhou, China. Association for Com- putational Linguistics. Yiming Wang, Pei Zhang, Jialong Tang, Hao-Ran Wei, Baosong Yang, Rui Wang,...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.