arxiv: 2604.09827 · v1 · submitted 2026-04-10 · 💻 cs.DL

Recognition: no theorem link

Auditing automated research assessment: an interpretable machine learning approach to validate funding criteria

Rafael P. Gouveia , Thiago C. Silva , Diego R. Amancio

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3

classification 💻 cs.DL

keywords research assessmentmachine learningfeature selectionbibliometricsgrant evaluationpolicy auditfunding criteriainterpretability

0 comments

The pith

Machine learning models predict Brazilian research grant levels with high accuracy but show that only a narrow set of criteria actually drive the distinctions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the real-world validity of Brazil's official criteria for Research Productivity grants by converting the rules into measurable variables drawn from CVs and publication records. Classifiers are trained to separate top-level recipients from others, revealing strong overall predictive power. Yet the signal collapses to just three areas: publication volume, training graduate students, and holding institutional leadership posts. Many other factors listed in the regulations add no detectable value to the outcomes. A reader cares because the work asks whether funding decisions follow the broad stated policy or a simpler hidden pattern, with direct implications for how research assessment should be designed and communicated.

Core claim

PQ grant levels carry a robust statistical signal that machine learning models can recover at mean AUC scores of 0.96. This signal is carried almost entirely by bibliographic production, graduate-level supervision, and institutional management roles. In contrast, multiple criteria that the regulations explicitly emphasize show no measurable contribution to distinguishing top-tier researchers from others.

What carries the argument

Block-based Boruta feature selection run across multiple machine learning classifiers to quantify the statistical contribution of each operationalized regulatory dimension extracted from CVs and OpenAlex data.

If this is right

Grant levels contain a structured statistical signal that can be recovered reliably from public data sources.
Explanatory power concentrates in bibliographic output, graduate supervision, and management roles.
Several criteria named in the regulations contribute nothing detectable to classification success.
The practical evaluative signal is substantially narrower than the formal regulatory list.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agencies could simplify official guidelines to match the small set of features that actually separate levels.
Applicants might focus effort on the predictive activities while de-emphasizing the non-contributory ones.
The same audit approach could be applied to grant programs in other countries to test for similar mismatches.
Adding qualitative indicators such as peer review letters might alter which features rise to importance.

Load-bearing premise

The variables pulled from CVs and bibliometric databases accurately and completely represent the regulatory dimensions without major measurement error or selection bias.

What would settle it

Repeating the full pipeline on a fresh cohort of grant applicants from a later cycle and obtaining either AUC scores below 0.85 or a different set of dominant features would show the claimed concentration of explanatory power does not hold.

Figures

Figures reproduced from arXiv: 2604.09827 by Diego R. Amancio, Rafael P. Gouveia, Thiago C. Silva.

**Figure 1.** Figure 1: Boruta’s process of creating shadow features by permuting the features (or, in this case, blocks of features) across [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Boruta’s cycle of pruning features by comparing them to noise. The entire Figure set refers to one single overall Boruta [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of maximum absolute correlations between different features across time windows. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Class separation visualizations for both classification problems using 2D t-SNE and 1D LDA [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter Plot of years since first doctorate vs h-index colored and shaped by grant level, using the latest data per [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: box plot of the ROC-AUC of the models for many combinations of features. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

This paper empirically examines the practical validity of the official evaluation criteria underpinning the Research Productivity (PQ) Grant framework, as governed by the Brazilian National Council for Scientific and Technological Development (CNPq). By operationalizing regulatory dimensions (including bibliographic output, human resource training, and scientific recognition) as measurable variables extracted from CVs and OpenAlex bibliometric data, we treat policy-defined indicators as testable hypotheses rather than a priori assumptions. Using a block-based adaptation of the Boruta feature selection algorithm across several machine learning classifiers, we evaluate the statistical contribution of each dimension in distinguishing grant levels, with a focus on identifying top-tier (Level 1A) researchers. Our models achieve high predictive performance, with mean AUC scores reaching 0.96, indicating that PQ levels carry a robust and structured statistical signal. However, explanatory power is heavily concentrated within a limited subset of features, specifically bibliographic production, graduate-level supervision and institutional management roles. Conversely, several criteria explicitly emphasized in the regulations demonstrated no detectable statistical contribution to classification outcomes. These findings reveal a potential misalignment between the formal regulatory framework and the effective signals driving evaluation outcomes, suggesting that the practical evaluative signal is substantially more compact than officially stated and providing evidence-based insights for the refinement and transparency of research assessment policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically audits the Brazilian CNPq PQ grant criteria by extracting measurable variables from CVs and OpenAlex data to represent regulatory dimensions such as bibliographic output, human resource training, and scientific recognition. Using a block-based adaptation of the Boruta algorithm with multiple ML classifiers, it reports mean AUC scores up to 0.96 for distinguishing grant levels (focusing on Level 1A), but finds that explanatory power is concentrated in bibliographic production, graduate supervision, and institutional management roles, while other explicitly regulated criteria show no detectable contribution, indicating a potential misalignment between formal policy and effective evaluative signals.

Significance. If the feature operationalization holds, the work offers evidence-based insights for improving transparency in research funding policies and illustrates the value of interpretable ML for auditing automated assessment systems. It merits credit for grounding the analysis in actual regulatory text and real bibliometric/CV data rather than synthetic benchmarks, and for adapting Boruta in a block-wise manner to respect grouped criteria.

major comments (2)

[Methods (feature extraction)] Methods section on data extraction and variable construction: no validation (e.g., inter-rater reliability, manual audit against original CNPq regulations, or error-rate estimates) is reported for how CV self-reports and OpenAlex entries are mapped to the policy dimensions. This is load-bearing for the central claim of 'no contribution' for non-selected criteria, because higher measurement error or incomplete coverage in those dimensions (as opposed to bibliographic counts) would cause Boruta to drop them even if they matter in the true process.
[Results (AUC and Boruta outcomes)] Results on model performance and feature selection: the reported mean AUC of 0.96 and the concentration of signal in three feature blocks are presented without accompanying details on sample size, class distribution, cross-validation folds, or handling of missing data. Without these, it is impossible to evaluate whether the high performance and the 'compact signal' conclusion are robust or artifacts of the dataset characteristics.

minor comments (2)

[Abstract] The abstract states 'several machine learning classifiers' without naming them; the methods section should list the exact algorithms and hyperparameters for reproducibility.
[Figures/Tables] Figure captions and table legends could more explicitly link back to the regulatory dimensions being tested to aid readers unfamiliar with CNPq criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our analysis on the alignment between CNPq PQ criteria and empirical signals. We address each major comment point by point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methods (feature extraction)] Methods section on data extraction and variable construction: no validation (e.g., inter-rater reliability, manual audit against original CNPq regulations, or error-rate estimates) is reported for how CV self-reports and OpenAlex entries are mapped to the policy dimensions. This is load-bearing for the central claim of 'no contribution' for non-selected criteria, because higher measurement error or incomplete coverage in those dimensions (as opposed to bibliographic counts) would cause Boruta to drop them even if they matter in the true process.

Authors: We agree that the lack of reported validation for the feature mapping process represents a genuine limitation in the current manuscript, particularly since it underpins the interpretation of which criteria contribute to grant level prediction. The mappings were constructed by aligning each regulatory dimension verbatim from the official CNPq resolutions to specific, extractable fields in the Lattes CV platform and OpenAlex (e.g., publication counts for bibliographic output, number of supervised theses for human resource training). Bibliographic features benefit from standardized, machine-readable data, while others rely on structured CV sections. However, no inter-rater reliability or error-rate audit was performed or reported. In the revised manuscript, we will add a dedicated subsection in Methods with the full mapping table, explicit rules for each proxy, and assumptions. We will also conduct and report a manual audit on a random sample of 100 CVs to quantify extraction accuracy and discuss potential differential error rates across blocks. This will allow us to evaluate whether the Boruta results for non-selected criteria could be influenced by measurement issues. revision: yes
Referee: [Results (AUC and Boruta outcomes)] Results on model performance and feature selection: the reported mean AUC of 0.96 and the concentration of signal in three feature blocks are presented without accompanying details on sample size, class distribution, cross-validation folds, or handling of missing data. Without these, it is impossible to evaluate whether the high performance and the 'compact signal' conclusion are robust or artifacts of the dataset characteristics.

Authors: We acknowledge that the results section omits essential details needed to assess robustness, which is a valid criticism. The mean AUC of 0.96 was obtained via block-based Boruta applied to multiple classifiers on the dataset of PQ grant recipients, with the signal concentrating in bibliographic production, supervision, and management blocks. In the revised manuscript, we will expand the Results section to report the exact sample size, the distribution of grant levels (with emphasis on the 1A vs. lower levels split), the cross-validation scheme (including number of folds and any stratification), and the missing data handling approach (e.g., complete-case analysis or imputation for CV fields). We will also add supplementary tables with per-fold AUCs, feature importance stability across runs, and sensitivity checks excluding blocks to confirm the compact signal is not driven by dataset artifacts or imbalance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical ML auditing of policy criteria

full rationale

The paper extracts features from independent external sources (CVs and OpenAlex bibliometric data) to operationalize regulatory dimensions, then applies Boruta feature selection and classifiers to predict PQ grant levels. The reported AUC of 0.96 and concentration of explanatory power in bibliographic production, supervision, and institutional roles are statistical outcomes of model performance on the data, not reductions to fitted inputs or self-definitions by construction. No equations, ansatzes, or uniqueness theorems are presented that collapse the central claim (misalignment between regulations and effective signals) back to the inputs. The analysis is self-contained against external benchmarks and does not rely on load-bearing self-citations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the approach assumes standard ML assumptions like representative sampling and no major data quality issues in bibliometric sources.

free parameters (1)

ML hyperparameters
Standard in classifiers and Boruta algorithm but unspecified in abstract.

axioms (1)

domain assumption Features from CVs and OpenAlex accurately represent regulatory dimensions
Central to operationalizing the criteria as testable variables.

pith-pipeline@v0.9.0 · 5532 in / 1260 out tokens · 73027 ms · 2026-05-10T15:44:09.660699+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages

[1]

Acuna, S

D. Acuna, S. Allesina, and K. Kording. Future impact: Predicting scientific success. Nature, 489: 0 201--2, 09 2012

2012
[2]

R. F. Araujo and M. Alves. The altmetric performance of publications authored by brazilian researchers: Analysis of cnpq productivity scholarship holders. Preprint arXiv: 1807.06366, 2018

work page arXiv 2018
[3]

Barata and M

R. Barata and M. Goldbaum. Perfil dos pesquisadores com bolsa de produtividade em pesquisa do cnpq da área de saúde coletiva. Cadernos de Saúde Pública, 19: 0 1863--1876, 12 2003

2003
[4]

L. Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, Oct 2001. ISSN 1573-0565

2001
[5]

A. C. Brito, F. N. Silva, and D. R. Amancio. Analyzing the influence of prolific collaborations on authors productivity and visibility. Scientometrics, 128 0 (4): 0 2471--2487, 2023

2023
[6]

A. C. M. Brito, F. N. Silva, and D. R. Amancio. A complex network approach to political analysis: Application to the brazilian chamber of deputies. Plos one, 15 0 (3): 0 e0229928, 2020

2020
[7]

Chen and C

T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, Aug. 2016

2016
[8]

Critérios para avaliação de Bolsas de Produtividade, PQ e DT, nas Chamadas de 2024, 2025 e 2026

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Critérios para avaliação de Bolsas de Produtividade, PQ e DT, nas Chamadas de 2024, 2025 e 2026 . Technical report, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) , Brasília, Brazil, 2024. URL http://memoria2.cnpq.br/web/guest/chamadas-publicas?p_p_id=resultadospo...

2024
[9]

Portal de Dados Abertos do CNPq: Open datasets on Brazilian research funding and scholarships , 2025

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Portal de Dados Abertos do CNPq: Open datasets on Brazilian research funding and scholarships , 2025. URL http://dadosabertos.cnpq.br/. Accessed: Nov. 2025

2025
[10]

Plataforma Lattes: Researcher resumes database , 2026

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Plataforma Lattes: Researcher resumes database , 2026. URL https://lattes.cnpq.br/. Accessed: Mar. 2026

2026
[11]

Cortes and V

C. Cortes and V. Vapnik. Support-vector networks. Chem. Biol. Drug Des., 297: 0 273--297, 01 2009

2009
[12]

Couto, F

M. Couto, F. Carmo, A. Jacob Junior, R. Marcacini, and F. Lobato. Characterization of co-authorship networks of cnpq productivity fellows: an approach based on data science. pages 113--120, 11 2024. doi:10.5753/kdmile.2024.244728

work page doi:10.5753/kdmile.2024.244728 2024
[13]

Fioravante, I

K. Fioravante, I. M. M. R. Robaina, and N. Almir. As bolsas de produtividade em pesquisa do cnpq: Um olhar sobre os pesquisadores nÍvel pq-2 da Área da geografia: Cnpq research productivity scholarships: A look at pq-2 level researchers in the field of geography. Boletim Goiano de Geografia, 43, 11 2023

2023
[14]

Geurts, D

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63: 0 3--42, 04 2006

2006
[15]

Hicks, P

D. Hicks, P. Wouters, L. Waltman, S. de Rijcke, and I. Rafols. The leiden manifesto for research metrics. Nature, 520: 0 429--431, 04 2015

2015
[16]

Hooker, L

G. Hooker, L. Mentch, and S. Zhou. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31 0 (6): 0 82, 2021

2021
[17]

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 3149–3157, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

2017
[18]

Kursa and W

M. Kursa and W. Rudnicki. Feature selection with the boruta package. Journal of Statistical Software, 36: 0 1--13, 01 2010

2010
[19]

J. P. Mena-Chalco . Dataset on Brazilian Curriculum Vitae from the Lattes platform . Unpublished dataset, University of São Paulo (USP), 2025. URL http://vision.ime.usp.br/ jmena/coleta-Lattes-23022025-09032025/. Accessed: Jan. 2026

2025
[20]

Oliveira, N

L. Oliveira, N. Santos, and J. B. Rocha. Geosciences of cnpq from research productivity fellows. Anuário do Instituto de Geociências - UFRJ, 39: 0 142, 03 2016

2016
[21]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12: 0 2825--2830, 2011

2011
[22]

Penner, R

O. Penner, R. K. Pan, A. Petersen, K. Kaski, and S. Fortunato. On the predictability of future impact in science. Scientific reports, 3: 0 3052, 10 2013

2013
[23]

Perlin, D

M. Perlin, D. Borenstein, T. Imasato, and M. Reichert. The determinants and impact of research grants: The case of brazilian productivity scholarships. Journal of Informetrics, 18 0 (4): 0 101563, 2024

2024
[24]

Picinin, L

C. Picinin, L. Pilatti, J. Kovaleski, A. Graeml, and B. Pedroso. Comparison of performance of researchers recipients of cnpq productivity grants in the field of brazilian production engineering. Scientometrics, 07 2016. doi:10.1007/s11192-016-2070-7

work page doi:10.1007/s11192-016-2070-7 2016
[25]

Shaurya Rohatgi

J. Priem, H. Piwowar, and R. Orr. Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Preprint arXiv: 2205.01833, 2022

work page arXiv 2022
[26]

Rodrigues, M

L. Rodrigues, M. Gouvêa, F. Marques, and S. Mourão. Overview of the scientific production in the pharmacy area in brazil: profile and productivity of researchers granted with fellowships by the national council for scientific and technological development. Scientometrics, 110: 0 1157, 02 2017. doi:10.1007/s11192-016-2210-0

work page doi:10.1007/s11192-016-2210-0 2017
[27]

Silva and J

M. Silva and J. DeSantana. Analysis of the profile of brazilian fellowship researchers productivity in physiotherapy: Observational study. Revista Brasileira de Pós-Graduação, 20: 0 1--19, 08 2025. doi:10.21713/rbpg.v20i41.1888

work page doi:10.21713/rbpg.v20i41.1888 2025
[28]

R. Q. Souto, G. da Silva Lacerda, G. M. C. Costa, A. L. Cavalcanti, I. S. X. Fran c a, F. S. Sousa, et al. Caracteriza c \ a o dos pesquisadores bolsistas de produtividade do cnpq da \'a rea de enfermagem: estudo transversal. Online Brazilian Journal of Nursing, 11 0 (2): 0 261--73, 2012

2012
[29]

J. A. Tohalino and D. R. Amancio. On predicting research grants productivity via machine learning. Journal of Informetrics, 16 0 (2): 0 101260, 2022

2022