Recognition: no theorem link
Auditing automated research assessment: an interpretable machine learning approach to validate funding criteria
Pith reviewed 2026-05-10 15:44 UTC · model grok-4.3
The pith
Machine learning models predict Brazilian research grant levels with high accuracy but show that only a narrow set of criteria actually drive the distinctions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PQ grant levels carry a robust statistical signal that machine learning models can recover at mean AUC scores of 0.96. This signal is carried almost entirely by bibliographic production, graduate-level supervision, and institutional management roles. In contrast, multiple criteria that the regulations explicitly emphasize show no measurable contribution to distinguishing top-tier researchers from others.
What carries the argument
Block-based Boruta feature selection run across multiple machine learning classifiers to quantify the statistical contribution of each operationalized regulatory dimension extracted from CVs and OpenAlex data.
If this is right
- Grant levels contain a structured statistical signal that can be recovered reliably from public data sources.
- Explanatory power concentrates in bibliographic output, graduate supervision, and management roles.
- Several criteria named in the regulations contribute nothing detectable to classification success.
- The practical evaluative signal is substantially narrower than the formal regulatory list.
Where Pith is reading between the lines
- Agencies could simplify official guidelines to match the small set of features that actually separate levels.
- Applicants might focus effort on the predictive activities while de-emphasizing the non-contributory ones.
- The same audit approach could be applied to grant programs in other countries to test for similar mismatches.
- Adding qualitative indicators such as peer review letters might alter which features rise to importance.
Load-bearing premise
The variables pulled from CVs and bibliometric databases accurately and completely represent the regulatory dimensions without major measurement error or selection bias.
What would settle it
Repeating the full pipeline on a fresh cohort of grant applicants from a later cycle and obtaining either AUC scores below 0.85 or a different set of dominant features would show the claimed concentration of explanatory power does not hold.
Figures
read the original abstract
This paper empirically examines the practical validity of the official evaluation criteria underpinning the Research Productivity (PQ) Grant framework, as governed by the Brazilian National Council for Scientific and Technological Development (CNPq). By operationalizing regulatory dimensions (including bibliographic output, human resource training, and scientific recognition) as measurable variables extracted from CVs and OpenAlex bibliometric data, we treat policy-defined indicators as testable hypotheses rather than a priori assumptions. Using a block-based adaptation of the Boruta feature selection algorithm across several machine learning classifiers, we evaluate the statistical contribution of each dimension in distinguishing grant levels, with a focus on identifying top-tier (Level 1A) researchers. Our models achieve high predictive performance, with mean AUC scores reaching 0.96, indicating that PQ levels carry a robust and structured statistical signal. However, explanatory power is heavily concentrated within a limited subset of features, specifically bibliographic production, graduate-level supervision and institutional management roles. Conversely, several criteria explicitly emphasized in the regulations demonstrated no detectable statistical contribution to classification outcomes. These findings reveal a potential misalignment between the formal regulatory framework and the effective signals driving evaluation outcomes, suggesting that the practical evaluative signal is substantially more compact than officially stated and providing evidence-based insights for the refinement and transparency of research assessment policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript empirically audits the Brazilian CNPq PQ grant criteria by extracting measurable variables from CVs and OpenAlex data to represent regulatory dimensions such as bibliographic output, human resource training, and scientific recognition. Using a block-based adaptation of the Boruta algorithm with multiple ML classifiers, it reports mean AUC scores up to 0.96 for distinguishing grant levels (focusing on Level 1A), but finds that explanatory power is concentrated in bibliographic production, graduate supervision, and institutional management roles, while other explicitly regulated criteria show no detectable contribution, indicating a potential misalignment between formal policy and effective evaluative signals.
Significance. If the feature operationalization holds, the work offers evidence-based insights for improving transparency in research funding policies and illustrates the value of interpretable ML for auditing automated assessment systems. It merits credit for grounding the analysis in actual regulatory text and real bibliometric/CV data rather than synthetic benchmarks, and for adapting Boruta in a block-wise manner to respect grouped criteria.
major comments (2)
- [Methods (feature extraction)] Methods section on data extraction and variable construction: no validation (e.g., inter-rater reliability, manual audit against original CNPq regulations, or error-rate estimates) is reported for how CV self-reports and OpenAlex entries are mapped to the policy dimensions. This is load-bearing for the central claim of 'no contribution' for non-selected criteria, because higher measurement error or incomplete coverage in those dimensions (as opposed to bibliographic counts) would cause Boruta to drop them even if they matter in the true process.
- [Results (AUC and Boruta outcomes)] Results on model performance and feature selection: the reported mean AUC of 0.96 and the concentration of signal in three feature blocks are presented without accompanying details on sample size, class distribution, cross-validation folds, or handling of missing data. Without these, it is impossible to evaluate whether the high performance and the 'compact signal' conclusion are robust or artifacts of the dataset characteristics.
minor comments (2)
- [Abstract] The abstract states 'several machine learning classifiers' without naming them; the methods section should list the exact algorithms and hyperparameters for reproducibility.
- [Figures/Tables] Figure captions and table legends could more explicitly link back to the regulatory dimensions being tested to aid readers unfamiliar with CNPq criteria.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help improve the clarity and rigor of our analysis on the alignment between CNPq PQ criteria and empirical signals. We address each major comment point by point below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods (feature extraction)] Methods section on data extraction and variable construction: no validation (e.g., inter-rater reliability, manual audit against original CNPq regulations, or error-rate estimates) is reported for how CV self-reports and OpenAlex entries are mapped to the policy dimensions. This is load-bearing for the central claim of 'no contribution' for non-selected criteria, because higher measurement error or incomplete coverage in those dimensions (as opposed to bibliographic counts) would cause Boruta to drop them even if they matter in the true process.
Authors: We agree that the lack of reported validation for the feature mapping process represents a genuine limitation in the current manuscript, particularly since it underpins the interpretation of which criteria contribute to grant level prediction. The mappings were constructed by aligning each regulatory dimension verbatim from the official CNPq resolutions to specific, extractable fields in the Lattes CV platform and OpenAlex (e.g., publication counts for bibliographic output, number of supervised theses for human resource training). Bibliographic features benefit from standardized, machine-readable data, while others rely on structured CV sections. However, no inter-rater reliability or error-rate audit was performed or reported. In the revised manuscript, we will add a dedicated subsection in Methods with the full mapping table, explicit rules for each proxy, and assumptions. We will also conduct and report a manual audit on a random sample of 100 CVs to quantify extraction accuracy and discuss potential differential error rates across blocks. This will allow us to evaluate whether the Boruta results for non-selected criteria could be influenced by measurement issues. revision: yes
-
Referee: [Results (AUC and Boruta outcomes)] Results on model performance and feature selection: the reported mean AUC of 0.96 and the concentration of signal in three feature blocks are presented without accompanying details on sample size, class distribution, cross-validation folds, or handling of missing data. Without these, it is impossible to evaluate whether the high performance and the 'compact signal' conclusion are robust or artifacts of the dataset characteristics.
Authors: We acknowledge that the results section omits essential details needed to assess robustness, which is a valid criticism. The mean AUC of 0.96 was obtained via block-based Boruta applied to multiple classifiers on the dataset of PQ grant recipients, with the signal concentrating in bibliographic production, supervision, and management blocks. In the revised manuscript, we will expand the Results section to report the exact sample size, the distribution of grant levels (with emphasis on the 1A vs. lower levels split), the cross-validation scheme (including number of folds and any stratification), and the missing data handling approach (e.g., complete-case analysis or imputation for CV fields). We will also add supplementary tables with per-fold AUCs, feature importance stability across runs, and sensitivity checks excluding blocks to confirm the compact signal is not driven by dataset artifacts or imbalance. revision: yes
Circularity Check
No significant circularity in empirical ML auditing of policy criteria
full rationale
The paper extracts features from independent external sources (CVs and OpenAlex bibliometric data) to operationalize regulatory dimensions, then applies Boruta feature selection and classifiers to predict PQ grant levels. The reported AUC of 0.96 and concentration of explanatory power in bibliographic production, supervision, and institutional roles are statistical outcomes of model performance on the data, not reductions to fitted inputs or self-definitions by construction. No equations, ansatzes, or uniqueness theorems are presented that collapse the central claim (misalignment between regulations and effective signals) back to the inputs. The analysis is self-contained against external benchmarks and does not rely on load-bearing self-citations.
Axiom & Free-Parameter Ledger
free parameters (1)
- ML hyperparameters
axioms (1)
- domain assumption Features from CVs and OpenAlex accurately represent regulatory dimensions
Reference graph
Works this paper leans on
-
[1]
Acuna, S
D. Acuna, S. Allesina, and K. Kording. Future impact: Predicting scientific success. Nature, 489: 0 201--2, 09 2012
2012
- [2]
-
[3]
Barata and M
R. Barata and M. Goldbaum. Perfil dos pesquisadores com bolsa de produtividade em pesquisa do cnpq da área de saúde coletiva. Cadernos de Saúde Pública, 19: 0 1863--1876, 12 2003
2003
-
[4]
L. Breiman. Random forests. Machine Learning, 45 0 (1): 0 5--32, Oct 2001. ISSN 1573-0565
2001
-
[5]
A. C. Brito, F. N. Silva, and D. R. Amancio. Analyzing the influence of prolific collaborations on authors productivity and visibility. Scientometrics, 128 0 (4): 0 2471--2487, 2023
2023
-
[6]
A. C. M. Brito, F. N. Silva, and D. R. Amancio. A complex network approach to political analysis: Application to the brazilian chamber of deputies. Plos one, 15 0 (3): 0 e0229928, 2020
2020
-
[7]
Chen and C
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. ACM, Aug. 2016
2016
-
[8]
Critérios para avaliação de Bolsas de Produtividade, PQ e DT, nas Chamadas de 2024, 2025 e 2026
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Critérios para avaliação de Bolsas de Produtividade, PQ e DT, nas Chamadas de 2024, 2025 e 2026 . Technical report, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) , Brasília, Brazil, 2024. URL http://memoria2.cnpq.br/web/guest/chamadas-publicas?p_p_id=resultadospo...
2024
-
[9]
Portal de Dados Abertos do CNPq: Open datasets on Brazilian research funding and scholarships , 2025
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Portal de Dados Abertos do CNPq: Open datasets on Brazilian research funding and scholarships , 2025. URL http://dadosabertos.cnpq.br/. Accessed: Nov. 2025
2025
-
[10]
Plataforma Lattes: Researcher resumes database , 2026
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) . Plataforma Lattes: Researcher resumes database , 2026. URL https://lattes.cnpq.br/. Accessed: Mar. 2026
2026
-
[11]
Cortes and V
C. Cortes and V. Vapnik. Support-vector networks. Chem. Biol. Drug Des., 297: 0 273--297, 01 2009
2009
-
[12]
M. Couto, F. Carmo, A. Jacob Junior, R. Marcacini, and F. Lobato. Characterization of co-authorship networks of cnpq productivity fellows: an approach based on data science. pages 113--120, 11 2024. doi:10.5753/kdmile.2024.244728
-
[13]
Fioravante, I
K. Fioravante, I. M. M. R. Robaina, and N. Almir. As bolsas de produtividade em pesquisa do cnpq: Um olhar sobre os pesquisadores nÍvel pq-2 da Área da geografia: Cnpq research productivity scholarships: A look at pq-2 level researchers in the field of geography. Boletim Goiano de Geografia, 43, 11 2023
2023
-
[14]
Geurts, D
P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63: 0 3--42, 04 2006
2006
-
[15]
Hicks, P
D. Hicks, P. Wouters, L. Waltman, S. de Rijcke, and I. Rafols. The leiden manifesto for research metrics. Nature, 520: 0 429--431, 04 2015
2015
-
[16]
Hooker, L
G. Hooker, L. Mentch, and S. Zhou. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31 0 (6): 0 82, 2021
2021
-
[17]
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 3149–3157, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964
2017
-
[18]
Kursa and W
M. Kursa and W. Rudnicki. Feature selection with the boruta package. Journal of Statistical Software, 36: 0 1--13, 01 2010
2010
-
[19]
J. P. Mena-Chalco . Dataset on Brazilian Curriculum Vitae from the Lattes platform . Unpublished dataset, University of São Paulo (USP), 2025. URL http://vision.ime.usp.br/ jmena/coleta-Lattes-23022025-09032025/. Accessed: Jan. 2026
2025
-
[20]
Oliveira, N
L. Oliveira, N. Santos, and J. B. Rocha. Geosciences of cnpq from research productivity fellows. Anuário do Instituto de Geociências - UFRJ, 39: 0 142, 03 2016
2016
-
[21]
Pedregosa, G
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12: 0 2825--2830, 2011
2011
-
[22]
Penner, R
O. Penner, R. K. Pan, A. Petersen, K. Kaski, and S. Fortunato. On the predictability of future impact in science. Scientific reports, 3: 0 3052, 10 2013
2013
-
[23]
Perlin, D
M. Perlin, D. Borenstein, T. Imasato, and M. Reichert. The determinants and impact of research grants: The case of brazilian productivity scholarships. Journal of Informetrics, 18 0 (4): 0 101563, 2024
2024
-
[24]
C. Picinin, L. Pilatti, J. Kovaleski, A. Graeml, and B. Pedroso. Comparison of performance of researchers recipients of cnpq productivity grants in the field of brazilian production engineering. Scientometrics, 07 2016. doi:10.1007/s11192-016-2070-7
-
[25]
J. Priem, H. Piwowar, and R. Orr. Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. Preprint arXiv: 2205.01833, 2022
-
[26]
L. Rodrigues, M. Gouvêa, F. Marques, and S. Mourão. Overview of the scientific production in the pharmacy area in brazil: profile and productivity of researchers granted with fellowships by the national council for scientific and technological development. Scientometrics, 110: 0 1157, 02 2017. doi:10.1007/s11192-016-2210-0
-
[27]
M. Silva and J. DeSantana. Analysis of the profile of brazilian fellowship researchers productivity in physiotherapy: Observational study. Revista Brasileira de Pós-Graduação, 20: 0 1--19, 08 2025. doi:10.21713/rbpg.v20i41.1888
-
[28]
R. Q. Souto, G. da Silva Lacerda, G. M. C. Costa, A. L. Cavalcanti, I. S. X. Fran c a, F. S. Sousa, et al. Caracteriza c \ a o dos pesquisadores bolsistas de produtividade do cnpq da \'a rea de enfermagem: estudo transversal. Online Brazilian Journal of Nursing, 11 0 (2): 0 261--73, 2012
2012
-
[29]
J. A. Tohalino and D. R. Amancio. On predicting research grants productivity via machine learning. Journal of Informetrics, 16 0 (2): 0 101260, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.