Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Annabella S\'anchez-Guzm\'an; Denis Helic; Lisette Esp\'in-Noboa; Lukas Eberhard

arxiv: 2605.28187 · v1 · pith:CQIVOZUPnew · submitted 2026-05-27 · 💻 cs.IR · cs.AI· cs.CY· cs.SI

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

Annabella S\'anchez-Guzm\'an , Lukas Eberhard , Denis Helic , Lisette Esp\'in-Noboa This is my paper

Pith reviewed 2026-06-29 09:57 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CYcs.SI

keywords scholar recommendationLLM promptingpersona effectsacademic discoveryfactualitydiversityAI bias audits

0 comments

The pith

Persona prompts in LLMs alter which scholars get recommended as experts, with location and context driving separate effects on factuality and diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that prompt design, specifically the persona elements like language and location, produces measurable differences in LLM scholar recommendations that go beyond the choice of model. It builds a benchmark to test this across many models, disciplines, and prompt variations, then measures outputs against a standard database for accuracy and balance. A reader would care because these systems now influence whose research gains visibility in academia. The results indicate that prompt choices affect who appears on expert lists in ways that can reduce or increase factual accuracy and group representation.

Core claim

The authors develop a benchmark that isolates the effects of model choice from those of persona prompts (language, location, role-and-task) and context (field, seniority, k) when LLMs recommend scholars. They evaluate outputs from 43 models across six disciplines against Semantic Scholar on technical quality measures (factuality, coverage) and social representativeness measures (diversity, parity). Basic technical quality tracks model choice, factuality and parity track context, and diversity tracks location. South Africa persona prompts produce less factual lists while Japan persona prompts produce factual lists that are homogeneous and favor highly productive scholars.

What carries the argument

A benchmark that varies persona prompts and context while holding model fixed, then scores recommended scholars against Semantic Scholar on factuality, coverage, diversity, and parity.

If this is right

Model choice sets the baseline technical quality of scholar lists.
Context details such as field and seniority control how factual and balanced the lists are.
Location specified in the prompt controls how diverse the recommended scholars are.
South Africa persona prompts lower the factuality of the output lists.
Japan persona prompts raise factuality but reduce diversity and skew toward high-productivity scholars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users in different countries could receive systematically different pictures of who counts as an expert through the same LLM.
Prompt templates might need region-specific adjustments to avoid uneven visibility for scholars from certain locations.
Auditing prompts could become a standard step when deploying LLMs for academic search tasks.

Load-bearing premise

Comparing LLM outputs to Semantic Scholar gives an unbiased standard for judging both technical quality and social representativeness of the recommendations.

What would settle it

If factuality, coverage, diversity, and parity scores stayed identical across all persona prompt variations in the benchmark, the claim that prompt design is a non-trivial factor would not hold.

Figures

Figures reproduced from arXiv: 2605.28187 by Annabella S\'anchez-Guzm\'an, Denis Helic, Lisette Esp\'in-Noboa, Lukas Eberhard.

**Figure 1.** Figure 1: Auditing pipeline for quantifying persona and context effects in LLM-based scholar recommendations. The pipeline systematically varies persona variables (who asks the question), including language, role, and country, and context variables (what is asked), including the number of requested scholars, their seniority, field, and subfield. These controlled prompt variations are passed to 43 LLMs and evaluated … view at source ↗

**Figure 2.** Figure 2: Zero-shot prompt template. The English variant of the prompt template; the German and Spanish variants are shown in Section A. Placeholders in braces (role-and-task, location, k, seniority, field, sub-field) are instantiated with the values of six of the seven audited prompt dimensions; the seventh dimension, language, is implicit in the text. All three variants are functionally equivalent translations: sa… view at source ↗

**Figure 3.** Figure 3: Sensitivity of evaluation metrics to prompt variables and LLM choice. Heatmaps report ω 2 effect sizes from per-metric ANOVA models quantifying the influence of persona, context, and LLM factors (rows) across technical quality and social representativeness metrics (columns). Darker cells indicate stronger influence, and the Residual row reports 1 − R2 , the variance not attributable to the modeled factors.… view at source ↗

**Figure 4.** Figure 4: Drivers of technical quality in LLM recommendations. Regression coefficients (points) with 95% confidence intervals (bars) from regressions of four technical-quality metrics: validity, seniority factuality, field factuality, and location factuality. Each row is one level of a categorical predictor relative to its reference category (in brackets); the coefficient is the change in the metric on its [0, 1] sc… view at source ↗

**Figure 5.** Figure 5: Drivers of social representativeness in LLM recommendations. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Technical quality vs. social representativeness across evaluated LLMs. Each point is one of the 43 audited models, scored on aggregate technical quality (x-axis) and aggregate social representativeness (y-axis). Parity sums over gender, ethnicity, publications, and citations, and factuality over author, field, seniority, and location. Dashed lines mark the per-axis medians, defining quadrants Q1–Q4. Model… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a benchmark to separate model choice from persona prompt effects in LLM scholar recommendations, but the abstract gives no methods or baseline validation, leaving the claims hard to assess.

read the letter

The main thing to know is that this paper proposes a benchmark for disentangling LLM model effects from persona prompt effects (language, location, role) when recommending scholars, and it reports that location prompts especially shape diversity and factuality while model choice drives basic quality.

It is new in extending prior English-centric audits by varying those persona axes across 43 models, multiple fields, seniorities, and k values, then scoring outputs against Semantic Scholar on factuality, coverage, diversity, and parity. The high-level split of effects by model, context, and location is a clear step past persona-agnostic work.

The soft spots are the lack of any experimental design, data processing steps, statistical tests, or error handling in the abstract, which makes it impossible to verify whether the data actually supports the specific claims about South Africa or Japan prompts. The concern that Semantic Scholar may carry its own coverage biases by language or region is reasonable and unaddressed here; if the baseline itself under-represents certain locations, then prompt-induced differences could be artifacts rather than real effects.

This is for researchers working on LLM auditing or fairness in academic information retrieval. A reader interested in prompt design would get value from the benchmark framing if the full methods hold up.

It deserves peer review because the disentanglement idea is sensible and the topic is timely, even though the current writeup is too thin on evidence to evaluate the results.

Referee Report

2 major / 2 minor

Summary. The manuscript audits 43 LLMs for scholar recommendation by varying persona prompts (language, location, role-and-task) and context (field, seniority, k) across six disciplines. Recommended scholars are compared to Semantic Scholar to quantify technical quality (factuality, coverage) and social representativeness (diversity, parity). The central claims are that model choice drives basic technical quality, context drives factuality and parity, and location drives diversity, with South Africa prompts producing less factual lists and Japan prompts producing highly factual but homogeneous lists skewed toward high-productivity scholars. The conclusion is that prompt design is a non-trivial axis requiring systematic audit alongside model choice.

Significance. If the empirical results hold after addressing baseline validation, the work provides a scalable benchmark for disentangling prompt versus model effects in LLM-based academic search, with implications for fairness in who is surfaced as an expert. The audit's scale (43 models, multiple disciplines and contexts) and explicit separation of prompt axes are strengths that could support reproducible follow-up studies.

major comments (2)

[Methods (comparison to Semantic Scholar)] The evaluation treats Semantic Scholar as the neutral reference for factuality (existence/correctness), coverage, diversity, and parity, yet the manuscript provides no validation that SS coverage or distributions are unbiased with respect to the location (South Africa, Japan) and language axes used in the persona prompts. This is load-bearing for the attribution of differences to prompts rather than baseline artifacts.
[Results (location-prompt findings)] The headline location-prompt effects (SA prompts reduce factuality; Japan prompts increase homogeneity) rest on the untested assumption that SS provides an unbiased ground truth across the tested persona axes; without explicit checks (e.g., coverage rates by region/language in the six disciplines), the causal link to prompt design cannot be isolated.

minor comments (2)

[Abstract] The abstract states high-level findings but omits any description of experimental design, data processing, statistical tests, or error handling, which reduces immediate assessability even though the full methods are presumably present.
[Methods] Ensure operational definitions and formulas for the four metrics (factuality, coverage, diversity, parity) are stated explicitly, including how ties or missing SS entries are handled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to validate Semantic Scholar as a reference. The two major comments raise a single core issue regarding potential bias in the ground truth, which we address point by point below. We agree this assumption merits explicit discussion.

read point-by-point responses

Referee: [Methods (comparison to Semantic Scholar)] The evaluation treats Semantic Scholar as the neutral reference for factuality (existence/correctness), coverage, diversity, and parity, yet the manuscript provides no validation that SS coverage or distributions are unbiased with respect to the location (South Africa, Japan) and language axes used in the persona prompts. This is load-bearing for the attribution of differences to prompts rather than baseline artifacts.

Authors: We acknowledge that the manuscript does not provide explicit validation or coverage statistics for Semantic Scholar broken down by the location and language axes. Semantic Scholar was selected as the reference because it is the largest open academic graph with broad disciplinary coverage and is commonly used in similar recommendation audits. In the revision we will add a limitations subsection that (a) states the assumption explicitly, (b) discusses how regional or language biases in SS could affect absolute factuality scores, and (c) reports any readily available aggregate coverage indicators for the six disciplines. Comprehensive per-region validation would require external datasets not integrated in the current study. revision: partial
Referee: [Results (location-prompt findings)] The headline location-prompt effects (SA prompts reduce factuality; Japan prompts increase homogeneity) rest on the untested assumption that SS provides an unbiased ground truth across the tested persona axes; without explicit checks (e.g., coverage rates by region/language in the six disciplines), the causal link to prompt design cannot be isolated.

Authors: The reported location effects are differences relative to a fixed SS reference; therefore the comparative claims (SA prompts produce lower factuality scores than other locations, Japan prompts produce higher homogeneity) remain internally consistent even if SS itself has coverage skew. We agree, however, that the manuscript should qualify the interpretation by noting that absolute factuality attributions could be confounded by reference bias. The revision will insert clarifying language in the results and discussion sections stating that the findings demonstrate prompt-induced shifts relative to SS and that future work should cross-validate against additional sources. This does not change the core empirical patterns but strengthens the causal framing. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparisons to external Semantic Scholar benchmark

full rationale

The paper conducts a purely empirical audit: 43 LLMs are prompted with varying persona (language, location, role) and context (field, seniority, k) settings; outputs are scored for factuality, coverage, diversity, and parity by direct comparison to Semantic Scholar records across six disciplines. No equations, fitted parameters, predictions, or derivations appear. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. All central claims (e.g., location prompts drive diversity; South Africa prompts reduce factuality) are measured against an independent external database rather than reducing to the paper's own inputs by construction. This is the standard case of a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information available from the abstract to identify or populate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5725 in / 985 out tokens · 31604 ms · 2026-06-29T09:57:47.275858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 15 canonical work pages · 4 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

I.; Gunasekar, S.; Chandrasekaran, V.; Li, J.; Yuksekgonul, M.; Peshawaria, R.; Naik, R.; and Nushi, B

Abdin, M. I.; Gunasekar, S.; Chandrasekaran, V.; Li, J.; Yuksekgonul, M.; Peshawaria, R.; Naik, R.; and Nushi, B. 2024. Kitab: Evaluating llms on constraint satisfaction for information retrieval. In International Conference on Learning Representations, volume 2024, 30664--30686

2024
[4]

Anonymous. 2025. Anonymous Repository for Auditing LLMs as People Recommender Systems Across Languages and Countries. https://anonymous.4open.science/r/PersonasScholarRec. Anonymous repository for double-blind review

2025
[5]

Anzenberg, E.; Samajpati, A.; Chandrasekar, S.; and Kacholia, V. 2025. Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions. arXiv preprint arXiv:2507.02087

work page arXiv 2025
[6]

S.; and Jayagopi, D

Awasthi, D.; Rao, P. S.; and Jayagopi, D. B. 2025. ResumeGenAI: Supporting Job Seekers with LLM-Driven Resume Feedback. In Proceedings of the 7th ACM Conference on Conversational User Interfaces, 1--9

2025
[7]

G.; and Esp \' n-Noboa, L

Barolo, D.; Valentin, C.; Karimi, F.; Gal \'a rraga, L.; M \'e ndez, G. G.; and Esp \' n-Noboa, L. 2025. Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations. arXiv preprint arXiv:2506.00074

work page arXiv 2025
[8]

Benjamini, Y.; and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1): 289--300

1995
[9]

S.; and Pagan, A

Breusch, T. S.; and Pagan, A. R. 1979. A simple test for heteroscedasticity and random coefficient variation. Econometrica: Journal of the econometric society, 1287--1294

1979
[10]

S.; Zhang, Y.; Kejriwal, M.; and Calyam, P

Cheng, X.; Edara, L. S.; Zhang, Y.; Kejriwal, M.; and Calyam, P. 2024. Influence Role Recognition and LLM-Based Scholar Recommendation in Academic Social Networks. In 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA), 1--11. IEEE

2024
[11]

De Araujo, P. H. L.; R \"o ttger, P.; Hovy, D.; and Roth, B. 2025. Principled personas: Defining and measuring the intended effects of persona prompting on task performance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 26845--26874

2025
[12]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Durmus, E.; Nguyen, K.; Liao, T. I.; Schiefer, N.; Askell, A.; Bakhtin, A.; Chen, C.; Hatfield-Dodds, Z.; Hernandez, D.; Joseph, N.; et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Espin-Noboa, L.; and Mendez, G. G. 2026. Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation. arXiv preprint arXiv:2602.08873

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

L.; Bonchi, F.; and Castillo, C

Fabbri, F.; Croci, M. L.; Bonchi, F.; and Castillo, C. 2022. Exposure inequality in people recommender systems: The long-term effects. In Proceedings of the international AAAI conference on web and social media, volume 16, 194--204

2022
[15]

Hanusz, Z.; Tarasinska, J.; and Zielinski, W. 2016. Shapiro--Wilk test with known mean. REVSTAT-statistical Journal, 14(1): 89--100

2016
[16]

Hu, T.; and Collier, N. 2024. Quantifying the Persona Effect in LLM Simulations. arXiv preprint arXiv:2402.10811

work page arXiv 2024
[17]

M.; Macedo, M.; Oliveira, M.; Karimi, F.; and Menezes, R

Jaramillo, A. M.; Macedo, M.; Oliveira, M.; Karimi, F.; and Menezes, R. 2025. Systematic comparison of gender inequality in scientific rankings across disciplines. arXiv preprint arXiv:2501.13061

work page arXiv 2025
[18]

Jiao, J.; Afroogh, S.; Xu, Y.; and Phillips, C. 2025. Navigating llm ethics: Advancements, challenges, and future directions. AI and Ethics, 1--25

2025
[19]

Karimi, F.; Wagner, C.; Lemmerich, F.; Jadidi, M.; and Strohmaier, M. 2016. Inferring Gender from Names on the Web: A Comparative Evaluation of Gender Detection Methods. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW '16 Companion, 53--54. New York, New York, USA: ACM Press

2016
[20]

Kim, J.; Yang, N.; and Jung, K. 2025. Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts in Zero-shot Reasoning Tasks. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 848--862

2025
[21]

Kinney, R. M.; Anastasiades, C.; Authur, R.; Beltagy, I.; Bragg, J.; Buraczynski, A.; Cachola, I.; Candra, S.; Chandrasekhar, Y.; Cohan, A.; Crawford, M.; Downey, D.; Dunkelberger, J.; Etzioni, O.; Evans, R.; Feldman, S.; Gorney, J.; Graham, D. W.; Hu, F.; Huff, R.; King, D.; Kohlmeier, S.; Kuehl, B.; Langan, M.; Lin, D.; Liu, H.; Lo, K.; Lochner, J.; Mac...

work page arXiv 2023
[22]

S.; Bell, A.; Hulsey, W.; Larivière, V.; Monroe-White, T.; and Sugimoto, C

Kozlowski, D.; Murray, D. S.; Bell, A.; Hulsey, W.; Larivière, V.; Monroe-White, T.; and Sugimoto, C. R. 2022. Avoiding bias when inferring race using name-based approaches. PLOS ONE

2022
[23]

D.; and Finley, J

Kroes, A. D.; and Finley, J. R. 2023. Demystifying omega squared: Practical guidance for effect size in common analysis of variance designs. Psychological Methods

2023
[24]

R.; and Koch, G

Landis, J. R.; and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics, 159--174

1977
[25]

Letteri, I.; and Vittorini, P. 2024. Exploring the Impact of LLM-Generated Feedback: Evaluation from Professors and Students in Data Science Courses. In International Conference in Methodologies and intelligent Systems for Techhnology Enhanced Learning, 11--20. Springer

2024
[26]

Li, T.; Qin, Y.; and Sheng, O. R. L. 2025. A Multi-Task Evaluation of LLMs' Processing of Academic Text Input. arXiv preprint arXiv:2508.11779

work page arXiv 2025
[27]

Liang, L.; and Acuna, D. 2021. demographicx: A Python package for estimating gender and ethnicity using deep learning transformers. Zenodo https://doi. org/10.5281/zenodo, 4898367

work page doi:10.5281/zenodo 2021
[28]

Lin, A.; Wang, J.; Zhu, Z.; and Caverlee, J. 2022. Quantifying and mitigating popularity bias in conversational recommender systems. In Proceedings of the 31st ACM international conference on information & knowledge management, 1238--1247

2022
[29]

Liu, Y.; Elekes, \'A .; Lu, J.; Dorantes-Gilardi, R.; and Barab \'a si, A.-L. 2025. Unequal Scientific Recognition in the Age of LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, 23558--23568

2025
[30]

P.-W.; Qiu, J.; Wang, Z.; Yu, H.; Chen, Y.; Zhang, G.; and Lo, B

Lo, F. P.-W.; Qiu, J.; Wang, Z.; Yu, H.; Chen, Y.; Zhang, G.; and Lo, B. 2025. AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 4223--4232

2025
[31]

Lutz, M.; Sen, I.; Ahnert, G.; Rogers, E.; and Strohmaier, M. 2025. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, 23212--23237. Suzhou, China: Association for Computational Linguistics

2025
[32]

Pava, J.; Meinhardt, C.; Zaman, H. B. U.; Friedman, T.; Truong, S. T.; Zhang, D.; Marivate, V.; and Koyejo, S. 2025. Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

2025
[33]

Polonioli, A. 2021. The ethics of scientific recommender systems. Scientometrics, 126(2): 1841--1848

2021
[34]

Priem, J.; Piwowar, H.; and Orr, R. 2022. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

K.; and Bijoy Das, A

Sakib, S. K.; and Bijoy Das, A. 2024. Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations. In 2024 IEEE International Conference on Big Data (BigData), 1585--1592

2024
[36]

Sandnes, F. E. 2024. Can we identify prominent scholars using ChatGPT? Scientometrics, 129(1): 713--718

2024
[37]

Sood, G.; and Laohaprapanon, S. 2018. Predicting Race and Ethnicity From the Sequence of Characters in a Name. arXiv preprint arXiv:1805.02109

work page arXiv 2018
[38]

Tonneau, M.; Sehgal, N. K. R.; Malhotra, N.; Kazemi, S.; Orozco-Olvera, V.; Mu \ n oz Boudet, A. M.; Subramanian, L.; Fraiberger, S. P.; Guntuku, S. C.; and Hofmann, V. 2026. Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias . arXiv preprint arXiv:2601.18486v2

work page arXiv 2026
[39]

V \'a s \'a rhelyi, O.; and Horv \'a t, E.- \'A . 2023. Who benefits from altmetrics? The effect of team gender composition on the link between online visibility and citation impact. arXiv preprint arXiv:2308.00405

work page arXiv 2023
[40]

V \'a s \'a rhelyi, O.; Zakhlebin, I.; Milojevi \'c , S.; and Horv \'a t, E.- \'A . 2021. Gender inequities in the online dissemination of scholars’ work. Proceedings of the National Academy of Sciences, 118(39): e2102945118

2021
[41]

Wald, A. 1943. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical society, 54(3): 426--482

1943
[42]

E.; and Koyejo, S

Wang, A.; Ho, D. E.; and Koyejo, S. 2025. The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior. Patterns, 6(12)

2025
[43]

A.; Liu, F.; Georgiev, G

Wang, Y.; Wang, M.; Manzoor, M. A.; Liu, F.; Georgiev, G. N.; Das, R. J.; and Nakov, P. 2024. Factuality of Large Language Models: A Survey. In Al-Onaizan, Y.; Bansal, M.; and Chen, Y.-N., eds., Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 19519--19529. Miami, Florida, USA: Association for Computational Linguistics

2024
[44]

Weeber, F.; Neplenbroek, V.; Batzner, J.; and Pad \'o , S. 2026. One Persona , Many Cues , Different Results : How Sociodemographic Cues Impact LLM Personalization . arXiv preprint arXiv:2601.18572

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Whittle, R. 2024. Why Microsoft’s Copilot AI falsely accused court reporter of crimes he covered. The Conversation. Accessed: 2026-05-22

2024
[46]

Wilson, K.; Sim, M.; Gueorguieva, A.-M.; and Caliskan, A. 2025. No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, 2692--2704

2025
[47]

B.; and Hu, G

Xu, S. B.; and Hu, G. 2025. Rethinking the author name ambiguity problem and beyond: The case of the Chinese context. Accountability in research, 32(6): 913--936

2025
[48]

Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Xu, K.; Ye, Y.; and Gu, H. 2025. A survey on multilingual large language models: Corpora, alignment, and bias. Frontiers of Computer Science, 19(11): 1911362

2025
[49]

Ye, W.; Zhang, Q.; Zhou, X.; Hu, W.; Tian, C.; and Cheng, J. 2024. Correcting Factual Errors in LLMs via Inference Paths Based on Knowledge Graph. In 2024 International Conference on Computational Linguistics and Natural Language Processing (CLNLP), 12--16. IEEE

2024
[50]

Ye, X.; and Durrett, G. 2022. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35: 30378--30392

2022
[51]

Zhang, X.; Li, S.; Hauer, B.; Shi, N.; and Kondrak, G. 2023. Don’t Trust ChatGPT when your Question is not in English: A Study of Multilingual Abilities and Types of LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 7915--7927

2023
[52]

Zhao, J.; Zhang, S.; Xu, N.; and Wang, L. 2025. SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys. arXiv preprint arXiv:2512.02763

work page arXiv 2025
[53]

Zheng, M.; Pei, J.; Logeswaran, L.; Lee, M.; and Jurgens, D. 2024. When” a helpful assistant” is not really helpful: Personas in system prompts do not improve performances of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 15126--15154

2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [3]

I.; Gunasekar, S.; Chandrasekaran, V.; Li, J.; Yuksekgonul, M.; Peshawaria, R.; Naik, R.; and Nushi, B

Abdin, M. I.; Gunasekar, S.; Chandrasekaran, V.; Li, J.; Yuksekgonul, M.; Peshawaria, R.; Naik, R.; and Nushi, B. 2024. Kitab: Evaluating llms on constraint satisfaction for information retrieval. In International Conference on Learning Representations, volume 2024, 30664--30686

2024

[4] [4]

Anonymous. 2025. Anonymous Repository for Auditing LLMs as People Recommender Systems Across Languages and Countries. https://anonymous.4open.science/r/PersonasScholarRec. Anonymous repository for double-blind review

2025

[5] [5]

Anzenberg, E.; Samajpati, A.; Chandrasekar, S.; and Kacholia, V. 2025. Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions. arXiv preprint arXiv:2507.02087

work page arXiv 2025

[6] [6]

S.; and Jayagopi, D

Awasthi, D.; Rao, P. S.; and Jayagopi, D. B. 2025. ResumeGenAI: Supporting Job Seekers with LLM-Driven Resume Feedback. In Proceedings of the 7th ACM Conference on Conversational User Interfaces, 1--9

2025

[7] [7]

G.; and Esp \' n-Noboa, L

Barolo, D.; Valentin, C.; Karimi, F.; Gal \'a rraga, L.; M \'e ndez, G. G.; and Esp \' n-Noboa, L. 2025. Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations. arXiv preprint arXiv:2506.00074

work page arXiv 2025

[8] [8]

Benjamini, Y.; and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1): 289--300

1995

[9] [9]

S.; and Pagan, A

Breusch, T. S.; and Pagan, A. R. 1979. A simple test for heteroscedasticity and random coefficient variation. Econometrica: Journal of the econometric society, 1287--1294

1979

[10] [10]

S.; Zhang, Y.; Kejriwal, M.; and Calyam, P

Cheng, X.; Edara, L. S.; Zhang, Y.; Kejriwal, M.; and Calyam, P. 2024. Influence Role Recognition and LLM-Based Scholar Recommendation in Academic Social Networks. In 2024 IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA), 1--11. IEEE

2024

[11] [11]

De Araujo, P. H. L.; R \"o ttger, P.; Hovy, D.; and Roth, B. 2025. Principled personas: Defining and measuring the intended effects of persona prompting on task performance. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 26845--26874

2025

[12] [12]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Durmus, E.; Nguyen, K.; Liao, T. I.; Schiefer, N.; Askell, A.; Bakhtin, A.; Chen, C.; Hatfield-Dodds, Z.; Hernandez, D.; Joseph, N.; et al. 2023. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Espin-Noboa, L.; and Mendez, G. G. 2026. Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation. arXiv preprint arXiv:2602.08873

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

L.; Bonchi, F.; and Castillo, C

Fabbri, F.; Croci, M. L.; Bonchi, F.; and Castillo, C. 2022. Exposure inequality in people recommender systems: The long-term effects. In Proceedings of the international AAAI conference on web and social media, volume 16, 194--204

2022

[15] [15]

Hanusz, Z.; Tarasinska, J.; and Zielinski, W. 2016. Shapiro--Wilk test with known mean. REVSTAT-statistical Journal, 14(1): 89--100

2016

[16] [16]

Hu, T.; and Collier, N. 2024. Quantifying the Persona Effect in LLM Simulations. arXiv preprint arXiv:2402.10811

work page arXiv 2024

[17] [17]

M.; Macedo, M.; Oliveira, M.; Karimi, F.; and Menezes, R

Jaramillo, A. M.; Macedo, M.; Oliveira, M.; Karimi, F.; and Menezes, R. 2025. Systematic comparison of gender inequality in scientific rankings across disciplines. arXiv preprint arXiv:2501.13061

work page arXiv 2025

[18] [18]

Jiao, J.; Afroogh, S.; Xu, Y.; and Phillips, C. 2025. Navigating llm ethics: Advancements, challenges, and future directions. AI and Ethics, 1--25

2025

[19] [19]

Karimi, F.; Wagner, C.; Lemmerich, F.; Jadidi, M.; and Strohmaier, M. 2016. Inferring Gender from Names on the Web: A Comparative Evaluation of Gender Detection Methods. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW '16 Companion, 53--54. New York, New York, USA: ACM Press

2016

[20] [20]

Kim, J.; Yang, N.; and Jung, K. 2025. Persona is a Double-Edged Sword: Rethinking the Impact of Role-play Prompts in Zero-shot Reasoning Tasks. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 848--862

2025

[21] [21]

Kinney, R. M.; Anastasiades, C.; Authur, R.; Beltagy, I.; Bragg, J.; Buraczynski, A.; Cachola, I.; Candra, S.; Chandrasekhar, Y.; Cohan, A.; Crawford, M.; Downey, D.; Dunkelberger, J.; Etzioni, O.; Evans, R.; Feldman, S.; Gorney, J.; Graham, D. W.; Hu, F.; Huff, R.; King, D.; Kohlmeier, S.; Kuehl, B.; Langan, M.; Lin, D.; Liu, H.; Lo, K.; Lochner, J.; Mac...

work page arXiv 2023

[22] [22]

S.; Bell, A.; Hulsey, W.; Larivière, V.; Monroe-White, T.; and Sugimoto, C

Kozlowski, D.; Murray, D. S.; Bell, A.; Hulsey, W.; Larivière, V.; Monroe-White, T.; and Sugimoto, C. R. 2022. Avoiding bias when inferring race using name-based approaches. PLOS ONE

2022

[23] [23]

D.; and Finley, J

Kroes, A. D.; and Finley, J. R. 2023. Demystifying omega squared: Practical guidance for effect size in common analysis of variance designs. Psychological Methods

2023

[24] [24]

R.; and Koch, G

Landis, J. R.; and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics, 159--174

1977

[25] [25]

Letteri, I.; and Vittorini, P. 2024. Exploring the Impact of LLM-Generated Feedback: Evaluation from Professors and Students in Data Science Courses. In International Conference in Methodologies and intelligent Systems for Techhnology Enhanced Learning, 11--20. Springer

2024

[26] [26]

Li, T.; Qin, Y.; and Sheng, O. R. L. 2025. A Multi-Task Evaluation of LLMs' Processing of Academic Text Input. arXiv preprint arXiv:2508.11779

work page arXiv 2025

[27] [27]

Liang, L.; and Acuna, D. 2021. demographicx: A Python package for estimating gender and ethnicity using deep learning transformers. Zenodo https://doi. org/10.5281/zenodo, 4898367

work page doi:10.5281/zenodo 2021

[28] [28]

Lin, A.; Wang, J.; Zhu, Z.; and Caverlee, J. 2022. Quantifying and mitigating popularity bias in conversational recommender systems. In Proceedings of the 31st ACM international conference on information & knowledge management, 1238--1247

2022

[29] [29]

Liu, Y.; Elekes, \'A .; Lu, J.; Dorantes-Gilardi, R.; and Barab \'a si, A.-L. 2025. Unequal Scientific Recognition in the Age of LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, 23558--23568

2025

[30] [30]

P.-W.; Qiu, J.; Wang, Z.; Yu, H.; Chen, Y.; Zhang, G.; and Lo, B

Lo, F. P.-W.; Qiu, J.; Wang, Z.; Yu, H.; Chen, Y.; Zhang, G.; and Lo, B. 2025. AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 4223--4232

2025

[31] [31]

Lutz, M.; Sen, I.; Ahnert, G.; Rogers, E.; and Strohmaier, M. 2025. The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, 23212--23237. Suzhou, China: Association for Computational Linguistics

2025

[32] [32]

Pava, J.; Meinhardt, C.; Zaman, H. B. U.; Friedman, T.; Truong, S. T.; Zhang, D.; Marivate, V.; and Koyejo, S. 2025. Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

2025

[33] [33]

Polonioli, A. 2021. The ethics of scientific recommender systems. Scientometrics, 126(2): 1841--1848

2021

[34] [34]

Priem, J.; Piwowar, H.; and Orr, R. 2022. OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:2205.01833

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

K.; and Bijoy Das, A

Sakib, S. K.; and Bijoy Das, A. 2024. Challenging Fairness: A Comprehensive Exploration of Bias in LLM-Based Recommendations. In 2024 IEEE International Conference on Big Data (BigData), 1585--1592

2024

[36] [36]

Sandnes, F. E. 2024. Can we identify prominent scholars using ChatGPT? Scientometrics, 129(1): 713--718

2024

[37] [37]

Sood, G.; and Laohaprapanon, S. 2018. Predicting Race and Ethnicity From the Sequence of Characters in a Name. arXiv preprint arXiv:1805.02109

work page arXiv 2018

[38] [38]

Tonneau, M.; Sehgal, N. K. R.; Malhotra, N.; Kazemi, S.; Orozco-Olvera, V.; Mu \ n oz Boudet, A. M.; Subramanian, L.; Fraiberger, S. P.; Guntuku, S. C.; and Hofmann, V. 2026. Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias . arXiv preprint arXiv:2601.18486v2

work page arXiv 2026

[39] [39]

V \'a s \'a rhelyi, O.; and Horv \'a t, E.- \'A . 2023. Who benefits from altmetrics? The effect of team gender composition on the link between online visibility and citation impact. arXiv preprint arXiv:2308.00405

work page arXiv 2023

[40] [40]

V \'a s \'a rhelyi, O.; Zakhlebin, I.; Milojevi \'c , S.; and Horv \'a t, E.- \'A . 2021. Gender inequities in the online dissemination of scholars’ work. Proceedings of the National Academy of Sciences, 118(39): e2102945118

2021

[41] [41]

Wald, A. 1943. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical society, 54(3): 426--482

1943

[42] [42]

E.; and Koyejo, S

Wang, A.; Ho, D. E.; and Koyejo, S. 2025. The inadequacy of offline large language model evaluations: A need to account for personalization in model behavior. Patterns, 6(12)

2025

[43] [43]

A.; Liu, F.; Georgiev, G

Wang, Y.; Wang, M.; Manzoor, M. A.; Liu, F.; Georgiev, G. N.; Das, R. J.; and Nakov, P. 2024. Factuality of Large Language Models: A Survey. In Al-Onaizan, Y.; Bansal, M.; and Chen, Y.-N., eds., Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 19519--19529. Miami, Florida, USA: Association for Computational Linguistics

2024

[44] [44]

Weeber, F.; Neplenbroek, V.; Batzner, J.; and Pad \'o , S. 2026. One Persona , Many Cues , Different Results : How Sociodemographic Cues Impact LLM Personalization . arXiv preprint arXiv:2601.18572

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

Whittle, R. 2024. Why Microsoft’s Copilot AI falsely accused court reporter of crimes he covered. The Conversation. Accessed: 2026-05-22

2024

[46] [46]

Wilson, K.; Sim, M.; Gueorguieva, A.-M.; and Caliskan, A. 2025. No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, 2692--2704

2025

[47] [47]

B.; and Hu, G

Xu, S. B.; and Hu, G. 2025. Rethinking the author name ambiguity problem and beyond: The case of the Chinese context. Accountability in research, 32(6): 913--936

2025

[48] [48]

Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Xu, K.; Ye, Y.; and Gu, H. 2025. A survey on multilingual large language models: Corpora, alignment, and bias. Frontiers of Computer Science, 19(11): 1911362

2025

[49] [49]

Ye, W.; Zhang, Q.; Zhou, X.; Hu, W.; Tian, C.; and Cheng, J. 2024. Correcting Factual Errors in LLMs via Inference Paths Based on Knowledge Graph. In 2024 International Conference on Computational Linguistics and Natural Language Processing (CLNLP), 12--16. IEEE

2024

[50] [50]

Ye, X.; and Durrett, G. 2022. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35: 30378--30392

2022

[51] [51]

Zhang, X.; Li, S.; Hauer, B.; Shi, N.; and Kondrak, G. 2023. Don’t Trust ChatGPT when your Question is not in English: A Study of Multilingual Abilities and Types of LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 7915--7927

2023

[52] [52]

Zhao, J.; Zhang, S.; Xu, N.; and Wang, L. 2025. SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys. arXiv preprint arXiv:2512.02763

work page arXiv 2025

[53] [53]

Zheng, M.; Pei, J.; Logeswaran, L.; Lee, M.; and Jurgens, D. 2024. When” a helpful assistant” is not really helpful: Personas in system prompts do not improve performances of large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 15126--15154

2024