pith. sign in

arxiv: 2606.09637 · v1 · pith:XQGMHTAEnew · submitted 2026-06-08 · 💻 cs.SE

Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation

Pith reviewed 2026-06-27 15:19 UTC · model grok-4.3

classification 💻 cs.SE
keywords persona generationLLM agentscritique-refinementindustrial evaluationsoftware engineeringrequirements elicitationagentic systemsexpert validation
0
0 comments X

The pith

PerGent generates personas via an iterative LLM critique-refinement loop that reaches 96.9% expert approval in an industrial test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PerGent as a method to automate persona creation for software engineering tasks such as requirements elicitation. It coordinates a generator LLM and a critic LLM through an orchestrator that refines outputs over multiple rounds using sources like interviews, surveys, and job postings. In a deployment at Kinaxis, PerGent outperformed three baselines, including one-shot approaches, with the highest expert approval rate while reproducing more content from prior expert personas and adding substantial new material. A reader would care because manual persona development remains costly and difficult to scale, so a reliable automated alternative could expand their use in design and validation.

Core claim

PerGent, an industry-grade method for persona generation built around an iterative critique-refinement loop, uses a generator and a critic LLM agent coordinated by an orchestrator to refine personas from external resources such as interviews, surveys, and job postings through a user-defined maximum number of rounds. In an expert in-situ evaluation at Kinaxis, PerGent achieved the highest expert approval rate of 96.9 percent, exceeding all baselines. Compared to baselines, PerGent reproduces a larger proportion of expert content while also contributing substantial new content beyond the pre-LLM personas.

What carries the argument

The critique-refinement loop in which a generator LLM produces personas and a critic LLM evaluates them against provided data sources, coordinated by an orchestrator for iterative passes up to a maximum round limit.

If this is right

  • PerGent exceeds all one-shot baselines in expert approval rate during in-situ evaluation.
  • PerGent reproduces a larger proportion of content from pre-existing expert personas than the baselines.
  • PerGent contributes substantial new content not found in pre-LLM expert personas.
  • The method supports deployment and evaluation inside an active industrial context such as Kinaxis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the iterative loop drives the gains, single-shot LLM methods may consistently miss nuanced details that multiple critique passes can capture.
  • The same generator-critic structure could be tested on related artifacts such as user stories or acceptance criteria.
  • Lessons from the Kinaxis deployment could inform how teams set round limits or select data sources when adapting the method elsewhere.

Load-bearing premise

Expert approval rates from a single-company in-situ setting provide an unbiased and generalizable measure of persona quality.

What would settle it

A replication study at a different company with independent experts that finds PerGent's approval rate falls below one or more one-shot baselines.

Figures

Figures reproduced from arXiv: 2606.09637 by David Dewar, Mehrdad Sabetzadeh, Mohammad Hossein Amini, Shiva Nejati.

Figure 1
Figure 1. Figure 1: Example of a supply-planner persona; each section may [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our agentic persona generator (PerGent). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Workflow of PERGENT and the baselines. PERGENT￾NORES uses both Steps 1 and 2, but without external re￾sources; ONESHOT and ONESHOT+RES use only Step 1, with external resources used only by ONESHOT+RES. RQ3 (Cost). What is the cost of PerGent in terms of LLM calls and tokens per call? We compare PerGent and the baselines on persona-generation cost using the number of LLM calls and tokens required to generat… view at source ↗
Figure 5
Figure 5. Figure 5: Experiment workflows for our research questions (RQ1, RQ2 and RQ3) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Edit categories for generated persona items. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distinctness vs. full preservation Relative to ONESHOT, PERGENT improves distinctness and full preservation by 10.7% and 12.5%, respectively; the corresponding gains over ONESHOT+RES are 9.4% and 12.4%, and over PERGENTNORES are 5.1% and 9.3% [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Personas are widely used in software engineering to support requirements elicitation, design, and validation, but their manual creation is costly, time-consuming, and hard to scale. Recent LLM-based approaches automate persona generation from textual data; however, they typically rely on single-shot generation and subjective evaluations, limiting practical reliability. We present PerGent, an industry-grade method for persona generation built around an iterative critique-refinement loop. Specifically, PerGent uses a generator and a critic LLM agent, coordinated by an orchestrator, to iteratively refine personas using external resources such as interviews, surveys, and job postings through a critique-refinement loop with a user-defined maximum number of rounds. We deploy and evaluate PerGent in an industrial setting at Kinaxis, comparing it with three baselines, including one-shot methods. In an expert in-situ evaluation, PerGent achieved the highest expert approval rate (96.9%), exceeding all baselines. We further compare PerGent-generated personas with best-practice personas manually created by domain experts prior to the adoption of LLMs. Compared to baselines, PerGent reproduces a larger proportion of expert content while also contributing substantial new content beyond the pre-LLM personas. We conclude with lessons learned from deploying and evaluating PerGent at Kinaxis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents PerGent, an agentic persona generation method that uses a generator-critic LLM pair coordinated by an orchestrator in an iterative critique-refinement loop (with user-defined max rounds and external resources such as interviews and job postings). In an industrial deployment at Kinaxis, PerGent is compared to three baselines (including one-shot LLM methods) and to pre-LLM manually created expert personas; the central empirical claim is that PerGent attains a 96.9% expert approval rate (highest among methods) while reproducing a larger share of expert content and adding substantial new content.

Significance. If the evaluation results hold under more rigorous controls, the work supplies a rare industrial case study showing that multi-agent iterative refinement can outperform single-shot LLM generation for a practically important SE artifact. The explicit comparison against pre-LLM expert personas provides independent grounding that is uncommon in LLM persona papers and strengthens the practical relevance of the findings.

major comments (1)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: the central superiority claim rests on the reported 96.9% expert approval rate and the reproduction metric versus pre-LLM personas, yet the manuscript supplies no information on the number of experts, blinding procedures, inter-rater agreement, or how familiarity with the iterative Kinaxis context was controlled. These omissions are load-bearing because the evaluation is in-situ and the iterative loop is unique to PerGent.
minor comments (1)
  1. [Method section] The description of how external resources are ingested and how the orchestrator decides termination could be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation methodology. We agree that additional details are needed to support the reported results and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the central superiority claim rests on the reported 96.9% expert approval rate and the reproduction metric versus pre-LLM personas, yet the manuscript supplies no information on the number of experts, blinding procedures, inter-rater agreement, or how familiarity with the iterative Kinaxis context was controlled. These omissions are load-bearing because the evaluation is in-situ and the iterative loop is unique to PerGent.

    Authors: We agree that the manuscript omits key details on the evaluation procedure. In the revised version we will expand the Evaluation section with a dedicated paragraph reporting the exact number of experts who performed the approval ratings, any blinding procedures employed, inter-rater agreement statistics, and how experts' familiarity with the Kinaxis context was handled. Because the study is an in-situ industrial deployment, complete blinding to the generation method was not feasible; we will explicitly note this limitation and describe the mitigation steps taken. These additions will allow readers to assess the strength of the 96.9% approval claim and the pre-LLM persona comparison. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external expert judgments and pre-LLM baselines

full rationale

The paper presents an industrial evaluation of PerGent using expert approval rates (96.9%) collected in-situ at Kinaxis and direct comparisons to independently created pre-LLM personas. These metrics are external to the generation process and not reduced to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked; the central claims are grounded in independent human assessment rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the procedural effectiveness of the critique-refinement loop and on the validity of expert approval as a quality signal. No numerical parameters are fitted, no new physical or mathematical entities are postulated, and the only notable assumption is the reliability of the expert evaluation.

axioms (1)
  • domain assumption Expert in-situ approval rates constitute a reliable and unbiased proxy for persona quality and utility.
    The 96.9% approval figure and the comparison to pre-LLM personas are treated as decisive evidence of superiority.

pith-pipeline@v0.9.1-grok · 5759 in / 1350 out tokens · 26600 ms · 2026-06-27T15:19:23.217806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 3 linked inside Pith

  1. [1]

    Personacraft: Leveraging language models for data-driven persona development,

    S. Jung, J. Salminen, K. K. Aldous, and B. J. Jansen, “Personacraft: Leveraging language models for data-driven persona development,” International Journal of Human-Computer Studies, vol. 197, p. 103445,

  2. [2]

    Available: https://doi.org/10.1016/j.ijhcs.2025.103445

    [Online]. Available: https://doi.org/10.1016/j.ijhcs.2025.103445

  3. [3]

    Deus ex machina and personas from large language models: Investigating the composition of AI-generated persona descriptions,

    J. Salminen, C. Liu, W. Pian, J. Chi, E. H ¨ayh¨anen, and B. J. Jansen, “Deus ex machina and personas from large language models: Investigating the composition of AI-generated persona descriptions,” inProceedings of the CHI Conference on Human Factors in Computing Systems, 2024. [Online]. Available: https: //doi.org/10.1145/3613904.3642036

  4. [4]

    Who uses personas in requirements engineering: The practitioners’ perspective,

    Y . Wang, C. Arora, X. Liu, T. Hoang, V . Malhotra, B. Cheng, and J. C. Grundy, “Who uses personas in requirements engineering: The practitioners’ perspective,”Information and Software Technology, vol. 178, p. 107609, 2025. [Online]. Available: https://doi.org/10.1016/j. infsof.2024.107609

  5. [5]

    Personagen: A tool for generating personas from user feedback,

    X. Zhang, L. Liu, Y . Wang, X. Liu, H. Wang, A. Ren, and C. Arora, “Personagen: A tool for generating personas from user feedback,” inProceedings of 31st IEEE International Requirements Engineering Conference (RE’23), 2023, pp. 353–354. [Online]. Available: https://doi.org/10.1109/RE57278.2023.00048

  6. [6]

    Cooper,The Inmates Are Running the Asylum: Why High-Tech Products Drive Us Crazy and How to Restore the Sanity

    A. Cooper,The Inmates Are Running the Asylum: Why High-Tech Products Drive Us Crazy and How to Restore the Sanity. Sams Publishing, 1999

  7. [7]

    Understanding human-AI workflows for generating personas,

    J. Shin, M. A. Hedderich, B. J. Rey, A. Lucero, and A. Oulasvirta, “Understanding human-AI workflows for generating personas,” in Proceedings of the 2024 ACM Designing Interactive Systems Conference, 2024, pp. 757–781. [Online]. Available: https://doi.org/10. 1145/3643834.3660729

  8. [8]

    Imaginary people representing real numbers: Generating personas from online social media data,

    J. An, H. Kwak, S.-G. Jung, J. Salminen, M. Ahmad, and B. J. Jansen, “Imaginary people representing real numbers: Generating personas from online social media data,”ACM Transactions on the Web, vol. 12, no. 4, 2018. [Online]. Available: https://doi.org/10.1145/3265986

  9. [9]

    From flat file to interface: Synthesis of personas and analytics for enhanced user understanding,

    B. J. Jansen, S. Jung, and J. Salminen, “From flat file to interface: Synthesis of personas and analytics for enhanced user understanding,”Proceedings of the Association for Information Science and Technology, vol. 57, no. 1, 2020. [Online]. Available: https://doi.org/10.1002/pra2.215

  10. [10]

    Automatic persona generation (apg): A rationale and demonstration,

    S. Jung, J. Salminen, H. Kwak, J. An, and B. J. Jansen, “Automatic persona generation (apg): A rationale and demonstration,” inExtended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, 2018, p. 321–324. [Online]. Available: https://doi.org/10.1145/3176349.3176893

  11. [11]

    Generating personas using LLMs and assessing their viability,

    A. Schuller, D. Janssen, J. Blumenr ¨other, T. M. Probst, M. Schmidt, and C. Kumar, “Generating personas using LLMs and assessing their viability,” inExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024. [Online]. Available: https://doi.org/10.1145/3613905.3650860

  12. [12]

    RepairAgent: An autonomous, LLM-based agent for program repair,

    I. Bouzenia, P. Devanbu, and M. Pradel, “RepairAgent: An autonomous, LLM-based agent for program repair,” in Proceedings of 47th IEEE/ACM International Conference on Software Engineering (ICSE’25), 2025, p. 2188–2200. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00157

  13. [13]

    An LLM-based agent-oriented approach for automated code design issue localization,

    F. Batole, D. O’Brien, T. N. Nguyen, R. Dyer, and H. Rajan, “An LLM-based agent-oriented approach for automated code design issue localization,” inProceedings of 47th IEEE/ACM International Conference on Software Engineering (ICSE’25), 2025, pp. 1320–1332. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00100

  14. [14]

    Advanced smart contract vulnerability detection via LLM-powered multi-agent systems,

    S. Cheng, Y . Duan, Y . Li, L. Chen, Y . Xiao, Q. Li, L. Lin, Y . Jiang, and J. Zhao, “Advanced smart contract vulnerability detection via LLM-powered multi-agent systems,”IEEE Transactions on Software Engineering, vol. 51, no. 10, pp. 2830–2846, 2025. [Online]. Available: https://doi.org/10.1109/TSE.2025.3597319

  15. [15]

    Exploring LLM-based agents for root cause analysis,

    D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring LLM-based agents for root cause analysis,” in Companion Proceedings of the ACM on Software Engineering, 2024, pp. 656–660. [Online]. Available: https://doi.org/10.1145/3663529.3663841

  16. [16]

    The impact of critique on LLM-based model generation from natural language: The case of activity diagrams,

    P. Khamsepour, M. Cole, I. Ashraf, S. Puri, M. Sabetzadeh, and S. Nejati, “The impact of critique on LLM-based model generation from natural language: The case of activity diagrams,” arXiv preprint, vol. abs/2509.03463, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.03463

  17. [17]

    DSL or Code? Evaluating the quality of LLM-generated algebraic specifications: A case study in optimization at Kinaxis,

    N. Ayoughi, D. Dewar, S. Nejati, and M. Sabetzadeh, “DSL or Code? Evaluating the quality of LLM-generated algebraic specifications: A case study in optimization at Kinaxis,” inProceedings of 48th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’26), 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.00469

  18. [18]

    AutoGen: Enabling next-gen LLM applications via multi- agent conversation,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “AutoGen: Enabling next-gen LLM applications via multi- agent conversation,” 2023, arXiv:2308.08155 [cs]. [Online]. Available: https://doi.org/10.48550/arXiv.2308.08155

  19. [19]

    Use of personas in requirements engineering: A systematic mapping study,

    D. Karolita, J. McIntosh, T. Kanij, J. Grundy, and H. O. Obie, “Use of personas in requirements engineering: A systematic mapping study,” Information and Software Technology, vol. 162, p. 107264, 2023. [Online]. Available: https://doi.org/10.1016/j.infsof.2023.107264

  20. [20]

    What’s in a persona? A preliminary taxonomy from persona use in requirements engineering,

    D. Karolita, J. Grundy, T. Kanij, H. Obie, and J. McIntosh, “What’s in a persona? A preliminary taxonomy from persona use in requirements engineering,” inProceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE’23), 2023, pp. 39–51. [Online]. Available: https://doi.org/10.5220/0011708500003464

  21. [21]

    Agentic software engineering: Foundational pillars and a research roadmap,

    A. E. Hassan, H. Li, D. Lin, B. Adams, T.-H. Chen, Y . Kashiwa, and D. Qiu, “Agentic software engineering: Foundational pillars and a research roadmap,”arXiv preprint, vol. 2509.06216, 2025, preprint. [Online]. Available: https://doi.org/10.48550/arXiv.2509.06216

  22. [22]

    Online repository for PerGent,

    M. H. Amini, S. Nejati, and M. Sabetzadeh, “Online repository for PerGent,” https://github.com/M-H-Amini/PerGent, 2026

  23. [23]

    Dated data: Tracing knowledge cutoffs in large language models,

    J. Cheng, M. Marone, O. Weller, D. Lawrie, D. Khashabi, and B. Van Durme, “Dated data: Tracing knowledge cutoffs in large language models,”arXiv preprint arXiv:2403.12958, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.12958

  24. [24]

    Judging LLM-as-a-judge with MT-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” inAdvances in Neural Information Processing Systems 36 (NeurIPS’23), vol. 36, 2023, pp. 46 595–46 623. [Online]. Available: https://doi.org/10.48550/arXiv.2306.05685

  25. [25]

    iKnow: An intent-guided chatbot for cloud operations with retrieval-augmented generation,

    J. Huang, Y . Zhong, G. Yu, Z. Jiang, M. Yan, W. Luan, T. Yang, R. Ren, and M. R. Lyu, “iKnow: An intent-guided chatbot for cloud operations with retrieval-augmented generation,” inProceedings of 40th IEEE/ACM International Conference on Automated Software Engineering (ASE’25), 2025, pp. 958–970. [Online]. Available: https://doi.org/10.1109/ASE63991.2025.00084

  26. [26]

    Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed

    K. Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed. SAGE Publications, 2018

  27. [27]

    Wilcoxon signed-rank test,

    D. Rey and M. Neuh ¨auser, “Wilcoxon signed-rank test,” inInternational Encyclopedia of Statistical Science, M. Lovric, Ed., 2011, pp. 1658–

  28. [28]

    Available: https://doi.org/10.1007/978-3-642-04898-2 616

    [Online]. Available: https://doi.org/10.1007/978-3-642-04898-2 616

  29. [29]

    A critique and improvement of the CL common language effect size statistics of McGraw and Wong,

    A. Vargha and H. D. Delaney, “A critique and improvement of the CL common language effect size statistics of McGraw and Wong,”Journal of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132,

  30. [30]

    Available: https://doi.org/10.3102/10769986025002101

    [Online]. Available: https://doi.org/10.3102/10769986025002101