pith. sign in

arxiv: 2605.22971 · v1 · pith:QWMP2MIFnew · submitted 2026-05-21 · 💻 cs.CL · cs.HC

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

Pith reviewed 2026-05-25 05:50 UTC · model grok-4.3

classification 💻 cs.CL cs.HC
keywords LLMexpertise estimationSlack logsdomain knowledgecommunication analysiszero-shot inferenceGeminiGPT models
0
0 comments X

The pith

Large language models can estimate individual domain knowledge from Slack logs, with accuracy varying sharply by model and showing only weak dependence on message volume.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs can automatically map who knows what inside an organization by reading employees' chat histories instead of asking direct questions. It processes 27,188 Slack messages from 43 users and pits seven models against self-reported skill ratings collected from 27 participants on a 0-100 scale. Gemini 2.5 Flash records the lowest mean absolute error at 21.13 percent while GPT-family models produce noticeably larger errors; crucially, giving a model more messages does not reliably reduce that error. Organizations lose productivity when they cannot locate internal expertise, so an automated method could help surface hidden knowledge if the error rates prove acceptable in practice.

Core claim

Zero-shot prompting of LLMs on raw Slack communication logs yields domain-knowledge estimates whose mean absolute error against self-reported ratings reaches 21.13 percent for Gemini 2.5 Flash, with GPT models showing significantly larger discrepancies, and with estimation accuracy depending only weakly on the number of messages available per user.

What carries the argument

Zero-shot LLM inference of per-user skill levels (0-100) from unstructured, long-term communication logs, scored by mean absolute error against self-reported ground truth.

If this is right

  • Automated mapping of expertise from everyday chat data is feasible with current models.
  • Model choice matters: Gemini 2.5 Flash outperforms GPT variants on this task.
  • Simply supplying more messages does not guarantee better estimates.
  • Any real-world use requires privacy-preserving techniques because raw logs contain sensitive content.
  • Richer, structure-aware knowledge representations would likely reduce current error rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations could first test the method on anonymized message summaries to limit privacy exposure before scaling to full logs.
  • The same pipeline might be applied to email threads or project-management comments with only minor prompt changes.
  • A follow-up study could check whether LLM estimates predict actual task success better than the self-ratings used here.
  • Teams that adopt such tools should periodically re-validate estimates against fresh objective measures rather than treating them as static.

Load-bearing premise

Self-reported skill ratings from the 27 participants accurately reflect their true domain knowledge and can serve as reliable ground truth for measuring LLM estimation error.

What would settle it

Replace the self-reported ratings with objective skill-test scores on the same topics for the same participants and re-measure the LLMs' mean absolute error.

Figures

Figures reproduced from arXiv: 2605.22971 by Ko Watanabe, Shoya Ishimaru.

Figure 1
Figure 1. Figure 1: Concept image of this study. This study aims to estimate the domain knowledge of a person from their communication logs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed workflow. The first step is to export Slack communication logs as JSON files. Then, the backend [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of messages per user (UID). The quantity varies among users, so we examine the impact of this variation on estimation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLMs prompt template used to estimate domain knowledge from Slack logs during evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Web application user interface (UI) for self-annotation. Users were instructed to self-rate their skills, which were displayed on [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Experiment procedure. Participants first listen to the experiment instructions from the experiment conductor. The user then [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The number of words extracted from the communication logs for each participant. The bar chart shows the number of words [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison of LLMs domain knowledge estimation. The bar chart shows the mean absolute error ( [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs can infer individual domain knowledge from Slack logs, evaluated via zero-shot estimates from seven models (Gemini, Claude, GPT families) against self-reported skill ratings from 27 of 43 users whose 27,188 messages were analyzed. Gemini 2.5 Flash achieves the lowest MAE of 21.13%, GPT models perform worse, accuracy depends only weakly on message volume, and the work concludes this demonstrates feasibility and limits of automated expertise mapping while calling for privacy-preserving approaches.

Significance. If self-reports validly proxy true knowledge, the concrete MAE benchmarks and the negative result on message volume would provide useful empirical grounding for LLM-based expertise estimation in organizations. The study supplies a direct model comparison on real logs, which could serve as a baseline for future work on structure-aware knowledge representations.

major comments (2)
  1. [Evaluation] Evaluation (as described in the abstract): The central claim that LLMs infer 'individual domain knowledge' rests on comparing model outputs to self-reported skill ratings from 27 participants, yet no external validation (objective tests, peer ratings, or supervisor assessments) is reported. Self-assessments are known to be biased, so the MAE of 21.13% quantifies agreement with self-perception rather than accuracy of knowledge estimation; this directly undermines the feasibility conclusion.
  2. [Methods] Methods: Participant selection, domain definitions, and prompting details for the zero-shot estimates are not specified, preventing assessment of whether the 27 self-reports cover the same knowledge dimensions used in the LLM prompts or whether the reported model differences are robust.
minor comments (2)
  1. [Abstract] Clarify whether model estimates were produced only for the 27 participants who provided self-reports or for all 43 users, and report the exact number of messages per participant used in each estimate.
  2. [Results] Add error bars, confidence intervals, or statistical tests for the MAE comparisons across models to support statements about significant differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, acknowledging where revisions are needed to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] The central claim that LLMs infer 'individual domain knowledge' rests on comparing model outputs to self-reported skill ratings from 27 participants, yet no external validation (objective tests, peer ratings, or supervisor assessments) is reported. Self-assessments are known to be biased, so the MAE of 21.13% quantifies agreement with self-perception rather than accuracy of knowledge estimation; this directly undermines the feasibility conclusion.

    Authors: We agree that self-reports are an imperfect proxy subject to well-documented biases and that the reported MAE reflects agreement with self-perception rather than objective accuracy. The manuscript will be revised to explicitly frame the results as measuring concordance with self-assessed knowledge, to add a dedicated limitations subsection on this point, and to moderate the feasibility claims to reflect this scope. No external validation data were collected, so we cannot claim stronger validity. revision: yes

  2. Referee: [Methods] Participant selection, domain definitions, and prompting details for the zero-shot estimates are not specified, preventing assessment of whether the 27 self-reports cover the same knowledge dimensions used in the LLM prompts or whether the reported model differences are robust.

    Authors: The full manuscript contains a Methods section describing recruitment from a single organization, the ten predefined technical domains, and the zero-shot prompt template. However, we acknowledge that these details are insufficiently explicit for reproducibility. We will expand the section with participant demographics, exact domain-question mappings, and full prompt examples to allow direct comparison between self-report items and LLM inputs. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper reports an empirical comparison of LLM zero-shot outputs to self-reported skill ratings from 27 participants, with no equations, parameter fitting, or derivations present in the provided text. The MAE values and feasibility conclusions are direct measurements against independent participant data rather than reductions by construction. No self-citations are invoked as load-bearing premises, and no ansatz, uniqueness theorems, or renamings of known results appear. The evaluation chain is self-contained against the stated ground truth.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is an empirical benchmarking study; it relies on standard domain assumptions about ground truth and signal presence in logs but introduces no fitted parameters, new entities, or ad-hoc axioms beyond those typical for LLM evaluation.

axioms (2)
  • domain assumption Self-reported skill ratings accurately reflect participants' true domain knowledge.
    The entire evaluation pipeline treats these ratings as the reference standard against which LLM outputs are measured.
  • domain assumption Slack communication logs contain extractable signals about users' domain expertise.
    This premise underpins the decision to use message logs as input for knowledge estimation.

pith-pipeline@v0.9.0 · 5665 in / 1346 out tokens · 36401 ms · 2026-05-25T05:50:41.565789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    2025.Claude Haiku 4.5

    Anthropic. 2025.Claude Haiku 4.5. https://www.anthropic.com/news/claude-haiku-4-5 AI language model

  2. [2]

    2025.Claude Sonnet 4.5

    Anthropic. 2025.Claude Sonnet 4.5. https://www.anthropic.com/news/claude-sonnet-4-5 AI language model

  3. [3]

    Sasa Arsovski, Hasmik Osipyan, Muniru Idris Oladele, and Adrian David Cheok. 2019. Automatic knowledge extraction of any Chatbot from conversation.Expert Systems with Applications137 (2019), 343–348

  4. [4]

    Jacques Bughin, Michael Chui, and James Manyika. 2012. Capturing business value with social technologies.McKinsey Quarterly4, 1 (2012), 72–80

  5. [5]

    Liuqing Chen, Zhaojun Jiang, Duowei Xia, Zebin Cai, Lingyun Sun, Peter Childs, and Haoyu Zuo. 2024. BIDTrainer: An LLMs-driven Education Tool for Enhancing the Understanding and Reasoning in Bio-inspired Design. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, ...

  6. [6]

    Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. 2023. The future landscape of large language models in medicine. Communications medicine3, 1 (2023), 141

  7. [7]

    Elnara Galimzhanova, Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, and Guido Rocchietti. 2023. Rewriting conversational utterances with instructed large language models. In2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT). IEEE, 56–63

  8. [8]

    2025.Firebase

    Google. 2025.Firebase. Technical Report. Google. https://firebase.google.com/ Firebase is a backend-as-a-service platform for building web and mobile applications

  9. [9]

    2025.Gemini 2.5 Flash

    DeepMind & Google. 2025.Gemini 2.5 Flash. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash Multimodal large-language model

  10. [10]

    2025.Gemini 2.5 Pro

    DeepMind & Google. 2025.Gemini 2.5 Pro. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro Multimodal large-language model

  11. [11]

    Xiaodong Gu, Kang Min Yoo, and Jung-Woo Ha. 2021. Dialogbert: Discourse-aware response generation via learning to recover and rank utterances. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 12911–12919

  12. [12]

    Jizhou Huang, Ming Zhou, and Dan Yang. 2007. Extracting chatbot knowledge from online discussion forums. InProceedings of the 20th International Joint Conference on Artifical Intelligence(Hyderabad, India)(IJCAI’07). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 423–428

  13. [13]

    Samuel Kernan Freire, Chaofan Wang, Santiago Ruiz-Arenas, and Evangelos Niforatos. 2023. Tacit Knowledge Elicitation for Shop-floor Workers with an Intelligent Assistant. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI EA ’23). Association for Computing Machinery, New York, NY, USA, Article 266, ...

  14. [14]

    Ksenija Kosilova and Ilze Birzniece. 2024. Survey on Organizational Chat Conversation Analysis: Exploring Dialogue Summarization from a Knowledge Discovery Perspective.Complex Systems Informatics and Modeling Quarterly39 (2024), 86–104

  15. [15]

    Rajeev Kumar, Harishankar Kumar, and Kumari Shalini. 2025. Leveraging Knowledge Graphs and LLMs for Context-Aware Messaging.arXiv preprint arXiv:2503.13499(2025)

  16. [16]

    Yuhan Liu, Aadit Shah, Jordan Ackerman, and Manaswi Saha. 2025. Exploring the Design Space of Real-time LLM Knowledge Support Systems: A Case Study of Jargon Explanations. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 633, 20 pages. doi:10.1145/3706...

  17. [17]

    Ryugo Morita, Ko Watanabe, Jinjia Zhou, Andreas Dengel, and Shoya Ishimaru. 2025. GenAIReading: Augmenting Human Cognition with Interactive Digital Textbooks Using Large Language Models and Image Generation Models. InProceedings of the Augmented Humans International Conference 2025 (AHs ’25). Association for Computing Machinery, New York, NY, USA, 289–301...

  18. [18]

    Ummara Mumtaz, Awais Ahmed, and Summaya Mumtaz. 2023. LLMs-Healthcare: Current applications and challenges of large language models in various medical specialties.arXiv preprint arXiv:2311.12882(2023)

  19. [19]

    Kotaro Oomori, Yoshio Ishiguro, and Jun Rekimoto. 2024. SkillsInterpreter: A Case Study of Automatic Annotation of Flowcharts to Support Browsing Instructional Videos in Modern Martial Arts using Large Language Models. InProceedings of the Augmented Humans International Conference 2024 (Melbourne, VIC, Australia)(AHs ’24). Association for Computing Machin...

  20. [20]

    2024.GPT-4o System Card

    OpenAI. 2024.GPT-4o System Card. Technical Report. OpenAI. https://openai.com/ja-JP/index/hello-gpt-4o/ System card describing the GPT-4o omni model

  21. [21]

    2025.GPT-5 System Card

    OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://openai.com/ja-JP/index/introducing-gpt-5/ System card describing the GPT-5 model family

  22. [22]

    2025.OpenAI o3 and o4-mini System Card

    OpenAI. 2025.OpenAI o3 and o4-mini System Card. Technical Report. OpenAI. https://openai.com/ja-JP/index/o3-o4-mini-system-card/ System card describing the OpenAI o3 and o4-mini reasoning models

  23. [23]

    2018.Workplace Knowledge and Productivity Report

    Panopto and YouGov. 2018.Workplace Knowledge and Productivity Report. Technical Report. Panopto. https://www.panopto.com/resource/ebook/ valuing-workplace-knowledge/ Survey of 1 001 US employees in organisations with 200+ employees; average large US business loses US$47 million annually from inefficient knowledge sharing

  24. [24]

    Jun Rekimoto. 2025. GazeLLM: Multimodal LLMs incorporating Human Visual Attention. InProceedings of the Augmented Humans International Conference 2025 (AHs ’25). Association for Computing Machinery, New York, NY, USA, 302–311. doi:10.1145/3745900.3746075 Manuscript submitted to ACM 18 Watanabe et al

  25. [25]

    Joni Salminen, Danial Amin, Soon-Gyo Jung, and Bernard Jansen. 2025. The Use of Large Language Models in HCI: A Critical Analysis of Synthetic Users. InProceedings of the Augmented Humans International Conference 2025 (AHs ’25). Association for Computing Machinery, New York, NY, USA, 413–417. doi:10.1145/3745900.3746108

  26. [26]

    Haruki Suzawa, Ko Watanabe, Andreas Dengel, and Shoya Ishimaru. 2025. Augmenting Online Meetings with Context-Aware Real-time Music Generation. InProceedings of the Augmented Humans International Conference 2025 (AHs ’25). Association for Computing Machinery, New York, NY, USA, 487–490. doi:10.1145/3745900.3746116

  27. [27]

    Hirotaka Takita, Shannon L Walston, Yasuhito Mitsuyama, Ko Watanabe, Shoya Ishimaru, and Daiju Ueda. 2025. Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.Japanese Journal of Radiology(2025), 1–11

  28. [28]

    Anna Tigunova. 2020. Extracting Personal Information from Conversations. InCompanion Proceedings of the Web Conference 2020(Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 284–288. doi:10.1145/3366424.3382089

  29. [29]

    Anthony J Trippe. 2022. Knowledge management with patents: A systematic method for capturing and sharing unique technical information for Corporate guidance.World Patent Information69 (2022), 102110

  30. [30]

    Xiaomei Wang and Xiaoyu Chen. 2024. Towards human-ai mutual learning: A new research paradigm.arXiv preprint arXiv:2405.04687(2024)

  31. [31]

    2018.The future of work: Robots, AI, and automation

    Darrell M West. 2018.The future of work: Robots, AI, and automation. Bloomsbury Publishing USA

  32. [32]

    Tongshuang Wu, Haiyi Zhu, Maya Albayrak, Alexis Axon, Amanda Bertsch, Wenxing Deng, Ziqi Ding, Boyuan Guo, Sireesh Gururaja, Tzu-Sheng Kuo, Jenny T Liang, Ryan Liu, Ihita Mandal, Jeremiah Milbauer, Xiaolin Ni, Namrata Padmanabhan, Subhashini Ramkumar, Alexis Sudjianto, Jordan Taylor, Ying-Jui Tseng, Patricia Vaidos, Zhijin Wu, Wei Wu, and Chenyang Yang. 2...

  33. [33]

    Kanta Yamaoka, Ko Watanabe, Koichi Kise, Andreas Dengel, and Shoya Ishimaru. 2023. Experience is the Best Teacher: Personalized Vocabulary Building Within the Context of Instagram Posts and Sentences from GPT-3. InAdjunct Proceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposi...

  34. [34]

    Kanta Yamaoka, Ko Watanabe, Koichi Kise, Andreas Dengel, and Shoya Ishimaru. 2025. Img2Vocab: Explore Words Tied to Your Life With LLMs and Social Media Images.IEEE Access13 (2025), 20456–20471. doi:10.1109/ACCESS.2025.3533076

  35. [35]

    Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond.ACM Transactions on Knowledge Discovery from Data18, 6 (2024), 1–32

  36. [36]

    Ruike Zhang, Yuan Tian, Penghui Wei, Daniel Dajun Zeng, and Wenji Mao. 2024. An LLM-enabled knowledge elicitation and retrieval framework for zero-shot cross-lingual stance identification. InFindings of the Association for Computational Linguistics: EMNLP 2024. 12253–12266

  37. [37]

    Shuo Zhang, Mu-Chun Wang, and Krisztian Balog. 2022. Analyzing and simulating user utterance reformulation in conversational recommender systems. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 133–143

  38. [38]

    Xinghua Zhang, Haiyang Yu, Yongbin Li, Minzheng Wang, Longze Chen, and Fei Huang. 2024. The imperative of conversation analysis in the era of llms: A survey of tasks, techniques, and trends.arXiv preprint arXiv:2409.14195(2024)

  39. [39]

    Songchen Zhou, Mark Armstrong, Giulia Barbareschi, Toshihiro Ajioka, Zheng Hu, Masatane Muto, Ando Ryoichi, Kentaro Yoshifuji, and Kouta Minamizawa. 2025. Augmented Body Communicator: Enhancing daily body expression for people with upper limb limitations through LLM and a robotic arm. InProceedings of the Augmented Humans International Conference 2025 (AH...

  40. [40]

    2019.A Model to Forecast Future Paradigms: Volume 1: Introduction to Knowledge Is Power in Four Dimensions

    Bahman Zohuri and Farhang Mossavar-Rahmani. 2019.A Model to Forecast Future Paradigms: Volume 1: Introduction to Knowledge Is Power in Four Dimensions. Apple academic press. Manuscript submitted to ACM