pith. machine review for the scientific record. sign in

arxiv: 2510.19028 · v3 · submitted 2025-10-21 · 💻 cs.CL

Recognition: unknown

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Eunsu Kim , Junyeong Park , Juhyun Oh , Kiwoong Park , Seyoung Song , A. Seza Do\u{g}ru\"oz , Alice Oh , Najoung Kim

Authors on Pith no claims yet
classification 💻 cs.CL
keywords socialllmsreasoningkoreanenglishmodelsscriptscapabilities
0
0 comments X
read the original abstract

As LLMs are increasingly deployed in real-world interactions, their social reasoning in interpersonal communication becomes critical. To explore their capabilities, we introduce SCRIPTS, a 1.1k-dialogue dataset in English and Korean, sourced from movie scripts and propose a social reasoning task based on SCRIPTS that evaluates the capacity of LLMs to infer the social relationships (e.g., friends, lovers) between speakers in each dialogue. Evaluating nine models on our task, current LLMs achieve around 75--80% on the English dataset and 58--69% in Korean, and models predict an Unlikely relationship in 10--25% of responses in both languages. Furthermore, we find that thinking models and chain-of-thought prompting provide minimal benefits for social reasoning and occasionally amplify social biases. In sum, there are significant limitations in current LLMs' social reasoning capabilities, especially for Korean, highlighting the need for efforts to develop socially-aware LLMs across languages.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.