arxiv: 2604.11741 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

Keyang Zhong , Junlin Xie , Hefeng Wu , Haofeng Li , Guanbin Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent collaborationvision-language modelsmurder mystery gamesimperfect information reasoningdeception detectionscript generationreinforcement learningmultimodal multi-hop reasoning

0 comments

The pith

A collaborative multi-agent system generates murder mystery scripts to improve vision-language models' reasoning with deceptive and incomplete information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to strengthen vision-language models on complex multi-hop reasoning when information is imperfect and some players intend to deceive others. It uses murder mystery games as a test case where characters with hidden roles like murderer or innocent supply partial, biased clues that must be reconciled into hidden truths. A multi-agent framework coordinates agents to produce rich scripts containing backstories, visual and textual clues, and multi-hop reasoning chains tailored to each role. These scripts feed a two-stage process of chain-of-thought fine-tuning followed by reinforcement learning whose rewards are shaped by ongoing agent monitoring. If the approach holds, it supplies a practical route for building models that handle uncertainty and social deduction in narrative settings.

Core claim

The authors claim that their collaborative multi-agent framework for synthesizing role-driven multiplayer game scripts, paired with chain-of-thought fine-tuning on uncertainty-modeling data and GRPO-based reinforcement learning using agent-monitored reward shaping, produces substantial gains in VLMs' narrative reasoning, hidden-fact extraction, and deception-resilient understanding inside murder-mystery scenarios.

What carries the argument

The collaborative multi-agent script generation framework that coordinates agent interactions to create character-identity-specific multimodal contexts and reasoning chains, which then supply the data and reward signals for the two-stage training procedure.

If this is right

VLMs develop improved ability to extract hidden facts from narratives that include deliberate deception and partial clues.
The generated scripts allow fine-grained control over uncertainty levels and role-based intentions in the training data.
Agent-monitored reinforcement learning produces models that exhibit character-specific reasoning patterns during inference.
The method supplies a scalable route for constructing both training sets and evaluation benchmarks for multimodal multi-hop reasoning under imperfect information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-agent script-generation loop could be applied to other imperfect-information domains such as legal evidence analysis or medical case review with conflicting reports.
If the deception-handling skills transfer, the trained models may perform better on real-world tasks like spotting misleading claims in social media threads or news reports.
Extending the framework to generate scripts for non-murder games such as negotiation or espionage scenarios would test broader applicability without changing the core machinery.

Load-bearing premise

The synthetic scripts and agent-monitored rewards are assumed to capture the essential structure of real-world imperfect information and deception so that gains transfer to other multimodal multi-hop reasoning tasks.

What would settle it

Train the model on the generated scripts, then test it on a fresh collection of human-written murder mystery games or another imperfect-information task; if accuracy shows no improvement over an identically trained baseline without the multi-agent scripts, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.11741 by Guanbin Li, Haofeng Li, Hefeng Wu, Junlin Xie, Keyang Zhong.

**Figure 1.** Figure 1: Overview of the proposed framework. It employs evaluation agents and generation agents to collaboratively [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The details of game scripts generated via our multi-agent framework. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The bottom part outlines the two-stage training strategy. The top part showcases the ScoreAgent [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Red denotes low-scoring and green denotes [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Training reward curves for verifiable sub [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Statistics of the proposed dataset: (a) Distribu [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: An example of synthetic Multimodal Clues [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: An example of synthetic Role Scripts [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt of OutlineAgent You are good at perfecting the background story of each character according to the timeline outline of the story development. Your task is to complete the background of the characters and the details of the day of the crime according to the outline. Do not add irrelevant characters. You just need to describe the development of the story in as much detail as possible and avoid … view at source ↗

**Figure 10.** Figure 10: System prompt of CharacterAgent [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt of CriticAgent [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: System prompt of ClueAgent which include both text clues and images clues prompt. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: System prompt of QaAgent for multihot reasoning chain generation. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: System prompt of QaAgent for long scripts question-answer pairs generation. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: System prompt of QaAgent for one-hop image-based question-answer pairs generation, which includes [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: System prompt of QaAgent for multi-hop multimodal question-answer pairs generation. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: System prompt of RoleplayAgent for first-person self-introduction generation. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: System prompt of RoleplayAgent for ask other player questions generation. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: System prompt of RoleplayAgent for answer other players’ questions generation. [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: System prompt of RoleplayAgent for answer other players’ questions generation. [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt Templates of Judge LLM for scoring different type tasks’ response. [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear multi-agent pipeline for generating role-specific murder mystery scripts and using them to train VLMs with CoT then agent-monitored GRPO, but the claimed gains have no numbers or ablations attached.

read the letter

You should know that the main contribution here is a system where agents work together to build murder mystery game scripts with clues and reasoning steps tailored to different character roles, like the killer versus the others. They use this data to train vision-language models first with chain-of-thought examples that include uncertainty, then with reinforcement learning guided by rewards from monitoring agents that check for proper character-aware thinking. The approach stands out for how it ties the script creation directly to role intentions and uses those agents again in the reward shaping during training. It handles the generation of backstories, visual clues, and multi-hop chains in a coordinated way, which gives a concrete method for creating challenging training examples in this domain. The weak part is the lack of hard evidence. The writeup says the method boosts performance a lot on narrative reasoning and handling deception, but it does not include any specific scores, baseline comparisons, or tests showing what happens when you remove parts of the pipeline. Without those, you cannot tell how much the agent monitoring or the synthetic scripts actually help versus just training on more data. This would interest people who build AI for games or who want VLMs to reason better in situations with hidden motives and incomplete info. It could give them a template for generating their own datasets. I recommend putting it through peer review. The framework makes sense on paper and the problem is worth solving, but the evaluation needs more detail and transparency before it can be taken as a solid advance.

Referee Report

1 major / 0 minor

Summary. The paper proposes a collaborative multi-agent framework for generating role-driven scripts and multimodal contexts for Murder Mystery Games to improve VLMs' reasoning under imperfect and deceptive information. It describes generating character-specific clues and multi-hop chains via agent interactions, followed by a two-stage training process: (1) chain-of-thought fine-tuning on curated synthetic datasets modeling uncertainty and deception, and (2) GRPO-based reinforcement learning using agent-monitored reward shaping to encourage character-aware inference. The central claim is that this yields significant gains in narrative reasoning, hidden fact extraction, and deception-resilient understanding, supported by extensive experiments.

Significance. If the empirical results hold, the work provides a scalable pipeline for synthesizing tailored training data and fine-tuning VLMs on multi-agent, multi-modal tasks involving deception and partial information. This could advance robust reasoning in socially complex domains and establish useful benchmarks for imperfect-information multi-hop inference. The agent-collaborative generation and monitoring mechanism is a constructive technical contribution.

major comments (1)

§4 (Experimental Evaluation): the central claim of 'significant boosts' in performance is load-bearing for the paper's contribution, yet the manuscript supplies no quantitative metrics, baseline comparisons, ablation results on the two training stages, or statistical details (e.g., standard deviations or significance tests), preventing evaluation of the magnitude and reliability of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of the collaborative multi-agent framework and two-stage training approach. We address the single major comment below and will incorporate the requested details in a revised manuscript.

read point-by-point responses

Referee: [—] §4 (Experimental Evaluation): the central claim of 'significant boosts' in performance is load-bearing for the paper's contribution, yet the manuscript supplies no quantitative metrics, baseline comparisons, ablation results on the two training stages, or statistical details (e.g., standard deviations or significance tests), preventing evaluation of the magnitude and reliability of the reported gains.

Authors: We agree that the current version of §4 does not provide the quantitative metrics, baseline comparisons, ablation studies, or statistical details needed to fully substantiate the claims of significant performance gains. The manuscript currently focuses on describing the framework, data generation process, and high-level experimental observations without including specific numerical results or rigorous comparisons. In the revised manuscript we will expand §4 to report concrete performance metrics (e.g., accuracy on narrative reasoning, hidden-fact extraction, and deception-resilient inference tasks), direct comparisons against relevant VLM baselines and alternative training regimes, ablation results isolating the contributions of chain-of-thought fine-tuning versus GRPO reinforcement learning, and statistical details such as means, standard deviations across runs, and significance tests. These additions will be presented in tables and figures to enable precise evaluation of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed derivation

full rationale

The paper describes a collaborative multi-agent framework for script generation followed by a two-stage training process (CoT fine-tuning on synthetic data and GRPO reinforcement learning with agent-monitored rewards). No equations, derivations, or parameter-fitting steps are presented that reduce by construction to the same inputs or self-citations. The central claims rest on experimental performance gains from externally generated scripts and standard RL procedures, which are independent of the method's own outputs. No load-bearing self-citation chains or ansatz smuggling are evident.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, mathematical axioms, or newly postulated entities; the framework invokes standard chain-of-thought and reinforcement-learning components without additional invented constructs.

pith-pipeline@v0.9.0 · 5555 in / 1285 out tokens · 72189 ms · 2026-05-10T15:00:23.952752+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.

Reference graph

Works this paper leans on

40 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Multimodal chain-of-thought reasoning: A comprehensive survey.Preprint, arXiv:2503.12605. Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu

work page internal anchor Pith review arXiv
[2]

Whitehouse, T

Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems, 37:128374– 128395. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. 2025. J1: Incentivizing thinking in llm- as-a-judge via reinforcement learning.Preprint, arXiv:2505.10320. Ju...

work page arXiv 2025
[3]

Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games

Judgelm: Fine-tuned large language models are scalable judges. Qinglin Zhu, Runcong Zhao, Bin Liang, Jinhua Du, Lin Gui, and Yulan He. 2025. Player*: Enhancing llm- based multi-agent communication and interaction in murder mystery games.Preprint, arXiv:2404.17662. A Appendix Agent Prompt Settings This appendix provides the detailed prompt settings and con...

work page arXiv 2025
[4]

This clue pool serves as the foundation for generating more complex, multi- step reasoning question-answer pairs

Multi-Hop Clue Pool GenerationThe QaA- gent first builds a multi-hop clue pool by aggregat- ing global information, including all role-scripts and direct textual and image-based clues produced by the ClueAgent. This clue pool serves as the foundation for generating more complex, multi- step reasoning question-answer pairs
[5]

A heavy, marble globe bookend lies on the carpet near the victim's desk, stained dark red. Its matching pair sits undisturbed on the desk corner

Question GenerationLeveraging the infor- mation from the multi-hop clue pool and other sources, the QaAgent creates a variety of ques- tions tailored to test different aspects of the VLM’s capabilities: • Long Script QA: These questions are derived from all role-scripts, challenging the VLM to comprehend and reason across extensive narrative contexts. • M...
[6]

Title: A concise title summarizing the story
[7]

Characters: A list of key characters involved in the story, excluding the victim
[8]

Timeline of Events: A chronological account of each character's actions on the day of the incident
[9]

In json Format [ { 'Title': '', 'Characters': '', 'Timeline of Events': '', 'Background Stories': '' },

Background Stories: A clear and engaging background story for each character, outlining their motives, secrets, and past experiences leading up to the murder. In json Format [ { 'Title': '', 'Characters': '', 'Timeline of Events': '', 'Background Stories': '' }, ... ] OutlineAgent System Prompt Figure 9: System prompt of OutlineAgent You are good at perfe...
[10]

name: The name of the character
[11]

The story should be immersive, offering insights into the character's emotions, motives, and perceptions

back: A detailed first-person account of the character's past background and actions on the day of the crime. The story should be immersive, offering insights into the character's emotions, motives, and perceptions. Use `[]` to describe any relevant image clues in text related to this character
[12]

] CharacterAgent System Prompt Figure 10: System prompt of CharacterAgent You are an expert in evaluating murder mystery scripts

m: 0 or 1, if the character is murder? In json Format [ { 'name': '', 'back': '', 'm': '' }, ... ] CharacterAgent System Prompt Figure 10: System prompt of CharacterAgent You are an expert in evaluating murder mystery scripts. Your task is to assess the script based on its quality, entertainment value, difficulty, and storytelling elements. Please conside...
[13]

Plot Complexity: How intricate and well-developed is the plot? Does the script contain plot twists, suspense, or unique elements that contribute to its depth?
[14]

Character Development: Are the characters well-defined? Do they have clear motivations, and do their actions align with their personalities? Is there a strong connection between characters and the plot?
[15]

Difficulty Level: How challenging is the script for players? Is the mystery difficult to solve, and are there any obstacles or complexities in the investigation that make it engaging?
[16]

- Write your evaluation based on the criteria mentioned above

Logical rationality: Is there any conflict or irrationality in the character's behavior logic in the time sequence? ## Your Abilities: - Upon the provided script's outline and and character details, give a comprehensive evaluation of this script. - Write your evaluation based on the criteria mentioned above. - Provide constructive feedback on how the scri...
[17]

evaluation - plot complexity - character development - difficulty level - logical rationality
[18]

evaluation

feedback - suggestions for outline improvement - suggestions for character details improvement in json format { "evaluation": { "plot complexity": "", "character development": "", "difficulty level": "", "logical rationality": "" }, "feedback": { "suggestions for outline improvement": "", "suggestions for character details improvement": "" } } CriticAgent...
[19]

Each clue should describe information related to the crime scene — such as observations, evidence, or environmental details, or other background clues that could lead to the incident
[20]

Clues must not directly reveal or explicitly identify the murderer
[21]

", "", ... ] ## Example: [

Three to five text clues are enough, and each text clue should be a single sentence. ## Response: In json Format [ "", "", ... ] ## Example: [ "There are a lot of ice cubes in the cabinets in the game room", ... ] ClueAgent System Prompt [Text Clues] """You are an AI assistant specialized in generating visual clues for murder mystery scripts. Your primary...
[22]

- The image should accurately represent the given clue description with relevant details

AI-Generated Image Creation: - First, attempt to generate a custom image using a text-to-image model. - The image should accurately represent the given clue description with relevant details
[23]

- The XML should be well-formed and structured to be easily converted into a visual format

Structured Diagram Generation (XML Code Format): - If the clue requires a logical or relational structure (e.g., timelines, suspect connections, or evidence charts), generate an XML-based code structure that represents the diagram. - The XML should be well-formed and structured to be easily converted into a visual format
[24]

- Evaluate the relevance of retrieved images based on their semantic similarity and visual accuracy

Web Image Search (Fallback Option): - If AI-generated images do not meet the requirements, perform an online image search. - Evaluate the relevance of retrieved images based on their semantic similarity and visual accuracy. - If a relevant image is found, provide it with a justification (e.g., 'Selected from search due to high relevance')
[25]

Direct clues

Final Output: - Provide the best possible image (either AI-generated or selected from a search). - If applicable, include an XML code representation of the clue's structure. - Justify the choices made, explaining why a particular image or structure was used. """ ClueAgent System Prompt [Image Clues] ClueAgent System Prompt [Text Clues] Figure 12: System p...
[39]

question

Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 18: System prompt of RoleplayAgent for ask ot...
[53]

question

Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 19: System prompt of RoleplayAgent for answer...
[54]

Identify the murderer responsible for the victim's death
[55]

Reconstruct the crime method, including: -How the murder was committed -How a secret room (if any) was created -How the murderer built an alibi
[56]

Infer the motive behind the murder
[57]

If the deceased was found dead in a *locked-room* situation and no suspects entered, you must solve the locked-room mystery — that is, deduce how the murderer managed to kill the victim under such constraints
[58]

You should: -Strategically ask questions to other players to uncover contradictions or new insights

No assumptions allowed: you may not imagine the existence of secret tunnels, unknown poison, or identity swaps unless there are explicit clues to support such theories. You should: -Strategically ask questions to other players to uncover contradictions or new insights. -Share relevant clues from your character's knowledge that may help progress collective...
[60]

question

Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "What were you doing at 9 PM near the hotel? I found blood in that area.", "clues share": "My character saw someone with a red umbrella near the murder scene at 9 PM." } RoleplayAgent System Prompt [ask questions-Murder] RoleplayAgent System P...
[61]

**Avoid suspicion** by acting like a cooperative player and maintaining a consistent character background
[62]

**Divert suspicion** toward other players by: - Asking strategic but misleading questions - Highlighting inconsistencies in others' statements - Selectively sharing real or partial clues to frame others
[63]

**Never reveal or hint at your true role** as the murderer
[64]

If there's a locked-room mystery or alibi verification, subtly guide others away from the real explanation, without making obviously false claims
[65]

You should: -Strategically ask questions to subtly shift the focus onto others

No assumptions allowed: you may not invent unknown poison, secret tunnels, or identity swaps unless there are explicit clues supporting such theories. You should: -Strategically ask questions to subtly shift the focus onto others. -Share carefully chosen clues from your character's knowledge that **seem helpful** but ultimately **create doubt or confusion...
[66]

The specific question you would like to ask
[67]

question

Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 20: System prompt of RoleplayAgent for answer...