Recognition: unknown
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3
The pith
A collaborative multi-agent system generates murder mystery scripts to improve vision-language models' reasoning with deceptive and incomplete information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that their collaborative multi-agent framework for synthesizing role-driven multiplayer game scripts, paired with chain-of-thought fine-tuning on uncertainty-modeling data and GRPO-based reinforcement learning using agent-monitored reward shaping, produces substantial gains in VLMs' narrative reasoning, hidden-fact extraction, and deception-resilient understanding inside murder-mystery scenarios.
What carries the argument
The collaborative multi-agent script generation framework that coordinates agent interactions to create character-identity-specific multimodal contexts and reasoning chains, which then supply the data and reward signals for the two-stage training procedure.
If this is right
- VLMs develop improved ability to extract hidden facts from narratives that include deliberate deception and partial clues.
- The generated scripts allow fine-grained control over uncertainty levels and role-based intentions in the training data.
- Agent-monitored reinforcement learning produces models that exhibit character-specific reasoning patterns during inference.
- The method supplies a scalable route for constructing both training sets and evaluation benchmarks for multimodal multi-hop reasoning under imperfect information.
Where Pith is reading between the lines
- The same multi-agent script-generation loop could be applied to other imperfect-information domains such as legal evidence analysis or medical case review with conflicting reports.
- If the deception-handling skills transfer, the trained models may perform better on real-world tasks like spotting misleading claims in social media threads or news reports.
- Extending the framework to generate scripts for non-murder games such as negotiation or espionage scenarios would test broader applicability without changing the core machinery.
Load-bearing premise
The synthetic scripts and agent-monitored rewards are assumed to capture the essential structure of real-world imperfect information and deception so that gains transfer to other multimodal multi-hop reasoning tasks.
What would settle it
Train the model on the generated scripts, then test it on a fresh collection of human-written murder mystery games or another imperfect-information task; if accuracy shows no improvement over an identically trained baseline without the multi-agent scripts, the central claim does not hold.
Figures
read the original abstract
Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a collaborative multi-agent framework for generating role-driven scripts and multimodal contexts for Murder Mystery Games to improve VLMs' reasoning under imperfect and deceptive information. It describes generating character-specific clues and multi-hop chains via agent interactions, followed by a two-stage training process: (1) chain-of-thought fine-tuning on curated synthetic datasets modeling uncertainty and deception, and (2) GRPO-based reinforcement learning using agent-monitored reward shaping to encourage character-aware inference. The central claim is that this yields significant gains in narrative reasoning, hidden fact extraction, and deception-resilient understanding, supported by extensive experiments.
Significance. If the empirical results hold, the work provides a scalable pipeline for synthesizing tailored training data and fine-tuning VLMs on multi-agent, multi-modal tasks involving deception and partial information. This could advance robust reasoning in socially complex domains and establish useful benchmarks for imperfect-information multi-hop inference. The agent-collaborative generation and monitoring mechanism is a constructive technical contribution.
major comments (1)
- §4 (Experimental Evaluation): the central claim of 'significant boosts' in performance is load-bearing for the paper's contribution, yet the manuscript supplies no quantitative metrics, baseline comparisons, ablation results on the two training stages, or statistical details (e.g., standard deviations or significance tests), preventing evaluation of the magnitude and reliability of the reported gains.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential value of the collaborative multi-agent framework and two-stage training approach. We address the single major comment below and will incorporate the requested details in a revised manuscript.
read point-by-point responses
-
Referee: [—] §4 (Experimental Evaluation): the central claim of 'significant boosts' in performance is load-bearing for the paper's contribution, yet the manuscript supplies no quantitative metrics, baseline comparisons, ablation results on the two training stages, or statistical details (e.g., standard deviations or significance tests), preventing evaluation of the magnitude and reliability of the reported gains.
Authors: We agree that the current version of §4 does not provide the quantitative metrics, baseline comparisons, ablation studies, or statistical details needed to fully substantiate the claims of significant performance gains. The manuscript currently focuses on describing the framework, data generation process, and high-level experimental observations without including specific numerical results or rigorous comparisons. In the revised manuscript we will expand §4 to report concrete performance metrics (e.g., accuracy on narrative reasoning, hidden-fact extraction, and deception-resilient inference tasks), direct comparisons against relevant VLM baselines and alternative training regimes, ablation results isolating the contributions of chain-of-thought fine-tuning versus GRPO reinforcement learning, and statistical details such as means, standard deviations across runs, and significance tests. These additions will be presented in tables and figures to enable precise evaluation of the reported improvements. revision: yes
Circularity Check
No significant circularity in claimed derivation
full rationale
The paper describes a collaborative multi-agent framework for script generation followed by a two-stage training process (CoT fine-tuning on synthetic data and GRPO reinforcement learning with agent-monitored rewards). No equations, derivations, or parameter-fitting steps are presented that reduce by construction to the same inputs or self-citations. The central claims rest on experimental performance gains from externally generated scripts and standard RL procedures, which are independent of the method's own outputs. No load-bearing self-citation chains or ansatz smuggling are evident.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval
ReTrack calibrates directional bias in composed video features using semantic disentanglement and bidirectional evidence alignment to improve retrieval performance on CVR and CIR tasks.
Reference graph
Works this paper leans on
-
[1]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Multimodal chain-of-thought reasoning: A comprehensive survey.Preprint, arXiv:2503.12605. Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu
work page internal anchor Pith review arXiv
-
[2]
Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems, 37:128374– 128395. Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, and Swarnadeep Saha. 2025. J1: Incentivizing thinking in llm- as-a-judge via reinforcement learning.Preprint, arXiv:2505.10320. Ju...
-
[3]
Player*: Enhancing llm-based multi-agent communication and interaction in murder mystery games
Judgelm: Fine-tuned large language models are scalable judges. Qinglin Zhu, Runcong Zhao, Bin Liang, Jinhua Du, Lin Gui, and Yulan He. 2025. Player*: Enhancing llm- based multi-agent communication and interaction in murder mystery games.Preprint, arXiv:2404.17662. A Appendix Agent Prompt Settings This appendix provides the detailed prompt settings and con...
-
[4]
This clue pool serves as the foundation for generating more complex, multi- step reasoning question-answer pairs
Multi-Hop Clue Pool GenerationThe QaA- gent first builds a multi-hop clue pool by aggregat- ing global information, including all role-scripts and direct textual and image-based clues produced by the ClueAgent. This clue pool serves as the foundation for generating more complex, multi- step reasoning question-answer pairs
-
[5]
A heavy, marble globe bookend lies on the carpet near the victim's desk, stained dark red. Its matching pair sits undisturbed on the desk corner
Question GenerationLeveraging the infor- mation from the multi-hop clue pool and other sources, the QaAgent creates a variety of ques- tions tailored to test different aspects of the VLM’s capabilities: • Long Script QA: These questions are derived from all role-scripts, challenging the VLM to comprehend and reason across extensive narrative contexts. • M...
-
[6]
Title: A concise title summarizing the story
-
[7]
Characters: A list of key characters involved in the story, excluding the victim
-
[8]
Timeline of Events: A chronological account of each character's actions on the day of the incident
-
[9]
In json Format [ { 'Title': '', 'Characters': '', 'Timeline of Events': '', 'Background Stories': '' },
Background Stories: A clear and engaging background story for each character, outlining their motives, secrets, and past experiences leading up to the murder. In json Format [ { 'Title': '', 'Characters': '', 'Timeline of Events': '', 'Background Stories': '' }, ... ] OutlineAgent System Prompt Figure 9: System prompt of OutlineAgent You are good at perfe...
-
[10]
name: The name of the character
-
[11]
The story should be immersive, offering insights into the character's emotions, motives, and perceptions
back: A detailed first-person account of the character's past background and actions on the day of the crime. The story should be immersive, offering insights into the character's emotions, motives, and perceptions. Use `[]` to describe any relevant image clues in text related to this character
-
[12]
] CharacterAgent System Prompt Figure 10: System prompt of CharacterAgent You are an expert in evaluating murder mystery scripts
m: 0 or 1, if the character is murder? In json Format [ { 'name': '', 'back': '', 'm': '' }, ... ] CharacterAgent System Prompt Figure 10: System prompt of CharacterAgent You are an expert in evaluating murder mystery scripts. Your task is to assess the script based on its quality, entertainment value, difficulty, and storytelling elements. Please conside...
-
[13]
Plot Complexity: How intricate and well-developed is the plot? Does the script contain plot twists, suspense, or unique elements that contribute to its depth?
-
[14]
Character Development: Are the characters well-defined? Do they have clear motivations, and do their actions align with their personalities? Is there a strong connection between characters and the plot?
-
[15]
Difficulty Level: How challenging is the script for players? Is the mystery difficult to solve, and are there any obstacles or complexities in the investigation that make it engaging?
-
[16]
- Write your evaluation based on the criteria mentioned above
Logical rationality: Is there any conflict or irrationality in the character's behavior logic in the time sequence? ## Your Abilities: - Upon the provided script's outline and and character details, give a comprehensive evaluation of this script. - Write your evaluation based on the criteria mentioned above. - Provide constructive feedback on how the scri...
-
[17]
evaluation - plot complexity - character development - difficulty level - logical rationality
-
[18]
evaluation
feedback - suggestions for outline improvement - suggestions for character details improvement in json format { "evaluation": { "plot complexity": "", "character development": "", "difficulty level": "", "logical rationality": "" }, "feedback": { "suggestions for outline improvement": "", "suggestions for character details improvement": "" } } CriticAgent...
-
[19]
Each clue should describe information related to the crime scene — such as observations, evidence, or environmental details, or other background clues that could lead to the incident
-
[20]
Clues must not directly reveal or explicitly identify the murderer
-
[21]
", "", ... ] ## Example: [
Three to five text clues are enough, and each text clue should be a single sentence. ## Response: In json Format [ "", "", ... ] ## Example: [ "There are a lot of ice cubes in the cabinets in the game room", ... ] ClueAgent System Prompt [Text Clues] """You are an AI assistant specialized in generating visual clues for murder mystery scripts. Your primary...
-
[22]
- The image should accurately represent the given clue description with relevant details
AI-Generated Image Creation: - First, attempt to generate a custom image using a text-to-image model. - The image should accurately represent the given clue description with relevant details
-
[23]
- The XML should be well-formed and structured to be easily converted into a visual format
Structured Diagram Generation (XML Code Format): - If the clue requires a logical or relational structure (e.g., timelines, suspect connections, or evidence charts), generate an XML-based code structure that represents the diagram. - The XML should be well-formed and structured to be easily converted into a visual format
-
[24]
- Evaluate the relevance of retrieved images based on their semantic similarity and visual accuracy
Web Image Search (Fallback Option): - If AI-generated images do not meet the requirements, perform an online image search. - Evaluate the relevance of retrieved images based on their semantic similarity and visual accuracy. - If a relevant image is found, provide it with a justification (e.g., 'Selected from search due to high relevance')
-
[25]
Direct clues
Final Output: - Provide the best possible image (either AI-generated or selected from a search). - If applicable, include an XML code representation of the clue's structure. - Justify the choices made, explaining why a particular image or structure was used. """ ClueAgent System Prompt [Image Clues] ClueAgent System Prompt [Text Clues] Figure 12: System p...
-
[39]
question
Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 18: System prompt of RoleplayAgent for ask ot...
-
[53]
question
Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 19: System prompt of RoleplayAgent for answer...
-
[54]
Identify the murderer responsible for the victim's death
-
[55]
Reconstruct the crime method, including: -How the murder was committed -How a secret room (if any) was created -How the murderer built an alibi
-
[56]
Infer the motive behind the murder
-
[57]
If the deceased was found dead in a *locked-room* situation and no suspects entered, you must solve the locked-room mystery — that is, deduce how the murderer managed to kill the victim under such constraints
-
[58]
You should: -Strategically ask questions to other players to uncover contradictions or new insights
No assumptions allowed: you may not imagine the existence of secret tunnels, unknown poison, or identity swaps unless there are explicit clues to support such theories. You should: -Strategically ask questions to other players to uncover contradictions or new insights. -Share relevant clues from your character's knowledge that may help progress collective...
-
[60]
question
Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "What were you doing at 9 PM near the hotel? I found blood in that area.", "clues share": "My character saw someone with a red umbrella near the murder scene at 9 PM." } RoleplayAgent System Prompt [ask questions-Murder] RoleplayAgent System P...
-
[61]
**Avoid suspicion** by acting like a cooperative player and maintaining a consistent character background
-
[62]
**Divert suspicion** toward other players by: - Asking strategic but misleading questions - Highlighting inconsistencies in others' statements - Selectively sharing real or partial clues to frame others
-
[63]
**Never reveal or hint at your true role** as the murderer
-
[64]
If there's a locked-room mystery or alibi verification, subtly guide others away from the real explanation, without making obviously false claims
-
[65]
You should: -Strategically ask questions to subtly shift the focus onto others
No assumptions allowed: you may not invent unknown poison, secret tunnels, or identity swaps unless there are explicit clues supporting such theories. You should: -Strategically ask questions to subtly shift the focus onto others. -Share carefully chosen clues from your character's knowledge that **seem helpful** but ultimately **create doubt or confusion...
-
[66]
The specific question you would like to ask
-
[67]
question
Any clue(s) you have got that you choose to share now. ## Output Format (must be strictly followed) { "question": "Why did you avoid mentioning what happened after you left the bar at 8:30 PM?", "clues share": "I heard someone arguing behind the hotel at around 8:45 PM, but I couldn't see who it was." } Figure 20: System prompt of RoleplayAgent for answer...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.