Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou

arxiv: 2605.31354 · v1 · pith:APS26XOLnew · submitted 2026-05-29 · 💻 cs.AI · cs.LG

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou This is my paper

Pith reviewed 2026-06-28 22:18 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords shared-state collaborationnoise reinforcementpolicy collapsevisual agentshallucinationsCoSee frameworkresource-constrained modelsdocument visual question answering

0 comments

The pith

Naive shared workspaces amplify hallucinations in resource-constrained visual agents rather than resolving them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how shared working memory affects collaboration among small visual reasoning models in tasks like document question answering. It finds that without careful checks, sharing intermediate notes often makes errors worse by letting ungrounded information spread and by pushing the models to give vague responses. The authors introduce an auditing method to track how information flows through read, write, and verify steps. This reveals that the key issue is maintaining accurate communication between parts of the system rather than needing deeper reasoning. Readers would care because many practical AI setups use small models and shared states, so understanding these breakdowns helps design better systems.

Core claim

Modular visual reasoning systems with weak learners rely on shared working memory, but this leads to noise accumulation where ungrounded notes reinforce hallucinations and added context causes policy collapse to short-form answers. The CoSee framework audits the read-write-verify loop to trace failures across benchmarks, showing that increased compute without verification can degrade performance and that the bottleneck is communication fidelity.

What carries the argument

The CoSee auditing framework, which formalizes the read-write-verify loop to trace information flow in collaborative visual reasoning.

If this is right

Ungrounded notes in shared workspaces get reused as evidence, amplifying hallucinations.
Added context from sharing shifts models toward under-specified, short-form answers.
Increased compute can correlate negatively with performance without explicit verification.
The primary bottleneck for resource-constrained agents is communication fidelity rather than reasoning depth.
Trace-level diagnostics from the auditing method provide a baseline for reliable modular design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verification steps should be prioritized when designing shared memory for multi-agent visual systems.
The identified failure modes may appear in other collaborative setups beyond visual agents.
Adding explicit checks could reverse the negative correlation between compute and performance.
Testing the framework on larger models would show if the issues persist or change.

Load-bearing premise

The introduced CoSee auditing framework and its read-write-verify loop faithfully capture the actual information flow and failure dynamics without introducing its own artifacts or selection effects in the multi-page, chart, and web benchmarks.

What would settle it

Applying the CoSee framework to a new set of multi-page and chart benchmarks and observing neither noise reinforcement nor policy collapse under naive sharing.

Figures

Figures reproduced from arXiv: 2605.31354 by Yunpeng Zhou.

**Figure 1.** Figure 1: CoSee overview. Shared-board collaboration with trace logging and integrity auditing enables controlled, cost-normalized evaluation under strict budgets. Our study finds that naive board use is not a reliable win under small-model constraints, while a lightweight verified-board gate mitigates chart-centric failures. on reasoning-heavy tasks. Specifically, we find that the multi-agent setting frequently und… view at source ↗

**Figure 2.** Figure 2: Effect sizes across reasoning distributions. Pairedbootstrap confidence intervals (95%) for ∆(Method − Baseline). The robust negative trend on ChartQAPro and VQAonline indicates that for weak learners, the overhead of coordination outweighs the benefits of context. Diagnosing Policy Drift. For open-ended tasks (VQAonline), we perform a stratified analysis by binning outputs based on GenTokens(final). B… view at source ↗

**Figure 3.** Figure 3: Diagnosing Policy Collapse on VQAonline. Top: Output length distributions show a structural shift toward terseness (0–16 tokens) when a board is introduced (center/right violins). Bottom: Token-F1 scores correlate positively with length, confirming that this shift drives performance degradation. tleneck. Under the regime of weak learners (small models under single-GPU constraints), we observe that the sh… view at source ↗

**Figure 4.** Figure 4: Causal Failure Analysis. We stratify errors into dominant mechanisms. ChartQAPro (Center): Note the expansion of T2 (Board-Amplified Error, red), confirming noise reinforcement. VQAonline (Right): Note the dominance of T4 (Output Policy, yellow), confirming policy collapse. 0 20 40 60 80 100 120 Extra mean output tokens vs baseline 4 2 0 2 4 6 8 10 12 Drop in raw EM (pp) ChartQAPro: quality drop vs cost (c… view at source ↗

**Figure 5.** Figure 5: Cost–Utility analysis on ChartQAPro. We plot the degradation in exact match accuracy (∆ EM) against the additional computational cost (mean output tokens). Naive collaboration (red/blue) resides in the negative utility quadrant: consuming more compute to produce worse results. Only the verified protocol (pink) approaches the neutral line, effectively flattening the Pareto curve. dynamics we observe would b… view at source ↗

read the original abstract

Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4B--8B models) through the lens of noise accumulation. We introduce CoSee, an auditing framework that formalizes the read-write-verify loop to trace information flow in document visual question answering. Across multi-page, chart, and web-based benchmarks, we find a counter-intuitive degradation: naive shared workspaces often amplify hallucinations rather than resolve them. We identify two dominant failure modes: Noise Reinforcement, where ungrounded notes are reused as evidence, and Policy Collapse, where added context shifts the model toward under-specified, short-form answers. Using cost-accuracy Pareto frontiers, we show that increased compute can correlate negatively with performance without explicit verification. Our findings suggest that for resource-constrained agents, the bottleneck lies not in reasoning depth but in communication fidelity, providing trace-level diagnostics and a mechanistic baseline for reliable modular design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoSee names Noise Reinforcement and Policy Collapse in small-model shared workspaces but the auditing loop itself may be shaping the failures it reports.

read the letter

The paper's main takeaway is that shared workspaces in 4B-8B visual agents on document, chart, and web QA tasks often make hallucinations worse through two mechanisms: ungrounded notes getting reused as evidence, and extra context pushing models toward short, vague answers. It introduces CoSee to trace the read-write-verify loop and shows that more compute can hurt accuracy on Pareto frontiers without explicit verification.

What the work does cleanly is give names and a trace-level view to problems that matter for modular agent design under tight budgets. The emphasis on communication fidelity rather than raw reasoning depth is a useful shift, and the counter-intuitive degradation result is worth testing further.

The soft spot is the stress-test point. CoSee adds an explicit verification step and auditing structure on top of the shared workspace. That extra machinery could alter context length, prompting, or note retention in ways that produce or amplify the very noise reinforcement and policy collapse being measured. The abstract gives no indication of a direct no-audit baseline or controls for those effects, so the central claim rests on an assumption that the framework is neutral. Experimental details on error bars, data rules, and frontier construction are also missing from the provided text, which keeps the empirical strength provisional.

This is aimed at people building resource-constrained modular visual systems who need concrete failure diagnostics. A reader working on agent collaboration would get practical value from the named modes even if the framework needs refinement.

It deserves peer review so the experimental controls and neutrality of CoSee can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoSee, a read-write-verify auditing framework, to trace information flow and diagnose failure modes in shared-state collaboration among weak visual agents (4B-8B models) on multi-page, chart, and web VQA benchmarks. It claims that naive shared workspaces amplify hallucinations rather than mitigate them, identifying two dominant modes—Noise Reinforcement (ungrounded notes reused as evidence) and Policy Collapse (added context driving under-specified short-form answers)—and shows via cost-accuracy Pareto frontiers that increased compute can correlate negatively with performance absent explicit verification, concluding that the bottleneck is communication fidelity rather than reasoning depth.

Significance. If the empirical claims hold after controls for framework artifacts, the work supplies useful trace-level diagnostics and a mechanistic baseline for modular agent design in resource-constrained regimes, underscoring that collaboration can degrade rather than improve performance when state is shared naively. The provision of explicit failure-mode identification and Pareto analysis is a constructive contribution to the literature on reliable multi-agent visual reasoning.

major comments (2)

[§3] §3 (CoSee framework definition): The read-write-verify loop adds an explicit verification step and auditing structure on top of the shared workspace. This machinery could alter context length, prompting style, or note retention relative to a purely naive shared state, creating the risk that Noise Reinforcement and Policy Collapse are partly framework-induced rather than intrinsic properties of naive collaboration. An ablation that isolates the verify component (or compares unmodified shared state against CoSee) is needed to support the central attribution.
[§5] §5 (Pareto frontier results and benchmark comparisons): The claim that increased compute correlates negatively with performance without verification rests on how the frontiers are constructed and how "naive" baselines are implemented versus CoSee-augmented runs. Without reported controls for selection effects, context overhead, or data exclusion rules in the multi-page/chart/web suites, it is difficult to separate the reported degradation from artifacts of the auditing loop itself.

minor comments (2)

[§3.1] Notation for the read/write/verify primitives is introduced without a compact summary table; a small table listing the exact prompt templates and state-update rules would improve reproducibility.
[§5.3] Figure captions for the Pareto plots should explicitly state whether error bars reflect multiple random seeds or only single-run variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the attribution of the reported failure modes.

read point-by-point responses

Referee: [§3] §3 (CoSee framework definition): The read-write-verify loop adds an explicit verification step and auditing structure on top of the shared workspace. This machinery could alter context length, prompting style, or note retention relative to a purely naive shared state, creating the risk that Noise Reinforcement and Policy Collapse are partly framework-induced rather than intrinsic properties of naive collaboration. An ablation that isolates the verify component (or compares unmodified shared state against CoSee) is needed to support the central attribution.

Authors: We agree that isolating the verification step is important for attribution. Our naive baselines are already implemented without the read-write-verify loop. In the revision we will add an explicit ablation that runs the same shared-state protocol with and without the verify component, reporting any differences in note retention, context length, and observed failure rates. revision: yes
Referee: [§5] §5 (Pareto frontier results and benchmark comparisons): The claim that increased compute correlates negatively with performance without verification rests on how the frontiers are constructed and how "naive" baselines are implemented versus CoSee-augmented runs. Without reported controls for selection effects, context overhead, or data exclusion rules in the multi-page/chart/web suites, it is difficult to separate the reported degradation from artifacts of the auditing loop itself.

Authors: We acknowledge the need for explicit controls. The revised manuscript will include (i) measured context overhead for each condition, (ii) the precise selection and exclusion rules applied to the multi-page, chart, and web suites, and (iii) additional Pareto curves that hold context length and data subsets fixed across naive and CoSee conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical diagnostic framework with no derivations or self-referential fitting

full rationale

The paper introduces the CoSee auditing framework as a new contribution and reports empirical observations of failure modes (Noise Reinforcement, Policy Collapse) across benchmarks. No equations, parameter fitting, uniqueness theorems, or derivation chains appear in the abstract or described content. Claims rest on experimental traces rather than reducing to self-defined inputs or prior self-citations. The framework is presented as an external auditing tool, not derived from the results it measures, making the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5708 in / 1038 out tokens · 12841 ms · 2026-06-28T22:18:40.854366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 2 internal anchors

[1]

URL https: //aclanthology.org/2025.acl-long.291/

doi: 10.18653/v1/2025.acl-long.291. URL https: //aclanthology.org/2025.acl-long.291/. Jain, C., Wu, Y ., Zeng, Y ., Liu, J., hengyu Dai, S., Shao, Z., Wu, Q., and Wang, H. Simpledoc: Multi- modal document understanding with dual-cue page re- trieval and iterative refinement.ArXiv, abs/2506.14035,

work page doi:10.18653/v1/2025.acl-long.291 2025
[2]

copy” case: P(Zu =Z v = 1) =p . Under the “independent

URL https://api.semanticscholar. org/CorpusID:279410653. Jiang, B., Zhuang, Z., Shivakumar, S. S., Roth, D., and Tay- lor, C. J. Multi-agent vqa: Exploring multi-agent founda- tion models in zero-shot visual question answering, 2024. URLhttps://arxiv.org/abs/2403.14783. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., a...

work page doi:10.1109/cvpr52733.2024.01480 2024
[3]

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , booktitle =

URL https://proceedings.mlr.press/ v260/nguyen25c.html. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Su, J., Duh, K., and Carreras, X. (eds.),Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Associat...

work page doi:10.18653/v1/d16-1264 2016
[4]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

URL https://aclanthology.org/2025. emnlp-main.893/. Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., and Liu, X. DocLLM: A layout-aware generative language model for multimodal document understanding. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meet- ing of the Association for C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.463 2025
[5]

org/CorpusID:247595263

URL https://api.semanticscholar. org/CorpusID:247595263. Wang, Z., Wan, W., Lao, Q., Chen, R., Lang, M., Wang, X., Wang, K., and Lin, L. Towards top-down reasoning: An explainable multi-agent approach for visual question answering, 2025. URL https://arxiv.org/abs/ 2311.17331. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., L...

work page doi:10.18653/v1/2021.acl-long 2025
[6]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://aclanthology.org/2021. acl-long.201/. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601. Yi, Z., Liu, J., Xiao, T., and Albert, M. V . A multi-agent system for complex reasoning in radiology v...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[1] [1]

URL https: //aclanthology.org/2025.acl-long.291/

doi: 10.18653/v1/2025.acl-long.291. URL https: //aclanthology.org/2025.acl-long.291/. Jain, C., Wu, Y ., Zeng, Y ., Liu, J., hengyu Dai, S., Shao, Z., Wu, Q., and Wang, H. Simpledoc: Multi- modal document understanding with dual-cue page re- trieval and iterative refinement.ArXiv, abs/2506.14035,

work page doi:10.18653/v1/2025.acl-long.291 2025

[2] [2]

copy” case: P(Zu =Z v = 1) =p . Under the “independent

URL https://api.semanticscholar. org/CorpusID:279410653. Jiang, B., Zhuang, Z., Shivakumar, S. S., Roth, D., and Tay- lor, C. J. Multi-agent vqa: Exploring multi-agent founda- tion models in zero-shot visual question answering, 2024. URLhttps://arxiv.org/abs/2403.14783. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., a...

work page doi:10.1109/cvpr52733.2024.01480 2024

[3] [3]

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , booktitle =

URL https://proceedings.mlr.press/ v260/nguyen25c.html. Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Su, J., Duh, K., and Carreras, X. (eds.),Proceed- ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Associat...

work page doi:10.18653/v1/d16-1264 2016

[4] [4]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

URL https://aclanthology.org/2025. emnlp-main.893/. Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., and Liu, X. DocLLM: A layout-aware generative language model for multimodal document understanding. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meet- ing of the Association for C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.463 2025

[5] [5]

org/CorpusID:247595263

URL https://api.semanticscholar. org/CorpusID:247595263. Wang, Z., Wan, W., Lao, Q., Chen, R., Lang, M., Wang, X., Wang, K., and Lin, L. Towards top-down reasoning: An explainable multi-agent approach for visual question answering, 2025. URL https://arxiv.org/abs/ 2311.17331. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., L...

work page doi:10.18653/v1/2021.acl-long 2025

[6] [6]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

URL https://aclanthology.org/2021. acl-long.201/. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601. Yi, Z., Liu, J., Xiao, T., and Albert, M. V . A multi-agent system for complex reasoning in radiology v...

work page internal anchor Pith review Pith/arXiv arXiv 2021