MASSW: A new dataset and benchmark tasks for AI-assisted scientific workflows

Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, Qiaozhu Mei · 2025 · DOI 10.18653/v1/2025.findings-naacl.127

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

citing papers explorer

Showing 2 of 2 citing papers.

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks cs.CL · 2026-06-23 · unverdicted · none · ref 27
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 137
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

MASSW: A new dataset and benchmark tasks for AI-assisted scientific workflows

fields

years

verdicts

representative citing papers

citing papers explorer