StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
StepSTEM benchmark and dynamic-programming step alignment show top MLLMs achieve only 38.29% accuracy on graduate STEM tasks requiring interleaved cross-modal reasoning.