ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
Scemqa: A scientific college entrance level multimodal question answering benchmark
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2verdicts
UNVERDICTED 2representative citing papers
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
citing papers explorer
-
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.