PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Paper2poster: Towards multimodal poster automation from scientific papers,
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
X+Slides is a new benchmark that measures audience-conditioned slide generation quality via 8,133 source-grounded probes across 113 topics, reporting Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness on three existing systems.
Visual-SDPO distills visual feedback from rendered code outputs into a student policy via grounded credit weighting and GRPO, yielding over 10-point gains on chart/UI/slide benchmarks.
Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
VideoAgent is a modular framework that redefines scientific video synthesis as an intent-driven planning problem and introduces the SciVidEval benchmark for multimodal quality and pedagogical utility.
SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.
citing papers explorer
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
-
X+Slides: Benchmarking Audience-Conditioned Slide Generation
X+Slides is a new benchmark that measures audience-conditioned slide generation quality via 8,133 source-grounded probes across 113 topics, reporting Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness on three existing systems.
-
Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
Visual-SDPO distills visual feedback from rendered code outputs into a student policy via grounded credit weighting and GRPO, yielding over 10-point gains on chart/UI/slide benchmarks.
-
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.
-
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
-
Narrative-Driven Paper-to-Slide Generation via ArcDeck
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
-
VideoAgent: Personalized Synthesis of Scientific Videos
VideoAgent is a modular framework that redefines scientific video synthesis as an intent-driven planning problem and introduces the SciVidEval benchmark for multimodal quality and pedagogical utility.
-
Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising
SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.
- PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation