pith. sign in

hub Baseline reference

Omni-video: Democra- tizing unified video understanding and generation

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

13 Pith papers citing it
Baseline 60% of classified citations

hub tools

citation-role summary

baseline 3 background 2

citation-polarity summary

fields

cs.CV 12 cs.AI 1

years

2026 13

verdicts

UNVERDICTED 13

clear filters

representative citing papers

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

Valley3: Scaling Omni Foundation Models for E-commerce

cs.AI · 2026-05-02 · unverdicted · novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying competitive on general ones.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 100 · 2 links

    Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

  • ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 31

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.