A two-stage diversity-plus-entropy token selection framework speeds up visual geometry transformers by over 85% on 500-image scenes while preserving baseline accuracy.
4DLangVGGT: 4D language-visual geometry grounded transformer
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A multi-view feed-forward transformer provides initial poses and geometry from calibrated videos, followed by physics-aware Gaussian optimization with tetrahedral and collision constraints to produce robust 4D hand-object reconstructions.
citing papers explorer
-
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
A two-stage diversity-plus-entropy token selection framework speeds up visual geometry transformers by over 85% on 500-image scenes while preserving baseline accuracy.
-
High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians
A multi-view feed-forward transformer provides initial poses and geometry from calibrated videos, followed by physics-aware Gaussian optimization with tetrahedral and collision constraints to produce robust 4D hand-object reconstructions.