SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
End-to-end object detection with transformers
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
IC-Seg is a multi-turn clarification framework with hierarchical GRPO optimization that resolves ambiguous queries in referring video object segmentation and introduces the Ambi-RVOS benchmark.
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
XDecomposer uses set prediction and phase-query decomposition to jointly identify phases and reconstruct multiphase PXRD patterns without priors.
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to 40% on hallucination benchmarks and 20% under adversarial perturbations.
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
citing papers explorer
-
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
-
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
IC-Seg is a multi-turn clarification framework with hierarchical GRPO optimization that resolves ambiguous queries in referring video object segmentation and introduces the Ambi-RVOS benchmark.
-
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
-
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
-
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to 40% on hallucination benchmarks and 20% under adversarial perturbations.
-
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.