pith. machine review for the scientific record. sign in

arxiv: 2511.16624 · v1 · submitted 2025-11-20 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

SAM 3D: 3Dfy Anything in Images

Authors on Pith no claims yet

Pith reviewed 2026-05-11 11:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D reconstructionsingle imagegenerative modelobject shapeannotation pipelinenatural imagescomputer vision3D ground truth
0
0 comments X

The pith

SAM 3D generates geometry, texture and layout of objects from one natural image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAM 3D as a generative model that reconstructs 3D objects with shape, surface appearance and spatial arrangement from a single photograph of a real scene. To overcome scarce training data, the authors built a pipeline that interleaves human judgment and model predictions to label object shape, texture and pose across many natural images containing occlusions and clutter. Training proceeds in stages: initial learning on synthetic examples followed by alignment to the real annotated set. The resulting model produces outputs that human evaluators prefer to those of earlier methods at a rate of at least five to one.

Core claim

SAM 3D is a generative model for visually grounded 3D object reconstruction that predicts geometry, texture, and layout from a single image. It is trained on a large collection of such reconstructions obtained through a human- and model-in-the-loop annotation pipeline, using synthetic pretraining followed by real-world alignment to break the previous data barrier for natural images.

What carries the argument

The human- and model-in-the-loop annotation pipeline that supplies accurate 3D ground truth for natural images with occlusion and clutter, which in turn supports the multi-stage training of the generative reconstruction model.

If this is right

  • Superior 3D reconstruction quality on cluttered natural images compared with earlier single-image methods.
  • At least a 5:1 preference margin in blind human evaluations on real-world objects and scenes.
  • Public release of code, trained weights, an interactive demo, and a new benchmark dataset for in-the-wild 3D reconstruction.
  • Demonstration that combining synthetic pretraining with large-scale real annotations overcomes the data limitation for 3D vision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation loop could be applied to video sequences to obtain consistent 3D models over time.
  • The released benchmark could become a standard test set that future single-image 3D methods must surpass.
  • Scaling the pipeline further might support training models that reconstruct entire scenes rather than isolated objects.

Load-bearing premise

The human- and model-in-the-loop annotation pipeline produces accurate, unbiased 3D ground truth at scale for natural images with occlusion and clutter.

What would settle it

If human preference tests on real-world objects and scenes show no clear advantage or a reversal of the reported 5:1 win rate for SAM 3D over recent prior work, the performance claim would not hold.

read the original abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents SAM 3D, a generative model for single-image 3D reconstruction of objects that outputs geometry, texture, and layout. It introduces a scalable human- and model-in-the-loop annotation pipeline to generate 3D ground-truth data for natural images with occlusion and clutter, trains via synthetic pretraining followed by real-world alignment, and reports at least a 5:1 win rate in human preference tests over prior work on real-world scenes. Code, weights, demo, and a new benchmark are to be released.

Significance. If the data quality and performance claims hold, the work would meaningfully advance in-the-wild 3D reconstruction by addressing the scarcity of accurate 3D annotations for natural images, enabling stronger context-aware models; the planned public releases would further support reproducibility and downstream applications in vision and graphics.

major comments (3)
  1. [Abstract] Abstract: the headline claim of 'at least a 5:1 win rate in human preference tests' is presented without any quantitative metrics (e.g., number of raters, total comparisons, confidence intervals, or inter-rater agreement), error analysis, or evaluation protocol details, which are required to assess whether the result supports the central performance assertion.
  2. [Annotation pipeline] Annotation pipeline description (likely §3–4): the human-model-in-the-loop process is asserted to produce accurate, unbiased 3D ground truth at scale for occluded natural images, yet no external validation (multi-view consistency checks, depth-sensor comparisons, or laser-scan ground truth on a held-out set) is reported; because downstream synthetic-pretrain + real-alignment stages and the 5:1 preference result depend directly on this data fidelity, the absence of such verification is load-bearing.
  3. [Evaluation] Evaluation section: the human preference tests are the sole quantitative evidence offered for superiority over recent work, but without objective metrics (e.g., Chamfer distance, IoU on reconstructed meshes, or pose accuracy on a standard benchmark) or ablation isolating the contribution of the new data versus the training schedule, it is difficult to determine whether the gains are robust or artifactual.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from a concise statement of the precise architectural differences from prior single-image 3D methods (e.g., explicit comparison to recent diffusion-based or NeRF-based baselines).
  2. [Figures] Figure captions and legends should include more detail on what is being visualized (e.g., input image, predicted mesh, texture map, and any overlaid ground-truth annotations) to aid readers in interpreting qualitative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with clear indications of planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of 'at least a 5:1 win rate in human preference tests' is presented without any quantitative metrics (e.g., number of raters, total comparisons, confidence intervals, or inter-rater agreement), error analysis, or evaluation protocol details, which are required to assess whether the result supports the central performance assertion.

    Authors: The abstract is intentionally concise and summarizes the primary result, while the full quantitative details—including the number of raters, total comparisons, confidence intervals, inter-rater agreement, error analysis, and evaluation protocol—are provided in the Evaluation section. To address the concern, we will revise the abstract to include a brief reference to the human study scale and key supporting statistics from the main text. revision: yes

  2. Referee: [Annotation pipeline] Annotation pipeline description (likely §3–4): the human-model-in-the-loop process is asserted to produce accurate, unbiased 3D ground truth at scale for occluded natural images, yet no external validation (multi-view consistency checks, depth-sensor comparisons, or laser-scan ground truth on a held-out set) is reported; because downstream synthetic-pretrain + real-alignment stages and the 5:1 preference result depend directly on this data fidelity, the absence of such verification is load-bearing.

    Authors: We agree that additional external validation would increase confidence in the data fidelity. The pipeline incorporates internal model-assisted consistency checks during annotation. In the revised manuscript, we will add a validation subsection that reports multi-view consistency metrics and comparisons to depth-sensor data on a held-out set to directly address this point. revision: yes

  3. Referee: [Evaluation] Evaluation section: the human preference tests are the sole quantitative evidence offered for superiority over recent work, but without objective metrics (e.g., Chamfer distance, IoU on reconstructed meshes, or pose accuracy on a standard benchmark) or ablation isolating the contribution of the new data versus the training schedule, it is difficult to determine whether the gains are robust or artifactual.

    Authors: Human preference evaluation is the most appropriate primary metric for assessing perceptual quality of 3D reconstructions in natural, cluttered scenes. We will nevertheless expand the Evaluation section to include objective metrics such as Chamfer distance and pose accuracy on standard benchmarks, along with ablations that separate the contributions of the new annotated data from the training schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical data generation, training, and external human evaluation.

full rationale

The paper presents an annotation pipeline to generate 3D data at scale from natural images, followed by multi-stage training of a generative model and evaluation via human preference tests. No equations, derivations, or self-referential definitions are present in the provided text that would reduce any prediction or result to fitted inputs or prior outputs by construction. The data pipeline, training framework, and preference-based evaluation are described as sequential and externally validated steps without load-bearing self-citations or ansatzes that collapse the logic. This is a standard empirical ML paper structure with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim depends on the annotation pipeline yielding high-quality 3D labels and on the assumption that synthetic pretraining plus real alignment suffices to overcome data scarcity.

free parameters (1)
  • model training hyperparameters
    Standard deep learning parameters tuned on the new dataset.
axioms (1)
  • domain assumption Single natural images contain sufficient visual cues for accurate 3D object reconstruction despite occlusion and clutter
    Invoked implicitly as the basis for the task and evaluation on real-world scenes.

pith-pipeline@v0.9.0 · 5543 in / 1095 out tokens · 94918 ms · 2026-05-11T11:41:46.946357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

    cs.CV 2026-04 unverdicted novelty 8.0

    ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...

  2. neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

    cs.CV 2026-04 unverdicted novelty 8.0

    neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

  3. MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

    cs.HC 2026-05 unverdicted novelty 7.0

    MiXR enables in-situ 3D design by harvesting real-world geometry for user-defined compositions that generative AI then refines, outperforming text-only generative methods in control and fidelity per a 12-person study.

  4. OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

    cs.RO 2026-04 unverdicted novelty 7.0

    A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.

  5. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  6. LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

    cs.CV 2026-04 unverdicted novelty 7.0

    LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.

  7. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  8. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  9. Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

    cs.CV 2026-04 unverdicted novelty 7.0

    A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.

  10. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  11. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  12. THOM: Generating Physically Plausible Hand-Object Meshes From Text

    cs.CV 2026-04 unverdicted novelty 7.0

    THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.

  13. Focusable Monocular Depth Estimation

    cs.CV 2026-05 unverdicted novelty 6.0

    FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.

  14. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  15. Few-Click-Driven Interactive 3D Segmentation with Semantic Embedding

    cs.CV 2026-05 unverdicted novelty 6.0

    A point-Transformer interactive 3D instance segmentation model handles multiple clicks jointly in one pass and reports over 20% mIoU gains versus baselines plus 8-10% cross-dataset improvement for one-click-per-instan...

  16. Creative Robot Tool Use by Counterfactual Reasoning

    cs.RO 2026-05 unverdicted novelty 6.0

    Robots discover causal tool features through VLM suggestions and physics-based counterfactual perturbations in simulation, then transfer manipulation skills via conditioned keypoint matching.

  17. Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

    cs.CV 2026-04 unverdicted novelty 6.0

    RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...

  18. GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning

    cs.RO 2026-04 unverdicted novelty 6.0

    GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for ...

  19. FurnSet: Exploiting Repeats for 3D Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...

  20. ShapeGen: Robotic Data Generation for Category-Level Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ShapeGen generates shape-diverse 3D robotic manipulation demonstrations without simulators by curating a functional shape library and applying a minimal-annotation pipeline for novel, physically plausible data.

  21. Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

    cs.CV 2026-04 unverdicted novelty 6.0

    GraG reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work by using compact Sum-of-Gaussians tracking initialized from large models and refined with 2D losses.

  22. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  23. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  24. PhyMix: Towards Physically Consistent Single-Image 3D Indoor Scene Generation with Implicit--Explicit Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.

  25. A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures

    cs.CV 2026-04 conditional novelty 6.0

    A pipeline using SAM segmentation and Hi3DGen mesh generation, evaluated on 69 medieval figures, produces usable 3D models for XR and tactile applications with Hi3DGen as the best starting point.

  26. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  27. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  28. MemoryDiorama: Generating Dynamic 3D Diorama from Everyday Photos for Memory Recall

    cs.HC 2026-04 unverdicted novelty 6.0

    MemoryDiorama generates animated 3D dioramas from photos via LLM scene analysis and generative components, yielding richer autobiographical recall than photo-only or static diorama baselines.

  29. Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

    cs.CV 2026-04 unverdicted novelty 6.0

    BoxerNet lifts 2D bounding boxes to metric 3D boxes via transformer regression with aleatoric uncertainty and median depth encoding, then fuses multi-view results to outperform CuTR by large margins on open-world benchmarks.

  30. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  31. UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.

  32. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  33. A Model-based Visual Contact Localization and Force Sensing System for Compliant Robotic Grippers

    cs.RO 2026-05 unverdicted novelty 5.0

    A visual system combines deep learning contact localization with inverse FEA to estimate forces on soft grippers at 0.23 N RMSE during loading.

  34. 3DPipe: A Pipelined GPU Framework for Scalable Generalized Spatial Join over Polyhedral Objects

    cs.DB 2026-04 unverdicted novelty 5.0

    3DPipe delivers up to 9x faster 3D spatial joins on polyhedra via GPU pipelining, multi-level pruning, and chunked streaming compared to prior GPU methods.

  35. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  36. A Rapid Deployment Pipeline for Autonomous Humanoid Grasping Based on Foundation Models

    cs.RO 2026-04 unverdicted novelty 4.0

    An end-to-end pipeline combining Roboflow annotation, SAM 3D reconstruction, and FoundationPose tracking reduces humanoid grasping deployment to 30 minutes with mAP 0.995 detection and sub-1.05 mm pose precision.

  37. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 36 Pith papers · 14 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.https: //arxiv.org/abs/2404.14219. Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5366–53...

  2. [2]

    In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09)

    Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380.https://doi.org/10. 1145/1553374.1553380. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision (ECCV),

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012,

  5. [5]

    A Tutorial on the Cross-Entropy Method.Annals of Operations Research 2005; 134(1): 19–67

    ISSN 0254-5330. doi: 10.1007/s10479-005-5724-z. Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2,

  6. [6]

    Emerging Properties in Unified Multimodal Pretraining

    Yuxuan Deng, Yujia Zhu, Jiahui Chen, Yuan Wang, Yifei Li, Haotian Li, Junnan Li, Jinsheng Zhang, Wenhui Liu, Yuzheng Zhang, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  7. [7]

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang

    doi: 10.1509/jmkr.47.2.312.https://doi.org/10.1509/jmkr.47.2.312. Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research,

  8. [8]

    3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787,

    Dylan Ebert. 3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787,

  9. [9]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,

  10. [10]

    One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

    Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, and Hao Zhao. One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation. arXiv preprint arXiv:2509.07978,

  11. [11]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation.CoRR, abs/1311.2524, 2013.http://arxiv.org/abs/1311.2524. Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh r-cnn. InProceedings of the IEEE/CVF international conference on computer vision, pages 9785–9795,

  12. [12]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models, 2024.https://arxiv.org/abs/2407.21783. Kristen Grauman, Andrew Westbury, et al. Ego4d: Around the world in 3,000 hours of egocentric video.International Journal of Computer Vision (IJCV),

  13. [13]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. arXiv preprint arXiv:2401.10889,

  14. [14]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023.https://arxiv.org/abs/2308.08998. Agrim Gupta, Piotr Dollar, and Ross...

  15. [15]

    Scaling laws for transfer

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

  16. [16]

    arXiv2506.15442(2025) 10

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442,

  17. [17]

    Category-specific object reconstruction from a single image

    Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1966–1974,

  18. [18]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  19. [19]

    arXiv preprint arXiv:2212.06870 (2022)

    Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870,

  20. [20]

    Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

    Nathan Lambert.Reinforcement Learning from Human Feedback. Online, 2025.https://rlhfbook.com. Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024.https://arxi...

  21. [21]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Yanghao Li, Haoqi Fan, Rohit Girdhar, and Alexander Kirillov. Segment anything in videos.arXiv preprint arXiv:2305.06500, 2023a. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023b.https://arxiv.org/abs/2309.05463. Weixin Liang, LILI YU, Liang Luo, S...

  22. [22]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

  23. [23]

    Aria Everyday Activities Dataset,

    Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349,

  24. [24]

    Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard Newcombe

    https://arxiv.org/abs/2406.09905. Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048,

  25. [25]

    Large Language Models: A Survey

    16 Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025.https://arxiv.org/abs/2402.06196. Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng. Mid-training of large language models: A survey, 2025.https://arxiv.org...

  26. [26]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  28. [28]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  29. [29]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  30. [31]

    Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman

    https://arxiv.org/abs/2406.10224. Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2974–2983,

  31. [32]

    Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

    Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv: 2405.08448,

  32. [33]

    Octothinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025a. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. O...

  33. [34]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, and Yao Yao. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412,

  34. [35]

    Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

    Hu Xu, Nikhila Goyal, Mitchell Wortsman, Gabriel Ilharco, Ozan Sener, Aniruddha Kembhavi, Ali Farhadi, and Rohit Girdhar. Metaclip: How to make clip efficiently.arXiv preprint arXiv:2404.07143,

  35. [36]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation,

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024a. Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xin...

  36. [37]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.https://arxiv.org/abs/ 2308.01825. Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and...

  37. [38]

    Uni3d: Ex- ploring unified 3d representation at scale,

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

  38. [39]

    Left” (is better), “Right

    19 Appendix Outline The appendix provides additional context to the main paper; it contains additional details about the method and the implementation in SAM 3D, as well as ablations. The structure of the appendix is as follows: (i) Data Engine details:A more detailed description the data collection used in thecollection stepin Section 3.2.1. (ii) Pretrai...

  39. [40]

    speed of the data flywheel

    as a tool to assist in segmentation. • Stage 2: Annotators on average spend 80 seconds to select the best candidate shape/texture from 6-10 candidate meshes from variable sources. • Stage 3: Annotators on average spend 150 seconds to anchor and orient the matched 3D shape to the 2.5D point cloud. 23 Algorithm 1SAM 3D Basic Alignment (Texture, Shape) Requi...

  40. [41]

    push into the tail

    is to align the model to match human preference on the distribution ofallpossible real-world objects. The core algorithm in our data engine generates samples by asking humans to select viable samples from a set of candidate generations. Challenging inputs often result in no viable candidate generations and thus never get selected by humans. However, at an...

  41. [42]

    This linearly scales the annotation time of preference data collection, and the selections themselves become noisier and more random due to choice overload (Diehl and Poynor, 2010)

    However, the primary impediment to increasingN is that, at some point, there are too many choices for a human to compare. This linearly scales the annotation time of preference data collection, and the selections themselves become noisier and more random due to choice overload (Diehl and Poynor, 2010). 25 Failure data Generate 50 seeds VLM tournament rank...

  42. [43]

    or related self-training methods. Under this interpretation, the generative model q is a policy and the data collection step is a policy evaluation; collecting demonstrationsD+ and preferences D+/D− through the interaction with the environment (annotators). The model improvement step simply updates the current policy using both finetuning and DPO. This re...

  43. [44]

    and RFT (Yuan et al., 2023), although the alignment algorithm in SAM 3D adds explicit expert policies/ensembles and leverages preference supervision. B Pretraining and Mid-Training Data Details B.1 Iso-3DO Data Filtering For the Iso-3DO data used for pretraining, the quality of the 3D meshes can vary substantially, and not all samples exhibit high-fidelit...

  44. [45]

    de-lighted

    and FlyingThings3D (Mayer et al., 2016), we name our first variant Flying Occlusions, reflecting its use of freely inserted synthetic objects. Each training example consists of a natural image onto which we composite two rendered 3D objects: an occluderand anoccludee. For each pair, we also compute the final visible mask of the occludee after occlusion. T...

  45. [46]

    shape-only), and when freezing shape capabilities and finetuning just for layout

    This proves helpful when training on datasets that contain labels for only one modality (e.g. shape-only), and when freezing shape capabilities and finetuning just for layout. At the same time, MoT still allows for information sharing during the forward pass, through the joint self-attention layers for cross-modal interaction. This shared context is criti...

  46. [47]

    Implementation details.We apply DPO on shape prediction in the Geometry model and the predictions of the Texture & Refinement model

    logσ −βT w(τ)·∆ (2) (3) where∆ =∥v w −v θ(xw τ , c, τ)∥2 2 − ∥vw −v ref(xw τ , c, τ)∥2 2 − ∥vl −v θ(xl τ , c, τ)∥2 2 − ∥vl −v ref(xl τ , c, τ)∥2 2 wherev w andv l are the target flow-matching velocities forxw τ and xl τ, andv θ,v ref are the learned and frozen reference velocity fields, respectively. Implementation details.We apply DPO on shape prediction...

  47. [48]

    de-lighted rendering

    The final model is fine-tuned for approximately4K iterations using the same objective as in Frans et al. (2024):75%flow matching and25%shortcut. When shortcut mode is disabled, the model behaves identically to the original flow matching model. We initialize the step size embedder by setting the weights and bias of its final linear layer to zero, since, un...

  48. [49]

    C.6 Texture & Refinement VAE We make improvements over the original SLAT VAE design in Xiang et al

    We follow the same training objective described in Equation (4). C.6 Texture & Refinement VAE We make improvements over the original SLAT VAE design in Xiang et al. (2025), where features are back-projected to all voxels, including those that are not visible (i.e., occluded) from the current image. This original design choice leads to reduced sharpness in...

  49. [50]

    Many rely on synthetic datasets (Deitke et al., 2023; Chang et al.,

    D Evaluation Current evaluation benchmarks for visually grounded 3D object reconstruction fall short of capturing the complexity of the real world. Many rely on synthetic datasets (Deitke et al., 2023; Chang et al.,

  50. [51]

    This introduces a large visual gap with real-world evaluation conditions and the rich variation of real-world imagery

    where single objects are rendered in isolation, centered against a white background. This introduces a large visual gap with real-world evaluation conditions and the rich variation of real-world imagery. Efforts to move to real data mostly focus on indoor environments (Khanna et al., 2024; Sun et al., 2018; Pan et al., 2023), but these benchmarks heavily ...

  51. [52]

    A” vs. “B

    D.2 Human Preference Set We further expand our evaluation suite to support more rigorous and domain-targeted assessments. While SA-3DAO provides a general and standardized way to measure progress, we want to also capture the challenges of settings where 3D perception is most critical, such as robotic manipulation and egocentric vision. To address this, we...

  52. [53]

    and Uni3D (Zhou et al., 2023). For each generated mesh, we uniformly sample8, 192surface points to form a point cloud representation, and compute cross-modal similarity between the point cloud features and image features. D.3.2 Layout Metrics Definitions To evaluate single-object pose and compare with existing methods, we employ standard 6D pose estimatio...

  53. [54]

    In this case, when the predicted and ground truth shape are the same, the asymmetric and symmetric versions of ADD-S coincide

    was designed for 6DoF pose estimation using a ground truth CAD model. In this case, when the predicted and ground truth shape are the same, the asymmetric and symmetric versions of ADD-S coincide. In SAM 3D we jointly estimate shape and pose, so generalize the metric to the symmetric version. • ADD-S @ 0.1:A binary value per-sample indicating whether the ...

  54. [55]

    and Hunyuan3D-2.1 (Hunyuan3D et al., 2025). We also conduct a texture-only comparison by providing SAM 3D geometry as input to the texture modules of the aforementioned baselines, with the addition of Unitex (Liang et al., 2025b), a model that performs texture prediction given paired image and shape input. We report human preference for SAM 3D over each b...

  55. [56]

    proposals

    using annotator preferences on the Pref Set. We benchmark each component to the alternative model without the change. We remark a few themes here: • Augmentation is very important, with lighting augmentation to be the most critical here. This is expected, given the Mask and Blur augmentations primarily focus on specific challenging cases (poor mask qualit...

  56. [57]

    Further applying normalization to the 6D rotation vectors using the statistics over the training datasets leads to an additional improvement when training the flow matching models

    yields a notable reduction in oriented rotation error, confirming that the 6D formulation provides a smoother optimization landscape more suitable for generative modeling. Further applying normalization to the 6D rotation vectors using the statistics over the training datasets leads to an additional improvement when training the flow matching models. Repr...