pith. machine review for the scientific record. sign in

arxiv: 2605.07385 · v1 · submitted 2026-05-08 · 💻 cs.GR · cs.CV

Recognition: 1 theorem link

· Lean Theorem

Velocity-Space 3D Asset Editing

Hao Liu, Jingfeng Guo, Junjie Wang, Ruihang Chu, Ruotong Li, Yujiu Yang, Yuxuan Lin

Pith reviewed 2026-05-11 02:03 UTC · model grok-4.3

classification 💻 cs.GR cs.CV
keywords 3D asset editingrectified flowvelocity fieldlocal editingidentity preservationODE samplergenerative modelsmask-free editing
0
0 comments X

The pith

VS3D performs local 3D asset editing inside the rectified-flow sampler by anchoring source identity, amplifying consistent edits, and deciding preservation token by token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that local 3D editing succeeds when the velocity field itself is reshaped during sampling rather than controlled by external masks or post-processing. Existing approaches fail because one velocity field cannot stay strong on the edit region while staying zero on preserved content, producing leakage, weak edits, and drag at geometry and material stages. VS3D counters this with three sampler interventions: RASI anchors the unconditional path to the original asset reconstruction, PMG boosts the velocity difference only where edits are consistent across subsamples, and TAR lets the model choose per token whether to keep or change geometry and material. If correct, this removes the need for manual masks, model retraining, or inversion while keeping edits faithful.

Core claim

A single velocity field cannot simultaneously drive a strong edit on the target region and vanish on preserved content, so rectified-flow samplers produce identity leakage, lack dedicated amplification, and suffer identity drag. VS3D solves each problem with a velocity-space module: Reconstruction-Anchored Source Injection turns the unconditional embedding into a per-step asset-specific anchor, Partial-Mean Guidance contrasts high- and low-quality velocity differences to amplify only consistent edits, and Twin-Agreement Residual injection lets the sampler decide token by token what to preserve at the geometry and material stages.

What carries the argument

The combination of three velocity-space interventions—Reconstruction-Anchored Source Injection (RASI), Partial-Mean Guidance (PMG), and Twin-Agreement Residual (TAR)—applied directly inside the rectified-flow ODE sampler.

Load-bearing premise

The three interventions can be combined in the sampler without creating new artifacts or requiring per-asset tuning beyond the described procedure.

What would settle it

An edited asset in which the target region fails to change as intended or in which preserved regions show visible shifts in shape or material after applying the three modules would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07385 by Hao Liu, Jingfeng Guo, Junjie Wang, Ruihang Chu, Ruotong Li, Yujiu Yang, Yuxuan Lin.

Figure 1
Figure 1. Figure 1: Overview of the VS3D pipeline. A source 3D asset is rendered and 2D-edited to obtain the condition. Stage 1 operates on the dense occupancy latent: RASI (§3.2) optimises a per-step ϕt to suppress v∆ on non-edited regions, and PMG (§3.3) amplifies the edit signal via subsample extrapolation. Stages 2–3 handle sparse geometry and material SLATs: TAR (§3.4) computes a token-wise pkeep map (blue = preserve, re… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison. Our method is the only mask-free approach that jointly preserves the non-edited region and produces high-quality edits. Red boxes highlight the edited regions for easier visual inspection and are not part of the model input. In the Mask row, black regions indicate the user-provided editing mask required by mask-based methods. geometry in preserved regions closely matches the origina… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study. Each module is progressively added and resolves a distinct failure mode. Furthermore, VS3D operates entirely through residual injection on frozen DiTs; consequently, the editing quality is upper-bounded by the generative capacity of the TRELLIS backbone itself. Any failure mode of the base model propagates into the edited output unchanged. Despite this limitation, VS3D demonstrates that hig… view at source ↗
Figure 4
Figure 4. Figure 4: TAR keep-SLAT map visualisation (Part 1). Each row: source rendering, edited rendering, geometry keep-SLAT map, and material keep-SLAT map. Warm colours denote the edited region (high twin discrepancy); cool colours denote the preserved region where TAR injects source residuals. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TAR keep-SLAT map visualisation (Part 2). Each row: source rendering, edited rendering, geometry keep-SLAT map, and material keep-SLAT map. Warm colours denote the edited region (high twin discrepancy); cool colours denote the preserved region where TAR injects source residuals. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extended editing gallery (Part 1). Large-scale qualitative results of VS3D across diverse assets and edit operations. Each group shows the source asset and the edited result under the corresponding text instruction. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Extended editing gallery (Part 2). Additional large-scale results demonstrating VS3D’s robustness on mixed-operation edits and complex compositional instructions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Editing a 3D asset locally, modifying a target region while preserving the rest, is a fundamental requirement of native 3D editing. Existing methods enforce locality through mechanisms external to the generator, such as manual 3D masks, post-hoc voxel merging, or 2D multi-view lifting. None of them intervene where the corruption actually originates: inside the ODE sampler. For a rectified-flow generator to achieve faithful local editing, its velocity field should be strong over the target editing region while vanishing on preserved content. Yet a single velocity field can hardly satisfy both requirements simultaneously, leading to three problems: (i) identity leakage that keeps the edit signal non-zero on preserved regions; (ii) no dedicated edit-amplification channel, so strengthening the edit inevitably perturbs identity; and (iii) an identity drag at the geometry and material stages, where a global condition pulls every token toward the target. We propose VS3D (Velocity-Space 3D Asset editing}), an inversion-free, training-free, and mask-free framework that addresses each problem with a targeted intervention inside the sampler. VS3D integrates three complementary modules, each corresponding to a specific stage of the editing pipeline. Reconstruction-Anchored Source Injection (RASI) absorbs identity leakage by turning the unconditional embedding into a per-step, asset-specific anchor calibrated through source reconstruction. Partial-Mean Guidance (PMG) amplifies the edit signal by contrasting high- and low-quality subsample estimates of the velocity difference, active only where a consistent edit exists. Twin-Agreement Residual injection (TAR) lets the sampler decide token by token what to preserve at the geometry and material stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VS3D, an inversion-free, training-free, and mask-free framework for local 3D asset editing that intervenes directly inside the rectified-flow ODE sampler. It identifies three problems (identity leakage, lack of edit amplification, and identity drag at geometry/material stages) and addresses them via three modules: Reconstruction-Anchored Source Injection (RASI) that calibrates the unconditional embedding from source reconstruction, Partial-Mean Guidance (PMG) that amplifies consistent velocity differences from high/low-quality subsamples, and Twin-Agreement Residual injection (TAR) that applies token-wise residuals only on agreeing tokens.

Significance. If the modules prove orthogonal and robust, the velocity-space approach could meaningfully advance 3D editing by eliminating reliance on external masks, post-hoc merging, or 2D lifting while remaining training-free. The explicit targeting of the velocity field v_t rather than latent or pixel space is a conceptual strength, and the absence of per-asset inversion or fine-tuning would be practically valuable if the interventions generalize without new artifacts.

major comments (2)
  1. The central claim that RASI, PMG, and TAR can be combined inside the sampler without mutual interference or per-asset tuning is load-bearing but unsupported by any derivation or normalization analysis of their additive effects on v_t. No equation shows how the RASI anchor, PMG velocity-difference contrast, and TAR residual are scheduled across denoising steps or scaled so that PMG amplification does not override TAR preservation on the same token (see the integration description following the problem statement).
  2. The paper asserts that PMG is 'active only where a consistent edit exists' and TAR decides 'token by token,' yet provides no formal definition or threshold for consistency/agreement, nor any ablation demonstrating that these decisions remain stable when all three modules operate simultaneously. This leaves the orthogonality assumption untested and risks re-introducing identity drag or high-frequency artifacts at geometry/material stages.
minor comments (2)
  1. The abstract and introduction use several invented acronyms (RASI, PMG, TAR) without immediate expansion on first use, which reduces readability.
  2. No quantitative metrics, datasets, or baseline comparisons are referenced in the provided description, making it difficult to assess the magnitude of improvement over existing mask-based or post-hoc methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the integration and orthogonality of the proposed modules in VS3D. We address each major comment below and plan to revise the manuscript accordingly to provide additional analysis and ablations.

read point-by-point responses
  1. Referee: The central claim that RASI, PMG, and TAR can be combined inside the sampler without mutual interference or per-asset tuning is load-bearing but unsupported by any derivation or normalization analysis of their additive effects on v_t. No equation shows how the RASI anchor, PMG velocity-difference contrast, and TAR residual are scheduled across denoising steps or scaled so that PMG amplification does not override TAR preservation on the same token (see the integration description following the problem statement).

    Authors: We appreciate the referee highlighting the need for explicit integration details. The manuscript describes the modules as complementary interventions on distinct aspects of the velocity field (RASI on unconditional embeddings, PMG on velocity differences, TAR on token residuals), but we acknowledge the absence of a formal derivation or normalization analysis for their combined effects and scheduling. In the revision we will add a dedicated subsection with equations specifying the per-step scaling and scheduling of the three terms, along with empirical validation of non-interference across denoising stages. revision: yes

  2. Referee: The paper asserts that PMG is 'active only where a consistent edit exists' and TAR decides 'token by token,' yet provides no formal definition or threshold for consistency/agreement, nor any ablation demonstrating that these decisions remain stable when all three modules operate simultaneously. This leaves the orthogonality assumption untested and risks re-introducing identity drag or high-frequency artifacts at geometry/material stages.

    Authors: We agree that the current description of activation criteria and joint behavior is qualitative. The manuscript states that PMG activates on consistent edits and TAR operates token-wise, yet omits explicit thresholds and a combined ablation. We will revise to include formal definitions (e.g., velocity-difference threshold for PMG consistency and token-agreement score for TAR) and a new ablation table evaluating all three modules together, confirming stability and absence of introduced artifacts at geometry and material stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided manuscript text and abstract present VS3D as a conceptual framework that integrates three named interventions (RASI, PMG, TAR) inside a rectified-flow sampler to mitigate identity leakage, lack of edit amplification, and identity drag. No equations, derivations, fitted parameters, or self-citations appear that would reduce any claimed result to its own inputs by construction. The description remains at the level of module definitions and qualitative problem statements without a load-bearing mathematical chain that collapses into self-definition or renamed fits. The central claim therefore stands as an independent proposal rather than a tautological restatement of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the assumption that velocity-field interventions can be localized without side effects; the three modules are introduced as new constructs without independent prior validation.

axioms (2)
  • domain assumption A single velocity field in a rectified-flow generator cannot simultaneously be strong on the edit region and zero on preserved regions.
    Stated as the root cause of the three listed problems.
  • domain assumption Rectified-flow ODE sampling allows per-step injection of auxiliary signals without breaking the overall generative process.
    Required for all three modules to function during sampling.
invented entities (3)
  • Reconstruction-Anchored Source Injection (RASI) no independent evidence
    purpose: Absorbs identity leakage by turning unconditional embedding into per-step asset-specific anchor
    New module proposed to solve leakage; no independent evidence provided.
  • Partial-Mean Guidance (PMG) no independent evidence
    purpose: Amplifies edit signal by contrasting high- and low-quality subsample velocity estimates
    New module proposed to solve lack of dedicated amplification channel; no independent evidence provided.
  • Twin-Agreement Residual injection (TAR) no independent evidence
    purpose: Allows token-by-token decision on preservation at geometry and material stages
    New module proposed to solve identity drag; no independent evidence provided.

pith-pipeline@v0.9.0 · 5615 in / 1610 out tokens · 29369 ms · 2026-05-11T02:03:33.699679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

    VS3D integrates three complementary modules... Reconstruction-Anchored Source Injection (RASI) absorbs identity leakage by turning the unconditional embedding into a per-step, asset-specific anchor... Partial-Mean Guidance (PMG) amplifies the edit signal by contrasting high- and low-quality subsample estimates... Twin-Agreement Residual injection (TAR) lets the sampler decide token by token what to preserve

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

  1. [1]

    Mokady, A

    R. Mokady, A. Hertz, K. Aberman, Y . Pritch, and D. Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6038–6047, 2023

  2. [2]

    FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models , journal =

    V . Kulikov, M. Kleiner, I. Huberman-Spiegelglas, and T. Michaeli. FlowEdit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

  3. [3]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  4. [4]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  5. [5]

    DINOv3

    O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint arXiv:2508.10104, 2025

  6. [6]

    Xiang, Z

    J. Xiang, Z. Lv, S. Xu, Y . Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3D latents for scalable and versatile 3D generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21469–21480, 2025

  7. [7]

    Native and Compact Structured Latents for 3D Generation.arXiv preprint arXiv:2512.14692, 2025a

    J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y . Deng, H. Zhu, Y . Dong, H. Zhao, N. J. Yuan, and J. Yang. Native and compact structured latents for 3D generation.arXiv preprint arXiv:2512.14692, 2025

  8. [8]

    L. Li, Z. Huang, H. Feng, G. Zhuang, R. Chen, C. Guo, and L. Sheng. V oxHammer: Training-free precise and coherent 3D editing in native 3D space.arXiv preprint arXiv:2508.19247, 2025

  9. [9]

    S. Hu, Y . Wei, F. Zha, Y . Guo, and J. Zhang. Easy3E: Feed-forward 3D asset editing via rectified voxel flow.arXiv preprint arXiv:2602.21499, 2026

  10. [10]

    VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

    T.-F. Hsiao, B.-K. Ruan, Y .-L. Liu, and H.-H. Shuai. VecSet-Edit: Unleashing pre-trained LRM for mesh editing from single image.arXiv preprint arXiv:2602.04349, 2026

  11. [11]

    Barda, M

    A. Barda, M. Gadelha, V . G. Kim, N. Aigerman, A. H. Bermano, and T. Groueix. Instant3dit: Multiview inpainting for fast editing of 3D objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16273–16282, 2025

  12. [12]

    Y . Chi, X. Li, Z. Huang, and J. M. Rehg. Vinedresser3D: Agentic text-guided 3D editing.arXiv preprint arXiv:2602.19542, 2026

  13. [13]

    I. Gat, D. Cohen-Bar, G. Levy, E. Richardson, and D. Cohen-Or. ShapeUP: Scalable image-conditioned 3D editing.arXiv preprint arXiv:2602.05676, 2026

  14. [14]

    J. Ye, S. Xie, R. Zhao, Z. Wang, H. Yan, W. Zu, L. Ma, and J. Zhu. Nano3D: A training-free approach for efficient 3D editing without masks.arXiv preprint arXiv:2510.15019, 2025

  15. [15]

    Huang, X

    J. Huang, X. Hu, S. Shi, Z. Tian, and L. Jiang. Edit360: 2D image edits to 3D assets from any angle.arXiv preprint arXiv:2506.10507, 2025

  16. [16]

    Z. Lai, Y . Zhao, H. Liu, Z. Zhao, Q. Lin, H. Shi, X. Yang, M. Yang, S. Yang, Y . Feng, S. Zhang, X. Huang, D. Luo, F. Yang, F. Yang, L. Wang, S. Liu, Y . Tang, Y . Cai, Z. He, T. Liu, Y . Liu, J. Jiang, Linus, J. Huang, and C. Guo. Hunyuan3D 2.5: Towards high-fidelity 3D assets generation with ultimate details.arXiv preprint arXiv:2506.16504, 2025

  17. [17]

    Z. Li, Y . Wang, H. Zheng, Y . Luo, and B. Wen. Sparc3D: Sparse representation and construction for high-resolution 3D shapes modeling.arXiv preprint arXiv:2505.14521, 2025

  18. [18]

    S. Wu, Y . Lin, F. Zhang, Y . Zeng, Y . Yang, Y . Bao, J. Qian, S. Zhu, X. Cao, P. Torr, and Y . Yao. Direct3D-S2: Gigascale 3D generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

  19. [19]

    Z. Lai, Y . Zhao, Z. Zhao, H. Liu, Q. Lin, J. Huang, C. Guo, and X. Yue. LATTICE: Democratize high-fidelity 3D generation at scale.arXiv preprint arXiv:2512.03052, 2025

  20. [20]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kem- bhavi, and A. Farhadi. Objaverse: A universe of annotated 3D objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023. 10

  21. [21]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 586–595, 2018

  22. [22]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without supervis...

  23. [23]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021. A Technical appendices and supplementary material A.1 Experimental s...