pith. sign in

arxiv: 2606.20364 · v1 · pith:TOEKK7EEnew · submitted 2026-06-18 · 💻 cs.LG

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

Pith reviewed 2026-06-26 18:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords VLM judge3D mesh generationsingle-image to 3Dde-biasingparameter-efficient adaptationwin-rate evaluationconditioner repairfurniture assets
0
0 comments X

The pith

A hardened VLM-as-3D-judge protocol reaches parity with the base generator but no adaptation exceeds the 65 percent win-rate target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a de-biased VLM judge developed for ranking single-image 3D mesh quality can be hardened into an optimization signal to specialize the base generator on furniture assets using only public models and lightweight adaptation. It separates the training judge from the evaluation judge, applies position-bias correction, and fixes three failure modes to produce an independent signal with clear-gap win-rates of 0.83-1.0. Across six methods, two input regimes, and a severity sweep, the strongest result is parity at 0.50 win-rate from conditioner repair under severe degradation; independent base samples show 0.94 order-flip rate, so no method clears the 65 percent target. This outcome indicates that clean inputs saturate the judge and that flow-based fine-tuning washes out through the sampler, leaving conditioning repair as the only moving part.

Core claim

The central claim is that converting the de-biased VLM-as-3D-judge from ranking to optimization requires explicit hardening against circularity and saturation, after which lightweight parameter-efficient adaptations on public data match but do not surpass the strong base generator, with the mechanistic limit that base samples carry essentially no learnable preference.

What carries the argument

The hardened VLM-as-3D-judge that separates training and evaluation models, corrects position bias, and repairs three failure modes (image overload, geometry-hiding splat renders, reference-free judging) to supply an independent optimization signal.

If this is right

  • Independent base samples carry essentially no learnable preference, requiring quality-contrastive construction for any signal.
  • Conditioning repair under severe degradation is the only locus that moves geometry; other adaptations wash out through the sampler.
  • Matching a strong public-data base with cheap adaptation shows that exceeding it requires more than lightweight PEFT on public data.
  • The hardened judge protocol functions as a reusable independent evaluator for 3D generation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The saturation on clean inputs implies the judge may be most useful when inputs are deliberately degraded or paired with lower-quality contrasts.
  • The result could extend to testing whether heavier adaptation techniques or private data sources would be needed to surpass the base.
  • The protocol may apply to other single-image generation domains where cheap proxies fail but a hardened VLM can supply directional preference.

Load-bearing premise

The judge supplies an independent, non-saturated optimization signal that can be used to specialize the generator without the signal being washed out by the sampler or already maximized on clean base outputs.

What would settle it

An adaptation method that produces a win-rate of 65 percent or higher against the base generator on the n=8 test objects would falsify the result that no method clears the target.

read the original abstract

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a hardened de-biased VLM-as-3D-judge protocol (distinct training judge Qwen2.5-VL-7B and evaluation judge InternVL3-8B, with position-bias correction and fixes for image overload, splat rendering, and reference-free judging) to optimize the TRELLIS single-image-to-3D generator on furniture assets via six lightweight PEFT adaptation methods. It reports that independent base samples show no learnable preference (0.94 order-flip rate), that the most effective method (conditioner repair under severe degradation) reaches only parity (0.50 win-rate) with the base, and that no method exceeds the >=65% target; the result is attributed to judge saturation on clean inputs and signal washout through the sampler. Calibration evidence is provided (clear-gap win-rates 0.83-1.0; base-vs-base ~0.5), and win-rates are described as directional at n=8 objects.

Significance. If the empirical findings hold after addressing sample-size limitations, the work would usefully document the practical barriers to turning a ranking-grade VLM judge into an optimization signal for 3D generation: clean base outputs already saturate the judge, flow-DiT fine-tuning erases preference information, and only targeted conditioning repair moves geometry. The separation of judges and the calibration protocol constitute reusable methodological contributions that future work can adopt. The negative result on public-data lightweight adaptation also supplies a concrete baseline indicating that exceeding strong open generators will require either larger-scale data, architectural changes, or stronger preference signals.

major comments (2)
  1. [Results / abstract] Results section (and abstract): All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.
  2. [Methods / evaluation protocol] § on adaptation methods and evaluation protocol: The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.
minor comments (2)
  1. [Results] The manuscript should explicitly state the exact number of objects, prompts, and renderings used for each win-rate comparison and whether the same 8 objects were used across all conditions.
  2. [Calibration] Clarify whether the 0.94 order-flip rate on base samples was measured on the same n=8 objects or a larger held-out set.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the methodological contributions of the de-biased judge protocol. We address the two major comments below.

read point-by-point responses
  1. Referee: [Results / abstract] All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.

    Authors: We agree the sample size limits statistical power and will add 95% binomial confidence intervals, standard errors, and an explicit discussion of overlap with 0.5 to the results section and abstract. The manuscript already qualifies results as directional at n=8; we will further temper language around the >=65% target and parity claim to reflect uncertainty. A note on the absence of formal hypothesis testing will be included. We cannot expand to larger n within this study. revision: partial

  2. Referee: [Methods / evaluation protocol] The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.

    Authors: The full manuscript already describes the six PEFT methods, input regimes, and evaluation protocol in the methods section. To improve verifiability we will expand the methods and supplementary material with exact hyperparameters, dataset sizes, and any statistical considerations used. The abstract will be revised to indicate that full implementation details are provided in the paper. revision: yes

standing simulated objections not resolved
  • Increasing the evaluation set beyond n=8 objects is not feasible due to the computational cost of 3D generation and VLM judging.

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation results are independent of judge construction

full rationale

The paper's central finding—that no adaptation method exceeds the 65% win-rate target and the best reaches only parity—is an empirical outcome measured on n=8 objects using a hardened judge protocol. The protocol explicitly separates the training judge (Qwen2.5-VL-7B) from the evaluation judge (InternVL3-8B) and reports base-vs-base win-rates near 0.5 as calibration. These steps prevent the optimization signal from being self-referential. No derivation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from overlapping-author prior work, and the negative result is not forced by the inputs. The work is self-contained against the reported public-model benchmarks and internal calibration checks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the companion study's judge being reliable after hardening and on the assumption that public data and models suffice to test whether preference signals can be learned via lightweight adaptation.

pith-pipeline@v0.9.1-grok · 5910 in / 1144 out tokens · 35159 ms · 2026-06-26T18:23:31.082999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 12 linked inside Pith

  1. [1]

    A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

    Ali Asaria, Tony Salomone, and Deep Gandhi. A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short). arXiv:2606.18451 [cs.LG], 2026. URL https://arxiv.org/abs/2606.18451. Companion work; introduces the cross-model VLM-as-3D-judge evaluation protocol adopted here

  2. [2]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D Latents for Scalable and Versatile 3D Generation

  3. [3]

    8 A De-biased VLM-as-3D-Judge Protocol

    URLhttps://arxiv.org/abs/2412.01506. 8 A De-biased VLM-as-3D-Judge Protocol

  4. [4]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  5. [5]

    Diffusion Model Alignment Using Direct Preference Optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion Model Alignment Using Direct Preference Optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URLhttps://arxiv.org/abs/2311.12908

  6. [6]

    DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness

    Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness. 2025. URLhttps://arxiv. org/abs/2503.22677

  7. [7]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

  8. [8]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.02747

  9. [9]

    DINOv2: Learning Robust Visual Features without Supervision.Transactions on Machine Learning Research (TMLR), 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

  10. [10]

    DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

    Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

  11. [11]

    URLhttps://arxiv.org/abs/2502.04370

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

  13. [13]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic Preference Optimization without Reference Model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URLhttps://arxiv.org/abs/2403.07691

  14. [14]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https://arxiv.org...

  15. [15]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926

  16. [16]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...

  17. [17]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

  18. [18]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  19. [19]

    SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

    Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653

  20. [20]

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151

  21. [21]

    3D-FUTURE: 3D Furniture Shape with TextURE

    Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3D-FUTURE: 3D Furniture Shape with TextURE. InInternational Journal of Computer Vision (IJCV), 2021. URLhttps://arxiv.org/abs/2009.09633

  22. [22]

    Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023. URL https://arxiv.org/abs/2308.05371. 10