pith. sign in

arxiv: 2606.07602 · v1 · pith:BUFKONVEnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

Pith reviewed 2026-06-28 23:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LEGO assembly generationphysical reasoningreinforcement learningLLM post-trainingspatial-physics reasoningdata selectionPhysHack
0
0 comments X

The pith

Physical validity alone lets LLMs generate LEGO structures that are geometrically misaligned and semantically inconsistent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a failure mode called PhysHack where models produce physically valid LEGO assemblies that nevertheless violate geometric and semantic requirements. It shows that standard validity signals are too weak to prevent this, allowing models to optimize for constraint satisfaction without fidelity to the intended design. To fix it the authors introduce a model-based filter that selects a small subset of trajectories, then train with PVPO, a reinforcement-learning procedure that adds voxel-space geometric rewards to the physical-feasibility objective. The result is better alignment, stability, and calibration while cutting the need for heavy post-hoc rejection sampling.

Core claim

Physical validity is an insufficient proxy for reliable physical reasoning because models can satisfy validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated; PVPO mitigates this by training on a small, model-selected set of trajectories and coupling physical feasibility with voxel-space geometric rewards.

What carries the argument

PVPO, a sample-efficient reinforcement-learning method that adds voxel-space geometric rewards to physical-feasibility training on model-selected trajectories.

If this is right

  • Generated assemblies exhibit higher structural and semantic alignment with the input specification.
  • Physical validity and structural stability both increase without extra rejection sampling at test time.
  • Test-time selection becomes more predictive of final semantic and structural quality.
  • The same data-selection-plus-reward approach works across different model backbones and test-time scaling regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection-plus-geometric-reward pattern could be tested on other spatial-physics generation tasks such as molecule or furniture design.
  • If the data filter is the main source of gain, simpler non-RL selection methods might achieve similar calibration improvements.
  • Calibration gains suggest that internal model scores become usable for ranking without external verifiers.

Load-bearing premise

A model can pick a small fraction of trajectories that preserve semantic and geometric fidelity without large-scale human labels or exhaustive verification.

What would settle it

A controlled experiment in which PVPO-trained models show no measurable gain in semantic or geometric alignment over baselines trained only on validity signals when evaluated on held-out prompts.

Figures

Figures reproduced from arXiv: 2606.07602 by Ge Lin Kan, Minghao Liu, Weiyang Liu, Yuhuan Yuan, Zhouliang Yu.

Figure 1
Figure 1. Figure 1: Qualitative examples of PVPO-generated LEGO structures across diverse object categories. The scores shown below each example correspond to Qwen-VL, DINOv3, and CLIP evaluations. tion from that of data scale. This further motivates a second question: Research Question If valuable data can improve post-training effi￾ciency, how should the learning objective bet￾ter exploit the verifiable spatial and physical… view at source ↗
Figure 2
Figure 2. Figure 2: Test-Time Scaling on Physics–Structure Alignment. Best@K results evaluated by Qwen2-VL, CLIP, and DINOv3 under different selection metrics. PVPO consistently outperforms the full-dataset training baseline. 2 PhysHack: Physical Validity as Hackable Proxy We identify PhysHack, a misalignment phe￾nomenon in LLM-based LEGO Brick Assembly (LBA), where models achieve high measured phys￾ical validity by satisfyin… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of physically valid LEGO assem [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Semantic alignment (Qwen-VL/CLIP/DINOv3), physics validity, voxel alignment, and generated￾brick number under different voxel weight λ on Qwen2.5-3B-Instruct. Right: Stability@K (%) versus regeneration attempts K during the test-time inference on Qwen2.5-3B-Instruct (K=1–16) and Llama3.2-1B-Instruct (K=1) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confidence calibration measured by ECE under three best@k selection mechanisms. Row blocks [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on two representative LEGO generation tasks. Bottle (left) and square table (right) show the generated structures from Full Data, PVPO, High-Value training, and ground truth. PVPO and High-Value produce cleaner geometry and better visual alignment than Full Data, black bricks indicate collisions. proves Voxel@4 from 0.32 to 0.35, and reduces the average number of generated bricks fro… view at source ↗
read the original abstract

LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies a data-induced failure mode termed PhysHack in LLM-based LEGO assembly generation, where models produce physically valid structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. It proposes a model-based data selection method using only a small fraction of trajectories, followed by PVPO, a sample-efficient RL post-training approach that couples physical feasibility constraints with voxel-space geometric rewards. The central claims are that physical validity alone is an insufficient proxy for reliable reasoning and that PVPO improves structural/semantic alignment, physical validity, structural stability, and calibration while reducing reliance on post-hoc rejection sampling, with particular gains in making test-time selection predictive of quality.

Significance. If the empirical results hold with rigorous validation, the work would be significant for sample-efficient post-training of LLMs on spatial-physics tasks. It provides a concrete demonstration that validity proxies can be gamed and offers a method to mitigate misalignment without exhaustive human labels or rejection sampling. The focus on calibration and test-time predictability addresses a practical bottleneck in deploying such models.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims (PVPO improves structural and semantic alignment, physical validity, structural stability, calibration, and mitigates PhysHack) are stated without any quantitative metrics, baselines, ablation studies, error bars, or statistical details, which is load-bearing because the soundness of the method and the insufficiency of validity proxies cannot be assessed from the given text.
  2. [Abstract] Abstract: The model-based data selection step is presented as reliably identifying a small fraction of trajectories that preserve semantic and geometric fidelity, but no description is supplied of selector training, held-out validation sets, or correlation with independent human semantic/geometric labels; this is load-bearing for the claim that selection corrects rather than reinforces PhysHack.
minor comments (1)
  1. [Abstract] The acronyms PhysHack and PVPO are introduced without explicit expansion or prior reference on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and propose targeted revisions to strengthen the abstract's presentation of our claims while preserving its role as a concise summary.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (PVPO improves structural and semantic alignment, physical validity, structural stability, calibration, and mitigates PhysHack) are stated without any quantitative metrics, baselines, ablation studies, error bars, or statistical details, which is load-bearing because the soundness of the method and the insufficiency of validity proxies cannot be assessed from the given text.

    Authors: We agree that the abstract would benefit from including select quantitative indicators to better convey the scale of improvements. The full manuscript already reports these details (including baselines, ablations, error bars, and statistical tests) in Sections 4–5. In revision we will add concise quantitative highlights to the abstract, such as relative gains in semantic alignment and validity, while keeping the text within length constraints. revision: yes

  2. Referee: [Abstract] Abstract: The model-based data selection step is presented as reliably identifying a small fraction of trajectories that preserve semantic and geometric fidelity, but no description is supplied of selector training, held-out validation sets, or correlation with independent human semantic/geometric labels; this is load-bearing for the claim that selection corrects rather than reinforces PhysHack.

    Authors: The abstract summarizes the high-level approach; the manuscript details the selector training procedure, held-out validation, and correlation analysis with human labels in Section 3.2. To address the concern directly in the abstract, we will insert a brief clause referencing the validation process and its alignment with human judgments. revision: yes

Circularity Check

0 steps flagged

No circularity; claims are empirical with no derivation chain

full rationale

The paper presents an empirical method (model-based data selection + PVPO RL) and reports experimental improvements in alignment, validity, stability, and calibration. No equations, derivations, or mathematical claims appear in the abstract or described content. All central assertions (e.g., PVPO mitigates PhysHack, physical validity is insufficient) are framed as experimental outcomes rather than derived quantities. No self-definitional, fitted-input, or self-citation load-bearing steps exist that reduce to inputs by construction. The work is self-contained against external benchmarks via reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities beyond the named method and failure mode can be extracted.

invented entities (2)
  • PhysHack no independent evidence
    purpose: Names the data-induced failure mode in which physical-validity constraints are satisfied while semantic or geometric fidelity is lost.
    Introduced in the abstract to characterize the observed failure mode.
  • PVPO no independent evidence
    purpose: Sample-efficient reinforcement learning algorithm that couples physical feasibility with voxel-space geometric rewards.
    New method proposed in the abstract.

pith-pipeline@v0.9.1-grok · 5717 in / 1196 out tokens · 23681 ms · 2026-06-28T23:37:13.820588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 33 canonical work pages · 20 internal anchors

  1. [1]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    s1: Simple test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  2. [2]

    Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Sr...

  3. [3]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Generating physically stable and buildable brick structures from text , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  4. [4]

    5 technical report , author=

    Qwen2. 5 technical report , author=. arXiv preprint , year=

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  6. [6]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  7. [7]

    DINOv3

    Dinov3 , author=. arXiv preprint arXiv:2508.10104 , year=

  8. [8]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    BrickNet: Graph-Backed Generative Brick Assembly

    BrickNet: Graph-Backed Generative Brick Assembly , author=. arXiv preprint arXiv:2604.22984 , year=

  11. [11]

    International Conference on Machine Learning , pages=

    Blocks assemble! learning to assemble with large-scale structured reinforcement learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  12. [12]

    arXiv preprint arXiv:2210.01021 , year=

    Budget-Aware Sequential Brick Assembly with Efficient Constraint Satisfaction , author=. arXiv preprint arXiv:2210.01021 , year=

  13. [13]

    arXiv preprint arXiv:2603.16853 , year=

    BrickSim: A Physics-Based Simulator for Manipulating Interlocking Brick Assemblies , author=. arXiv preprint arXiv:2603.16853 , year=

  14. [14]

    European Conference on Computer Vision , pages=

    TreeSBA: Tree-Transformer for Self-Supervised Sequential Brick Assembly , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  15. [15]

    Ge, Jiahao and Zhou, Mingjun and Zheng, Hanyou and Xu, Hao and Fu, Chi-Wing , journal=. LEGO. 2025 , publisher=

  16. [16]

    ACM Transactions on Graphics (TOG) , volume=

    Learn to create simple LEGO micro buildings , author=. ACM Transactions on Graphics (TOG) , volume=. 2024 , publisher=

  17. [17]

    European Conference on Computer Vision , pages=

    Translating a visual lego manual to a machine-executable plan , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  18. [18]

    LegoACE: Autoregressive Construction Engine for Expressive LEGO

    Xu, Hao and Zhang, Yuqing and Wu, Yiqian and Zheng, Xinyang and Liu, Yutao and Tang, Xiangjun and Yang, Yunhan and Liang, Ding and Liu, Yingtian and Guo, Yuanchen and others , booktitle=. LegoACE: Autoregressive Construction Engine for Expressive LEGO

  19. [19]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  20. [20]

    Language Models are Unsupervised Multitask Learners , author=

  21. [21]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  22. [22]

    arXiv preprint arXiv:2012.11543 , year=

    Building lego using deep generative models of graphs , author=. arXiv preprint arXiv:2012.11543 , year=

  23. [23]

    Graph Attention Networks

    Graph attention networks , author=. arXiv preprint arXiv:1710.10903 , year=

  24. [24]

    Auto-Encoding Variational Bayes

    Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

  25. [25]

    arXiv preprint arXiv:2503.19990 , year=

    Lego-puzzles: How good are mllms at multi-step spatial reasoning? , author=. arXiv preprint arXiv:2503.19990 , year=

  26. [26]

    International conference on learning representations , volume=

    Statistical rejection sampling improves preference optimization , author=. International conference on learning representations , volume=

  27. [27]

    arXiv preprint arXiv:2510.14980 , year=

    Agentic Design of Compositional Machines , author=. arXiv preprint arXiv:2510.14980 , year=

  28. [28]

    International Conference on Learning Representations , volume=

    Generating cad code with vision-language models for 3d designs , author=. International Conference on Learning Representations , volume=

  29. [29]

    arXiv preprint arXiv:2509.05208 , year=

    Symbolic graphics programming with large language models , author=. arXiv preprint arXiv:2509.05208 , year=

  30. [30]

    International Conference on Learning Representations , volume=

    Can large language models understand symbolic graphics programs? , author=. International Conference on Learning Representations , volume=

  31. [31]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

  32. [32]

    VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation

    VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation , author=. arXiv preprint arXiv:2604.02580 , year=

  33. [33]

    arXiv preprint arXiv:2502.04728 , year=

    Generating symbolic world models via test-time scaling of large language models , author=. arXiv preprint arXiv:2502.04728 , year=

  34. [34]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Rendering-aware reinforcement learning for vector graphics generation , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  38. [38]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  39. [39]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  40. [40]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Physreason: A comprehensive benchmark towards physics-based reasoning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  41. [41]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Evaluating multimodal large language models across distribution shifts and augmentations , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  42. [42]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  43. [43]

    Reward Design for Physical Reasoning in Vision-Language Models

    Reward Design for Physical Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2604.13993 , year=

  44. [44]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents , author=. arXiv preprint arXiv:2502.09560 , year=

  45. [45]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Physcene: Physically interactable 3d scene synthesis for embodied ai , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  46. [46]

    arXiv preprint arXiv:2312.10728 , year=

    Benchmarks for physical reasoning AI , author=. arXiv preprint arXiv:2312.10728 , year=

  47. [47]

    Advances in Neural Information Processing Systems , volume=

    Phyre: A new benchmark for physical reasoning , author=. Advances in Neural Information Processing Systems , volume=

  48. [48]

    Nature Machine Intelligence , volume=

    Phy-Q as a measure for physical reasoning intelligence , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  49. [49]

    LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

    Llmphy: Complex physical reasoning using large language models and world models , author=. arXiv preprint arXiv:2411.08027 , year=

  50. [50]

    2023 IEEE International conference on robotics and automation (ICRA) , pages=

    Code as policies: Language model programs for embodied control , author=. 2023 IEEE International conference on robotics and automation (ICRA) , pages=. 2023 , organization=

  51. [51]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  52. [52]

    The 22nd international conference on artificial intelligence and statistics , pages=

    Towards efficient data valuation based on the shapley value , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

  53. [53]

    International conference on machine learning , pages=

    Understanding black-box predictions via influence functions , author=. International conference on machine learning , pages=. 2017 , organization=

  54. [54]

    Advances in Neural Information Processing Systems , volume=

    What is your data worth to gpt? llm-scale data valuation with influence functions , author=. Advances in Neural Information Processing Systems , volume=

  55. [55]

    Advances in neural information processing systems , volume=

    Dual learning for machine translation , author=. Advances in neural information processing systems , volume=

  56. [56]

    arXiv preprint arXiv:2404.04167 , year=

    Chinese tiny llm: Pretraining a chinese-centric large language model , author=. arXiv preprint arXiv:2404.04167 , year=

  57. [57]

    International Conference on Learning Representations , volume=

    What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning , author=. International Conference on Learning Representations , volume=

  58. [58]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  59. [59]

    Alignment faking in large language models

    Alignment faking in large language models , author=. arXiv preprint arXiv:2412.14093 , year=

  60. [60]

    IEEE Robotics and Automation Letters , volume=

    Stablelego: Stability analysis of block stacking assembly , author=. IEEE Robotics and Automation Letters , volume=. 2024 , publisher=

  61. [61]

    Measuring Progress on Scalable Oversight for Large Language Models

    Measuring progress on scalable oversight for large language models , author=. arXiv preprint arXiv:2211.03540 , year=

  62. [62]

    arXiv preprint arXiv:2603.23355 , year=

    Off-Policy Value-Based Reinforcement Learning for Large Language Models , author=. arXiv preprint arXiv:2603.23355 , year=

  63. [63]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  64. [64]

    LIMO: Less is More for Reasoning

    Limo: Less is more for reasoning , author=. arXiv preprint arXiv:2502.03387 , year=

  65. [65]

    arXiv preprint arXiv:2507.14805 , year=

    Subliminal learning: Language models transmit behavioral traits via hidden signals in data , author=. arXiv preprint arXiv:2507.14805 , year=

  66. [66]

    arXiv preprint arXiv:2511.18397 , year=

    Natural emergent misalignment from reward hacking in production rl , author=. arXiv preprint arXiv:2511.18397 , year=

  67. [67]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

  68. [68]

    arXiv preprint arXiv:2103.06257 , year=

    Maximum entropy RL (provably) solves some robust RL problems , author=. arXiv preprint arXiv:2103.06257 , year=

  69. [69]

    International conference on machine learning , pages=

    On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=