pith. sign in

arxiv: 2605.16137 · v2 · pith:Y22NRFWCnew · submitted 2026-05-15 · 💻 cs.CV · cs.RO

STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System

Pith reviewed 2026-05-20 19:10 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords tabletop scene generationembodied AIphysics-aware layoutLLM for spatial reasoningsimulation-ready scenesprogressive generationpose correction
0
0 comments X

The pith

STABLE generates simulation-ready tabletop scenes by alternating a fine-tuned LLM with a physics pose corrector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create tabletop scenes directly from task instructions that can be dropped into physics simulators without manual fixes for collisions or floating objects. Pure LLM approaches fall short because they lack reliable 3D spatial understanding, so STABLE pairs a Semantic Reasoner trained on structured scene data with a Physics Corrector that uses flow-based denoising to adjust object poses. The two modules run in alternation under a progressive schedule that first places task-critical objects and then adds background items. This combination is meant to keep the output faithful to the original instructions while satisfying basic physical constraints. The result matters for Embodied AI because it supplies ready-to-use training environments at scale.

Core claim

STABLE consists of a Semantic Reasoner, a fine-tuned LLM that produces coarse layouts from task instructions, and a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates. By alternating between the two in a progressive generation process that grows the scene from task-critical objects outward, the system yields layouts that conform to the given instructions while meeting physical plausibility criteria, outperforming prior LLM-only methods on validity metrics.

What carries the argument

The semantics-physics dual system that alternates a fine-tuned LLM Semantic Reasoner with a flow-based Physics Corrector under a progressive object-addition schedule.

If this is right

  • Generated scenes can be loaded directly into simulators without post-processing for collisions or floating objects.
  • Scene layouts remain faithful to the input task instructions even after physics corrections are applied.
  • Progressive addition of objects from critical to background maintains both semantic and physical consistency.
  • Physical validity metrics improve over methods that rely exclusively on large language models for layout prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alternating correction pattern could be tested on other scene types such as kitchen counters or warehouse shelves.
  • Replacing the flow-based corrector with a learned dynamics model might allow handling of more complex interactions like stacking.
  • The approach could reduce the amount of human annotation needed to create large-scale simulation datasets for robot training.
  • Integrating the system with real-time simulation feedback might enable iterative refinement when initial corrections fall short.

Load-bearing premise

The Physics Corrector can produce pose updates that remove physical violations while still keeping the layout aligned with the original task instructions.

What would settle it

Generate scenes from the same task instructions with STABLE and with a pure-LLM baseline, then run identical physics simulations and measure collision, penetration, and stability failure rates; if the rates are statistically indistinguishable, the dual-system advantage would be falsified.

Figures

Figures reproduced from arXiv: 2605.16137 by Feng Zheng, Jiangmiao Pang, Jinkun Hao, Xudong Xu, Yanwei Fu, Yixuan Yang, Zhaoyang Lyu, Zhen Luo.

Figure 1
Figure 1. Figure 1: Overview of our proposed STABLE for tabletop scene generation. STABLE is a Semantics–Physics dual system that alter￾nates between an LLM-based Semantic Reasoner and a geometry￾aware flow-based Physics Corrector to generate diverse, task￾aligned, and simulation-ready tabletop layouts. easy scalability (Deitke et al., 2022; Tian et al., 2025). For synthetic data in robotic manipulation, the generation of div… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed framework. (a) The training pipeline. Our framework decomposes task-oriented layout generation into two decoupled modules. The Semantic Reasoner (left) is an LLM-based model trained to generate structured JSON layouts progressively in three levels: task-oriented objects (O t ), important background (O B), and secondary background (O b ). The Physics Corrector (right) is a flow-base… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of task-conditioned tabletop scenes generated by STABLE and baselines. Red/yellow/blue boxes denote Physical Failure, Missing Objects, and Task Misalignment, respectively. STABLE yields task-aligned and physically plausible, simulation-ready layouts. Add more fruit to the bowl. Remove bowls from plates. Remove green plants. Add one more book [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on Rearrangement. Com￾pared with StructDiffusion and LEGO-NET, STABLE produces more physically consistent layouts and better preserves functional relations under clutter. 5. Conclusion In this paper, we present STABLE, a semantics–physics dual-system for simulation-ready tabletop scene generation. STABLE decouples semantic layout generation via an LLM￾based Semantic Reasoner from phy… view at source ↗
Figure 6
Figure 6. Figure 6: Generalization Results. STABLE generalizes to unseen tabletop types, producing well-structured and collision-free layouts. Interpolation path and target velocity. We use the standard linear path between endpoints: xt = (1 − t)x0 + tx1, t ∼ U[0, 1]. Under this path, the target velocity field is constant: vtarget = dxt dt = x1 − x0. We parameterize the conditional velocity with a neural network vθ(xt, t, C) … view at source ↗
Figure 7
Figure 7. Figure 7: Generalization to unseen tabletop types. STABLE generalizes to tabletop types that are not included in MesaTask-10K, including nightstands, TV stands, and side tables. For each unseen tabletop type, task instructions are generated by GPT-4o. STABLE produces coherent, task-aligned, and physically plausible layouts on these out-of-distribution support surfaces. D.2. Generalization to Unseen Object Assets We … view at source ↗
Figure 8
Figure 8. Figure 8: Generalization to unseen object assets. We introduce 100 new high-quality assets generated by Hunyuan3D and prioritize this new asset set during test-time retrieval. STABLE remains effective under these unseen geometries, suggesting that the Physics Corrector learns transferable geometry-aware pose correction rather than memorizing the original asset library. and ”detecting the graspable area of the fork.”… view at source ↗
Figure 9
Figure 9. Figure 9: Controllability under ambiguous or conflicting instructions. We test STABLE with deliberately unusual or conflicting spatial constraints. Although STABLE is not designed as a dedicated conflict-resolution system, it follows the given instructions in a largely literal and controllable way while maintaining physically plausible layouts when possible. non-upright resting poses, articulated parts, or tasks tha… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparisons of task to scene generation results across Steerable, MesaTask, MesaTask with refine and STABLE. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative results generated by STABLE. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs' inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes STABLE, a semantics-physics dual system for generating simulation-ready tabletop scenes from task instructions. It combines a Semantic Reasoner (fine-tuned LLM on a structured tabletop scene dataset) that produces coarse layouts with a Physics Corrector (physics-aware flow-based denoising model) that outputs pose updates for physical plausibility. The system uses a progressive generation paradigm that alternates between the modules, starting with task-critical objects and incrementally adding background objects, with the central claim that this yields scenes strictly conforming to instructions and with significantly improved physical validity over prior LLM-only methods.

Significance. If the quantitative results hold, the work offers a practical pipeline for Embodied AI that mitigates LLM limitations in 3D spatial reasoning while incorporating physics constraints. The dual-system design and progressive paradigm could support more reliable simulation environments for robotics and task planning, provided the semantic-physics interplay is rigorously validated.

major comments (2)
  1. [§3.3] §3.3 (Progressive Generation Paradigm): The description of alternating Semantic Reasoner and Physics Corrector steps does not specify any conditioning, regularization, or constraint mechanism that ensures the flow-based pose updates preserve task-specific semantic relations (e.g., relative positions or containment implied by the instruction). Without such a mechanism, the Physics Corrector risks displacing objects in ways that violate earlier semantic decisions, which is load-bearing for the claim of strict task conformance.
  2. [§4] §4 (Experiments): The abstract asserts that STABLE 'significantly enhances the physical validity of scenes over prior art' and 'strictly conform[s] to task instructions,' yet the provided experimental summary lacks quantitative metrics, specific baselines, error bars, or ablation studies on the dual-system components. This absence prevents assessment of whether the improvements are statistically meaningful or attributable to the proposed architecture.
minor comments (2)
  1. [§2.2] The notation for the flow-based denoiser in §2.2 could be clarified by explicitly defining the conditioning inputs (task embedding, current layout) and the loss terms used during training.
  2. [Figure 3] Figure 3 caption should include the exact number of scenes and task instructions used in the qualitative examples to allow readers to gauge representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Progressive Generation Paradigm): The description of alternating Semantic Reasoner and Physics Corrector steps does not specify any conditioning, regularization, or constraint mechanism that ensures the flow-based pose updates preserve task-specific semantic relations (e.g., relative positions or containment implied by the instruction). Without such a mechanism, the Physics Corrector risks displacing objects in ways that violate earlier semantic decisions, which is load-bearing for the claim of strict task conformance.

    Authors: We thank the referee for this observation. The Physics Corrector is a flow-based model whose inputs include the task instruction embedding and the current coarse layout (object categories and poses) produced by the Semantic Reasoner; its training objective includes a term that penalizes large deviations from the initial semantic poses. This conditioning and regularization are intended to keep task-specific relations intact while correcting only physical violations. We agree that §3.3 would benefit from an explicit description of these mechanisms and will revise the section accordingly in the next version. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts that STABLE 'significantly enhances the physical validity of scenes over prior art' and 'strictly conform[s] to task instructions,' yet the provided experimental summary lacks quantitative metrics, specific baselines, error bars, or ablation studies on the dual-system components. This absence prevents assessment of whether the improvements are statistically meaningful or attributable to the proposed architecture.

    Authors: We acknowledge that the current experimental section would benefit from a clearer and more detailed presentation of the quantitative results. In the revised manuscript we will expand §4 to explicitly report the physical validity metric (percentage of collision-free and stable scenes under physics simulation), the task-conformance score, comparisons against LLM-only baselines, standard deviations across repeated trials, and ablation studies that isolate the contributions of the progressive paradigm and the Physics Corrector. These additions will make the statistical significance and architectural attribution more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents STABLE as a pipeline that integrates a fine-tuned LLM (Semantic Reasoner) with a physics-aware flow-based denoising model (Physics Corrector) under a progressive generation paradigm. No equations, fitted parameters, or self-referential definitions appear in the abstract or description that would reduce any claimed prediction or output to an input quantity by construction. The physical plausibility and semantic alignment claims are positioned as outcomes of the dual-system design rather than tautological redefinitions or renamings of known results. Self-citations, if present in the full text, are not load-bearing for the central architecture, which draws on standard LLM fine-tuning and flow-based models without importing uniqueness theorems or ansatzes from prior author work in a circular manner. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of fine-tuned LLMs for coarse layouts and flow-based models for physics corrections, plus the assumption that alternation preserves semantics; no new physical entities are postulated.

free parameters (1)
  • structured tabletop scene dataset
    The dataset used to fine-tune the Semantic Reasoner is a key input whose quality and coverage directly affect coarse layout generation.
axioms (2)
  • domain assumption Fine-tuned LLMs can produce coarse layouts that are semantically aligned with task instructions
    This underpins the Semantic Reasoner module as described in the abstract.
  • domain assumption A physics-aware flow-based model can refine object poses for physical plausibility without breaking semantic alignment
    This is required for the Physics Corrector to improve validity while keeping task conformance.

pith-pipeline@v0.9.0 · 5760 in / 1374 out tokens · 79418 ms · 2026-05-20T19:10:13.378979+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

    cs.CV 2026-05 unverdicted novelty 5.0

    Code-as-Room is an MLLM-based agentic pipeline that parses top-down images into multi-stage Blender code synthesis with cross-stage memory to generate functional 3D rooms.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    I-design: Personal- ized llm interior designer.arXiv preprint arXiv:2404.02838,

    C ¸elen, A., Han, G., Schindler, K., Van Gool, L., Armeni, I., Obukhov, A., and Wang, X. I-design: Personalized llm interior designer. arXiv preprint arXiv:2404.02838,

  3. [3]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al. Robotwin 2.0: A scal- able data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088,

  4. [4]

    Mesatask: Towards task- driven tabletop scene generation via 3d spatial reasoning

    Hao, J., Liang, N., Luo, Z., Xu, X., Zhong, W., Yi, R., Jin, Y ., Lyu, Z., Zheng, F., Ma, L., et al. Mesatask: Towards task- driven tabletop scene generation via 3d spatial reasoning. arXiv preprint arXiv:2509.22281,

  5. [5]

    Midi: Multi-instance diffusion for single image to 3d scene generation

    Huang, Z., Guo, Y ., An, X., Yang, Y ., Li, Y ., Zou, Z., Liang, D., Liu, X., Cao, Y ., and Sheng, L. Midi: Multi-instance diffusion for single image to 3d scene generation. arXiv preprint arXiv:2412.03558,

  6. [6]

    Instructscene: Instruction- driven 3d indoor scene synthesis with semantic graph prior

    Lin, C. and Mu, Y . Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717,

  7. [7]

    PAT3D: Physics-Augmented Text-to-3D Scene Generation

    Lin, G., Huang, K., Liu, M., Gao, R., Chen, H., Chen, L., Lu, B., Komura, T., Liu, Y ., Zhu, J.-Y ., et al. Pat3d: Physics-augmented text-to-3d scene generation. arXiv preprint arXiv:2511.21978,

  8. [8]

    Flow Matching for Generative Modeling

    Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,

  9. [9]

    Structd- iffusion: Object-centric diffusion for semantic rearrange- ment of novel objects

    Liu, W., Hermans, T., Chernova, S., and Paxton, C. Structd- iffusion: Object-centric diffusion for semantic rearrange- ment of novel objects. In Workshop on Language and Robotics at CoRL 2022,

  10. [10]

    Steerable scene generation with post training and inference-time search

    Pfaff, N., Dai, H., Zakharov, S., Iwase, S., and Tedrake, R. Steerable scene generation with post training and inference-time search. arXiv preprint arXiv:2505.04831,

  11. [11]

    Layoutvlm: Differentiable optimization of 3d layout via vision-language models

    Sun, F.-Y ., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., and Wu, J. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. arXiv preprint arXiv:2412.02193,

  12. [12]

    Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

    Tian, Y ., Yang, Y ., Xie, Y ., Cai, Z., Shi, X., Gao, N., Liu, H., Jiang, X., Qiu, Z., Yuan, F., et al. Interndata-a1: Pioneer- ing high-fidelity synthetic data for pre-training generalist policy. arXiv preprint arXiv:2511.16651,

  13. [13]

    Tabletopgen: Instance-level interactive 3d tabletop scene generation from text or single image.arXiv preprint arXiv:2512.01204, 2025

    Wang, Z., He, Y ., Yang, L., Zou, W., Ma, H., Liu, L., Sui, W., Guo, Y ., and Su, H. Tabletopgen: Instance-level in- teractive 3d tabletop scene generation from text or single image. arXiv preprint arXiv:2512.01204,

  14. [14]

    Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent

    Yang, Y ., Jia, B., Zhang, S., and Huang, S. Sceneweaver: All-in-one 3d scene synthesis with an extensible and self- reflective agent. In The Thirty-ninth Annual Conference on Neural Information Processing Systems. Yang, Y ., Lu, J., Zhao, Z., Luo, Z., Yu, J. J., Sanchez, V ., and Zheng, F. Llplace: The 3d indoor scene layout generation and editing via la...

  15. [15]

    Cast: Component-aligned 3d scene reconstruction from an rgb image

    Yao, K., Zhang, L., Yan, X., Zeng, Y ., Zhang, Q., Xu, L., Yang, W., Gu, J., and Yu, J. Cast: Component-aligned 3d scene reconstruction from an rgb image. arXiv preprint arXiv:2502.12894,

  16. [16]

    The denoising network is a 1D U-Net with a hidden dimension of 512 and self-conditioning

    with linear interpolation paths to learn the vector field that transports samples from a standard Gaussian prior to the data distribution. The denoising network is a 1D U-Net with a hidden dimension of 512 and self-conditioning. Each object is represented as a 4D vector (3D position + z-rotation), conditioned on 64-dimensional point cloud features and lea...

  17. [17]

    MesaTask generates a structured tabletop layout from the task instruction and retrieves 3D assets accordingly

    as a representative task-to-scene method. MesaTask generates a structured tabletop layout from the task instruction and retrieves 3D assets accordingly. We use the official preprocessing and evaluation protocol provided by MesaTask. Holodeck-Table.We adopt the tabletop adaptation of HOLODECK (Yang et al., 2024b) provided in MesaTask. Concretely, the pipel...

  18. [18]

    to the layouts generated by MesaTask, following the same solver configuration and stopping criteria as in the original implementation. Steerable.We also compare against a steerable post-processing baseline (Pfaff et al., 2025), where we first use our Semantic Reasoner to generate a coarse (potentially colliding) layout from the task instruction and then f...