ActionMap: Robot Policy Learning via Voxel Action Heatmap
Pith reviewed 2026-06-27 21:56 UTC · model grok-4.3
The pith
A voxel heatmap action head improves VLA model performance across backbones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a voxel heatmap action head, which assigns a probability to every voxel in a discretized action space rather than regressing to a single point, can be dropped into existing VLA backbones and produces higher task success, comparable or faster convergence, and markedly better data efficiency on both simulated and real robot manipulation benchmarks.
What carries the argument
The voxel heatmap action head that outputs a probability volume over the discretized action space so that geometric proximity between actions is directly encoded in the training signal.
If this is right
- The heatmap head produces higher average success than the original single-point heads on the same backbones.
- Convergence occurs at comparable or faster rates across the tested architectures.
- Data efficiency improves noticeably when training sets are reduced in size.
- The advantage appears consistently when the head is inserted into architecturally distinct models.
Where Pith is reading between the lines
- The same probability-volume approach could be tested on continuous control problems outside vision-language-action settings.
- Voxel resolution and action-space bounds become explicit hyperparameters that future designs might optimize separately from the backbone.
- If the geometric encoding is the active ingredient, similar structured representations may help other regression-style outputs in robotics.
Load-bearing premise
The observed gains come from the voxel heatmap representation rather than from any other unstated change in training procedure or implementation when the head is swapped in.
What would settle it
A controlled swap of the heatmap head into the identical backbones with every other training detail held fixed that shows the success-rate gap closing to zero.
Figures
read the original abstract
Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://showlab.github.io/ActionMap/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ActionMap, a voxel heatmap action head for vision-language-action (VLA) models that replaces single-point predictors (L1 regression, autoregressive bins, or flow matching) with a probability distribution over a discretized voxel grid in action space. The head is claimed to drop into existing VLAs; experiments on LIBERO (four-suite average) and real Franka manipulation report +8.2% gains over OpenVLA-OFT's L1 head at matched training steps, comparable or faster convergence on two architecturally distinct backbones, and improved data efficiency at low data regimes. The cross-backbone consistency is presented as evidence that action representation is an independent performance lever.
Significance. If the gains are shown to arise specifically from the voxel heatmap representation under controlled conditions, the result would establish action decoding as a meaningful design axis in VLAs separate from backbone or data scaling. The reported consistency across backbones is a positive indicator of robustness; the absence of parameter-free derivations or machine-checked proofs means significance rests entirely on the quality of the empirical isolation.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that the heatmap head produces the reported gains (+8.2% LIBERO average, better data efficiency) when 'dropped into' existing VLAs requires explicit confirmation that voxel grid resolution, probability normalization, action recovery method (e.g., expectation vs. argmax), and loss scaling were held identical to the baseline L1 regression. No such controlled swap protocol or ablation is described, leaving open the possibility that performance differences arise from unstated implementation choices rather than the representation itself.
- [§3] §3 (Method): the conversion from voxel heatmap to continuous action and the precise loss (cross-entropy over voxels) must be shown to have matched effective capacity to the original L1 head; without this, the attribution of faster convergence and data efficiency to geometric proximity exploitation cannot be isolated from differences in optimization landscape or output dimensionality.
minor comments (2)
- [Abstract] Abstract: the statement 'converges at comparable or faster rates' should specify the metric (e.g., steps to reach 80% success) and include error bars or seed counts for all reported numbers.
- [§4] Figure captions and §4: voxel grid resolution and action-space bounds are not stated; these parameters directly affect the heatmap representation and should be reported for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The two major comments both concern the need for explicit documentation of the controlled head-swap protocol and capacity matching. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that the heatmap head produces the reported gains (+8.2% LIBERO average, better data efficiency) when 'dropped into' existing VLAs requires explicit confirmation that voxel grid resolution, probability normalization, action recovery method (e.g., expectation vs. argmax), and loss scaling were held identical to the baseline L1 regression. No such controlled swap protocol or ablation is described, leaving open the possibility that performance differences arise from unstated implementation choices rather than the representation itself.
Authors: We agree that a clear description of the controlled swap is required. In the original experiments the voxel grid was fixed at 64×64×64 for the three action dimensions, probabilities were obtained via softmax, the continuous action was recovered as the expectation (weighted sum of voxel centers), and the loss was scaled so that its magnitude matched the L1 baseline at initialization. These choices were held constant across all reported backbone swaps. We will add a dedicated paragraph in §3.2 and a table in §4.1 that enumerates every hyper-parameter kept identical between the L1 and heatmap heads, together with the exact protocol used to replace the decoder. revision: yes
-
Referee: [§3] §3 (Method): the conversion from voxel heatmap to continuous action and the precise loss (cross-entropy over voxels) must be shown to have matched effective capacity to the original L1 head; without this, the attribution of faster convergence and data efficiency to geometric proximity exploitation cannot be isolated from differences in optimization landscape or output dimensionality.
Authors: We acknowledge the concern. The heatmap head replaces the final linear layer with a projection to voxel logits (output dimension 64³ = 262144 versus 7 for the L1 head) followed by a softmax and expectation. While the backbone parameters remain unchanged, the increased output dimensionality could affect optimization. In the revised manuscript we will report (i) the exact parameter count of each head, (ii) an ablation that keeps output dimensionality matched by using a coarser 16³ grid, and (iii) training curves with identical learning-rate schedules. These additions will allow readers to separate representational benefits from capacity or optimization differences. revision: yes
Circularity Check
No circularity: empirical comparisons on public benchmarks with no derivations or fitted predictions
full rationale
The paper introduces an empirical action head (voxel heatmap) and reports performance gains via direct comparisons on LIBERO and real-robot tasks against existing backbones at matched training steps. No equations, first-principles derivations, parameter fits renamed as predictions, or self-citation chains appear in the load-bearing claims. The central evidence consists of benchmark results rather than any reduction of outputs to inputs by construction. This is self-contained experimental work with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos
GRA extracts 2D waypoints from synthetic videos to supervise VLA vision while restricting action training to real data, outperforming pseudo-action baselines on real-robot tasks.
Reference graph
Works this paper leans on
-
[1]
RT-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “RT-1: Robotics transformer for real-world control at scale,” inRobotics: Science and Systems (RSS), 2023
2023
-
[2]
RT-2: Vision-language-action models transfer web knowledge to robotic control,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, K. A. Dubey, C. Finn, P. Florence, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. S. Ryoo,...
2023
-
[3]
Openvla: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” inConference on Robot Learning (CoRL), 2024
2024
-
[4]
Fine-tuning vision-language-action models: Optimizing speed and success,
M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success,” inRobotics: Science and Systems (RSS), 2025
2025
-
[5]
π0: A vision-language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zheng, “π0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[6]
π0.5: A vision-language-action model with open- world generalization,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Lin, O. Mees, K. Pertsch, P. Sanketi, S. Schaal, L. X. Shi, L. Smith, J. T. Springenberg, K. Stone, J. Tanner, Q. Vuong, A. Walling, H. Wang, J. Welander, and U. Zhilinsk...
Pith/arXiv arXiv 2025
-
[7]
GR00T N1: An open foundation model for generalist humanoid robots,
J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, ...
Pith/arXiv arXiv 2025
-
[8]
Perceiver-actor: A multi-task transformer for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” inProceedings of the 6th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 205, 2022, pp. 785–799
2022
-
[9]
Coarse-to-fine Q-attention: Efficient learning for visual robotic manipulation via discretisation,
S. James, K. Wada, T. Laidlow, and A. J. Davison, “Coarse-to-fine Q-attention: Efficient learning for visual robotic manipulation via discretisation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[10]
RVT: Robotic view transformer for 3d object manipulation,
A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “RVT: Robotic view transformer for 3d object manipulation,” inProceedings of the 7th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 229, 2023
2023
-
[11]
RVT-2: Learning precise manipulation from few demonstrations,
A. Goyal, V . Blukis, J. Xu, Y . Guo, Y .-W. Chao, and D. Fox, “RVT-2: Learning precise manipulation from few demonstrations,” inProceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[12]
Octo: An open-source generalist robot policy,
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo et al., “Octo: An open-source generalist robot policy,” inRobotics: Science and Systems (RSS), 2024
2024
-
[13]
Vision-language foundation models as effective robot imitators,
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, H. Li, and T. Kong, “Vision-language foundation models as effective robot imitators,” inInternational Conference on Learning Representations (ICLR), 2024
2024
-
[14]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,
L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[15]
RDT-1B: A diffusion foundation model for bimanual manipulation,
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “RDT-1B: A diffusion foundation model for bimanual manipulation,” inInternational Conference on Learning Representations (ICLR), 2025. 10
2025
-
[16]
TinyVLA: Toward fast, data-efficient vision-language-action models for robotic manipulation,
J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, F. Feng, and J. Tang, “TinyVLA: Toward fast, data-efficient vision-language-action models for robotic manipulation,”IEEE Robotics and Automation Letters (RA-L), 2025
2025
-
[17]
X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zenget al., “X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,”arXiv preprint arXiv:2510.10274, 2025
Pith/arXiv arXiv 2025
-
[18]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[19]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProceedings of Robotics: Science and Systems (RSS), 2023
2023
-
[20]
Open X-Embodiment: Robotic learning datasets and RT-X models,
A. Padalkar, A. Pooley, A. Jainet al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” arXiv preprint arXiv:2310.08864, 2023
Pith/arXiv arXiv 2023
-
[21]
DROID: A large-scale in-the-wild robot manipulation dataset,
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Elliset al., “DROID: A large-scale in-the-wild robot manipulation dataset,” inRobotics: Science and Systems (RSS), 2024
2024
-
[22]
BridgeData V2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, A. Lu, C. Finn, and S. Levine, “BridgeData V2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023
2023
-
[23]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning,
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “LIBERO: Benchmarking knowledge transfer for lifelong robot learning,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[24]
RoboCasa: Large-scale simulation of everyday tasks for generalist robots,
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “RoboCasa: Large-scale simulation of everyday tasks for generalist robots,” inRobotics: Science and Systems (RSS), 2024
2024
-
[25]
CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard, “CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks,”IEEE Robotics and Automation Letters (RA-L), vol. 7, no. 3, pp. 7327–7334, 2022
2022
-
[26]
V oxPoser: Composable 3d value maps for robotic manipulation with language models,
W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei, “V oxPoser: Composable 3d value maps for robotic manipulation with language models,” inProceedings of the 7th Conference on Robot Learning (CoRL), ser. Proceedings of Machine Learning Research, vol. 229, 2023
2023
-
[27]
P. Li, Y . Chen, H. Wu, X. Ma, X. Wu, Y . Huang, L. Wang, T. Kong, and T. Tan, “BridgeVLA: Input- output alignment for efficient 3d manipulation learning with vision-language models,”arXiv preprint arXiv:2506.07961, 2025
arXiv 2025
-
[28]
DeepPose: Human pose estimation via deep neural networks,
A. Toshev and C. Szegedy, “DeepPose: Human pose estimation via deep neural networks,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014
2014
-
[29]
Convolutional pose machines,
S.-E. Wei, V . Ramakrishna, T. Kanade, and Y . Sheikh, “Convolutional pose machines,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
2016
-
[30]
Stacked hourglass networks for human pose estimation,
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” inEuropean Conference on Computer Vision (ECCV), 2016
2016
-
[31]
Simple baselines for human pose estimation and tracking,
B. Xiao, H. Wu, and Y . Wei, “Simple baselines for human pose estimation and tracking,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 466–481
2018
-
[32]
Deep high-resolution representation learning for human pose estimation,
K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
2019
-
[33]
Context modeling in 3d human pose estimation: A unified perspective,
X. Ma, J. Su, C. Wang, H. Ci, and Y . Wang, “Context modeling in 3d human pose estimation: A unified perspective,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6238–6247
2021
-
[34]
Integral human pose regression,
X. Sun, B. Xiao, F. Wei, S. Liang, and Y . Wei, “Integral human pose regression,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 529–545
2018
-
[35]
Unleashing large-scale video generative pre-training for visual robot manipulation,
H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,”arXiv preprint arXiv:2312.13139, 2023. 11
Pith/arXiv arXiv 2023
-
[36]
GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation,
C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu, “GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” arXiv preprint arXiv:2410.06158, 2024
Pith/arXiv arXiv 2024
-
[37]
Cosmos policy: Fine-tuning video models for visuomotor control and planning,
M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu, “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[38]
World action models are zero-shot policies,
S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang, “World action mo...
Pith/arXiv arXiv 2026
-
[39]
Fast-WAM: Do world action models need test-time future imagination?
T. Yuan, Z. Dong, Y . Liu, and H. Zhao, “Fast-WAM: Do world action models need test-time future imagination?”arXiv preprint arXiv:2603.16666, 2026
Pith/arXiv arXiv 2026
-
[40]
X-Humanoid: Robotize human videos to generate humanoid videos at scale,
P. Yang, H. Ci, Y . Song, and M. Z. Shou, “X-Humanoid: Robotize human videos to generate humanoid videos at scale,”arXiv preprint arXiv:2512.04537, 2025
arXiv 2025
-
[41]
H. Ci, X. Liu, P. Yang, Y . Song, and M. Z. Shou, “H2R-Grounder: A paired-data-free paradigm for trans- lating human interaction videos into physically grounded robot videos,”arXiv preprint arXiv:2512.09406, 2025
arXiv 2025
-
[42]
UENR-600K: A large-scale physically grounded dataset for nighttime video deraining,
P. Yang, H. Ci, B. Lin, Y . Song, and M. Z. Shou, “UENR-600K: A large-scale physically grounded dataset for nighttime video deraining,”arXiv preprint arXiv:2604.04402, 2026
Pith/arXiv arXiv 2026
-
[43]
macOSWorld: A multilingual interactive benchmark for GUI agents,
P. Yang, H. Ci, and M. Z. Shou, “macOSWorld: A multilingual interactive benchmark for GUI agents,” in Advances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[44]
In-context defense in computer agents: An empirical study,
——, “In-context defense in computer agents: An empirical study,” 2025. [Online]. Available: https://arxiv.org/abs/2503.09241
arXiv 2025
-
[45]
RingID: Rethinking tree-ring watermarking for enhanced multi-key identification,
H. Ci, P. Yang, Y . Song, and M. Z. Shou, “RingID: Rethinking tree-ring watermarking for enhanced multi-key identification,” inEuropean Conference on Computer Vision (ECCV), 2024
2024
-
[46]
WMAdapter: Adding WaterMark control to latent diffusion models,
H. Ci, Y . Song, P. Yang, J. Xie, and M. Z. Shou, “WMAdapter: Adding WaterMark control to latent diffusion models,” inInternational Conference on Machine Learning (ICML), 2025
2025
-
[47]
Can simple averaging defeat modern watermarks?
P. Yang, H. Ci, Y . Song, and M. Z. Shou, “Can simple averaging defeat modern watermarks?” inAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[48]
IDProtector: An adversarial noise encoder to protect against id-preserving image generation,
Y . Song, P. Yang, H. Ci, and M. Z. Shou, “IDProtector: An adversarial noise encoder to protect against id-preserving image generation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 12 A OpenVLA-OFT Integration Details We integrate our voxel heatmap action head into OpenVLA-OFT [4] by replacing its L1 regression head whil...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.