Recognition: 2 theorem links
· Lean TheoremE-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
Pith reviewed 2026-05-10 20:18 UTC · model grok-4.3
The pith
Event-augmented VLA models restore robotic manipulation success in dark and blurred scenes via direct event fusion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
E-VLA demonstrates that directly leveraging motion and structural cues in event streams preserves semantic perception and perception-action consistency in VLA models under adverse conditions such as extreme low light and motion blur, rather than attempting to reconstruct images from events. Experiments on a collected real-world dataset demonstrate that parameter-free overlay fusion of accumulated event maps onto RGB images raises Pick-Place success from 0% to 60% at 20 lux and to 20-25% under 1000 ms blur, with further gains using an event adapter.
What carries the argument
Overlay fusion of accumulated event maps onto RGB images, along with a lightweight pretrained-compatible event adapter, which injects motion cues directly into the VLA visual input to maintain performance when conventional frames degrade.
Load-bearing premise
The real-world RGB-event-action dataset and the selected tasks plus illumination conditions are representative enough that the observed robustness gains will hold for other robots, tasks, and VLA backbones.
What would settle it
Running the Pick-Place task at 20 lux illumination on a different robot arm or unseen VLA backbone and measuring no meaningful success improvement over the image-only baseline would show the gains do not transfer.
Figures
read the original abstract
Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces E-VLA, a framework augmenting Vision-Language-Action (VLA) models with event-camera streams to improve robotic manipulation robustness under low illumination and motion blur. It describes collection of a synchronized real-world RGB-event-action dataset using a DAVIS346 camera across diverse tasks and lighting, proposes lightweight pretrained-compatible integration methods (parameter-free event-map overlay and a learned event adapter), and reports empirical success-rate gains on tasks such as Pick-Place and Sorting (e.g., 0% to 60% at 20 lux with overlay, 0% to 90% with adapter; 0% to 20-25% under 1000 ms blur).
Significance. If the reported gains prove reproducible, the work supplies concrete evidence that event-based motion cues can be fused into existing VLA pipelines to recover performance where RGB perception collapses, without requiring full image reconstruction. The open release of the dataset and code is a clear strength that supports reproducibility and follow-on research. The significance is limited by the narrow scope of tested conditions and backbones.
major comments (3)
- [§4.2] §4.2 (Event Integration Strategies): The abstract and methods describe the overlay fusion as 'parameter-free,' yet the event accumulation window must be selected and is not ablated across the reported conditions; this choice directly affects the input to the VLA and therefore the measured gains (e.g., the 60% success figure on Pick-Place at 20 lux).
- [Experiments section] Experiments section (Tables 1-2 and associated text): Success rates are given as single point estimates (0%, 60%, 90%, etc.) without trial counts, standard deviations, or statistical significance tests, preventing assessment of whether the claimed improvements over the image-only baseline are reliable.
- [§5] §5 (Discussion and Conclusion): The broader claim that E-VLA provides 'systematic evidence' for robust embodied intelligence rests on a single custom dataset and one VLA backbone; no cross-backbone evaluation or external benchmark results are presented, leaving the generalization premise untested.
minor comments (2)
- [Figure 2] Figure 2: The event-map overlay examples would be clearer if the accumulation window and polarity rendering parameters were stated in the caption.
- [Related Work] Related Work: Several recent papers on event-based robotic perception (e.g., event-driven SLAM or low-light tracking) are not cited; adding them would better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Event Integration Strategies): The abstract and methods describe the overlay fusion as 'parameter-free,' yet the event accumulation window must be selected and is not ablated across the reported conditions; this choice directly affects the input to the VLA and therefore the measured gains (e.g., the 60% success figure on Pick-Place at 20 lux).
Authors: We thank the referee for this observation. The descriptor 'parameter-free' specifically denotes that the overlay fusion introduces no trainable parameters (in contrast to the learned event adapter). The accumulation window is a fixed hyperparameter; while the manuscript states that event windowing was studied, we did not include a dedicated ablation table quantifying its effect on the reported success rates. In the revision we will add an ablation study across multiple window sizes for the low-light and motion-blur conditions to show sensitivity of the gains. revision: yes
-
Referee: [Experiments section] Experiments section (Tables 1-2 and associated text): Success rates are given as single point estimates (0%, 60%, 90%, etc.) without trial counts, standard deviations, or statistical significance tests, preventing assessment of whether the claimed improvements over the image-only baseline are reliable.
Authors: We agree that single-point estimates limit reliability assessment. Each reported success rate was computed from 20 independent trials per condition. We will revise Tables 1–2 and the accompanying text to report trial counts, mean success rates with standard deviations, and paired statistical significance tests against the image-only baseline. revision: yes
-
Referee: [§5] §5 (Discussion and Conclusion): The broader claim that E-VLA provides 'systematic evidence' for robust embodied intelligence rests on a single custom dataset and one VLA backbone; no cross-backbone evaluation or external benchmark results are presented, leaving the generalization premise untested.
Authors: We acknowledge the scope limitation. The custom dataset was collected because no public synchronized RGB-event-action manipulation dataset existed at the time, and evaluation was performed on a representative VLA backbone to demonstrate integration feasibility. We will revise the discussion and conclusion to moderate the language, explicitly stating that the results supply evidence for the proposed integration methods under the tested conditions and backbone while noting the value of future cross-backbone and benchmark studies. No additional backbone experiments will be added in this revision. revision: partial
Circularity Check
No circularity: empirical gains measured directly on collected dataset
full rationale
The paper's central claims consist of measured success-rate improvements (0% to 60-90% on Pick-Place at 20 lux, 0% to 20-25% under 1000 ms blur) obtained by running standard VLA models plus simple fusion or adapter on a newly collected teleoperated RGB-event-action dataset. No equations, fitted parameters, or self-citations are invoked to derive these numbers; the results are direct experimental outputs. The work contains no self-definitional loops, no predictions that reduce to fitted inputs by construction, and no load-bearing uniqueness theorems imported from prior author work. The derivation chain is therefore self-contained empirical reporting rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- event accumulation window
axioms (1)
- domain assumption Event streams provide reliable motion and structural cues under low light and motion blur where frame-based RGB fails.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images... on Pick-Place at 20 lux, success increases from 0% to 60%
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: CVPRW (2019) 4, 6
Alonso, I., Murillo, A.C.: EV-SegNet: Semantic segmentation for event-based cam- eras. In: CVPRW (2019) 4, 6
2019
-
[2]
In: ECCV (2024) 9, 26
Bao, Y., Sun, L., Ma, Y., Wang, K.: Temporal-mapping photography for event cameras. In: ECCV (2024) 9, 26
2024
-
[3]
Bi, J., Ma, K.Y., Hao, C., Shou, M.Z., Soh, H.: VLA-Touch: Enhancing vision-language-action models with dual-level tactile feedback. arXiv preprint arXiv:2507.17294 (2025) 4
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π0: A vision-language-action flow model for general robot control. arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
In: RSS (2022) 1, 3
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jack- son, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...
2022
-
[6]
In: CVPRW (2025) 4
Bugueño-Córdova, I.G., Ruiz-del-Solar, J., Verschae, R.: Human-robot navigation using event-based cameras and reinforcement learning. In: CVPRW (2025) 4
2025
-
[7]
IEEE Transactions on Image Processing (2021) 4, 6
Cadena, P.R.G., Qian, Y., Wang, C., Yang, M.: SPADE-E2VID: Spatially-adaptive denormalization for event-based video reconstruction. IEEE Transactions on Image Processing (2021) 4, 6
2021
-
[8]
Cadene, R., Alibert, S., Soare, A., Gallouedec, Q., Zouitine, A., Palma, S., Kooij- mans, P., Aractingi, M., Shukor, M., Aubakirova, D., Russi, M., Capuano, F., Pas- cal, C., Choghari, J., Moss, J., Wolf, T.: LeRobot: State-of-the-art machine learn- ing for real-world robotics in pytorch.https://github.com/huggingface/lerobot (2024) 8, 9
2024
-
[9]
In: ICCV (2023) 11, 12
Cai, Y., Bian, H., Lin, J., Wang, H., Timofte, R., Zhang, Y.: Retinexformer: One- stage retinex-based transformer for low-light image enhancement. In: ICCV (2023) 11, 12
2023
-
[10]
In: ICRA (2024) 4
Cao, J., Zheng, X., Lyu, Y., Wang, J., Xu, R., Wang, L.: Chasing day and night: Towards robust and efficient all-day object detection guided by an event camera. In: ICRA (2024) 4
2024
-
[11]
Zhaiet al
Chen, K., Liang, G., Lu, Y., Li, H., Wang, L.: EvLight++: Low-light video en- hancement with an event camera: A large-scale real-world dataset, novel method, 16 J. Zhaiet al. and more. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026) 4
2026
-
[12]
arXiv preprint arXiv:2506.08440 , year=
Chen, Z., Niu, R., Kong, H., Wang, Q., Xing, Q., Fan, Z.: TGRPO: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimiza- tion. arXiv preprint arXiv:2506.08440 (2025) 4
-
[13]
Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing
Cheng, Z., Zhang, Y., Zhang, W., Li, H., Wang, K., Song, L., Zhang, H.: Om- niVTLA: Vision-tactile-language-action model with semantic-aligned tactile sens- ing. arXiv preprint arXiv:2508.08706 (2025) 4
-
[14]
In: International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society (2008) 2, 4
Delbrück, T.: Frame-free dynamic digital vision. In: International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society (2008) 2, 4
2008
-
[15]
Authorea Preprints (2025) 2
Deng, H., Wu, Z., Liu, H., Guo, W., Xue, Y., Shan, Z., Zhang, C., Jia, B., Ling, Y., Lu, G.: A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints (2025) 2
2025
-
[16]
Deng, S., Yan, M., Zheng, Y., Su, J., Zhang, W., Zhao, X., Cui, H., Zhang, Z., Wang,H.:StereoVLA:Enhancingvision-language-actionmodelswithstereovision. arXiv preprint arXiv:2512.21970 (2025) 4
-
[17]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J., Qiu, X.: LIBERO-Plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025) 2
work page internal anchor Pith review arXiv 2025
-
[18]
IEEE Transactions on Robotics (2024) 4
Funk, N., Helmut, E., Chalvatzaki, G., Calandra, R., Peters, J.: Evetac: An event-based optical tactile sensor for robotic manipulation. IEEE Transactions on Robotics (2024) 4
2024
-
[19]
IEEE Transactions on Pattern Analysis and Machine In- telligence (2022) 2, 4
Gallego, G., Delbrück, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., Scaramuzza, D.: Event- based vision: A survey. IEEE Transactions on Pattern Analysis and Machine In- telligence (2022) 2, 4
2022
-
[20]
IEEE Robotics and Automation Letters (2021) 9
Gehrig,M.,Aarents,W.,Gehrig,D.,Scaramuzza,D.:DSEC:Astereoeventcamera dataset for driving scenarios. IEEE Robotics and Automation Letters (2021) 9
2021
-
[21]
In: 3DV (2021) 4, 6
Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: Dense optical flow from event cameras. In: 3DV (2021) 4, 6
2021
-
[22]
In: Re- MAR (2024) 4
Guo, Q., Yu, Z., Fu, J., Lu, Y., Zweiri, Y., Gan, D.: Force-EvT: A closer look at robotic gripper force measurement with event-based vision transformer. In: Re- MAR (2024) 4
2024
-
[23]
Improving vision- language-action model with online reinforcement learning
Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. arXiv preprint arXiv:2501.16664 (2025) 4
-
[24]
Tactile-vla: Unlocking vision-language-action model’s physical knowledge for tactile generalization,
Huang, J., Wang, S., Lin, F., Hu, Y., Wen, C., Gao, Y.: Tactile-VLA: Unlocking vision-language-action model’s physical knowledge for tactile generalization. arXiv preprint arXiv:2507.09160 (2025) 4
-
[25]
Journal of Intelligent Manufacturing (2022) 4
Huang, X., Halwani, M., Muthusamy, R., Ayyad, A., Swart, D., Seneviratne, L., Gan, D., Zweiri, Y.: Real-time grasping strategies using event camera. Journal of Intelligent Manufacturing (2022) 4
2022
-
[26]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., Equi, M., Es- mail, A., Fang, Y., Finn, C., Glossop, C., Godden, T., Goryachev, I., Groom, L., Hancock, H., Hausman, K., Hussein, G., Ichter, B., Jakubczak, S., Jen, R., Jones, T., Katz, B., Ke, L., Kuchi, C.,...
work page Pith review arXiv 2025
-
[27]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springenberg, J.T., Stachow...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
In: CoRL (2024) 1, 4
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open-source vision-language-action model. In: CoRL (2024) 1, 4
2024
-
[29]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 6, 23, 24
Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.: HOTS: A hier- archy of event-based time-surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017) 6, 23, 24
2017
-
[30]
Frontiers in Neurorobotics (2020) 4
Li, B., Cao, H., Qu, Z., Hu, Y., Wang, Z., Liang, Z.: Event-based robotic grasping detection with neuromorphic vision sensor and event-grasping dataset. Frontiers in Neurorobotics (2020) 4
2020
-
[31]
arXiv preprint arXiv:2503.07511 (2025)
Li, C., Wen, J., Peng, Y., Peng, Y., Feng, F., Zhu, Y.: PointVLA: Injecting the 3D world into vision-language-action models. arXiv preprint arXiv:2503.07511 (2025) 4
-
[32]
arXiv preprint arXiv:2508.15201 (2025) 2, 3
Li, H., Chen, Y., Cui, W., Liu, W., Liu, K., Zhou, M., Zhang, Z., Zhao, D.: Sur- vey of vision-language-action models for embodied manipulation. arXiv preprint arXiv:2508.15201 (2025) 2, 3
-
[33]
In: CVPR (2024) 4, 6
Li, H., Wang, J., Yuan, J., Li, Y., Weng, W., Peng, Y., Zhang, Y., Xiong, Z., Sun, X.: Event-assisted low-light video object segmentation. In: CVPR (2024) 4, 6
2024
-
[34]
In: ECCV (2024) 9
Li, Y., Shen, Y., Huang, Z., Chen, S., Bian, W., Shi, X., Wang, F.Y., Sun, K., Bao, H., Cui, Z., Zhang, G., Li, H.: BlinkVision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events. In: ECCV (2024) 9
2024
-
[35]
In: CVPR (2024) 4, 11, 12
Liang, G., Chen, K., Li, H., Lu, Y., Wang, L.: Towards robust event-guided low- light image enhancement: A large-scale real-world event-image dataset and novel approach. In: CVPR (2024) 4, 11, 12
2024
-
[36]
In: IROS (2025) 4, 6
Liu, J., Wang, B., Tan, Z., Zhang, J., Shen, H., Hu, D.: Tracking any point with frame-event fusion network at high frame rate. In: IROS (2025) 4, 6
2025
-
[37]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: RDT-1B: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024) 1, 4
work page internal anchor Pith review arXiv 2024
-
[38]
Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning,
Lu, G., Guo, W., Zhang, C., Zhou, Y., Jiang, H., Gao, Z., Tang, Y., Wang, Z.: VLA-RL: Towards masterful and general robotic manipulation with scalable rein- forcement learning. arXiv preprint arXiv:2505.18719 (2025) 4
-
[39]
In: ECCV (2025) 4
Lu, Y., Liang, G., Wang, Y., Wang, L., Xiong, H.: UniINR: Event-guided unified rolling shutter correction, deblurring, and interpolation. In: ECCV (2025) 4
2025
-
[40]
In: ACMMM (2024) 25
Ma, Y., Duan, P., Hong, Y., Zhou, C., Zhang, Y., Ren, J., Shi, B.: Color4e: Event demosaicing for full-color event guided image deblurring. In: ACMMM (2024) 25
2024
-
[41]
A Survey on Vision-Language-Action Models for Embodied AI
Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied AI. arXiv preprint arXiv:2405.14093 (2024) 2, 3, 14 18 J. Zhaiet al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
SmolVLM: Redefining small and efficient multimodal models
Marafioti, A., Zohar, O., Farré, M., Noyan, M., Bakouch, E., Cuenca, P., Zakka, C., Allal, L.B., Lozhkov, A., Tazi, N., Srivastav, V., Lochner, J., Larcher, H., Morlon, M., Tunstall, L., von Werra, L., Wolf, T.: SmolVLM: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299 (2025) 6
work page internal anchor Pith review arXiv 2025
-
[43]
IEEE Access (2021) 4
Muthusamy, R., Ayyad, A., Halwani, M., Swart, D., Gan, D., Seneviratne, L., Zweiri, Y.: Neuromorphic eye-in-hand visual servoing. IEEE Access (2021) 4
2021
-
[44]
In: CVPR (2017) 9
Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: CVPR (2017) 9
2017
-
[45]
In: CVPR (2019) 4
Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. In: CVPR (2019) 4
2019
-
[46]
In: CVPR (2019) 4, 11
Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: Events-to-video: Bringing modern computer vision to event cameras. In: CVPR (2019) 4, 11
2019
-
[47]
In: WACVW (2025) 4
Reinold, T., Ghosh, S., Gallego, G.: Combined physics and event camera simulator for slip detection. In: WACVW (2025) 4
2025
-
[48]
Sanyal, S., Joshi, A., Kosta, A., Roy, K.: Real-time neuromorphic navigation: Guid- ing physical robots with event-based sensing and task-specific reconfigurable au- tonomy stack (2025) 4
2025
-
[49]
In: ICCV (2021) 4
Shang, W., Ren, D., Zou, D., Ren, J.S., Luo, P., Zuo, W.: Bringing events into video deblurring with non-consecutively blurry frames. In: ICCV (2021) 4
2021
-
[50]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., Alibert, S., Cord, M., Wolf, T., Cadène, R.: SmolVLA: A vision-language-action model for affordable and efficient robotics. arXiv preprint arXiv:2506.01844 (2025) 2, 4, 5, 9
work page internal anchor Pith review arXiv 2025
-
[51]
OG-VLA: 3D-aware vision language action model via orthographic image generation
Singh, I., Goyal, A., Birchfield, S., Fox, D., Garg, A., Blukis, V.: OG-VLA: 3D- aware vision language action model via orthographic image generation. arXiv preprint arXiv:2506.01196 (2025) 4
-
[52]
In: ECCV (2020) 9
Stoffregen, T., Scheerlinck, C., Scaramuzza, D., Drummond, T., Barnes, N., Klee- man, L., Mahony, R.E.: Reducing the sim-to-real gap for event cameras. In: ECCV (2020) 9
2020
-
[53]
In: ICCV (2025) 4
Sun, L., Bao, Y., Zhai, J., Liang, J., Zhang, Y., Wang, K., Paudel, D.P., Van Gool, L.: Low-light image enhancement using event-based illumination estimation. In: ICCV (2025) 4
2025
-
[54]
In: ECCV (2022) 4
Sun, L., Sakaridis, C., Liang, J., Jiang, Q., Yang, K., Sun, P., Ye, Y., Wang, K., Gool, L.V.: Event-based fusion for motion deblurring with cross-modal attention. In: ECCV (2022) 4
2022
-
[55]
In: CVPR (2023) 9
Sun, L., Sakaridis, C., Liang, J., Sun, P., Zhang, K., Cao, J., Jiang, Q., Wang, K., Van Gool, L.: Event-based frame interpolation with ad-hoc deblurring. In: CVPR (2023) 9
2023
-
[56]
Geovla: Em- powering 3d representations in vision-language-action models,
Sun, L., Xie, B., Liu, Y., Shi, H., Wang, T., Cao, J.: GeoVLA: Empowering 3D representations in vision-language-action models. arXiv preprint arXiv:2508.09071 (2025) 4
-
[57]
In: ECCV (2024) 6
Sun, Z., Fu, X., Huang, L., Liu, A., Zha, Z.J.: Motion aware event representation- driven image deblurring. In: ECCV (2024) 6
2024
-
[58]
In: RSS (2020) 4
Taunyazov, T., Sng, W., Lim, B., See, H., Kuan, J., Ansari, A.F., Tee, B.C.K., Soh, H.: Event-driven visual-tactile sensing and learning for robots. In: RSS (2020) 4
2020
-
[59]
Gemini Robotics: Bringing AI into the Physical World
Team, G.R., Abeyruwan, S., Ainslie, J., Alayrac, J.B., Arenas, M.G., Armstrong, T., Balakrishna, A., Baruch, R., Bauza, M., Blokzijl, M., et al.: Gemini robotics: Bringing AI into the physical world. arXiv preprint arXiv:2503.20020 (2025) 1, 4
work page internal anchor Pith review arXiv 2025
-
[60]
Tomy, A., Paigwar, A., Mann, K.S., Renzaglia, A., Laugier, C.: Fusing event-based andRGBcameraforrobustobjectdetectioninadverseconditions.In:ICRA(2022) 4, 6 E-VLA 19
2022
-
[61]
IEEE Robotics and Automation Letters (2024) 4
Wang, X., Yu, H., Yu, L., Yang, W., Xia, G.S.: Towards robust keypoint detection and tracking: A fusion approach with event-aligned image features. IEEE Robotics and Automation Letters (2024) 4
2024
-
[62]
Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers
Wang, Y., Ding, P., Li, L., Cui, C., Ge, Z., Tong, X., Song, W., Zhao, H., Zhao, W., Hou, P., Huang, S., Tang, Y., Wang, W., Zhang, R., Liu, J., Wang, D.: VLA- Adapter: An effective paradigm for tiny-scale vision-language-action model. arXiv preprint arXiv:2509.09372 (2025) 4
-
[63]
Deep Retinex Decomposition for Low-Light Enhancement
Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560 (2018) 11
work page Pith review arXiv 2018
-
[64]
IEEE Robotics and Automation Letters (2025) 4
Wen, J., Zhu, Y., Li, J., Zhu, M., Wu, K., Xu, Z., Liu, N., Cheng, R., Shen, C., Peng,Y.,Feng,F.,Tang,J.:TinyVLA:Towardsfast,data-efficientvision-language- action models for robotic manipulation. IEEE Robotics and Automation Letters (2025) 4
2025
-
[65]
Sensors (2025) 4
Ye, Y., Shi, H., Yang, K., Wang, Z., Yin, X., Sun, L., Wang, Y., Wang, K.: Towards anytime optical flow estimation with event cameras. Sensors (2025) 4
2025
-
[66]
A survey on vision- language-action models: An action tokenization perspective
Zhong, Y., Bai, F., Cai, S., Huang, X., Chen, Z., Zhang, X., Wang, Y., Guo, S., Guan, T., Lui, K.N., Qi, Z., Liang, Y., Chen, Y., Yang, Y.: A survey on vision-language-action models: An action tokenization perspective. arXiv preprint arXiv:2507.01925 (2025) 2, 3
-
[67]
In: ECCVW (2018) 4, 6, 23, 24
Zihao Zhu, A., Yuan, L., Chaney, K., Daniilidis, K.: Unsupervised event-based optical flow using motion compensation. In: ECCVW (2018) 4, 6, 23, 24
2018
-
[68]
In: CoRL (2023) 1, 3 This supplementary document provides additional materials that comple- ment the main paper
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H.T., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.E., Leal, I., Kuang, Y., ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.