TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
Pith reviewed 2026-05-10 15:42 UTC · model grok-4.3
The pith
TCL speeds up tensor program tuning by over 12x on CPU and GPU while improving inference latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TCL is a compiler framework for cross-hardware tensor program optimization built on an RDU Sampler that selects only 10 percent of programs while preserving cost-model accuracy, a Mamba-based cost model that models long-range dependencies efficiently, and a continuous knowledge distillation method that transfers knowledge progressively across platforms; together these components deliver substantially faster tuning and modestly better inference latency than Tenset-MLP on both CPU and GPU.
What carries the argument
The RDU Sampler, which jointly scores tensor programs for representativeness, diversity, and uncertainty to enable data-efficient active learning that trains accurate cost models from far fewer examples.
If this is right
- Tuning time drops by roughly 16x on CPU and 12x on GPU for typical deep learning models.
- Final optimized programs run with 13-20 percent lower latency than those produced by the prior Tenset-MLP baseline.
- Data collection cost for cost-model training falls to roughly one-tenth of previous requirements.
- Knowledge can be transferred to new hardware platforms without retraining from scratch or suffering parameter explosion.
- The same three-component structure supports progressive improvement as additional hardware targets are encountered.
Where Pith is reading between the lines
- The continual-distillation design may allow incremental updates when entirely new hardware families appear without discarding prior knowledge.
- Because only a small program subset is needed, the approach could be applied in resource-constrained environments such as edge-device optimization loops.
- The method's emphasis on uncertainty sampling suggests it could be combined with online feedback from actual hardware runs to further refine the cost model over time.
Load-bearing premise
Selecting only 10 percent of tensor programs with the RDU criteria keeps the cost model's accuracy close enough to the full-data version that optimization quality does not degrade on new programs or platforms.
What would settle it
Train the cost model once on the full dataset and once on the RDU-selected 10 percent subset, then compare both the prediction error on held-out tensor programs and the final tuned inference latency; a large gap in either metric would falsify the efficiency claim.
Figures
read the original abstract
Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TCL, a framework for efficient cross-hardware tensor program optimization in deep learning compilers. It consists of three main components: the RDU Sampler for selecting only 10% of tensor programs using representativeness, diversity, and uncertainty to reduce data collection costs while preserving accuracy; a Mamba-based cost model for efficient long-range dependency capture with reduced parameterization; and a continuous knowledge distillation approach for progressive knowledge transfer across hardware platforms. The paper reports that on mainstream DL models for CPU and GPU, TCL achieves 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency compared to Tenset-MLP.
Significance. If the empirical results hold under rigorous validation, TCL could meaningfully advance DL compiler optimization by reducing the high costs of offline data collection and improving transferability across hardware. The combination of active learning sampling, lightweight sequence modeling via Mamba, and continual distillation targets practical bottlenecks in auto-tuning, with potential for broader adoption if the speedups and latency gains prove robust.
major comments (2)
- [Abstract] The central tuning-time claims (16.8× on CPU, 12.48× on GPU) rest on the RDU sampler's 10% selection preserving near-original cost-model accuracy. The abstract asserts this but supplies no quantitative bounds (MAPE, Kendall-τ, or similar) on held-out programs or unseen hardware platforms, nor an ablation isolating sampler-induced ranking errors from the Mamba and distillation components. Without these, it is impossible to confirm that the reported latency gains are not eroded by mis-ranked candidates.
- [Experiments] The abstract presents concrete average speedups and latency reductions but omits all details on statistical significance, run-to-run variance, data splits, or ablation controls. This absence directly affects the soundness of the cross-hardware performance assertions and prevents assessment of whether the gains are reliable or platform-specific artifacts.
minor comments (1)
- [Abstract] The baseline 'Tenset-MLP' is referenced without a brief description or citation; adding one sentence would improve readability for readers unfamiliar with the prior work.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The two major comments identify areas where additional quantitative detail and statistical rigor would strengthen the presentation of our results. We address each point below and commit to revisions that directly incorporate the requested information without altering the core claims or methodology.
read point-by-point responses
-
Referee: [Abstract] The central tuning-time claims (16.8× on CPU, 12.48× on GPU) rest on the RDU sampler's 10% selection preserving near-original cost-model accuracy. The abstract asserts this but supplies no quantitative bounds (MAPE, Kendall-τ, or similar) on held-out programs or unseen hardware platforms, nor an ablation isolating sampler-induced ranking errors from the Mamba and distillation components. Without these, it is impossible to confirm that the reported latency gains are not eroded by mis-ranked candidates.
Authors: We agree that the abstract would be improved by explicit quantitative bounds on the RDU sampler. The current abstract summarizes end-to-end outcomes but does not report MAPE, Kendall-τ, or a dedicated isolation ablation. In the revised manuscript we will add a concise statement to the abstract citing the sampler's held-out Kendall-τ (reported in Section 4.2) and will insert a new ablation table in the experiments section that isolates the sampler's contribution to final ranking quality and latency from the Mamba cost model and distillation stages. These additions will allow readers to verify that any sampler-induced ranking discrepancies do not materially erode the reported speedups. revision: yes
-
Referee: [Experiments] The abstract presents concrete average speedups and latency reductions but omits all details on statistical significance, run-to-run variance, data splits, or ablation controls. This absence directly affects the soundness of the cross-hardware performance assertions and prevents assessment of whether the gains are reliable or platform-specific artifacts.
Authors: We concur that the experiments section would benefit from explicit statistical details. While averages across models are reported, the manuscript does not currently include run-to-run standard deviations, precise data-split descriptions, or expanded ablation controls. In the revision we will add: (i) standard deviations computed over five independent tuning runs per model, (ii) a description of the 80/20 random splits used for cost-model training together with 5-fold cross-validation results, and (iii) additional ablation tables that systematically vary each TCL component while holding the others fixed. These changes will demonstrate consistency across CPU and GPU and rule out platform-specific artifacts. revision: yes
Circularity Check
No circularity: empirical comparisons to external baseline
full rationale
The paper's central claims consist of measured speedups (16.8×/12.48× tuning time, 1.20×/1.13× latency) obtained by running TCL against the external Tenset-MLP baseline on mainstream DL models for CPU and GPU. The three enablers (RDU sampler, Mamba cost model, continual distillation) are introduced as engineering choices whose effectiveness is shown via ablation studies and end-to-end experiments; none of the reported quantities is obtained by fitting a parameter to a subset and then relabeling the fit as a prediction, nor is any load-bearing premise justified solely by a self-citation whose content reduces to the present result. The derivation chain therefore remains self-contained against external benchmarks and contains no self-definitional, fitted-input, or self-citation circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al.Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467(2016)
work page Pith review arXiv 2016
-
[2]
Aghapour, E., Shen, Y., Sapra, D., Pimentel, A., and Pathania, A.Piqi: Partially quantized dnn inference on hmpsocs. InProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design(2024), pp. 1–6
work page 2024
- [3]
-
[4]
Baghdadi, R., et al.Tiramisu: A polyhedral compiler for expressing fast and portable code. In2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)(Washington, DC, USA, 2019), IEEE/ACM, pp. 193–205
work page 2019
-
[5]
Baghdadi, R., et al.A deep learning based cost model for automatic code optimization.Proceedings of Machine Learning and Systems 3(2021), 181–193. [6]Bemporad, A.Active learning for regression by inverse distance weighting.Information Sciences 626(2023), 275–292
work page 2021
-
[6]
Bi, J., Li, X., Guo, Q., Zhang, R., Wen, Y., Hu, X., Du, Z., Song, X., Hao, Y., and Chen, Y.Balto: fast tensor program optimization with diversity-based active learning. InThe Eleventh International Conference on Learning Representations(Kigali, Rwanda, 2022), OpenReview.net
work page 2022
-
[7]
In Proceedings of the IEEE/CVF International Conference on Computer Vision(2025), pp
Bi, Q., Shen, Y., Yi, J., and Xia, G.-S.Adadcp: Learning an adapter with discrete cosine prior for clear-to-adverse domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision(2025), pp. 12997–13008
work page 2025
-
[8]
{TVM}: An automated {End-to-End} optimizing compiler for deep learning
Chen, T., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)(Carlsbad, CA, USA, 2018), USENIX, pp. 578–594
work page 2018
-
[9]
cuDNN: Efficient Primitives for Deep Learning
Chen, T., Zheng, L., Yan, E., Jiang, Z., Moreau, T., Ceze, L., Guestrin, C., and Krishnamurthy, A.Learning to optimize tensor programs. Advances in Neural Information Processing Systems 31(2018), 3393–3404. [11]Chetlur, S., et al.cudnn: Efficient primitives for deep learning.arXiv preprint arXiv:1410.0759(2014). [12]Chollet, F., et al.Keras.GitHub(2015)
work page Pith review arXiv 2018
-
[10]
Ding, C., Zheng, M., Chen, F., Zhang, Y., Zhuang, X., Fan, E., Wen, D., Zhang, L., Wei, W., and Zhang, Y.Hyperspectral image classification promotion using clustering inspired active learning.Remote Sensing 14, 3 (2022), 596. [14]Foley, D., and Danskin, J.Ultra-performance pascal gpu and nvlink interconnect.IEEE Micro 37, 2 (2017), 7–17
work page 2022
-
[11]
InInternational conference on machine learning(2017), PMLR, pp
Gal, Y., Islam, R., and Ghahramani, Z.Deep bayesian active learning with image data. InInternational conference on machine learning(2017), PMLR, pp. 1183–1192
work page 2017
-
[12]
Gibson, P., and Cano, J.Transfer-tuning: Reusing auto-schedules for efficient tensor program code generation. InProceedings of the International Conference on Parallel Architectures and Compilation Techniques(New York, NY, USA, 2022), ACM, pp. 28–39
work page 2022
-
[13]
Gourdoumanis, G. R., Oikonomou, F., Pantazi-Kypraiou, M., Stoikos, P., Axelou, O., Tziouvaras, A., Karakonstantis, G., Aladwani, T., Anagnostopoulos, C., Shen, Y., et al.Multi-partner project: Coin-3d–collaborative innovation in 3d vlsi reliability.arXiv preprint arXiv:2601.14347 (2026). [18]Gu, A.Modeling Sequences with Structured State Spaces. Stanford ...
-
[14]
Efficiently Modeling Long Sequences with Structured State Spaces
Gu, A., Dao, T., Ermon, S., Rudra, A., and Ré, C.Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems 33(2020), 1474–1487. [21]Gu, A., Goel, K., and Ré, C.Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)
work page internal anchor Pith review arXiv 2020
-
[15]
Guo, X., Jiang, Q., Shen, Y., Pimentel, A. D., and Stefanov, T.Easter: Learning to split transformers at the edge robustly.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43, 11 (2024), 3626–3637. Manuscript submitted to ACM TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning 25
work page 2024
-
[16]
Hemmer, P., Kühl, N., and Schöffer, J.Deal: Deep evidential active learning for image classification.Deep Learning Applications, Volume 3(2022), 171–192
work page 2022
-
[17]
Hu, H., Su, J., Zhao, J., Peng, Y., Zhu, Y., Lin, H., and Wu, C.Cdmpp: A device-model agnostic framework for latency prediction of tensor programs. InProceedings of the Nineteenth European Conference on Computer Systems(Athens, Greece, 2024), ACM, pp. 1054–1074
work page 2024
-
[18]
Huang, J.-H., Zhu, H., Shen, Y., Rudinac, S., Pacces, A. M., and Kanoulas, E.A novel evaluation framework for image2text generation.arXiv preprint arXiv:2408.01723(2024). [26]Intel.oneAPI Deep Neural Network Library (oneDNN), 2024
-
[19]
InProceedings of the 22nd ACM international conference on Multimedia(2014), pp
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T.Caffe: Convolutional architecture for fast feature embedding. InProceedings of the 22nd ACM international conference on Multimedia(2014), pp. 675–678
work page 2014
-
[20]
Jia, Z., Tillman, B., Maggioni, M., and Scarpazza, D. P.Dissecting the graphcore ipu architecture via microbenchmarking.arXiv preprint arXiv:1912.03413(2019)
-
[21]
P., et al.In-datacenter performance analysis of a tensor processing unit
Jouppi, N. P., et al.In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture(Toronto, ON, Canada, 2017), ACM/IEEE, pp. 1–12
work page 2017
-
[22]
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526
work page 2017
-
[23]
Li, X., W ang, X., Chen, X., Lu, Y., Fu, H., and Wu, Y. C.Unlabeled data selection for active learning in image classification.Scientific Reports 14, 1 (2024), 424
work page 2024
-
[24]
Long, M., Cao, Y., Cao, Z., W ang, J., and Jordan, M. I.Transferable representation learning with deep adaptation networks.IEEE transactions on pattern analysis and machine intelligence 41, 12 (2018), 3071–3085
work page 2018
-
[25]
Mullapudi, R. T., Adams, A., Sharlet, D., Ragan-Kelley, J., and Fatahalian, K.Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–11
work page 2016
-
[26]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.Pytorch: An imperative style, high-performance deep learning library. arxiv 2019.arXiv preprint arXiv:1912.01703 10(1912)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Qiao, L., Shi, J., Hao, X., Fang, X., Zhang, S., Zhao, M., Zhu, Z., Chen, J., An, H., Tang, X., et al.Pruner: A draft-then-verify exploration mechanism to accelerate tensor program tuning. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Rotterdam, Netherlands, 2025),...
work page 2025
- [28]
-
[29]
Ryu, J., Park, E., and Sung, H.One-shot tuner for deep learning compilers. InProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction(Seoul South Korea, 2022), ACM, pp. 89–103
work page 2022
- [30]
-
[31]
D., and Pathania, A.Macp: Minimal yet mighty adaptation via hierarchical cosine projection
Shen, Y., Bi, Q., Huang, J.-H., Zhu, H., Pimentel, A. D., and Pathania, A.Macp: Minimal yet mighty adaptation via hierarchical cosine projection. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2025), pp. 20602–20618
work page 2025
-
[32]
D., and Pathania, A.Ssh: Sparse spectrum adaptation via discrete hartley transformation
Shen, Y., Bi, Q., Huang, J.-H., Zhu, H., Pimentel, A. D., and Pathania, A.Ssh: Sparse spectrum adaptation via discrete hartley transformation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)(2025), pp. 10400–10415
work page 2025
-
[33]
D., and Pathania, A.Efficient multimodal spatial reasoning via dynamic and asymmetric routing
Shen, Y., Bi, Q., W ang, Z., Y ang, Z., W ang, C., Zhang, Z., Tiwari, P., Pimentel, A. D., and Pathania, A.Efficient multimodal spatial reasoning via dynamic and asymmetric routing. InThe Fourteenth International Conference on Learning Representations(2026)
work page 2026
-
[34]
Shen, Y., Schreuders, L., Pathania, A., and Pimentel, A. D.Thermal management for 3d-stacked systems via unified core-memory power regulation.ACM Transactions on Embedded Computing Systems 22, 5s (2023), 1–26
work page 2023
-
[35]
Shen, Y., Song, Y., Wu, C.-h., and Kuo, C.-C. J.Tbal: Two-stage batch-mode active learning for image classification.Signal Processing: Image Communication 106(2022), 116731
work page 2022
-
[36]
D.Tcps: a task and cache-aware partitioned scheduler for hard real-time multi-core systems
Shen, Y., Xiao, J., and Pimentel, A. D.Tcps: a task and cache-aware partitioned scheduler for hard real-time multi-core systems. InProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems(2022), pp. 37–49
work page 2022
-
[37]
Shen, Y., Zhang, H., Shen, Y., Wang, L., Shi, C., Du, S., and Tao, Y.Altgen: Ai-driven alt text generation for enhancing epub accessibility. In Proceedings of the 2025 International Conference on Artificial Intelligence and Computational Intelligence(2025), pp. 78–83
work page 2025
-
[38]
Steiner, B., Cummins, C., He, H., and Leather, H.Value learning for throughput optimization of deep learning workloads.Proceedings of Machine Learning and Systems 3(2021), 323–334
work page 2021
-
[39]
IEEE Transactions on Neural Networks and Learning Systems 33, 4 (2020), 1364–1384
Tampuu, A., Matiisen, T., Semikin, M., Fishman, D., and Muhammad, N.A survey of end-to-end driving: Architectures and training methods. IEEE Transactions on Neural Networks and Learning Systems 33, 4 (2020), 1364–1384
work page 2020
-
[40]
Verma, G., Raskar, S., Xie, Z., Malik, A. M., Emani, M., and Chapman, B.Transfer learning across heterogeneous features for efficient tensor program generation. InProceedings of the 2nd International Workshop on Extreme Heterogeneity Solutions(Montreal, QC, Canada, 2023), ACM, pp. 1–6
work page 2023
-
[41]
InThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025)
W ang, C., He, S., Fang, X., Hu, Z., Huang, J.-H., Shen, Y., and Tiwari, P.Reasoning beyond points: A visual introspective approach for few-shot 3d segmentation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025). Manuscript submitted to ACM 26 C. Shen et al
work page 2025
-
[42]
W ang, X., Li, C., Golbandi, N., Bendersky, M., and Najork, M.The lambdaloss framework for ranking metric optimization. InProceedings of the 27th ACM international conference on information and knowledge management(Torino,Italy, 2018), ACM, pp. 1313–1322
work page 2018
-
[43]
M., Wolff, J., Shen, Y., Pathania, A., Grelck, C., and Pimentel, A
W asala, S. M., Wolff, J., Shen, Y., Pathania, A., Grelck, C., and Pimentel, A. D.Energy-efficient qos-aware scheduling for s-nuca many-cores. In2025 26th International Symposium on Quality Electronic Design (ISQED)(2025), IEEE, pp. 1–8. [52]Weiss, K., Khoshgoftaar, T. M., and W ang, D.A survey of transfer learning.Journal of Big data 3(2016), 1–40. [53]W...
work page 2025
-
[44]
Zeng, X., Zhi, T., Du, Z., Guo, Q., Sun, N., and Chen, Y.Alt: optimizing tensor compilation in deep learning compilers with active learning. In2020 IEEE 38th International Conference on Computer Design (ICCD)(Hartford, Massachusetts, USA, 2020), IEEE, pp. 623–630
work page 2020
-
[45]
Zhai, Y., Zhang, Y., Liu, S., Chu, X., Peng, J., Ji, J., and Zhang, Y.Tlp: A deep learning-based cost model for tensor program tuning. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada, 2023), ACM, pp. 833–845. [56]Zhang, Y., and Y ang, Q.An overv...
work page 2023
-
[46]
InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(2025), pp
Zhang, Z., Shen, Y., Cao, C., and Shutova, E.Neuroada: Activating each neuron’s potential for parameter-efficient fine-tuning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing(2025), pp. 10960–10977
work page 2025
-
[47]
Zhao, Y., Sharif, H., Adve, V., and Misailovic, S.Felix: Optimizing tensor programs with gradient descent. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3(2024), pp. 367–381
work page 2024
-
[48]
Zhao, Z., Shuai, X., Ling, N., Guan, N., Y an, Z., and Xing, G.Moses: Exploiting cross-device transferable features for on-device tensor program optimization. InProceedings of the 24th International Workshop on Mobile Computing Systems and Applications(Newport Beach, CA, USA, 2023), ACM, pp. 22–28
work page 2023
-
[49]
Zheng, L., et al.Ansor: Generating {High-Performance} tensor programs for deep learning. In14th USENIX symposium on operating systems design and implementation (OSDI 20)(Banff, Alberta, Canada, 2020), USENIX, pp. 863–879
work page 2020
-
[50]
Zheng, L., et al.Tenset: A large-scale program performance dataset for learned tensor compilers. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)(Online Conference, Canada, 2021), Curran Associates, Inc
work page 2021
-
[51]
Zheng, S., Liang, Y., W ang, S., Chen, R., and Sheng, K.Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. InProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems(Lausanne, Switzerland, 2020), ACM, pp. 859–873
work page 2020
-
[52]
Zhu, H., Huang, J.-H., Shen, Y., Rudinac, S., and Kanoulas, E.Interactive image retrieval meets query rewriting with large language and vision language models.ACM Transactions on Multimedia Computing, Communications and Applications 21, 10 (2025), 1–23. Manuscript submitted to ACM
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.