arxiv: 2604.14951 · v1 · submitted 2026-04-16 · 💻 cs.CV · cs.AI· cs.CL· cs.MM

Recognition: unknown

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Evelyn Turri, Gabriele Mattioli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, Sara Sarto

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.MM

keywords multimodal tool selectionretrieval-based tool useopen-world generalizationmultimodal large language modelsdirect preference optimizationtool learningHugging Face models

0 comments

The pith

Multimodal models select external tools for open-world tasks by converting queries into structured descriptions and retrieving semantic matches from tool cards, without retraining for new tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a retrieval-based framework that lets multimodal large language models handle tool selection for tasks involving text, images, and other inputs in settings where tools may be entirely new. Rather than training a model to output fixed tool names, the approach first turns the user query into a structured task description, then ranks available tools by how well their machine-readable descriptions match that task. A new dataset built from Hugging Face model cards supports evaluation, and direct preference optimization further aligns the descriptions. A reader would care because prior tool-use systems are restricted to text-only inputs and tools seen in training, so they cannot scale to the open, multimodal world where real AI assistants operate.

Core claim

The paper establishes that an MLLM can transform a multimodal query into a structured task description, after which the most appropriate tool is retrieved by semantic matching against rich tool descriptions; this retrieval formulation supports adding new tools without retraining, and a subsequent preference optimization stage using DPO improves selection accuracy, as validated through experiments on a new open-world multimodal tool-use dataset derived from Hugging Face model cards.

What carries the argument

The retrieval step that matches an MLLM-generated structured task description against semantically rich, machine-readable tool descriptions to identify the best tool for the query.

If this is right

Tool selection succeeds on tools never seen during any training phase.
The same pipeline works for inputs that combine text with images or video.
Direct preference optimization measurably improves how well task descriptions guide tool choice.
A standardized dataset of tool descriptions now exists for benchmarking open-world multimodal tool use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support very large, dynamically updated tool libraries by relying on existing documentation rather than curated training examples.
Semantic retrieval on generated descriptions may prove more robust to changes in tool interfaces than classifier-based selection.
Downstream agents could chain this retrieval step with actual tool execution to measure end-to-end task success rather than selection accuracy alone.

Load-bearing premise

The structured task description produced from a multimodal query aligns closely enough with the semantic content of tool descriptions to enable reliable retrieval without any extra supervision or fine-tuning.

What would settle it

If retrieval accuracy falls to baseline levels when tool descriptions are rephrased or when the generated task descriptions are replaced with random text of similar length, the alignment premise would be falsified.

Figures

Figures reproduced from arXiv: 2604.14951 by Evelyn Turri, Gabriele Mattioli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara, Sara Sarto.

**Figure 1.** Figure 1: Pipeline of dataset creation. We first perform a metadata acquisition step by scraping model cards from Hugging Face. The collected data then undergoes a cleaning stage. Finally, given a user query and its associated prompt, we feed this information into an LLM to generate a structured tool description in a standardized JSON format. Multimodal Tool Agents. More recently, multimodal tool agents extend tool… view at source ↗

**Figure 2.** Figure 2: Qualitative dataset examples illustrating Hugging Face model cards and the corresponding tool descriptions represented in the standardized JSON format. query q ∈ Q, optionally accompanied by multimodal input, is associated with a single external tool t ∈ T . Although this supervised formulation enables controlled evaluation of tool selection, it inherently restricts generalization to tools observed during… view at source ↗

**Figure 3.** Figure 3: Overview of our pipeline. RaTA-Tool supports multimodal user queries by jointly processing the input prompt and associated modalities. A fine-tuned LLM encodes the combined inputs to generate a structured task description of the user request, further aligned via Direct Preference Optimization (DPO). This description is embedded and used to retrieve the most relevant tool from an external tool collection.… view at source ↗

**Figure 4.** Figure 4: Qualitative results of RaTA-Tool under different input modalities. Comparing prompting strategies, few-shot prompting consistently improves over zero-shot inference for both NL and JSON formats, with especially notable gains for image and audio queries. For instance, when using Qwen3-Embedding with JSON descriptions, image accuracy increases from 32.1 to 38.7 when moving from zero-shot to few-shot prompti… view at source ↗

**Figure 5.** Figure 5: Additional qualitative examples of RaTA-Tool for text-only user queries. B Prompts We provide examples of the prompts used for task-description generation during dataset creation, considering both structured JSON and natural-language (NL) formats. For dataset construction, the prompt includes a single in-context example to guide the generation of tool descriptions. In contrast, during inference with RaTA-… view at source ↗

**Figure 6.** Figure 6: Additional qualitative examples of RaTA-Tool for multimodal user queries with image inputs. Prompt I bought a book while traveling in Spain. The title reads 'El secreto de la vida'. What does this mean in English? Zero-Shot { ‘input’: ‘A book title in Spanish.’, ‘process’: ‘Translate the Spanish book title into English.’, ‘output’: ‘The English translation of the book title.’ } Ground Truth { ‘input’: ‘Tex… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of task descriptions with zero-shot inference and by RaTA-Tool, along with the ground-truth descriptions [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Example of a JSON-style prompt used for model description generation during dataset creation [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Example of a JSON-style prompt used by RaTA-Tool for task-description generation at inference time [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of NL-style prompts used for model description generation during dataset creation and for task-description generation with RaTA-Tool at inference time [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RaTA-Tool gives a retrieval setup that decouples task description from tool matching for open multimodal use plus a new HF-derived dataset, but the abstract leaves the size of the gains unspecified.

read the letter

The paper's core move is to let an MLLM first turn a multimodal query into a structured task description, then retrieve the right tool by matching that description against machine-readable tool metadata from Hugging Face cards. They add a DPO stage to tighten the alignment and release the resulting open-world dataset. This cleanly separates the generation step from selection, so new tools can be added without retraining the whole system. That separation is the practical contribution and it lines up with the goal of handling unseen tools and image-plus-text inputs where earlier text-only methods fall short. The dataset itself is new for this exact setting and should be usable by others working on tool-augmented vision-language agents. The abstract claims the approach improves selection performance in open multimodal cases, yet it supplies no numbers, baselines, or split-construction details, so the actual margin remains unclear from what is shown. The central assumption—that the generated task descriptions will be semantically close enough to tool cards for reliable retrieval—looks plausible on paper but will need the full experimental tables to judge how often it holds. This is straightforward engineering work aimed at people building practical multimodal tool users rather than at theorists rethinking model reasoning. The method is internally consistent and the dataset is a concrete addition, so it is worth sending out for peer review to surface the missing numbers and any dataset biases.

Referee Report

2 major / 3 minor

Summary. The paper introduces RaTA-Tool, a retrieval-based framework for open-world multimodal tool selection. An MLLM converts multimodal user queries into structured task descriptions, which are matched against semantically rich tool descriptions derived from Hugging Face model cards to select the appropriate tool. Direct Preference Optimization (DPO) is applied to improve alignment, and a new dataset for this task is contributed. Experiments are reported to show significant performance gains over prior methods, especially in open-world and multimodal settings.

Significance. If the experimental results hold, the work addresses a clear gap in tool learning by supporting generalization to unseen tools and multimodal inputs without retraining. The dataset construction from standardized HF model cards and the explicit separation of query-to-description conversion from retrieval are practical strengths that could enable extensible tool-use systems. The DPO stage provides a straightforward way to refine retrieval alignment.

major comments (2)

[§4] §4 (Experiments): The central claim of significant improvement in open-world multimodal tool selection requires quantitative support; the manuscript must report specific metrics (e.g., accuracy or recall@K), baselines (direct MLLM mapping, standard retrieval), error bars, and explicit construction details for the open-world splits to substantiate generalization.
[§3.2] §3.2 (Task Description Generation): The assumption that MLLM-generated structured task descriptions are sufficiently aligned with tool metadata for reliable retrieval without extra supervision is load-bearing for the open-world claim; an ablation isolating the conversion step versus end-to-end retrieval would be needed to confirm this holds across query types.

minor comments (3)

The abstract would be strengthened by briefly stating the magnitude of reported gains and the primary baseline.
[§3.1] Notation for the structured task description format and similarity metric should be defined explicitly in §3.1 for reproducibility.
Figure 2 (pipeline diagram) would benefit from clearer labeling of the DPO preference pair construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comments, which will help strengthen the quantitative rigor and clarity of the paper. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Experiments): The central claim of significant improvement in open-world multimodal tool selection requires quantitative support; the manuscript must report specific metrics (e.g., accuracy or recall@K), baselines (direct MLLM mapping, standard retrieval), error bars, and explicit construction details for the open-world splits to substantiate generalization.

Authors: We agree that detailed quantitative support is essential. The manuscript reports accuracy and recall@K metrics with comparisons to prior methods in Section 4. To fully address the comment, we will revise the section to explicitly add the suggested baselines (direct MLLM mapping and standard retrieval), include error bars from multiple runs, and provide expanded details on open-world split construction, including unseen tool selection criteria and partitioning methodology. revision: yes
Referee: [§3.2] §3.2 (Task Description Generation): The assumption that MLLM-generated structured task descriptions are sufficiently aligned with tool metadata for reliable retrieval without extra supervision is load-bearing for the open-world claim; an ablation isolating the conversion step versus end-to-end retrieval would be needed to confirm this holds across query types.

Authors: We acknowledge the importance of validating the task description generation step for the open-world claim. While the current experiments focus on the full pipeline, we will add an ablation study in the revised manuscript (updating Section 3.2 and the experiments) that isolates the structured conversion by comparing against an end-to-end retrieval baseline across text-only, image-only, and multimodal query types to confirm alignment benefits without extra supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a retrieval-based framework where an MLLM converts multimodal queries into structured task descriptions, followed by semantic matching against tool metadata from Hugging Face cards and optional DPO alignment. No equations, fitted parameters, or derivations are presented that reduce the claimed performance gains to inputs by construction. The method uses standard retrieval and preference optimization techniques without self-definitional loops or load-bearing self-citations that collapse the central claim. The dataset construction and open-world extensibility arguments remain independent of the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that MLLMs can reliably produce structured task descriptions and that semantic similarity between those descriptions and tool metadata is a sufficient proxy for tool utility. No new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption Multimodal LLMs can convert arbitrary image-plus-text queries into structured task descriptions that preserve user intent.
Invoked in the description of the first stage of RaTA-Tool.
domain assumption Semantic similarity between task descriptions and tool metadata correlates with actual tool usefulness.
Central to the retrieval step.

pith-pipeline@v0.9.0 · 5565 in / 1238 out tokens · 28184 ms · 2026-05-10T11:20:31.347096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 8 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: NeurIPS (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a Visual Language Model for Few-Shot Learning. In: NeurIPS (2022)

2022
[3]

In: NeurIPS (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language Models are Few- Shot Learners. In: NeurIPS (2020)

2020
[4]

In: ACL Findings (2024)

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cor- nia, M., Cucchiara, R.: The Revolution of Multimodal Large Language Models: A Survey. In: ACL Findings (2024)

2024
[5]

arXiv preprint arXiv:2305.04160 , year=

Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., , Xu, B.: X-LLM: Boot- strapping Advanced Large Language Models by Treating Multi-Modalities as For- eign Languages. arXiv preprint arXiv:2305.04160 (2023)

work page arXiv 2023
[6]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., et al.: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv preprint arXiv:2406.07476 (2024) 14 G. Mattioli et al

work page internal anchor Pith review arXiv 2024
[7]

In: ICCV Workshops (2025)

Cocchi, F., Moratelli, N., Caffagni, D., Sarto, S., Baraldi, L., Cornia, M., Cuc- chiara, R.: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning. In: ICCV Workshops (2025)

2025
[8]

In: BMVC (2025)

Compagnoni, A., Caffagni, D., Moratelli, N., Baraldi, L., Cornia, M., Cucchiara, R.: Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization. In: BMVC (2025)

2025
[9]

In: CVPR (2026)

Compagnoni, A., Morini, M., Sarto, S., Cocchi, F., Caffagni, D., Cornia, M., Baraldi, L., Cucchiara, R.: ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering. In: CVPR (2026)

2026
[10]

In: NAACL (2019)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: NAACL (2019)

2019
[11]

Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

Fu, C., Lin, H., Long, Z., Shen, Y., Dai, Y., Zhao, M., Zhang, Y.F., Dong, S., Li, Y., Wang, X., et al.: VITA: Towards Open-Source Interactive Omni Multimodal LLM. arXiv preprint arXiv:2408.05211 (2024)

work page arXiv 2024
[12]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

In: CVPR (2024)

Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: OneLLM: One Framework to Align All Modalities with Language. In: CVPR (2024)

2024
[14]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-Rank Adaptation of Large Language Models. In: ICLR (2022)

2022
[15]

TMLR (2021)

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Unsupervised Dense Information Retrieval with Contrastive Learning. TMLR (2021)

2021
[16]

In: CVPR (2025)

Jia, H., Jiang, C., Xu, H., Ye, W., Dong, M., Yan, M., Zhang, J., Huang, F., Zhang, S.: SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization. In: CVPR (2025)

2025
[17]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin: Qwen2.5-Omni Technical Report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review arXiv 2025
[18]

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre- trainingwithFrozenImageEncodersandLargeLanguageModels.In:ICML(2023)

2023
[19]

In: CVPR (2024)

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved Baselines with Visual Instruction Tun- ing. In: CVPR (2024)

2024
[20]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning. In: NeurIPS (2023)

2023
[21]

In: BMVC (2024)

Moratelli,N.,Caffagni,D.,Cornia,M.,Baraldi,L.,Cucchiara,R.:RevisitingImage Captioning Training Paradigm via Direct CLIP-based Optimization. In: BMVC (2024)

2024
[22]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al.: WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021)

work page internal anchor Pith review arXiv 2021
[23]

In: NeurIPS (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

2022
[24]

In: NeurIPS (2024)

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large Language Model Connected with Massive APIs. In: NeurIPS (2024)

2024
[25]

In: ICCV (2025) RaTA-Tool: Retrieval-Augmented Tool Selection with MLLMs 15

Pipoli, V., Saporita, A., Bolelli, F., Cornia, M., Baraldi, L., Grana, C., Cucchiara, R., Ficarra, E.: MissRAG: Addressing the Missing Modality Challenge in Multi- modal Large Language Models. In: ICCV (2025) RaTA-Tool: Retrieval-Augmented Tool Selection with MLLMs 15

2025
[26]

arXiv preprint arXiv:2601.04778 (2026)

Poppi, T., Uzkent, B., Garg, A., Porto, L., Kessler, G., Yang, Y., Cornia, M., Baraldi, L., Cucchiara, R., Schiffers, F.: CounterVid: Counterfactual Video Gener- ation for Mitigating Action and Temporal Hallucinations in Video-Language Mod- els. arXiv preprint arXiv:2601.04778 (2026)

work page arXiv 2026
[27]

ACM Computing Surveys 57(4), 1–40 (2024)

Qin,Y.,Hu,S.,Lin,Y.,Chen,W.,Ding,N.,Cui,G.,Zeng,Z.,Zhou,X.,Huang,Y., Xiao, C., et al.: Tool Learning with Foundation Models. ACM Computing Surveys 57(4), 1–40 (2024)

2024
[28]

In: ICLR (2024)

Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al.: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In: ICLR (2024)

2024
[29]

In: NeurIPS (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In: NeurIPS (2023)

2023
[30]

IJCV 133(11), 7647–7671 (2025)

Sarto,S.,Moratelli,N.,Cornia,M.,Baraldi,L.,Cucchiara,R.:Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Trainin. IJCV 133(11), 7647–7671 (2025)

2025
[31]

In: NeurIPS (2023)

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language Models Can Teach Themselves to Use Tools. In: NeurIPS (2023)

2023
[32]

In: NeurIPS (2023)

Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. In: NeurIPS (2023)

2023
[33]

In: AAAI (2024)

Song, F., Yu, B., Li, M., Yu, H., Huang, F., Li, Y., Wang, H.: Preference Ranking Optimization for Human Alignment. In: AAAI (2024)

2024
[34]

LaMDA: Language Models for Dialog Applications

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.T., Jin, A., Bos, T., Baker, L., Du, Y., et al.: LaMDA: Language Models for Dialog Applications. arXiv preprint arXiv:2201.08239 (2022)

work page Pith review arXiv 2022
[35]

In: CVPR (2024)

Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S.,Xiong,C.,Joty,S.,Naik,N.:DiffusionModelAlignmentUsingDirectPreference Optimization. In: CVPR (2024)

2024
[36]

In: WACV (2025)

Wang, C., Luo, W., Dong, S., Xuan, X., Li, Z., Ma, L., Gao, S.: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning. In: WACV (2025)

2025
[37]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671 (2023)

work page internal anchor Pith review arXiv 2023
[38]

In: NeurIPS (2024)

Wu, J., Xie, Y., Yang, Z., Wu, J., Gao, J., Ding, B., Wang, X., He, X.:β-DPO: Direct Preference Optimization with Dynamicβ. In: NeurIPS (2024)

2024
[39]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 Technical Report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

In: NeurIPS (2023)

Yang, R., Song, L., Li, Y., Zhao, S., Ge, Y., Li, X., Shan, Y.: GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. In: NeurIPS (2023)

2023
[41]

In: EMNLP Demos (2023)

Zhang, H., Li, X., Bing, L.: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In: EMNLP Demos (2023)

2023
[42]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176 (2025)

work page internal anchor Pith review arXiv 2025
[43]

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst.arXiv:2305.16103, 2023

Zhao, Z., Guo, L., Yue, T., Chen, S., Shao, S., Zhu, X., Yuan, Z., Liu, J.: Chat- Bridge: Bridging Modalities with Large Language Model as a Language Catalyst. arXiv preprint arXiv:2305.16103 (2023) 16 G. Mattioli et al. A Qualitative Example Additional qualitative examples of RaTA-Tool are presented in Fig. 5 and Fig. 6. These examples include both text-...

work page arXiv 2023
[44]

Input : What type of data the model takes in ( e . g . , text , image , audio , s t r u c t u r e d data , etc .) , i n c l u d i n g format or shape if m e n t i o n e d
[45]

Process : What the model does with the input ( e . g . , s e n t i m e n t classification , translation , object detection , etc .) , d e s c r i b e d c o n c i s e l y
[46]

input ":

Output : What the model pr odu ce s as output , i n c l u d i n g the format , labels or value types where a p p l i c a b l e . Your output must be : - Model - ag no sti c ( do not mention brand names or model IDs ) - Written s tr ic tly in English - A valid JSON object in the f o l l o w i n g format : { " input ": " < d e s c r i p t i o n of the e xp ...