Reinforcement Learning for Optimizing RAG for Domain Chatbots

Anusua Trivedi; Kyung Kim; Mandar Kulkarni; Praveen Tangarajan

arxiv: 2401.06800 · v1 · pith:KZ2VQRNBnew · submitted 2024-01-10 · 💻 cs.CL · cs.AI

Reinforcement Learning for Optimizing RAG for Domain Chatbots

Mandar Kulkarni , Praveen Tangarajan , Kyung Kim , Anusua Trivedi This is my paper

classification 💻 cs.CL cs.AI

keywords modelpolicyretrievalcostin-houseoptimizequeriesaccuracy

0 comments

read the original abstract

With the advent of Large Language Models (LLM), conversational assistants have become prevalent for domain use cases. LLMs acquire the ability to contextual question answering through training, and Retrieval Augmented Generation (RAG) further enables the bot to answer domain-specific questions. This paper describes a RAG-based approach for building a chatbot that answers user's queries using Frequently Asked Questions (FAQ) data. We train an in-house retrieval embedding model using infoNCE loss, and experimental results demonstrate that the in-house model works significantly better than the well-known general-purpose public embedding model, both in terms of retrieval accuracy and Out-of-Domain (OOD) query detection. As an LLM, we use an open API-based paid ChatGPT model. We noticed that a previously retrieved-context could be used to generate an answer for specific patterns/sequences of queries (e.g., follow-up queries). Hence, there is a scope to optimize the number of LLM tokens and cost. Assuming a fixed retrieval model and an LLM, we optimize the number of LLM tokens using Reinforcement Learning (RL). Specifically, we propose a policy-based model external to the RAG, which interacts with the RAG pipeline through policy actions and updates the policy to optimize the cost. The policy model can perform two actions: to fetch FAQ context or skip retrieval. We use the open API-based GPT-4 as the reward model. We then train a policy model using policy gradient on multiple training chat sessions. As a policy model, we experimented with a public gpt-2 model and an in-house BERT model. With the proposed RL-based optimization combined with similarity threshold, we are able to achieve significant cost savings while getting a slightly improved accuracy. Though we demonstrate results for the FAQ chatbot, the proposed RL approach is generic and can be experimented with any existing RAG pipeline.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

State Contamination in Memory-Augmented LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

Toxic context can be laundered into memory summaries that stay below toxicity thresholds while still driving higher downstream toxicity in LLM agents compared to neutral baselines.
EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming ...
Self-Aligned Reward: Towards Effective and Efficient Reasoners
cs.LG 2025-09 unverdicted novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.
Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation
cs.CL 2025-05 unverdicted novelty 5.0

RioRAG uses nugget-centric verification with cross-source checks to create dense verifiable rewards for RL-based optimization of long-form RAG, yielding higher factual recall and faithfulness on LongFact and RAGChecker.
Retrieval-Augmented Generation for AI-Generated Content: A Survey
cs.CV 2024-02 accept novelty 5.0

A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.