Carolina Guide: A Multi-Agent RAG System with Institutional Guardrails for Academic Policy Assistance

Ben Torsion; Jun Zhou

arxiv: 2606.28360 · v1 · pith:VKMAP75Inew · submitted 2026-06-11 · 💻 cs.IR · cs.AI

Carolina Guide: A Multi-Agent RAG System with Institutional Guardrails for Academic Policy Assistance

Ben Torsion , Jun Zhou This is my paper

Pith reviewed 2026-06-30 11:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords RAGmulti-agent systemsinstitutional guardrailsacademic policy assistanceretrieval successsafety evaluationuniversity advising

0 comments

The pith

A multi-agent RAG system with institutional guardrails answers university policy queries at 98.9 percent retrieval success while refusing unsafe requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Carolina Guide as a retrieval-augmented generation system built for students seeking academic policy information at one university. It uses a modular multi-agent pipeline that retrieves official documents, generates citation-supported responses, and applies guardrails to block inappropriate requests such as course recommendations or personalized advising. On a 90-query test set spanning six departments, the retrieval stage places a genuinely relevant chunk first for 98.9 percent of queries and reaches 98.9 percent success at a relevance threshold of two or higher. The guardrail component, tested on 30 adversarial queries, refuses 86 percent of unsafe inputs while still covering 93 percent of benign ones, for a Safety F1 of 0.89. The authors conclude that standard RAG designs must be altered to emphasize safety, transparency, and departmental control rather than open-ended conversation when the domain is high-stakes institutional policy.

Core claim

A modular multi-agent RAG pipeline equipped with institutional guardrails can deliver citation-supported answers to academic policy questions while correctly refusing unsafe queries, achieving 98.9 percent retrieval success at the genuinely-relevant threshold, first-relevant-chunk rank-1 performance on 98.9 percent of queries, and a guardrail Safety F1 of 0.89 that refuses 86 percent of adversarial inputs while preserving 93 percent coverage of benign queries.

What carries the argument

The modular multi-agent pipeline with institutional guardrails that retrieves policy documents, enforces citation support, and refuses requests outside institutional scope such as course recommendations.

If this is right

MMR reranking, retrieval context of k=20, and citation enforcement each add measurable practical value to answer quality.
The system reduces advising bottlenecks by returning policy-grounded answers with source citations.
Departmental autonomy is preserved because guardrails can be tuned per policy domain.
Production LLM systems for institutional guidance must prioritize safety and transparency over conversational flexibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guardrail pattern could be ported to other universities by swapping the underlying policy corpus without redesigning the agents.
The refusal logic might transfer to other constrained domains such as financial aid rules or housing regulations.
Larger-scale deployment logs could reveal whether students actually change behavior after receiving the guarded answers.

Load-bearing premise

The 90-query test set across six departments and the 30 adversarial queries are representative of real student interactions and sufficiently stress-test the guardrails for production deployment.

What would settle it

A new test collection of several hundred real student queries drawn from live advising records that yields retrieval success below 90 percent or guardrail refusal of more than 20 percent of benign queries would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2606.28360 by Ben Torsion, Jun Zhou.

**Figure 1.** Figure 1: System architecture showing three-layer design with hybrid PostgreSQL-Qdrant database, five-agent pipeline, and asynchro [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

University students often struggle to navigate complex academic policies, leading to advising bottlenecks and delayed access to critical information. Although large language models (LLMs) offer promise for automated assistance, their tendency toward hallucination and inability to enforce institutional constraints make them unsuitable for high-stakes policy guidance without careful architectural design. We present Carolina Guide, a retrieval-augmented generation (RAG) system for academic policy assistance at the University of South Carolina (USC). The system employs a modular multi-agent pipeline with institutional guardrails to provide citation-supported, policy-grounded answers to student queries while refusing unsafe requests such as course recommendations or personalized advising. We evaluate the system on a 90 query test set across 6 departments, achieving 98.9% retrieval success at the >= 2 threshold (genuinely relevant results) with the first relevant chunk at rank-1 for 98.9% of queries (MRR at 10 for rel >= 2 = 0.989). Through systematic baseline comparisons and ablation studies, we show that each architectural component-MMR reranking, adequate retrieval context (k=20), and citation enforcement-contributes measurable practical value despite limited statistical power at 90 queries. The evaluation of the guardrail on 30 adversarial queries demonstrates Safety F1 of 0.89, correctly refusing 86% of unsafe queries while maintaining 93% coverage of benign queries. These results show that production-ready LLM systems for institutional policy guidance require rethinking standard RAG patterns to prioritize safety, transparency, and departmental autonomy over conversational sophistication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward report on deploying a guarded multi-agent RAG system for one university's policies, with solid but small-scale numbers.

read the letter

Carolina Guide puts together a multi-agent RAG pipeline with citation enforcement and refusal guardrails to answer student questions about academic policies at the University of South Carolina. The system refuses unsafe requests like course recommendations and sticks to retrieved policy text.

The work is new mainly as an application. It shows a concrete modular setup that includes MMR reranking, k=20 context, and a separate guardrail agent. On their 90-query test set spanning six departments they report 98.9% retrieval success at the relevant threshold and MRR@10 of 0.989. The guardrail reaches 0.89 Safety F1 on 30 adversarial queries while keeping 93% coverage on benign ones. They include baseline comparisons and ablations that indicate each piece contributes something measurable.

The paper is open about the limited statistical power at n=90. The main soft spot is the evaluation data itself. The stress-test concern holds: without details on whether the queries came from real logs or were synthesized, how relevance and safety labels were assigned, or how well the adversarial set covers actual edge cases, the headline metrics are hard to generalize beyond this specific test distribution. That keeps the production-readiness claim modest.

The paper is useful for teams at other universities or similar institutions who need a working template for policy assistance with safety constraints. Readers looking for new algorithmic ideas will not find them here. The citation pattern is normal for applied RAG work and the thinking is clear and honest about what was measured.

I would bring it to a reading group on practical LLM deployments. I would not cite it for technical novelty. It deserves peer review as a systems paper if the authors add more on data provenance and perhaps some additional evaluation.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Carolina Guide, a multi-agent RAG system with institutional guardrails for academic policy assistance at USC. It employs a modular pipeline to deliver citation-supported answers while refusing unsafe requests such as course recommendations. On a 90-query test set across 6 departments, the system achieves 98.9% retrieval success at the >=2 relevance threshold with first relevant chunk at rank-1 for 98.9% of queries (MRR@10 for rel>=2 = 0.989). Ablation studies attribute value to MMR reranking, k=20 context, and citation enforcement. The guardrail is tested on 30 adversarial queries, yielding Safety F1 of 0.89 (86% refusal of unsafe, 93% coverage of benign). The authors conclude that production-ready LLM policy systems require rethinking standard RAG to prioritize safety, transparency, and departmental autonomy over conversational features.

Significance. If the metrics generalize, the work provides concrete empirical support for modular multi-agent RAG designs and explicit guardrails in high-stakes institutional settings. The systematic baseline comparisons and ablation studies on specific components (MMR, context size, citation enforcement) are a strength, offering actionable evidence of their contributions despite the acknowledged limited statistical power. The focus on refusing unsafe queries directly addresses LLM hallucination risks in policy domains and could inform similar deployments.

major comments (2)

[Evaluation section on the 90-query test set] Evaluation section on the 90-query test set: The provenance of the queries (real student logs vs. synthetic generation), any blinding, and inter-annotator agreement for relevance labels are not reported. This is load-bearing for the central claims of 98.9% retrieval success and MRR@10=0.989, because without evidence that the test distribution matches actual student interactions across departments, the generalization to production use cannot be assessed.
[Guardrail evaluation section on the 30 adversarial queries] Guardrail evaluation section on the 30 adversarial queries: The construction process for the adversarial queries and their coverage of unsafe request types (e.g., personalized advising variants or cross-department edge cases) are not specified. This directly affects the reliability of the Safety F1=0.89 result as support for the guardrail's robustness claim.

minor comments (1)

[Abstract] The abstract's phrasing of 'first relevant chunk at rank-1 for 98.9% of queries' should be aligned more precisely with the MRR@10 definition to avoid potential misreading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. The comments highlight important aspects of transparency that we will address in a revised manuscript.

read point-by-point responses

Referee: [Evaluation section on the 90-query test set] Evaluation section on the 90-query test set: The provenance of the queries (real student logs vs. synthetic generation), any blinding, and inter-annotator agreement for relevance labels are not reported. This is load-bearing for the central claims of 98.9% retrieval success and MRR@10=0.989, because without evidence that the test distribution matches actual student interactions across departments, the generalization to production use cannot be assessed.

Authors: We agree that these details are necessary for readers to evaluate the strength of our claims. We will revise the Evaluation section to explicitly describe the provenance of the 90 queries, any blinding procedures employed, and inter-annotator agreement (or lack thereof) for the relevance labels. This addition will clarify the scope and limitations of the reported metrics. revision: yes
Referee: [Guardrail evaluation section on the 30 adversarial queries] Guardrail evaluation section on the 30 adversarial queries: The construction process for the adversarial queries and their coverage of unsafe request types (e.g., personalized advising variants or cross-department edge cases) are not specified. This directly affects the reliability of the Safety F1=0.89 result as support for the guardrail's robustness claim.

Authors: We concur that specifying the query construction process and coverage of unsafe request types is required to substantiate the guardrail evaluation. We will expand the Guardrail evaluation section to detail how the 30 adversarial queries were developed and the range of unsafe request types they encompass, including variants such as personalized advising and cross-department cases. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on held-out test set with no derivations or self-citation chains

full rationale

The paper describes a RAG system and reports performance via direct measurements (98.9% retrieval success, MRR@10=0.989, Safety F1=0.89) on a fixed 90-query test set and 30 adversarial queries. No equations, parameter fitting presented as prediction, uniqueness theorems, or self-citations appear in the provided text. Claims rest on explicit evaluation results rather than any derivation that reduces to its own inputs by construction. This is the standard case of an empirical systems paper whose central claims are falsifiable against the reported test distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; the central claims rest on empirical test-set performance rather than mathematical axioms, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5812 in / 1157 out tokens · 43018 ms · 2026-06-30T11:10:33.368267+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 335–336

1998
[3]

CJ Date. 1994. An introduction to database systems Addison-Wesley.Reading, Massachusetts(1994)

1994
[4]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large language models to generate text with citations.arXiv preprint arXiv:2305.14627(2023)

work page arXiv 2023
[5]

Ashok K Goel and Lalith Polepeddi. 2018. Jill Watson: A virtual teaching assistant for online education. InLearning engineering for online education. Routledge, 120–143

2018
[6]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446

2002
[7]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

2019
[8]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

2020
[9]

Alice Kerlyl, Phil Hall, and Susan Bull. 2006. Bringing chatbots into education: Towards natural language negotiation of open learner models. In International conference on innovative techniques and applications of artificial intelligence. Springer, 179–192

2006
[10]

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024(2022)

work page arXiv 2022
[11]

Martin Kleppmann. 2019. Designing data-intensive applications

2019
[12]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[13]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. InProceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163–173

2021
[14]

Shangkun Liu, Cuixia Zhang, and Jiaman Ma. 2020. A novel ensemble deep learning model for stock prediction based on stock prices and news. PLoS ONE15, 9 (2020), e0238314. doi:10.1371/journal.pone.0238314

work page doi:10.1371/journal.pone.0238314 2020
[15]

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell- Gillingham, Geoffrey Irving, et al . 2022. Teaching language models to support answers with verified quotes.arXiv preprint arXiv:2203.11147 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[17]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118(2020)

work page arXiv 2020
[18]

Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and trends®in information retrieval3, 4 (2009), 333–389

2009
[19]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

work page arXiv 2023
[20]

Ben Torkian and Jun Zhou. 2026. Design and Implementation of a Safety-First AI Chatbot Architecture for Public Health Resource Navigation. In Practice and Experience in Advanced Research Computing 2026. ACM. Under Review

2026
[21]

Ben Torkian and Jun Zhou. 2026. A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale. InPractice and Experience in Advanced Research Computing 2026. ACM. Under Review

2026
[22]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

2024
[23]

Jun Zhou, Yuhang Lu, Karen Smith, Colin Wilder, Song Wang, Paul Sagona, and Ben Torkian. 2019. A framework for design identification on heritage objects. InPractice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning). ACM, 1–8

2019

[1] [1]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 335–336

1998

[3] [3]

CJ Date. 1994. An introduction to database systems Addison-Wesley.Reading, Massachusetts(1994)

1994

[4] [4]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large language models to generate text with citations.arXiv preprint arXiv:2305.14627(2023)

work page arXiv 2023

[5] [5]

Ashok K Goel and Lalith Polepeddi. 2018. Jill Watson: A virtual teaching assistant for online education. InLearning engineering for online education. Routledge, 120–143

2018

[6] [6]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446

2002

[7] [7]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

2019

[8] [8]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering.. InEMNLP (1). 6769–6781

2020

[9] [9]

Alice Kerlyl, Phil Hall, and Susan Bull. 2006. Bringing chatbots into education: Towards natural language negotiation of open learner models. In International conference on innovative techniques and applications of artificial intelligence. Springer, 179–192

2006

[10] [10]

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024(2022)

work page arXiv 2022

[11] [11]

Martin Kleppmann. 2019. Designing data-intensive applications

2019

[12] [12]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020

[13] [13]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. InProceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163–173

2021

[14] [14]

Shangkun Liu, Cuixia Zhang, and Jiaman Ma. 2020. A novel ensemble deep learning model for stock prediction based on stock prices and news. PLoS ONE15, 9 (2020), e0238314. doi:10.1371/journal.pone.0238314

work page doi:10.1371/journal.pone.0238314 2020

[15] [15]

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell- Gillingham, Geoffrey Irving, et al . 2022. Teaching language models to support answers with verified quotes.arXiv preprint arXiv:2203.11147 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022

[17] [17]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118(2020)

work page arXiv 2020

[18] [18]

Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and trends®in information retrieval3, 4 (2009), 333–389

2009

[19] [19]

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy.arXiv preprint arXiv:2305.15294(2023)

work page arXiv 2023

[20] [20]

Ben Torkian and Jun Zhou. 2026. Design and Implementation of a Safety-First AI Chatbot Architecture for Public Health Resource Navigation. In Practice and Experience in Advanced Research Computing 2026. ACM. Under Review

2026

[21] [21]

Ben Torkian and Jun Zhou. 2026. A Multi-Agent AI System for Automated High School Transcript Processing: Collaborative Document Analysis at Scale. InPractice and Experience in Advanced Research Computing 2026. ACM. Under Review

2026

[22] [22]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

2024

[23] [23]

Jun Zhou, Yuhang Lu, Karen Smith, Colin Wilder, Song Wang, Paul Sagona, and Ben Torkian. 2019. A framework for design identification on heritage objects. InPractice and Experience in Advanced Research Computing 2019: Rise of the Machines (learning). ACM, 1–8

2019