ChhetriSachinPaudel_MCS_2026

Title	ChhetriSachinPaudel_MCS_2026
Alternative Title	Using Custom Data to Train and Evaluate Chatbots with Existing AI Tools
Creator	Chhetri, Sachin Paudel
Contributors	Zhang, Yong (advisor)
Collection Name	Master of Computer Science
Abstract	This paper investigates whether a practical, domain-specific chatbot can be built effectively from custom institutional data under modest hardware constraints, and whether retrieval-based grounding or parameter-efficient fine-tuning is the more dependable strategy for this setting. The study uses a custom Weber State University Computer Science and MSCS corpus containing 851 normalized records, split at the document level into training, validation, and test partitions to prevent data leakage. Three approaches were implemented and evaluated on a locked 52-question benchmark containing 44 answerable questions and 8 unanswerable control questions: a dense Retrieval-Augmented Generation (RAG) system, a LoRA fine-tuned model, and a QLoRA fine-tuned model. The dense RAG system used Qwen/Qwen3-Embedding-0.6B for retrieval, Qwen/Qwen3-Reranker-0.6B for reranking, and Qwen/Qwen3-4B-Instruct-2507 for grounded answer generation. LoRA and QLoRA used Qwen/Qwen2.5-1.5B-Instruct as a shared base model and were trained on QA-aligned supervision derived strictly from the training and validation splits. Evaluation was conducted using a deterministic custom suite reporting Token F1 (word-level), Semantic Similarity, Factual Term Match, Abstention Accuracy, and runtime summaries rather than an LLM-as-a-judge framework. RAG achieved the strongest overall answer quality, with a Semantic Similarity of 0.8290 and Factual Term Match of 0.9091, substantially outperforming LoRA and QLoRA. LoRA and QLoRA achieved higher abstention accuracy overall, but that advantage mainly reflected their tendency to answer more often rather than stronger evidence calibration. These findings suggest that for small institutional knowledge bases, retrieval-based grounding remains the most reliable method for factual accuracy and source fidelity, while parameter-efficient fine-tuning offers a lighter-weight adaptation strategy with weaker grounding. The paper concludes with a practical recommendation: use RAG when evidence-backed answers matter most and treat LoRA or QLoRA as supporting baselines rather than replacements for retrieval.
Subject	Universities and colleges--Data processing; Chatbots; Question-answering systems; Information retrieval; Machine learning; Natural language processing (Computer science)
Keywords	Computer Science; Retrieval-Augmented Generation (RAG); LLMs; Chatbots
Digital Publisher	Digitized by Special Collections & University Archives, Stewart Library, Weber State University.
Date	2026-05
Medium	theses
Type	Text
Access Extent	23 page pdf
Conversion Specifications	Adobe Acrobat
Language	eng
Rights	The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce his or her thesis, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. For further information:
Source	University Archives Electronic Records: Master of Computer Science. Stewart Library, Weber State University
OCR Text	Show Using Custom Data to Train and Evaluate Chatbots with Existing AI Tools by Sachin Paudel Chhetri A Thesis/Project in the Field of Master of Science in Computer Science Approved: ________________________________ Dr. Yong Zhang Advisor/Committee Chair ________________________________ Dr. Abdulmalek Al-Gahmi Committee Member ________________________________ Dr. Meher Shaikh Committee Member Weber State University April 2026 i Abstract This paper investigates whether a practical, domain-specific chatbot can be built effectively from custom institutional data under modest hardware constraints, and whether retrieval-based grounding or parameter-efficient fine-tuning is the more dependable strategy for this setting. The study uses a custom Weber State University Computer Science and MSCS corpus containing 851 normalized records, split at the document level into training, validation, and test partitions to prevent data leakage. Three approaches were implemented and evaluated on a locked 52-question benchmark containing 44 answerable questions and 8 unanswerable control questions: a dense Retrieval-Augmented Generation (RAG) system, a LoRA fine-tuned model, and a QLoRA fine-tuned model. The dense RAG system used Qwen/Qwen3-Embedding-0.6B for retrieval, Qwen/Qwen3-Reranker-0.6B for reranking, and Qwen/Qwen3-4B-Instruct-2507 for grounded answer generation. LoRA and QLoRA used Qwen/Qwen2.5-1.5B-Instruct as a shared base model and were trained on QA-aligned supervision derived strictly from the training and validation splits. Evaluation was conducted using a deterministic custom suite reporting Token F1 (word-level), Semantic Similarity, Factual Term Match, Abstention Accuracy, and runtime summaries rather than an LLM-as-a-judge framework. RAG achieved the strongest overall answer quality, with a Semantic Similarity of 0.8290 and Factual Term Match of 0.9091, substantially outperforming LoRA and QLoRA. LoRA and QLoRA achieved higher abstention accuracy overall, but that advantage mainly reflected their tendency to answer more often rather than stronger evidence calibration. These findings suggest that for small institutional knowledge bases, retrieval-based grounding remains the most reliable method for factual accuracy and source fidelity, while parameter-efficient fine-tuning offers a lighter-weight adaptation strategy with weaker grounding. The paper concludes with a practical recommendation: use RAG when evidence-backed answers matter most and treat LoRA or QLoRA as supporting baselines rather than replacements for retrieval. ii Table of Contents 1. Introduction ................................................................................................................................. 1 2. Related Work .............................................................................................................................. 2 2.1 Retrieval-Augmented Generation ......................................................................................... 2 2.2 Parameter-Efficient Fine-Tuning .......................................................................................... 3 2.3 Evaluation of Generative Question Answering .................................................................... 4 3. Methodology ............................................................................................................................... 5 3.1 Data Pipeline and Corpus Construction ................................................................................ 5 3.2 Benchmark Design ................................................................................................................ 6 3.3 Dense RAG System .............................................................................................................. 7 3.4 LoRA and QLoRA Fine-Tuning ........................................................................................... 8 3.5 Evaluation Design ................................................................................................................. 9 4. Results ....................................................................................................................................... 10 4.1 Main Benchmark Results .................................................................................................... 10 4.2 Interpreting Token F1 (word-level) and Runtime ............................................................... 11 4.3 Manual Review by the Author and Category-Level Findings ............................................ 12 5. Discussion ................................................................................................................................. 13 5.1 Why RAG Performed Best ................................................................................................. 13 5.2 What LoRA Contributed ..................................................................................................... 13 5.3 Why QLoRA Lagged Behind LoRA .................................................................................. 14 5.4 Abstention as a Separate Dimension................................................................................... 14 5.5 Reproducibility and Implementation Scope........................................................................ 15 5.6 Limitations .......................................................................................................................... 16 6. Future Work .............................................................................................................................. 16 7. Conclusion ................................................................................................................................ 17 References ..................................................................................................................................... 19 iii 1. Introduction Large language models have made it possible to build conversational systems that appear knowledgeable across a wide range of domains. However, broad capability does not automatically translate to dependable performance in narrow institutional settings such as university advising, departmental policy interpretation, or program-specific question answering. In these environments, the central challenge is not simply generating fluent text. The real challenge is producing answers that are grounded in authoritative local documents, remain faithful to institutional language, and avoid confident mistakes when the required information is not present. This paper addresses that challenge by studying a practical question: for a small, domainspecific university question-answering system built from custom institutional data, what are the practical trade-offs between retrieval-based grounding and parameter-efficient fine-tuning, and which method is most dependable under realistic student hardware constraints? This question was motivated by the goal of building a chatbot using Weber State University Computer Science data without access to large proprietary infrastructure. The project was designed around opensource tools, local and Colab-friendly workflows, and a methodology that could be reproduced by a graduate student or small research team. Retrieval-Augmented Generation, LoRA, and QLoRA were selected as the three primary technical approaches for comparison. RAG offers the ability to answer questions by retrieving relevant source passages at inference time, which is especially attractive for university information that is structured, policy-driven, and subject to change. LoRA and QLoRA, by contrast, offer parameter-efficient ways to adapt an existing language model to a local domain without full-model retraining. These methods are appealing under limited hardware budgets 1 because they dramatically reduce memory and compute requirements compared with standard fine-tuning. A rigorous comparison requires reproducibility, leakage prevention, benchmark discipline, and interpretable results. For that reason, the study was built around document-level train/validation/test separation, a manually curated benchmark, unanswerable control questions, manual review, and a shared evaluation pipeline applied consistently across all three methods. The results show that dense RAG is the strongest method for grounded answer quality in this setting, while LoRA remains a credible lightweight baseline and QLoRA provides the most hardware-efficient training path but the weakest answer fidelity. More broadly, the project contributes a reproducible workflow for small-scale institutional question-answering research using custom data, open-source models, and modest hardware. A secondary contribution of this work is a fully reproducible evaluation pipeline for comparing retrieval-based and fine-tuningbased approaches on custom institutional data. 2. Related Work 2.1 Retrieval-Augmented Generation Retrieval-Augmented Generation was introduced as a framework that combines a parametric generator with a non-parametric memory in the form of an external document retriever [1]. Rather than relying entirely on model parameters to recall facts, a RAG system retrieves relevant passages and conditions generation on those passages at inference time. This makes the system more adaptable to knowledge-intensive tasks, especially when correctness depends on access to external sources rather than memorized associations. For a project centered on university program information, advising processes, and departmental policies, this motivation is especially relevant. Institutional facts are often spread 2 across webpages, policy documents, and FAQ materials. Many answers require precise language rather than broad semantic approximation. RAG is therefore a natural baseline for a domainspecific chatbot because it grounds responses in source material instead of relying solely on internalized weights. Later survey work situates RAG as a broader design family rather than a single architecture [2]. The survey highlights major design dimensions such as retrieval strategy, chunking, reranking, context construction, and evaluation. That framing is useful here because the present study does not evaluate an abstract notion of retrieval alone; it evaluates a specific dense retrieval stack tailored to grounded institutional question answering. 2.2 Parameter-Efficient Fine-Tuning LoRA provides a way to adapt pretrained language models by freezing the base weights and training only low-rank update matrices [3]. This greatly reduces the number of trainable parameters and lowers memory requirements, making fine-tuning possible on constrained hardware while often preserving strong downstream performance. LoRA is particularly attractive in resource-constrained settings because it offers a realistic path for students or small teams to adapt a general model to a local corpus without full-model retraining. QLoRA extends this idea by combining low-rank adapters with 4-bit quantization of the base model [4]. This approach can preserve competitive fine-tuning quality while dramatically reducing the hardware cost of training large language models. QLoRA is especially relevant to Colab-style experimentation because it directly targets the question of how far parameterefficient methods can be pushed on modest GPUs. Recent PEFT survey work reinforces the practical motivation for including LoRA and QLoRA in this comparison. Broad surveys of parameter-efficient fine-tuning describe these 3 methods as part of a larger family of techniques designed to reduce trainable parameters, memory use, and implementation cost while preserving useful adaptation behavior [5], [6]. A recent LoRA-focused survey also shows how low-rank adaptation has expanded into many variants and applications for large language models, which supports treating LoRA as a central and current baseline rather than a minor implementation detail [7]. Repeatability is also an important concern for fine-tuning-based experiments. Work on repeated QLoRA fine-tuning shows that even when hardware, software, and training settings are controlled, repeated fine-tuning runs can still exhibit variation [8]. This concern supports the reproducibility-oriented design of the present study, including fixed data splits, a locked benchmark, explicit training configurations, and deterministic evaluation metrics. In the context of this study, LoRA and QLoRA were included not because they were assumed to outperform retrieval on every metric, but because they represent realistic alternatives that many students and small teams might consider when retrieval infrastructure is unavailable or deployment simplicity matters more than maximum grounding fidelity. This emphasis on task-aligned supervision also connects to the instruction tuning literature, which shows that fine-tuning language models on instruction-structured tasks can substantially improve downstream generalization when the training format matches intended use [9]. 2.3 Evaluation of Generative Question Answering Evaluating open-ended generative question answering is difficult. Token-overlap metrics alone often understate the quality of answers that paraphrase correct information rather than reproduce it verbatim. Surveys of large language model evaluation therefore emphasize the need 4 for multi-metric assessment, careful interpretation, and caution when choosing automatic scoring criteria [10]. One widely discussed framework is RAGAS, which provides automated evaluation for retrieval-augmented systems using metrics such as faithfulness and answer relevance [11]. However, RAGAS relies on LLM-as-a-judge behavior and is best understood as one evaluation framework among several rather than as a mandatory standard. Related work on judge-based evaluation similarly shows that model-based judging can be useful while still introducing another model into the loop, along with its own biases, dependencies, and reproducibility challenges [12]. For a study comparing a retrieval-based system with non-retrieval fine-tuned systems on the same locked benchmark, a deterministic local metric suite offers a cleaner apples-to-apples comparison across methods. That consideration directly informed the final evaluation design used in this paper. Related work on calibration and uncertainty also suggests that language models can exhibit partial awareness of what they know and do not know, making abstention behavior a meaningful evaluation dimension rather than an ad hoc design choice [13]. 3. Methodology 3.1 Data Pipeline and Corpus Construction The project corpus was assembled from custom Weber State University Computer Science and MSCS materials. After preprocessing, normalization, and filtering, the final corpus contained 851 records, partitioned at the document level into 664 training, 91 validation, and 96 test records. Document-level splitting was a deliberate methodological choice to prevent leakage: all passages from a given source document were assigned to exactly one split so that the 5 evaluation set would not contain adjacent or near-duplicate text from documents already seen during training. The finalized corpus drew from two main data sources. Dataset 1 consisted of public Weber State CS and MSCS web content, including department pages, program information, advising materials, FAQs, and academic policy pages. Dataset 2 consisted of structured departmental policy and operational documents. A third dataset consisting of institutional email correspondence was considered during planning but was ultimately excluded. This exclusion was methodologically appropriate because a representative email corpus would have required stronger anonymization, broader privacy coordination, and possibly additional institutional review beyond the project timeline. The preprocessing pipeline normalized HTML, DOCX, JSON, and plain text source files into a shared schema. Empty, duplicated, and obvious boilerplate records were removed. The resulting corpus was suitable for both dense retrieval indexing and QA-aligned supervision. Table 1. Corpus split counts Split Records 664 91 96 851 Train Validation Test Total Table 2. Source composition of the final corpus Source group Public Weber State CS/MSCS web content Structured policy and operational documents Total Records 313 538 851 3.2 Benchmark Design The final benchmark contained 52 questions, of which 44 were answerable and 8 were unanswerable control questions. The answerable set was distributed across factual, 6 policy/process, and FAQ/how-to categories. Each answerable question was manually verified against authoritative source material, and reference answers were written directly from those sources rather than generated automatically. The eight unanswerable control questions were included as a methodological safeguard. In a university setting, a confidently incorrect answer about program requirements, administrative procedures, or policy language can be more harmful than an honest abstention. Unanswerable controls therefore served as a direct test of calibration and caution. The benchmark was intentionally limited to 52 questions rather than expanded toward a much larger target. After normalization, the retained sources converged into one coherent institutional domain, and constructing a much larger set of high-quality, non-redundant, manually verified questions from held-out material was not methodologically justified. The final benchmark therefore reflects a deliberate quality-over-quantity decision. Table 3. Composition of the locked evaluation benchmark Category Factual Policy / Process FAQ / How-to Unanswerable Control Total Questions 18 19 7 8 52 Answerable 18 19 7 0 44 Unanswerable 0 0 0 8 8 3.3 Dense RAG System The final RAG system used a dense retrieval pipeline rather than a lexical retrieval approach. Lexical retrieval is limited by vocabulary mismatch: when user phrasing differs from the wording in the source material, sparse methods such as BM25 can miss highly relevant passages. This limitation is especially pronounced in policy and FAQ settings, where questions and source documents often use different surface forms for the same underlying concept. Dense retrieval addresses this gap by operating in a shared semantic embedding space. 7 The final dense RAG stack consisted of three Qwen-family models: Qwen/Qwen3Embedding-0.6B as the retriever, Qwen/Qwen3-Reranker-0.6B as the cross-encoder reranker, and Qwen/Qwen3-4B-Instruct-2507 as the generator. At inference time, each question was embedded using the dense retriever and compared against a prebuilt dense index of held-out test records. The top 12 candidates were retrieved by cosine similarity and then reranked using the cross-encoder. The top 5 passages, up to a maximum context budget of 2,400 characters, were injected into a grounded prompt for generation. If the retrieved evidence fell below the configured dense-score threshold, the system abstained rather than generating a potentially unsupported answer. This design preserved a clean separation between retrieval and answer generation, making system behavior easier to interpret and reducing the likelihood of unsupported responses on out-of-scope questions. 3.4 LoRA and QLoRA Fine-Tuning Both LoRA and QLoRA used Qwen/Qwen2.5-1.5B-Instruct as the shared base model. This choice was driven by practical constraints and methodological fairness. The Colab T4 GPU provides approximately 15 GB of VRAM, which makes stable fine-tuning of 7B or 8B parameter models difficult without severe compromises in gradient accumulation or training configuration. Qwen2.5-1.5B-Instruct fit within that hardware budget and allowed LoRA and QLoRA to be compared under identical base-model conditions, isolating quantization as the primary experimental difference between them. A key design decision was the move to QA-aligned supervision. Earlier experiments using chunk-restatement style training produced weaker results because the training task did not closely resemble the final evaluation task. In practice, this supervision format encouraged the model to restate local content rather than answer benchmark-style questions directly, which 8 limited downstream question-answering performance. In the final system, deterministic questionanswer pairs were generated from the training and validation splits, yielding 520 training examples and 62 validation examples. This alignment between supervision format and benchmark task substantially improved the quality of both fine-tuned models. LoRA and QLoRA shared the same base model, prompt structure, and QA-aligned supervision. LoRA used rank-16 adapters with three training epochs, a learning rate of 2e-4, cosine scheduling, gradient accumulation of 16, and a maximum sequence length of 512. QLoRA kept the same supervision, prompt format, and adapter structure, but loaded the base model in NF4 4-bit quantization with a paged AdamW-8bit optimizer, making quantization the main experimental variable. At inference time, neither fine-tuned approach used external retrieval. 3.5 Evaluation Design The evaluation suite was designed to be deterministic, locally reproducible, and applicable equally to all three methods, including the unanswerable control questions. Rather than using an LLM-as-a-judge framework, the study implemented a custom deterministic suite reporting Token F1 (word-level), Semantic Similarity, Factual Term Match, Abstention Accuracy, and latency/runtime summaries. Semantic Similarity was computed using BAAI/bgesmall-en-v1.5 embeddings and served as the primary answer-quality metric for generative question answering, following the broader sentence-embedding approach popularized by Sentence-BERT for cosine-based semantic comparison [14]. BAAI/bge-small-en-v1.5 was selected as a compact and practical embedding model for sentence-level semantic comparison, with the broader choice of embedding-based evaluation informed by MTEB, a large benchmark for evaluating text embeddings across diverse tasks and languages [15]. Factual Term Match 9 measured partial factual agreement through thresholded token overlap, while Abstention Accuracy measured whether the system answered when answers existed and abstained otherwise. In addition to automatic metrics, 30 responses per method were reviewed manually on a three-point support scale to assess source support and hallucination behavior. The manual review was performed by the author using the following rubric: 3 = correct and relevant, 2 = partially correct, and 1 = incorrect or irrelevant. Deterministic metrics support strict reproducibility across computing environments and allow the three methods to be scored on identical, auditable criteria. For a study centered on a locked benchmark and cross-method comparison, this offers a stronger foundation than judgebased scoring, which can vary across model versions and API configurations. 4. Results 4.1 Main Benchmark Results The three methods were evaluated on the same 52-question locked benchmark. Table 4 presents the main comparison results. RAG performed best on all answer-quality metrics by a substantial margin. Its Semantic Similarity reached 0.8290, compared with 0.7297 for LoRA and 0.6914 for QLoRA. Factual Term Match showed the same pattern, with RAG at 0.9091, LoRA at 0.6136, and QLoRA at 0.5000. Token F1 (word-level) also favored RAG, though Token F1 is not the best standalone measure for this generative task. LoRA clearly outperformed QLoRA on answer quality, suggesting measurable degradation from 4-bit quantization at the 1.5B scale. QLoRA remained a meaningful efficiencyoriented baseline, but it did not preserve enough answer fidelity to match either dense RAG or full-precision LoRA in this institutional question-answering setting. 10 Table 4. Main benchmark results across the three methods Metric Token F1 (word-level) Semantic Similarity Factual Term Match Abstention Accuracy RAG 0.4839 0.8290 0.9091 0.6731 LoRA 0.2490 0.7297 0.6136 0.9038 QLoRA 0.1935 0.6914 0.5000 0.8846 Note: Semantic Similarity is the primary answer-quality metric for this generative question-answering task because correct answers often paraphrase the reference text. Token F1 is computed at the word level after normalization. 4.2 Interpreting Token F1 (word-level) and Runtime Token F1 (word-level) must be interpreted carefully because the systems were solving generative rather than extractive question answering. A response can be semantically correct while showing limited lexical overlap with the reference answer. For that reason, Semantic Similarity is the more meaningful primary metric in this study. The RAG system’s Semantic Similarity score of 0.8290 is therefore more informative than its Token F1 score of 0.4839 when assessing overall answer quality. Table 5 summarizes training and inference characteristics and provides context for the quality-latency trade-off observed across methods. Table 5. Training and runtime summary Measure Base / generator model Training runtime Training loss (final) Mean inference latency P95 inference latency Max inference latency Peak CUDA allocated LoRA Qwen2.5-1.5B-Instruct 696 s 0.8402 1,328 ms 2,381 ms 3,058 ms 175 MB QLoRA Qwen2.5-1.5B-Instruct 566 s 0.8477 2,488 ms 4,886 ms 11,730 ms 175 MB Dense RAG Qwen3-4B-Instruct-2507 N/A N/A 6,830 ms 8,851 ms 8,980 ms 178 MB Note: All runs were executed on Google Colab T4. The maximum QLoRA latency spike likely reflects first-token dequantization overhead on long inputs. Mean inference latency was 1,328 ms for LoRA, 2,488 ms for QLoRA, and 6,830 ms for dense RAG. QLoRA showed a large maximum-latency spike of 11.73 seconds, likely reflecting 11 first-token dequantization overhead on long inputs, although this hypothesis was not independently verified. These results indicate that retrieval improves answer quality at the cost of latency, while fine-tuning-based approaches are lightweight at inference time. 4.3 Manual Review by the Author and Category-Level Findings Table 6. Manual review summary Method RAG LoRA QLoRA Average rating (1–3) 2.57 1.33 1.23 Source-supported 93.3% 26.7% 20.0% Hallucination rate 6.7% 50.0% 40.0% Note: Manual review was conducted by the author using a fixed three-point rubric and was not used for model retraining. Manual review supported the automatic results. RAG received an average rating of 2.57 out of 3, with 93.3% of reviewed responses judged source-supported and only 6.7% judged hallucinatory. LoRA averaged 1.33, with 26.7% source-supported responses and a 50.0% hallucination rate. QLoRA averaged 1.23, with 20.0% source-supported responses and a 40.0% hallucination rate. This pattern reinforces the quantitative findings: RAG was not merely better on automated metrics, but also more trustworthy when judged for source support and hallucination behavior. Performance also differed by question type. This pattern was visible in per-category Semantic Similarity results for RAG, which reached 0.786 on factual questions, 0.806 on FAQ/how-to questions, and 0.878 on policy/process questions, reinforcing that retrieval was especially effective where institutional wording precision and traceability to official documents mattered most. LoRA was somewhat more competitive on simple factual questions, suggesting that fine-tuning can internalize stable mappings such as names, numbers, or short program facts. 12 Even in those cases, however, RAG remained slightly stronger overall. The category pattern reinforces the practical conclusion that retrieval is especially valuable when the task depends on exact institutional language rather than on broad topic familiarity. 5. Discussion 5.1 Why RAG Performed Best The strongest overall finding of the study is that retrieval-based grounding was the most dependable strategy for this task. The university domain used here consists of program requirements, advising instructions, deadlines, and policy language that are better answered by grounding generation in explicit source passages than by expecting a compact model to store all relevant knowledge internally. Three components worked together to produce RAG’s advantage. Dense retrieval helped surface relevant evidence even when question wording differed from the source text. Reranking improved precision by selecting the passages most directly connected to the question. Grounded prompting then constrained answer generation to a bounded set of institutional evidence. Together, these components yielded stronger semantic alignment with gold references and much lower hallucination rates in manual review. 5.2 What LoRA Contributed LoRA still produced an important positive result. Once training shifted from generic chunk restatement to QA-aligned supervision, the model became a respectable baseline and exceeded the original 0.70 semantic-similarity target. This is a meaningful methodological finding: parameter-efficient fine-tuning can work under tight hardware constraints when the training objective is carefully aligned with the downstream task. 13 At the same time, LoRA remained weaker than RAG because a retrieval-free model must rely entirely on internalized parametric knowledge. That limitation becomes especially visible when questions require exact policy wording, multi-step procedural details, or distinctions that are easy to blur without explicit evidence at inference time. LoRA therefore emerges as a credible lightweight baseline, but not as a full replacement for grounded retrieval in evidencesensitive institutional question answering. 5.3 Why QLoRA Lagged Behind LoRA QLoRA was included because it represents a more memory-efficient version of the same adaptation strategy. In principle, that makes it attractive for modest GPUs. In this study, however, QLoRA underperformed LoRA on every answer-quality metric. The most plausible explanation is that 4-bit quantization at the 1.5B scale compressed the representation enough to reduce answer fidelity in a narrow institutional domain. This result should not be generalized too aggressively. It does not make QLoRA irrelevant as a method. QLoRA remains meaningful for resource-constrained experimentation and may perform more competitively at larger model scales or in different domains. Within the present study, however, the quality loss was large enough that QLoRA could not match either dense RAG or full-precision LoRA. 5.4 Abstention as a Separate Dimension The abstention results are the most nuanced finding in the study. RAG scored 0.6731 on Abstention Accuracy, while LoRA and QLoRA scored 0.9038 and 0.8846, respectively. On first inspection, this appears to favor the fine-tuned models. A closer reading shows that the metric reflects different answer strategies rather than a simple quality ranking. 14 RAG correctly abstained on all 8 unanswerable control questions (100%), indicating strong caution when evidence was absent. Its lower overall Abstention Accuracy came from incorrectly abstaining on 17 of the 44 answerable questions (38.6%) when retrieved evidence fell below threshold. LoRA and QLoRA, by contrast, rarely abstained. That behavior increased their abstention scores on answerable items but also raised hallucination rates in manual review. Abstention behavior and answer quality therefore need to be interpreted separately. A system can answer more often and still be less trustworthy overall. This distinction is essential to understanding why higher abstention accuracy for LoRA and QLoRA should not be read as stronger evidence calibration. 5.5 Reproducibility and Implementation Scope The implementation favored direct use of the Hugging Face ecosystem over higher-level orchestration frameworks. This approach was preferred over higher-level orchestration frameworks such as LangChain or LlamaIndex, which introduce abstraction layers that can complicate reproducibility in a benchmark-locked study. For this project, that choice improved transparency, version stability, and control over retrieval and generation behavior. A benchmarklocked study benefits from auditable scripts and explicit control over model loading, prompts, thresholds, and evaluation artifacts. This engineering choice matters because research conclusions should be traceable to specific, reproducible implementation decisions. In that sense, the study contributes not only a model comparison but also a reproducibility-oriented workflow for small-scale institutional question-answering research. 15 5.6 Limitations Several limitations should be acknowledged. First, the benchmark contains 52 questions rather than a much larger evaluation set. This was a deliberate quality-over-quantity decision, but it still limits the breadth of the empirical claims. Second, both fine-tuned systems used a 1.5B base model rather than larger architectures, so the results should not be generalized uncritically to larger-scale fine-tuning. Third, the corpus excludes advising emails and other private communication data, which may contain forms of institutional knowledge not fully represented in public pages and policy documents. Finally, the work focuses on one institutional domain, so broader claims about all domain-specific question-answering tasks would require replication on additional corpora. The interactive demo also involved its own engineering constraints, but those practical deployment issues do not change the validity of the locked benchmark results reported here. The manual review was conducted by the author alone; inter-rater reliability was not measured, which limits the strength of the qualitative findings. 6. Future Work Future work could extend this study in several directions. First, the benchmark could be expanded with a larger and more diverse set of manually verified institutional questions, including additional unanswerable controls. Second, the manual review process could be strengthened by adding multiple reviewers and measuring inter-rater reliability. Third, future experiments could evaluate larger base models and hybrid systems that combine retrieval with lightweight fine-tuning for style or response behavior. Finally, replication on additional institutional domains would help determine how well the present findings generalize beyond Weber State University CS materials. 16 7. Conclusion This paper investigated a practical research question: for a small, domain-specific university chatbot built from custom institutional data under modest hardware constraints, what are the trade-offs between retrieval-based grounding and parameter-efficient fine-tuning, and which method is most dependable? Based on the completed experiments, the answer is clear. Dense RAG was the strongest method overall. It produced the highest Semantic Similarity, the highest Factual Term Match, and the strongest manual-review performance. Its advantage was especially clear on policy and process questions where institutional wording precision mattered most. These findings support the argument that retrieval remains the safest and most dependable strategy for grounded question answering on small institutional corpora. LoRA also produced an important positive result. With QA-aligned supervision and a carefully chosen compact base model, it delivered respectable performance under tight hardware constraints and outperformed QLoRA on most metrics. This suggests that parameter-efficient fine-tuning remains useful when latency, deployment simplicity, or lack of retrieval infrastructure are significant constraints. QLoRA provided the most hardware-efficient training path but showed the weakest answer fidelity in this particular setting. Taken together, the results point to a practical decision framework. For university advising and policy-oriented systems where correctness and grounding are critical, RAG should be the default choice. LoRA can serve as a lightweight adaptation method when retrieval is impractical, but it should not be assumed to replace grounded retrieval. QLoRA remains valuable as a resource-efficient baseline, though its quality limitations relative to full-precision fine-tuning must be acknowledged. 17 More broadly, this project demonstrates that a rigorous, reproducible, and defensible institutional chatbot research pipeline can be constructed using custom data, open-source models, and consumer-grade hardware. That contribution extends beyond the question of which method performed best: it establishes that this kind of principled comparison can be carried out responsibly in a real academic setting with limited resources. 18 References [1] P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” arXiv:2005.11401, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401 [2] Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv:2312.10997, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997 [3] E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” arXiv:2106.09685, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685 [4] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv:2305.14314, 2023. [Online]. Available: https://arxiv.org/abs/2305.14314 [5] Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang, “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,” Transactions on Machine Learning Research, 2024. [Online]. Available: https://arxiv.org/abs/2403.14608 [6] L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang, “Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment,” arXiv:2312.12148, 2023. [Online]. Available: https://arxiv.org/abs/2312.12148 [7] Y. Mao, Y. Ge, Y. Fan, W. Xu, Y. Mi, Z. Hu, and Y. Gao, “A Survey on LoRA of Large Language Models,” arXiv:2407.11046, 2024. [Online]. Available: https://arxiv.org/abs/2407.11046 [8] S. S. Alahmari, L. O. Hall, P. R. Mouton, and D. B. Goldgof, “Repeatability of Fine-Tuning Large Language Models Illustrated Using QLoRA,” IEEE Access, vol. 12, pp. 153221153231, 2024, doi: 10.1109/ACCESS.2024.3470850. [9] J. Wei et al., “Finetuned Language Models Are Zero-Shot Learners,” arXiv:2109.01652, 2021. [Online]. Available: https://arxiv.org/abs/2109.01652 [10] Y. Chang et al., “A Survey on Evaluation of Large Language Models,” arXiv:2307.03109, 2023. [Online]. Available: https://arxiv.org/abs/2307.03109 19 [11] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated Evaluation of Retrieval Augmented Generation,” arXiv:2309.15217, 2023. [Online]. Available: https://arxiv.org/abs/2309.15217 [12] L. Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” arXiv:2306.05685, 2023. [Online]. Available: https://arxiv.org/abs/2306.05685 [13] S. Kadavath et al., “Language Models (Mostly) Know What They Know,” arXiv:2207.05221, 2022. [Online]. Available: https://arxiv.org/abs/2207.05221 [14] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 3982-3992. [15] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “MTEB: Massive Text Embedding Benchmark,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2014-2037. 20
Format	application/pdf
ARK	ark:/87278/s6mwj36p
Setname	wsu_smt
ID	169753
Reference URL	https://digital.weber.edu/ark:/87278/s6mwj36p