VoskanyanMichael_MCS_2026

Title	VoskanyanMichael_MCS_2026
Alternative Title	Vulnerability-Preserving Program Reduction for Common Vulnerabilities and Exposured using LLM-Based Reduction
Creator	Voskanyan, Michael
Contributors	Christi, Arpit (advisor); Al-Gahmi, Abdulmalek (advisor); Valle, Hugo (advisor)
Collection Name	Master of Computer Science
Abstract	Program debugging is challenging, time-consuming, and tedious because developers often spend substantial effort simplifying and isolating the failure-inducing portion of a program. To support this process, many program and test reduction techniques have been proposed to automatically minimize failure-inducing inputs while preserving the observed fault. In the security setting, however, reduction must preserve vulnerability-relevant behavior rather than merely produce a smaller artifact. Recent work has explored the use of large language models (LLMs) to assist program reduction, but such techniques have not been widely evaluated; for vulnerability-preserving reduction. This thesis evaluates LVulnReducer (LVR), an HDD-first, LLM-assisted reduction pipeline for vulnerability-inducing programs, on 25 Common Vulnerabilities and Exposures (CVE) artifacts drawn from the San2Patch; benchmark. Reduction quality is evaluated relative to C-Reduce using Clang AST-based statement comparison, while reduction effectiveness is measured using source lines removed and percentage reduction. The results show that LVulnReducer substantially reduces vulnerable programs while preserving benchmark-specific exploit evidence. On average, LVR removes 3,321.32 source lines per artifact, corresponding to an average reduction of 94.62%. In comparison to the C-Reduce reference reductions, LVR achieves an average precision of 99.66% and an average recall of 97.16%, indicating that its reductions remain structurally close; to those produced by C-Reduce. Although C-Reduce still produces smaller final artifacts overall, the results show that LLM-assisted reduction can meaningfully improve on lightweight hierarchical mechanical reduction and approximate the behavior of a mature reduction baseline in the vulnerability-preserving setting.
Subject	Computer science; Large language models; Computer programming-Reduction; Natural language processing (Computer science)
Digital Publisher	Digitized by Special Collections & University Archives, Stewart Library, Weber State University.
Date	2026-04
Medium	theses
Type	Text
Access Extent	37 page pdf
Conversion Specifications	Adobe Acrobat
Language	eng
Rights	The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce his or her thesis, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. For further information:
Source	University Archives Electronic Records: Master of Computer Science. Stewart Library, Weber State University
OCR Text	Show VULNERABILITY-PRESERVING PROGRAM REDUCTION FOR COMMON VULNERABILITIES AND EXPOSURES USING LLM-BASED REDUCTION By Michael Voskanyan A thesis Submitted to the faculty of the MSCS Graduate Program of Weber State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science Graduation April 24, 2026 Ogden, Utah Approved: Date: 04/24/2026 Committee Chair, Arpit Christi, Ph.D. 04/24/2026 Committee Member, Abdulmalek Al-Gahmi, Ph.D. 04/26/2026 Committee Member, Hugo Valle, Ph.D. ACKNOWLEDGMENTS I would like to thank my committee chair, Dr. Arpit Christi, and my committee members, Dr. Abdulmalek Al-Gahmi, and Dr. Hugo Valle, for their guidance and support throughout the course of this research. I would also like to thank my friends, colleagues, and especially my family for their support and encouragement throughout my studies. ii ABSTRACT Program debugging is challenging, time-consuming, and tedious because developers often spend substantial effort simplifying and isolating the failure-inducing portion of a program. To support this process, many program and test reduction techniques have been proposed to automatically minimize failure-inducing inputs while preserving the observed fault. In the security setting, however, reduction must preserve vulnerabilityrelevant behavior rather than merely produce a smaller artifact. Recent work has explored the use of large language models (LLMs) to assist program reduction, but such techniques have not been widely evaluated for vulnerability-preserving reduction. This thesis evaluates LVulnReducer (LVR), an HDD-first, LLM-assisted reduction pipeline for vulnerabilityinducing programs, on 25 Common Vulnerabilities and Exposures (CVE) artifacts drawn from the San2Patch benchmark. Reduction quality is evaluated relative to C-Reduce using Clang AST-based statement comparison, while reduction effectiveness is measured using source lines removed and percentage reduction. The results show that LVulnReducer substantially reduces vulnerable programs while preserving benchmarkspecific exploit evidence. On average, LVR removes 3,321.32 source lines per artifact, corresponding to an average reduction of 94.62%. In comparison to the C-Reduce reference reductions, LVR achieves an average precision of 99.66% and an average recall of 97.16%, indicating that its reductions remain structurally close to those produced by C-Reduce. Although C-Reduce still produces smaller final artifacts overall, the results show that LLM-assisted reduction can meaningfully improve on lightweight hierarchical mechanical reduction and approximate the behavior of a mature reduction baseline in the vulnerability-preserving setting. iii TABLE OF CONTENTS Page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 3.2 3.3 5 6 8 4 LVulnReducer: Usage, Architecture and Implementation 4.1 4.2 4.3 5 10 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to use LVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 16 16 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effectiveness - Reduction Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 19 21 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7.1 7.2 7.3 7.4 23 24 24 25 6.1 6.2 6.3 6.4 7 . . . . . . . . . . . . . . . . . . . Experiments 5.1 5.2 5.3 5.4 6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AST-Based Comparison Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 8.1 8.2 27 28 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v LIST OF TABLES Table Page 5.1 Subject details: CVE ID, project, vulnerable file and the source size (lines of code). . . . 15 6.1 6.2 Precision and Recall: CVE ID, Precision, Recall . . . . . . . . . . . . . . . . . . . . . . Reduction Size: CVE ID, LVR reduction, LVR percentage reduction, HDD reduction, HDD percentage reduction, C-Reduce reduction, and C-Reduce percentage reduction . . 20 vi 20 LIST OF FIGURES Figure 3.1 3.2 Page Architecture and reduction workflow of LVR. The pipeline begins with benchmark setup and HDD-based mechanical reduction, followed by bounded LLM-guided semantic candidate generation. Each candidate is validated by a benchmark-specific interestingness oracle. Accepted candidates are fed back into the mechanical reduction stage to expose additional deletion opportunities, while rejected candidates are rolled back and replaced with the next candidate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AST-based comparison methodology used for evaluating reduction accuracy. The original, C-Reduce, and LVR artifacts are converted into normalized statement sets using Clang ASTs, after which removal sets are compared to compute TP, FP, FN, precision, and recall. vii 7 9 CHAPTER 1 Introduction Debugging is one of the most time consuming and challenging aspect of software development. A significant amount of developer time is spent on locating, understanding and isolating a given fault. Additionally, in many cases, the failure-inducing behavior is embedded within a much larger program context that includes parsing logic, configuration handling, and other functionality unrelated to the fault itself. Program reduction can improve the debugging time by removing the code that is unrelated to the fault and hence avoiding time spent by developers on analyzing unrelated code. Beyond debugging, program or test reduction can improve also automatic fault localization [1, 2]. Researchers have proposed and evaluated many techniques, algorithms, and tools for program and test reductions [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].. Recently, researchers have proposed and evaluated LLM-based program reduction techniques. These approaches combine the program-transformation capabilities of large language models with traditional reduction algorithms to produce smaller programs while preserving a fault or other properties of interest [18]. The motivation behind this line of work is that traditional reducers are highly effective at systematic deletion, whereas LLMs may contribute higher-level semantic transformations that expose additional opportunities for simplification. Beyond debugging-oriented reduction, LLM-based reducers have also been studied in the context of program debloating, where the objective is to remove unnecessary code while preserving desired program behavior [19]. In the security setting, this reduction problem becomes even more challenging. For most programs, vulnerability-specific code is interwoven with substantial amounts of code that are not directly related to the vulnerability itself. When debugging such programs, the developer must identify the vulnerability-relevant computation and isolate it form the rest of the code base. In effect, this process is similar to finding a vulnerability-inducing program slice. Performing this isolation manually can be tedious, time-consuming, and error-prone, particularly in large code-bases where the vulnerable path is embedded within ordinary application logic. This difficulty motivates the use of program reduction in the security setting. In general, program or test reduction seeks to minimize an artifact while preserving a property of interest, which is often the observed fault. In the case of vulnerability-focused reduction, the goal is to reduce the program while preserving the vulnerability-relevant behavior needed to reproduce the exploit evidence. To make the objective of preservation possible, the reducer relies on an oracle that provides objective, automatic, and verifiable acceptance criteria. In practice, this oracle can be combined with the normal build-and-run procedure of the program so 1 that each candidate reduction is accepted only if the required vulnerability evidence is still present. In this work, we combine LLM-assisted reduction with traditional DD/HDD-style reduction, following the general intuition of LPR (Large Language Model Aided Program Reduction) that semantic transformations and LLM-guided reasoning can complement mechanical reduction [18]. Although the present work does not adopt the full LPR pipeline end-to-end, it draws from the same core idea: progressively reducing a program while preserving the target behavior through an external acceptance criterion. Based on this idea, we propose LVulnReducer (LVR), an oracle-guided reduction pipeline designed to systematically and progressively reduce vulnerability-inducing programs. To evaluate accuracy and effectiveness of our approach, we study vulnerability artifacts drawn from San2Patch [20] and VulnLoc [21]. For each selected benchmark instance, we apply both C-Reduce and LVR and compare the resulting reductions. In particular, we measure how closely LVR approximates the reductions produced by C-Reduce, and we also quantify reduction effectiveness in terms of the number and percentage of statements removed from each artifact. The contributions of this work are as follows. 1. We adapt existing LLM-assisted program reduction techniques to the setting of vulnerability-preserving reduction. 2. We evaluate an LLM-assisted vulnerability reducer on benchmark common vulnerability and exposure artifacts using reduction-size and statement-level comparison metrics. The remainder of this thesis is organized as follows. Chapter 2 reviews prior work on program reduction and LLM-assisted reduction techniques. Chapter 3 describes the overall approach and the major components of LVR. Chapter 3.1 presents the motivation for vulnerability-preserving reduction. Chapter 4 introduces LVR, including its usage, architecture, and implementation. Chapter 5 presents the benchmark artifacts, research questions, and experimental design. Chapter 6 reports and analyzes the results. Chapter 7 discusses threats to validity. Finally, Chapter 8 concludes the thesis and outlines potential directions for future work. 2 CHAPTER 2 Related Work Delta Debugging (DD) is one of the foundational approaches to reducing failure-inducing inputs while preserving the observed failure [17]. DD applies a binary-search-like strategy to systematically identify smaller failure-inducing inputs, treating the input as a list or sequence of components and repeatedly testing whether subsets of those components are still sufficient to preserve the target behavior. Hierarchical Delta Debugging (HDD) extends this idea to structured artifacts such as programs, XML, and HTML by exploiting their hierarchical or tree-like organization during reduction [16]. These techniques established the basic idea of iterative, oracle-guided reduction that still underlies many later reducers. Building on these ideas, Regehr et al. proposed C-Reduce, a widely used reducer for C programs [3]. C-Reduce generalizes delta-debugging-style reduction through a collection of source transformations guided by a user-provided ”interestingness” script. The reducer repeatedly applies candidate transformations and retains only those that preserve the property checked by the script. When the interestingness script is designed to preserve a fault and its observable signatures, the resulting reduced program continues to exhibit the same failure behavior. Because of its strong reduction performance and broad use as a baseline in reduction research, this study treats C-Reduce as the reference reducer when evaluating the quality of LVR reductions [3, 7, 14, 18]. A large number of tools and techniques have since been proposed to simplify failure-inducing inputs, including C-Reduce, Picireny, Generalized Tree Reduction (GTR), ORBS, Reduktor, Perses, DDSET, ProbDD, ReduSharptor, ReduJavator, Vulcan, and T-Rec [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. These reducers can be broadly categorized as language-agnostic or language-specific. Language-agnostic reducers aim to operate across a wider range of input formats, often relying on structural or grammar-based properties, whereas language-specific reducers exploit properties of a particular language to achieve more aggressive simplification. Program reducers can also be categorized by purpose. Most reduction tools are designed for debugging and failure preservation, where the goal is to minimize a program or test case while maintaining the observed fault. However, program reduction has also been studied in other settings. For example, reduction and related transformations have been used for program debloating, where the goal is to remove unnecessary code, reduce attack surface, or improve generality [22]. Reduction has also been explored for building resource-adaptive software by removing low-priority but resource-consuming functionality [23]. These lines of work show that program reduction is not limited to debugging alone, but is a more general technique for simplifying software 3 artifacts under a specified preservation objectives. Large Language Models (LLMs) have recently been evaluated across a wide range of software engineering tasks, including code generation, summarization, program repair, vulnerability-related tasks, and other forms of software automation [24, 25]. Within the reduction setting, LLMs are especially appealing because they may contribute higher-level semantic transformations that complement the purely mechanical deletion strategies used by traditional reducers. The work most closely related to this thesis is LLM-aided Program Reduction (LPR), proposed by Zhang et al. [18]. LPR combines traditional reduction with LLM-guided semantic reasoning, using the model to help identify transformations that may expose additional opportunities for simplification. In their evaluation, LPR outperformed several language-agnostic reducers and achieved results comparable to strong languagespecific reducers such as C-Reduce. This thesis follows the same general intuition that LLM-guided semantic transformations may complement mechanical reduction, but applies that idea in a different setting. Whereas LPR was evaluated on a different benchmark suite, the present work focuses on vulnerability-inducing artifacts drawn from security-oriented benchmarks such as San2Patch and VulnLoc, and evaluates reduction quality relative to C-Reduce using AST-based statement comparison. Related work also includes LLM-based reduction for other purposes, such as the LEADER framework proposed by Lin et al. for program debloating [19]. Together, these studies suggest that LLMs can play a useful supporting role in program simplification. The contribution of the present work is to study that role specifically in the context of vulnerability-inducing programs, where the reduction must preserve exploitrelevant behavior under a benchmark-specific oracle. 4 CHAPTER 3 Approach Informed by the related work, this study evaluated whether LLM-assisted source reduction can improve vulnerability-focused reduction beyond what is achieved using a lightweight hierarchical mechanical reducer alone. The goal is not to prove full semantic equivalence, but rather to determine whether LVR can produce smaller vulnerability-inducing artifacts that more closely approximate the reductions performed by CReduce. 3.1 Motivation Vulnerability-inducing programs are often substantially larger than the code actually needed to reproduce the observed failure. In real software, the behavior that triggers a sanitizer report or crash is typically embedded with large amounts of surrounding logic related to input parsing, configuration, I/O, error handling, and general application functionality. As a result, direct analysis of the original vulnerable artifact can be difficult, since much of the source code is unrelated to the vulnerability of interest. The vulnerable behavior is often obscured by implementation detail, making it harder to determine which computations are essential to the vulnerability and which are merely incidental to the surrounding program context. This complicates manual debugging and slows efforts to isolate the failure-relevant portion of the program. Program reduction addresses this problem by removing code that is unnecessary for reproducing the target behavior. in the security setting, this is especially valuable because a reduced artifact can expose the portion of the source more directly associated with the observed failure, making the program easier to inspect, compare, and analyze. By preserving the exploit evidence while eliminating unrelated program context, reduction can provide a more focused view of the computation that contributes to the vulnerability. The motivation for this work, therefore, is to evaluate whether LLM-assisted reduction can serve as a practical complement to mechanical reduction for vulnerability-inducing programs. Traditional reducers such as C-Reduce are highly effective at minimizing failure-inducing inputs through systematic, oracle-guided source transformations. Recent LLM-based program reduction techniques suggest that language models may provide an additional source fo semantic guidance that can complement mechanical reduction. Motivated by this possibility, the present study evaluates whether an HDD-first, LLM-assisted reducer can produce smaller vulnerability-inducing artifacts than lightweight hierarchical mechanical reduction alone, and how closely those reductions approximate the results produced by C-Reduce. The following section describes the workflow used to evaluate this approach and to compare its reductions against both a 5 lightweight hierarchical baseline and the C-Reduce reference reducer. 3.2 Workflow To evaluate this question in a systematic way, the workflow of this study is organized around four source artifacts for each benchmark instance: the original vulnerable file, the final reduction produced by C-Reduce, the final reduction produced by LVR, and the final reduction produced by the HDD baseline. For reductionsize analysis, all three reducers are compared directly and evaluated by lines reduced. For AST-based accuracy analysis, LVR is compared against the C-Reduce reference reduction, while HDD is used as a mechanical baseline to measure how much additional reduction is obtained from LLM-assisted semantic edits. For each selected CVE, reduction begins from the original vulnerable source file and is constrained by a benchmark-specific interestingness oracle that rebuilds the program, reruns the proof-of-concept input, and checks whether the required exploit evidence is preserved. This ensures that the reducer is evaluated under the same intended preservation objective for a given benchmark instance. LVR operates in two stages. First, it applies hierarchical mechanical reduction to remove structurally deletable code. Once that phase reaches a fixed point, it applies bounded LLM-guided semantic edits to the remaining artifact. Each candidate edit is validated by the same benchmark-specific oracle so that semantic changes are accepted only if the required exploit evidence is still preserved. C-Reduce is run on the same original vulnerable source file using a reducer-specific interestingness harness that enforces the same intended exploit-evidence criterion. In addition, a lightweight hierarchical mechanical baseline is used to measure how much improvement is obtained from adding bounded semantic edits beyond mechanical reduction alone. The resulting reduced artifacts are then compared quantitatively. Reduction size is measured using lines of code and statement counts, while reduction quality is measured relative to C-Reduce. In this study, C-Reduce is treated as the reference reducer, and LVR is evaluated in terms of how closely its final removals match those produced by C-Reduce. This comparison makes it possible to evaluate not only how much reduction occurs, but how closely LVR approximates the reductions produced by a strong and widely used baseline. The overall architecture and reduction workflow of LVR are illustrated in Figure 3.1, which shows how the benchmark setup, mechanical reduction stage, semantic candidate generation stage, and oracle-driven accept/reject loop interact during reduction. 6 Figure 3.1: Architecture and reduction workflow of LVR. The pipeline begins with benchmark setup and HDD-based mechanical reduction, followed by bounded LLM-guided semantic candidate generation. Each candidate is validated by a benchmark-specific interestingness oracle. Accepted candidates are fed back into the mechanical reduction stage to expose additional deletion opportunities, while rejected candidates are rolled back and replaced with the next candidate. 7 3.3 AST-Based Comparison Methodology To make this comparison more structured and less sensitive to purely textual variation, the original file, the C-Reduce output, and the LVR output are each converted into statement-level representations using Clanggenerated Abstract Syntax Trees (ASTs) [26]. Rather than comparing raw source lines, which can be affected by formatting, blank lines, braces, and other non-semantic text, the evaluation is performed over normalized AST-derived statements extracted from each program. This makes the comparison better aligned with the structural transformations performed by the reducers. Let Sorig , Screduce , and Slvr denote the sets of normalized statements extracted from the original program, the C-Reduce reduction, and the LVR reduction respectively. The removed-statement sets are then defined relative to the original program as: • Rcreduce = Sorig \Screduce • Rlvr = Sorig \Slvr These sets allow LVR to be compared against C-Reduce using standard set-based metrics. True positives are statements removed by both reducers, false negatives are statements removed by C-Reduce but not by LVR, and false positives are statements removed by LVR but not by C-Reduce. Precision therefore measures how many of the statements removed by LVR were also removed by C-Reduce, whereas recall measures how completely LVR matches the statements removed by C-Reduce. Under this formulation, precision and recall do not represent semantic correctness in an absolute sense. Rather, they quantify structural agreement between LVR and C-Reduce reference reduction. This provides a practical and reproducible way to evaluate reduction quality at scale across benchmark CVEs while avoiding the noise introduced by direct text-based comparison. The overall AST-based comparison process is illustrated in Figure 3.2, which shows how the original vulnerable file, the C-Reduce reduction, and the LVR reduction are transformed into normalized statement sets and compared using set-based accuracy metrics. 8 Figure 3.2: AST-based comparison methodology used for evaluating reduction accuracy. The original, CReduce, and LVR artifacts are converted into normalized statement sets using Clang ASTs, after which removal sets are compared to compute TP, FP, FN, precision, and recall. 9 CHAPTER 4 LVulnReducer: Usage, Architecture and Implementation LVR is an oracle-guided source reduction tool for vulnerability-inducing C artifacts. Its purpose is to produce a substantially smaller source-level artifact that still satisfies externally specified exploit-evidence criteria under the original build-and-run harness. LVR is not intended to preserve full program correctness or semantic equivalence in the conventional compiler sense. Instead, it preserves whatever failure evidence is required by the benchmark-specific interestingness oracle. In the current prototype, LVR is implemented as an HDD-first reduction pipeline that combines hierarchical mechanical deletion with bounded LLM-guided simplification. The mechanical phase serves as the primary reduction engine, while the LLM is used to expose additional reduction opportunities after purely structural deletion stalls. 4.1 Availability LVulnReducer is publicly available at https://github.com/MichaelVoskanyan/LVR. The current implementation targets C programs from the San2Patch and VulnLoc benchmarks and is designed to operate within the benchmark working directory using the existing build system and proof-of-concept inputs. 4.2 How to use LVR LVulnReducer is currently used as a benchmark-driven research prototype rather than as a general end-user reduction tool. A typical LVR run is performed inside a specific CVE benchmark directory after sourcing an lvr setup.sh script that configures the Python environment, the target file, and the LLM endpoint. Once the environment is configured, the reducer is launched with python -m lvr.main. The key requirement for using LVR is not a separate test file, but a benchmark-specific interestingness.sh script. This script acts as the acceptance oracle for the reducer. It is responsible for rebuilding the project, rerunning the proof-of-concept input, and determining whether the candidate reduction still preserves the required exploit evidence. In practice, the script may either use the project’s default build system or override it with a faster, benchmark-specific build procedure. For this reason, build and run behavior are treated as part of the oracle rather than as logic hard-coded into LVR itself. The lvr setup.sh script typically defines the vulnerable source file through the LVR TARGET FILE environment variable, and configures model access through environment variables LVR USE API, LVR BASE URL, and LVR MODEL. Additional environment variables control semantic-phase behavior, including how often LLM-based edits are attempted, how much surrounding source context is shown to the model, and the al- 10 lowed size range of semantic regions. After these parameters are set, python -m lvr.main executes the reduction pipeline against the target file in the current benchmark directory. In the current implementation, the LLM endpoint is typically served using llama.cpp. Before launching python -m lvr.main, the selected model is started through an OpenAI-compatible local server so that LVR can issue completion requests through the configured LVR BASE URL. In our experiments, this was done by launching llama-server with the selected Qwen2.5-Coder model, GPU offloading enabled, a fixed context window, and a local port bound to the endpoint used by the reducer. This local serving setup allows LVR to use an external LLM during semantic reduction without embedding model execution directly into the reducer itself. During execution, LVR first applies its mechanical reduction phase to the file identified by LVR TARGET FILE. When that phase reaches exhaustion, LVR extracts bounded semantic regions from the current reduced artifact and queries the LLM for localized candidate edits. Each candidate is then written in place to the target file and validated by invoking the same interestingness.sh script. If the oracle returns success, the candidate is accepted and reduction continues from the smaller artifact. Otherwise, the file is restored and the next candidate is tried. The tool writes benchmark-local output during execution. Build logs, run logs, and candidate-level diagnostics are written under logs/, while accepted reduced artifacts are stored under reduced/. As a result, each run produces both a final reduced source file and an execution trail that can be used to inspect how the reduction progressed. In its current form, LVR is used on a per-benchmark basis: the user enters a CVE directory, sources the benchmark’s lvr setup.sh configuration, ensures that a suitable interestingness.sh oracle is present, and then launches the reducer with python -m lvr.main. 4.3 Architecture LVR is organized around four main components: a reduction driver, a mechanical reducer, an LLM-assisted semantic reducer, and an external interestingness oracle. These components operate over a single target source file inside the benchmark working directory and interact through an iterative accept-or-reject workflow. The reduction driver, implemented in main.py, coordinates the overall reduction process. It initializes logging, loads the target file specified by LVR TARGET FILE, invokes the mechanical and semantic reduction phases, and manages acceptance, rollback, and artifact persistence. The driver is responsible for controlling the reduction rounds and for ensuring that each candidate modification is validated before it becomes part of the current working artifact. The mechanical reducer serves as the first stage of each reduction round and is the authoritative re11 duction engine in the current prototype. Although its implementation class retained the historical name DDMinReducer, the current reducer is not a flat ddmin implementation. Instead, it performs lightweight hierarchy-guided reduction over the source file. It constructs a lexical hierarchy from the current source artifact and attempts structured deletions over bounded regions such as statements and blocks. This phase is deterministic and is intended to remove as much irrelevant code as possible before any LLM interaction occurs. Once the mechanical phase reaches exhaustion, LVR enters its semantic reduction phase. In this stage, the reducer extracts bounded semantic regions from the current reduced artifact and submits them to the LLM as localized reduction opportunities. Rather than rewriting the entire file, the model is asked to produce a small number of bounded candidate edits for a selected region. This design deliberately constrains the semantic phase so that the LLM acts as a proposal generator rather than an unrestricted code editor. The external interestingness oracle provides the acceptance criterion for both reduction phases. The oracle is implemented as a benchmark-specific interestingness.sh script and is responsible for rebuilding the project, rerunning the proof-of-concept input, and checking whether the reduced artifact still preserves the required exploit evidence. LVR therefore separates search from validation: the reducer proposes candidate simplifications, but the oracle is the only component that determines whether a candidate is accepted. These components interact in a round-based pipeline. Each round begins with the mechanical reducer operating on the current target file. After the mechanical phase stalls, the semantic reducer selects bounded regions from the resulting artifact and queries the LLM for localized candidate edits. Each candidate is written in place to the target file and validated by invoking the oracle. If the oracle returns success, the candidate is accepted and becomes the new current artifact. If the oracle rejects the candidate, the previous version of the file is restored and the next candidate is attempted. This process continues until no additional acceptable reductions can be found. LVR also includes a supporting artifact-management layer. Build logs, run logs, and candidate-level diagnostics are written under logs/, while accepted reduced artifacts are preserved under reduced/. Candidate backups and diffs are stored so that rejected edits can be rolled back safely and accepted edits can be inspected after the run completes. This logging and persistence layer is important in practice because reduction failures are common, and understanding why a candidate was rejected is often necessary for debugging both the reducer and the oracle. In its current form, the overall architecture is therefore intentionally conservative. Mechanical hierarchical deletion performs most of the shrinking, the semantic phase is bounded and opportunistic, and the interestingness oracle remains the final authority on preservation. This architecture makes the system practical for benchmark-driven evaluation while limiting the instability that would arise from unrestricted whole-file LLM 12 rewriting. 13 CHAPTER 5 Experiments This section describes the experimental design used to evaluate LVR. It presents the research questions, the benchmark subjects and their selection, the procedure used to generate reduction results, and the measurements collected for subsequent analysis. The corresponding experimental results are reported and discussed in Chapter 6. 5.1 Research Questions This study evaluates LLM-assisted program reduction on vulnerability-inducing artifacts drawn from the San2Patch benchmark. In particular, it examines both the effectiveness of reduction, measured by the number and percentage of statements removed, and the quality of the resulting reductions relative to C-Reduce. Accordingly, the study is organized around the following research questions: 1. RQ1: How accurate are LLM-based reductions when performing vulnerability preserving reductions? 2. RQ2: How much reduction is achieved when performing vulnerability-preserving reductions? 5.2 Subjects Most prior reduction research has focused on failure-inducing reductions and has relied on benchmark suites designed for reproducible fault-triggering inputs. Because the goal of this work is to evaluate vulnerabilitypreserving reduction, the study uses the San2Patch benchmark, which contains Common Vulnerabilities and Exposures (CVEs) for C programs together with the inputs or commands needed to reproduce the corresponding vulnerability behavior. For each CVE, the benchmark provides the vulnerable program together with the exploit input or command sequence needed to reproduce the observed behavior. These benchmark subjects are appropriate for this study because they provide concrete, repeatable reproduction workflows and a diverse set of real vulnerability instances for reduction. The evaluated CVEs were randomly selected from the benchmark and then screened against the following inclusion criteria. 1. Reproducibility and automation: Each selected CVE had to support an end-to-end, repeatable reduction workflow. In particular, the system needed to run successfully on the benchmark instance using a documented reproduction procedure and produce (a) a reduced artifact, (b) logs of validation attempts, and (c) a final preservation decision under the defined oracle. 14 2. Availability of a vulnerability-preservation oracle: Each selected CVE had to admit an oracle that could be defined using objective and automatable exploit evidence, such as preservation of the sanitizer or error class together with a crash, blame, or sink signature when available. The same intended oracle criterion also had to be applicable to both LVR and the C-Reduce baseline so that the resulting reductions could be compared fairly. The details of the subjects are mentioned in table 5.1. The selected subjects span multiple projects and vulnerable source files, providing diversity in both code base and bug location. Table 5.1: Subject details: CVE ID, project, vulnerable file and the source size (lines of code). CVE ID Project Vulnerable File Source Size (in lines) CVE-2016-10094 libtiff tiff2pdf.c 5585 CVE-2016-10092 libtiff tiffcrop.c 9223 CVE-2016-10272 libtiff tiffcrop.c 9223 CVE-2016-9532 libtiff tiffcrop.c 9223 CVE-2017-5225 libtiff tiffcp.c 1893 CVE-2017-7595 libtiff tif jpeg.c 2429 CVE-2017-7599 libtiff tif dirwrite.c 2930 CVE-2017-7600 libtiff tif dirwrite.c 2930 CVE-2017-7601 libtiff tif jpeg.c 2436 CVE-2012-2806 libjpeg jdmarker.c 1364 CVE-2017-15232 libjpeg jquant1.c 857 CVE-2018-14498 libjpeg rdbmp.c 488 CVE-2016-8691 jasper jpc cs.c 1662 CVE-2016-9557 jasper jas image.c 1547 CVE-2016-5844 libarchive archive read support format iso9660.c 3263 CVE-2012-5134 libxml2 parser.c 15441 CVE-2016-9264 libming listmp3.c 194 CVE-2018-8806 libming decompile.c 3449 CVE-2018-8964 libming decompile.c 3449 CVE-2024-24146 libming decompile.c 3449 CVE-2024-24148 libming outputscript.c 2085 CVE-2017-5974 zziplib memdisk.c 480 CVE-2017-5975 zziplib memdisk.c 480 CVE-2017-5976 zziplib memdisk.c 466 CVE-2013-7437 potrace bitmap io.c 829 Average 3415.00 Median 2429 Min 194 Max 15441 15 5.3 Procedure For each benchmark instance, we generate three types of reductions: (1) reference reductions using C-Reduce, (2) LLM-assisted reductions using LVR, and (3) HDD-based reductions using the mechanical baseline alone. Across all three reducers, the same intended preservation oracle and verification procedure are used so that each reduction is evaluated against the same vulnerability-preservation criterion. C-Reduce based ground truth reductions: The ideal way to establish a ground-truth reduction would be for an expert developer to manually inspect, debug, simplify, and isolate the vulnerability-preserving statements in the source code. In practice, however, the original source files and the resulting reductions are often too large for such manual ground-truth construction to be feasible. Accordingly, and in line with prior work that treats C-Reduce as a standard comparison baseline, this study uses the final C-Reduce reduction as the reference for evaluation [7, 14, 18]. LVR LLM-assisted reductions: LVR is used to generate reductions with LLM support during the reduction process. The approach follows a structured, HDD-first protocol in which mechanical reduction is performed first and bounded LLM-guided semantic edits are applied afterward to further simplify the artifact while preserving compilation and vulnerability reproduction. Each candidate reduction is validated using the benchmark harness and the same preservation oracle. HDD-based reductions: A standalone HDD-based reducer is used to produce mechanical-only reductions for comparison purposes. These reductions are generated under the same preservation objective and are validated using the same benchmark-specific oracle as the other reducers. This baseline makes it possible to measure how much additional reduction is obtained from LLM-assisted semantic edits beyond hierarchical mechanical reduction alone. 5.4 Measurements To answer the research questions about the accuracy and the reduction magnitude, we measure the following data for each CVE. 1. OSCS: Orginal Source Code Size. Source code line size of original benchmark files used in the reduction. 2. CRS: C-Reduce Size. The program size after C-Reduce reduction. 3. CRRS: C-Reduce Reduction Size. The number of lines removed. 4. LVRS: LVR Size. The program size in lines after LVR reduction. 5. LVRRS: LVR Reduction Size. The number of lines removed by LVR. 16 6. HDDS: HDD Size. The program size in lines after HDD reduction. 7. HDDRS: HDD Reduction Size. The number of lines removed by HDD reduction. 8. false-positive : The number of false positive statements. The statements that were not removed by C-Reduce but were removed by LVR. 9. false-negative: The number of false negative statements. The statements that were removed by CReduce but were missed by LVR. 10. true-positive: The number of correctly removed statements. Cases where both C-Reduce and LVR removed a statement. 11. CRPRS: C-Reduce Percentage Reduction Size. The percentage of lines removed by C-Reduce reduction. 12. LVPRS: LVR Percentage Reduction Size. The percentage of lines removed by LVR reduction. 13. HDDPRS: HDD Percentage Reduction Size. The percentage of lines removed by HDD reductions. 17 CHAPTER 6 Evaluation This chapter presents the evaluation results for LVR. The analysis is organized around the research questions defined in Chapter 5. We first consider the applicability of LVR across the evaluated benchmark subjects, and then examine reduction accuracy and effectiveness relative to C-Reduce and the HDD baseline. 6.1 Applicability We applied LVR to 25 CVEs across 8 projects, covering 85,375 lines of vulnerable source code. The evaluated cases span multiple libraries and vulnerable source files, including benchmarks from libtiff, libjpeg, jasper, libarchive, libxml2, libming, zziplib, and potrace. This provides evidence that the current implementation is not limited to a single program, source file, or a single vulnerability class. On these evaluated subjects, LVR completed successfully and produced final reductions once the benchmark setup and benchmark-specific interestingness oracle had been established. In each case, the reducer was able to operate within the benchmark-driven workflow, apply its HDD-first reduction process, invoke the benchmark-specific oracle, and preserve the required exploit evidence under the defined acceptance criterion. This indicates that the approach is operationally applicable across multiple benchmark instances rather than being confined to a narrow proof-of-concept example. The evaluated cases also vary in original source size and in the location of the vulnerable target file, ranging from smaller source files to substantially larger ones. The ability of LVR to run across this range of benchmark subjects suggests that the overall workflow, including setup, mechanical reduction, bounded semantic candidate generation, and oracle-guided validation, is robust enough to support evaluation across multiple projects and vulnerable artifacts. 6.2 Accuracy RQ1: How accurate are LLM-based reductions when performing vulnerability preserving reductions. Accuracy is evaluated by comparing the final LVR reductions against the corresponding C-Reduce reduction at the level of Clang AST-derived statements. In this study, C-Reduce is treated as the reference reducer, and LVR is measured in terms of how closely its removals match those produced by C-Reduce. This makes it possible to quantify reduction quality structurally rather than by relying on raw source lines, which may be affected by formatting, braces, blank lines, and other non-semantic artifacts. For each benchmark instance, three source files are analyzed: the original vulnerable file, the final C- 18 Reduce output, and the final LVR output. Each file is parsed using Clang, the resulting JSON AST is processed to extract a normalized set of statement-level program elements. Let Sorig , Screduce , and Slvr denote the resulting statement sets for the original program, the C-Reduce reduction, and the LVR reduction, respectively. The removed-statement sets are then defined relative to the original program as Rcreduce = Sorig \screduce and Rlvr = Sorig \Slvr . Using these removal sets, true positives (TP), false positives (FP), and false negatives (FN) are computed as follows. TP denotes statements removed by both C-Reduce and LVR. FP denotes statements removed by LVR but not by C-Reduce. FN denotes statements removed by C-Reduce but not by LVR. Precision is therefore defined as the fraction of statements removed by LVR that were also removed by C-Reduce, while recall is defined as the fraction of statements removed by C-Reduce that were also removed by LVR. Precision = Recall = TP T P + FP TP T P + FN Under this formulation, precision and recall quantify the degree of structural agreement between LVR and the C-Reduce reference reduction. This definition of accuracy is intentionally comparative rather than absolute. The reported precision and recall therefore do not claim to measure semantic correctness in a general sense; rather, they measure how closely LVR matches the statement-level reductions produced by C-Reduce across benchmark CVEs. Because C-Reduce is the strongest baseline used in this study, this provides a practical and reproducible way to evaluate reduction accuracy. The precision and recall for each CVE are reported in Table 6.1. The average precision is 99.66% and the median precision is 100.00%. The average recall is 97.16% and the median recall is 98.28%. Overall, the high precision and recall indicate that LVR produces reductions that closely approximate the C-Reduce reference reductions at the AST-statement level. 6.3 Effectiveness - Reduction Size RQ2: What is the size of the reduction when performing vulnerability preserving reductions? To answer Research Question 2, reduction size is measured in terms of source lines removed and percentage of source lines removed relative to the original vulnerable program. These measurements are used to quantify how much each artifact is reduced by LVR. For comparison, the same reduction-size measurements are also reported for C-Reduce and the HDD baseline. The reduction-size results for each benchmark instance are summarized in the Table 6.2. 19 Table 6.1: Precision and Recall: CVE ID, Precision, Recall CVE ID CVE-2016-10094 CVE-2016-10092 CVE-2016-10272 CVE-2016-9532 CVE-2017-5225 CVE-2017-7595 CVE-2017-7599 CVE-2017-7600 CVE-2017-7601 CVE-2012-2806 CVE-2017-15232 CVE-2018-14498 CVE-2016-8691 CVE-2016-9557 CVE-2016-5844 CVE-2012-5134 CVE-2016-9264 CVE-2018-8806 CVE-2018-8964 CVE-2024-24146 CVE-2024-24148 CVE-2017-5974 CVE-2017-5975 CVE-2017-5976 CVE-2013-7437 Precision 99.77 99.58 99.70 99.82 99.02 99.82 100.00 100.00 100.00 100.00 100.00 99.08 100.00 99.75 100.00 99.91 95.65 99.71 99.71 100.00 100.00 100.00 100.00 100.00 100.00 Recall 98.73 98.59 98.17 97.58 94.59 100.00 99.35 99.35 99.46 98.71 96.39 85.04 98.36 98.51 99.06 98.13 94.29 97.42 98.28 99.00 99.35 96.03 96.73 95.92 91.93 99.66 100.00 95.65 100.00 97.16 98.28 85.04 100.00 Average Median Min Max Table 6.2: Reduction Size: CVE ID, LVR reduction, LVR percentage reduction, HDD reduction, HDD percentage reduction, C-Reduce reduction, and C-Reduce percentage reduction CVE ID CVE-2016-10094 CVE-2016-10092 CVE-2016-10272 CVE-2016-9532 CVE-2017-5225 CVE-2017-7595 CVE-2017-7599 CVE-2017-7600 CVE-2017-7601 CVE-2012-2806 CVE-2017-15232 CVE-2018-14498 CVE-2016-8691 CVE-2016-9557 CVE-2016-5844 CVE-2012-5134 CVE-2016-9264 CVE-2018-8806 CVE-2018-8964 CVE-2024-24146 CVE-2024-24148 CVE-2017-5974 CVE-2017-5975 CVE-2017-5976 CVE-2013-7437 Average Median Min Max LVRS 5486 9160 9136 9123 1834 2401 2899 2899 2402 1288 805 406 1570 1447 3037 14846 176 3349 3347 3390 2053 397 422 393 767 LVRPS 98.23% 99.32% 99.06% 98.92% 96.88% 98.85% 98.94% 98.94% 98.60% 94.43% 93.93% 83.20% 94.46% 93.54% 93.07% 96.15% 90.72% 97.10% 97.04% 98.29% 98.47% 82.71% 87.92% 84.33% 92.52% HDDRS 5215 9032 9033 8843 1769 2383 2848 2851 2381 1272 787 369 1422 1414 2986 13953 168 3302 3315 3378 2023 352 402 372 720 HDDPS 93.38% 97.93% 97.94% 95.88% 93.45% 98.11% 97.20% 97.30% 97.74% 93.26% 91.83% 75.61% 85.56% 91.40% 91.51% 90.36% 86.60% 95.74% 96.11% 97.94% 97.03% 73.33% 83.75% 79.83% 86.85% CRRS 5557 9188 9190 9203 1856 2422 2913 2913 2430 1345 837 463 1617 1505 3240 15383 176 3404 3415 3439 2074 444 458 442 814 CRRPS 99.50% 99.62% 99.64% 99.78% 98.05% 99.71% 99.42% 99.42% 99.75% 98.61% 97.67% 94.88% 97.29% 97.29% 99.30% 99.62% 90.72% 98.70% 99.01% 99.71% 99.47% 92.50% 95.42% 94.85% 98.19% 3321.32 2401 176 14846 94.62% 96.88% 82.71% 99.32% 3223.60 2381 168 13953 91.43% 93.38% 73.33% 98.11% 3389.12 2422 176 15383 97.92% 99.01% 90.72% 99.78% 20 The average reduction size produced by LVR is 3,321.32 lines, and the median reduction size is 2,401 lines. The average reduction size produced by the HDD baseline is 3,223.60 lines, and its median reduction size is 2,381 lines. The average reduction size produced by C-Reduce is 3,389.12 lines, and the median reduction size is 2,422 lines. In percentage terms, the average and median reductions produced by LVR are 94.62% and 96.88%, respectively. The average and median percentage reductions produced by HDD are 91.43% and 93.38%, respectively. The average and median percentage reductions produced by C-Reduce are 97.92% and 99.01%, respectively. Because the original vulnerable files vary substantially in size, percentage reduction provides a more informative view of reduction effectiveness than raw lines removed alone. For example, removing 8 lines from a 10-line program yields an 80% reduction, whereas removing 10 lines from a 100-line program yields only a 10% reduction. Percentage reduction therefore captures relative reduction more accurately across benchmark instances of different sizes. Viewed in this way, all three reducers substantially reduce vulnerable programs, but they do so to different degrees. The HDD results are important because they provide a mechanical-only baseline against which the effect of LLM assistance can be interpreted. Across the completed benchmark cases, LVR outperforms HDD in both raw reduction size and percentage reduction. On average, LVR removes 97.72 more lines than HDD, and its average percentage reduction is 3.19 percentage points higher. The same pattern is reflected in the median values, where LVR exceeds HDD by 20 lines and by 3.50 percentage points. More importantly, LVR produces larger reductions than HDD on all currently completed CVEs. This suggests that the bounded LLMguided semantic phase exposes additional reduction opportunities beyond hierarchical mechanical deletion alone. In the LPR study, the reductions produced by the LLM-assisted approach were reported to be comparable to those of C-Reduce. In the present study, C-Reduce performs better than LVR on both average and median reduction size, as well as on average and median percentage reduction. More specifically, C-Reduce produces smaller artifacts than LVR across all currently completed CVEs. However, this does not mean that LVR is ineffective. Rather, the results show that it occupies an intermediate position between the lightweight HDD baseline and C-Reduce, producing substantial reductions while still falling short of the strongest baseline. 6.4 Interpretation of Results The results provide a clear answer to both research questions. First, LVR is effective at substantially reducing vulnerability-inducing source artifacts. Across the evaluated benchmark cases, the average reduction produced by LVR is 3,321.32 lines, with a median reduction of 2,401 lines. In percentage terms, the average reduction is 94.62%, and the median reduction is 96.88%. These values indicate that the proposed approach is 21 capable of aggressively shrinking vulnerable source files while still preserving the benchmark-specific exploit evidence required by the oracle. Second, the reduction-size results clarify the role of the HDD baseline. HDD provides a mechanical-only point of comparison, allowing the effect of bounded LLM-guided semantic edits to be isolated more clearly. Across the evaluated cases, HDD achieves an average reduction of 3,223.60 lines and a median reduction of 2,381 lines, corresponding to an average percentage reduction of 91.43% and a median percentage reduction of 93.38%. LVR therefore improves on the HDD baseline in both raw reduction size and percentage reduction, indicating that the semantic phase contributes additional simplification beyond hierarchical mechanical deletion alone. Third, the AST-based comparison shows that the reductions produced by LVR remain structurally close to those produced by C-Reduce. The average precision is 99.66%, and the median precision is 100.00%. The average recall is 97.16%, and the median recall is 98.28%. Under the comparison methodology used in this study, these values indicate that most of the statements removed by LVR are also removed by C-Reduce, and that LVR is able to recover most of the removals made by the reference reduction. This suggests a high degree of structural agreement between the two reducers, even though the resulting programs are not identical. At the same time, the results also show that LVR does not match the reduction strength of C-Reduce. On the evaluated benchmark cases, C-Reduce produces larger reductions both on average and in the median, whether measured by raw lines removed or by percentage reduction. The average gap between the reductions produced by C-Reduce and LVR is 67.80 lines, with a median gap of 47 lines. This indicates that while LVR often approaches the behavior of C-Reduce, a measurable gap remains between the two approaches. Taken together, these findings place LVR in an intermediate position between the pure HDD baseline and C-Reduce. It substantially improves on the mechanical baseline while remaining structurally close to the C-Reduce reference reductions. This is a useful result for the present study, because the central goal was not to prove that LLM assistance outperforms C-Reduce, but to determine whether it can improve vulnerabilityfocused reduction beyond mechanical deletion alone and approximate the behavior of a strong existing baseline [3, 7, 14, 18]. The results suggest that it can, while also showing that there is still room for improvement in both reduction strength and consistency. 22 CHAPTER 7 Threats to Validity In this section, we discuss the threats to validity in our experiments and how we mitigated them. As with any empirical study of program reduction, the results should be interpreted in light of several limitations. At the same time, the evaluation design was chosen to make those limitations explicit while still providing a practical and reproducible basis for comparison. 7.1 Construct Validity The main construct-validity concern is the definition of reduction accuracy. In this study, C-Reduce is treated as the reference reducer, and LVR is evaluated in terms of how closely its reductions match those produced by C-Reduce. This does not provide an absolute notion of semantic correctness. However, this choice is deliberate and reflects the practical constraints of vulnerability-oriented reduction. In principle, one might wish to define accuracy against the exact source-level vulnerability mechanism repaired by the developer patch, but this is not directly measurable in a consistent way across the benchmark. Developer patches often introduce guards or mitigation logic rather than explicitly identifying the exact vulnerable statement or logic, and sanitizer evidence does not always reveal every source line that contributed to the fault. As a result, patchgrounded semantic equivalence is not available as a robust, quantitative metric. Treating C-Reduce as the reference reducer is therefore a practical and widely used approximation that permits systematic comparison across benchmark instances. A second construct-validity concern is the use of AST-derived statements as the comparison unit. This abstraction does not capture every possible semantic nuance. However, it provides a substantially stronger basis for comparison than raw source lines or text-level diffs. Because both LVR and C-Reduce perform staged transformations that may alter variable names, formatting, whitespace, blank lines, and other nonsemantic details, direct textual comparison would produce misleading disagreement even when the resulting reductions are structurally similar. Clang-based AST comparison therefore serves as the most practical available heuristic for measuring structural agreement between the reducers while avoiding artifacts introduced by purely textual variation. A third construct-validity concern is the use of benchmark-specific interestingness oracles. These oracles preserve observable exploit evidence, such as sanitizer class, blame function, and stack-trace structure, but they do not provide a formal guarantee of full semantic equivalence. This limitation is inherent to oracleguided vulnerability reduction in the absence of a stronger, fully automatable semantic specification of the 23 vulnerability. Expanding the system to recover full root-cause semantics would require analyses such as deeper static taint tracking, symbolic execution, or automated vulnerability reasoning, which would substantially broaden the scope of the project beyond source reduction. 7.2 Internal Validity A major internal-validity concern is the use of benchmark-specific interestingness scripts. Because each CVE requires its own oracle, the strictness of the acceptance condition may vary across cases and can influence how much reduction is achievable. This threat was mitigated by keeping the preservation criterion aligned across reducers for each benchmark instance. In particular, whenever a benchmark required a modified or manually controlled build procedure for C-Reduce, the same effective build procedure was also used for LVR so that the comparison was not confounded by the differences in how the project was compiled or executed.. The scripts themselves are also being archived together with the final artifacts so that the exact preservation conditions can be inspected and reproduced. Another internal-validity concern is build-system variability across benchmarks. In some cases, the benchmark’s default build workflow could not be preserved literally for C-Reduce, because C-Reduce operates by copying the target source file into a temporary reduction directory rather than reducing the full benchmark tree in place. In those cases, the build procedure had to be adapted so that the candidate artifact could be compiled at all. This adaptation was not intended to alter the vulnerability or the preservation criterion; it was only a practical necessity of integrating C-Reduce with benchmark projects that were not designed for file-isolated reduction. To preserve fairness, equivalent build procedures were used for LVR on the same benchmark whenever such adaptations were necessary. Environmental instability during long-running reductions, especially for C-Reduce, is another internalvalidity concern. Some runs were affected by SSH session limits, long execution times, or system-level interruptions. This threat was mitigated by discarding interrupted or corrupted runs, resetting the affected benchmark state, and rerunning the reducers from clean starting points. Only completed start-to-finish reductions were retained in the reported results. Runtime was not included as an evaluation metric precisely because fair and reproducible end-to-end timing could not be guaranteed in the available environment. 7.3 External Validity Do our results generalize? An important question is the extent to which the results of this study generalize beyond the evaluated benchmark set. The present evaluation is limited to vulnerability-inducing artifacts written in C and reduced using C-Reduce as the reference baseline. Similarly, the current implementation of LVR is designed for C 24 programs and was evaluated using the Qwen2.5-Coder-32B model, primarily because a local deployment of that model was available for experimentation. This suggests that the general workflow may be adaptable to other programming languages and other model backends, although such extensions are not evaluated in this study. The results should therefore be interpreted primarily as evidence about vulnerability-preserving reductions for C programs under the current implementation and model configuration. A second external validity concern is that the evaluation is performed on a subset of benchmark CVEs rather than on every case available in the full benchmark suite. This limits the degree to which the results can be generalized beyond the evaluated subset. However, the selected cases span multiple projects and multiple bug classes, which helps reduce the risk that the findings are artifacts of a single codebase or a single failure type. The conclusions should therefore be interpreted as applying to the evaluated subset of vulnerabilityinducing C artifacts rather than to all possible vulnerable programs. A related concern is that some benchmark cases originate from the same project and, in some instances, from the same source file family. This reduces the independence of some observations and may make the aggregate results somewhat more reflective of the structure of particular codebases than of the full space of vulnerability-inducing programs. However, these cases still correspond to distinct CVEs and distinct vulnerability triggers, and they remain relevant as separate reduction targets because the study evaluates benchmark instances rather than attempting to estimate the prevalence of a phenomenon across all possible source files. Finally, the study focuses on benchmarks whose failures can be constrained using sanitizer-visible exploit evidence and benchmark-specific interestingness oracles. The results may therefore generalize less directly to industrial code-bases outside the benchmark setting, to vulnerabilities that do not produce strong dynamic failure evidence, or to settings in which vulnerability preservation cannot be expressed through an objective and automatable oracle. 7.4 Reliability Are our results reliable and reproducible? The VulnLoc and San2Patch benchmarks are publicly available at https://github.com/acorn421/san2patch-benchmark. LVR is also publicly available at https://github.com/ MichaelVoskanyan/LVR, and the final reduction results for both C-Reduce and LVR are available at https: //github.com/MichaelVoskanyan/lvr benchmark results. In addition, the benchmark-specific interestingness scripts used in the experiments are archived together with the final reduced artifacts. This makes it possible for other researchers to inspect the exact preservation criteria used for each CVE and to rerun the reduction process on the same benchmark instances. 25 The overall workflow is designed to be reproducible. Given the benchmark artifacts, the reducer, the interestingness scripts, and the reported configurations, another researcher can rerun LVR and C-Reduce on the same subjects and verify the resulting reductions. Because the evaluation is based on explicit benchmark artifacts, oracle scripts, and preserved outputs, the reported results do not depend on undocumented manual analysis. The use of an LLM introduces some sensitivity to model and backend configuration, since different model deployments may generate different semantic candidates across runs. However, in this study the LLM is used only in a bounded candidate-generation role, and every candidate is validated by the same benchmark-specific oracle before it is accepted. As a result, the correctness of the reported reduction sis determined by the oracle and the benchmark harness rather than by the LLM alone. While exact semantic candidates may vary across model backends, the overall reduction workflow remains reproducible as long as the same benchmark setup, oracle scripts, and tools configurations are used. 26 CHAPTER 8 Conclusion and Future Work 8.1 Conclusion This thesis presented LVulnReducer (LVR), an LLM-assisted reducer for vulnerability-preserving source reduction. The goal of this work was not to replace established reducers such as C-Reduce, but to evaluate whether bounded LLM-guided semantic edits could improve on lightweight hierarchical mechanical reduction and produce reductions that more closely approximate those generated by C-Reduce. To study this question, LVR was evaluated on 25 benchmark CVEs drawn from the San2Patch benchmark, using benchmarkspecific interestingness oracles to preserve vulnerability-relevant exploit evidence during reduction. The evaluation shows that LVR is applicable, accurate, and effective under the comparison methodology used in this thesis. Across the evaluated benchmark instances, LVR successfully produced substantial reductions in vulnerability-inducing source files while preserving the required exploit evidence under the benchmark-specific oracle. At the same time, the AST-based comparison against C-Reduce showed high precision and recall, indicating that the reductions produced by LVR remain structurally close to the C-Reduce reference reductions. On average, LVR reduced vulnerable programs by 3,321.32 lines, corresponding to an average percentage reduction of 94.62%, while achieving an average precision of 99.66% and an average recall of 97.16%. The results also clarify the role of LLM assistance in this setting. LVR improves on the lightweight HDD baseline, showing that bounded semantic edits can expose additional reduction opportunities beyond hierarchical mechanical deletion alone. However, C-Reduce remains the strongest reducer in terms of raw minimization. The contribution of this thesis is therefore not to show that LLM assistance surpasses C-Reduce, but rather to show that an HDD-first, oracle-guided, LLM-assisted reduction pipeline can meaningfully narrow the gap between lightweight mechanical reduction and a mature reduction baseline in the context of vulnerability-inducing programs. More broadly, this thesis shows that vulnerability-preserving reduction can be studied quantitatively using benchmark artifacts, oracle-guided reduction, and AST-based structural comparison. This provides a practical basis for evaluating LLM-assisted reduction in a security setting, where fully specifying the underlying vulnerability semantics is often not directly feasible. In that sense, LVR represents a useful step toward LLM-assisted reduction pipelines for security-focused software analysis. 27 8.2 Future Work Several directions remain for future work. The first is to strengthen the preservation objective used during reduction. In the current prototype, acceptance is based on benchmark-specific exploit evidence gathered from the build-and-run oracle. A natural extension would be to augment this with stronger program-analysis signals so that accepted reductions are constrained not only by observed failure behavior, but also by static or dynamic evidence about the vulnerability-relevant path. One promising direction is to integrate lightweight static analysis into the reduction workflow. For example, static taint or dataflow analysis using tools such as CodeQL [27] or Joern [28] could be used to identify source locations, data dependencies, or sink-relevant regions that should be preserved during reduction. This would provide a stronger structural bias than sanitizer output alone, while remaining significantly cheaper than full semantic reasoning over the entire program. A second direction is to incorporate symbolic execution for cases where static analysis is too coarse or where the vulnerability depends on more complex path constraints. Tools such as KLEE [29] or Angr [30] could be used to help determine whether the reduced artifact still preserves the critical path conditions needed to reach the vulnerable behavior. This would be especially useful for vulnerabilities whose root cause depends on the value relationships or guard conditions that are not fully visible in sanitizer output. A third direction is to combine reduction with directed fuzzing. Once a candidate artifact has been reduced to a smaller vulnerability-focused core, directed fuzzing tools such as AFL++ [31] or LibFuzzer [32] could be used to revalidate or rediscover the vulnerability with less search overhead. This suggests a broader workflow in which reduction is performed before automated vulnerability detection, confirmation, or oracle generation, thereby avoiding the cost of repeatedly analyzing large amounts of dead or irrelevant code. More broadly, this points toward an expanded role for reduction in automated security research. Many existing vulnerability analysis pipelines spend the majority of their time executing or analyzing code that is unrelated to the fault(s) of interest. A reduction-first workflow could serve as a preprocessing step for automated vulnerability detection, crash reproduction systems, fault localization, or exploit-evidence generation by shrinking the search space before more expensive downstream analyses are applied. In that sense, the long-term value of systems like LVR may lie not only in producing smaller artifacts for human inspection, but also in helping automated security pipelines focus computation on the code that matters most. 28 References [1] A. Christi, M. L. Olson, M. A. Alipour, and A. Groce, “Reduce before you localize: Delta-debugging and spectrum-based fault localization,” in 2018 IEEE International Symposium on Software Reliability Engineering Workshops, ISSRE Workshops, Memphis, TN, USA, October 15-18, 2018, 2018, pp. 184– 191. [2] D. Vince, R. Hodován, and Á. Kiss, “Reduction-assisted fault localization: Don’t throw away the byproducts!” in ICSOFT, 2021, pp. 196–206. [3] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for c compiler bugs,” in Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation, 2012, pp. 335–346. [4] R. Gopinath, A. Kampmann, N. Havrikov, E. O. Soremekun, and A. Zeller, “Abstracting failureinducing inputs,” in 29th ACM SIGSOFT international symposium on software testing and analysis, 2020, pp. 237–248. [5] S. Herfert, J. Patra, and M. Pradel, “Automatically reducing tree-structured test inputs,” in Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ser. ASE 2017, 2017, pp. 861–871. [6] R. Hodován and Á. Kiss, “Modernizing hierarchical delta debugging,” in Proceedings of the 7th International Workshop on Automating Test Case Design, Selection, and Evaluation, 2016, pp. 31–37. [7] C. Sun, Y. Li, Q. Zhang, T. Gu, and Z. Su, “Perses: Syntax-guided program reduction,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 361–371. [8] D. Weber and A. Christi, “Redusharptor: A tool to simplify developer-written c# unit tests,” International Journal of Software Engineering & Applications, vol. 14, pp. 29–40, 09 2023. [9] D. Binkley et al., “Orbs: Language-independent program slicing,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014, pp. 109–120. [10] D. Stepanov, M. Akhin, and M. Belyaev, “Reduktor: How we stopped worrying about bugs in kotlin compiler,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2019, pp. 317–326. [11] G. Wang, R. Shen, J. Chen, Y. Xiong, and L. Zhang, “Probabilistic delta debugging,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 881–892. [12] G. Wang et al., “A probabilistic delta debugging approach for abstract syntax trees,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 763–773. [13] B. Wilber, T. D. Le, and A. Christi, “Redujavator: A tool to simplify developer-written java unit tests,” in 2024 IEEE International Conference on Data and Software Engineering (ICoDSE). IEEE, 2024, pp. 199–204. [14] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of languageagnostic program reduction,” Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 636–664, 2023. [15] Z. Xu, Y. Tian, M. Zhang, J. Zhang, P. Liu, Y. Jiang, and C. Sun, “T-rec: Fine-grained languageagnostic program reduction guided by lexical syntax,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 2, pp. 1–31, 2025. 29 [16] G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 142–151. [17] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on software engineering, vol. 28, no. 2, pp. 183–200, 2002. [18] M. Zhang, Y. Tian, Z. Xu, Y. Dong, S. H. Tan, and C. Sun, “Lpr: Large language models-aided program reduction,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2024, pp. 261–273. [19] B. Lin, S. Wang, Y. Qin, L. Chen, and X. Mao, “Large language models-aided program debloating,” IEEE Transactions on Software Engineering, 2025. [20] Y. Kim, S. Shin, H. Kim, and J. Yoon, “Logs in, patches out: Automated vulnerability repair via treeof-thought llm analysis,” in 34th USENIX Security Symposium (USENIX Security ’25), 2025, pp. 4401– 4419. [21] S. Shen, A. Kolluri, Z. Dong, P. Saxena, and A. Roychoudhury, “Localizing vulnerabilities statistically from one exploit,” in Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (ASIA CCS ’21), 2021, pp. 537–549. [22] Q. Xin, M. Kim, Q. Zhang, and A. Orso, “Program debloating via stochastic optimization,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, 2020, pp. 65–68. [23] A. Christi, A. Groce, and R. Gopinath, “Resource adaptation via test-based software minimization,” in 2017 IEEE 11th International Conference on Self-Adaptive and Self-Organizing Systems (SASO). IEEE, 2017, pp. 61–70. [24] Z. Zheng, K. Ning, Q. Zhong, J. Chen, W. Chen, L. Guo, W. Wang, and Y. Wang, “Towards an understanding of large language models in software engineering tasks,” Empirical Software Engineering, vol. 30, no. 2, p. 50, 2025. [25] Q. Zhang, C. Fang, Y. Xie, Y. Zhang, S. Yu, W. Sun, Y. Yang, and Z. Chen, “A survey on large language models for software engineering,” Science China Information Sciences, vol. 69, no. 4, p. 141102, 2026. [26] LLVM Project, “Clang: a c language family frontend for llvm,” https://clang.llvm.org/, accessed: 202604-22. [27] GitHub, “Codeql documentation,” https://codeql.github.com/docs/, 2026, accessed: 2026-04-15. [28] F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering vulnerabilities with code property graphs,” in 2014 IEEE Symposium on Security and Privacy, 2014, pp. 590–604. [29] C. Cadar, D. Dunbar, and D. Engler, “Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008, pp. 209–224. [30] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna, “Sok: (state of) the art of war: Offensive techniques in binary analysis,” in 2016 IEEE Symposium on Security and Privacy, 2016, pp. 138–157. [31] A. Fioraldi, D. Maier, H. Eissfeldt, and M. Heuse, “AFL++: Combining Incremental Steps of Fuzzing Research,” in WOOT ’20: 14th USENIX Workshop on Offensive Technologies, 2020. [32] LLVM Project, “LibFuzzer – a library for coverage-guided fuzz testing,” https://llvm.org/docs/ LibFuzzer.html, 2026, accessed: 2026-04-15. 30
Format	application/pdf
ARK	ark:/87278/s6qz825m
Setname	wsu_smt
ID	166263
Reference URL	https://digital.weber.edu/ark:/87278/s6qz825m