Wilber, Brandon MCS_2026

Title	Wilber, Brandon MCS_2026
Alternative Title	Unsupervised Statement Reduction of Java Unit Tests Experiencing Failure
Creator	Wilber, Brandon
Contributors	Christi, Arpit (advisor)
Collection Name	Master of Computer Science
Abstract	Software development often results in a massive volume of logical statements that depend on each other, as well as often influencing and building on each other. Programs composed of these statements can fail both programmatically by breaking existing rules within the project structure, or compiler checks, as well as logically ¬¬¬- a functioning program often requires configured test cases to validate logic because these failures will usually not crash the program. When a program crashes, or fails a test case, the environment the program is running in will attempt to reference a line of code where it believes the user can start the investigation into what has happened. Due to the nature of a program changing as it is executed, this starting point may not be incredibly helpful and locating the source of the fault can require traveling through a large amount of the codebase and become a massive time investment. Determining what code is required to maintain the existing failure allows irrelevant code to be hidden, or removed, to greatly simplify locating the cause for errors and maintaining the codebase as it evolves. A tool can be constructed to automate this process. Running externally, this program can modify a project, or multiple projects and create minimal test cases, removing many code elements that are unrelated to the existing faults. Algorithms of varying complexity can be used to find the different combinations of code statements that are required to maintain the failure. The processing time to reduce the code statements can vary, but is of minimal concern as it can be automated without requiring user intervention. Testing of multiple open-source projects with known failures determined that in all cases, a significant portion of code could be removed while maintaining the original error.
Subject	Debugging in computer science; Computer science; Algorithms
Digital Publisher	Digitized by Special Collections & University Archives, Stewart Library, Weber State University.
Date	2026-03
Medium	theses
Type	Text
Access Extent	34 page pdf
Conversion Specifications	Adobe Acrobat
Language	eng
Rights	The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce his or her thesis, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author
Source	University Archives Electronic Records: Master of Computer Science. Stewart Library, Weber State University
OCR Text	Show UNSUPERVISED STATEMENT REDUCTION OF JAVA UNIT TESTS EXPERIENCING FAILURE By Brandon Wilber A project report submitted to the faculty of the MSCS Graduate Program of Weber State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science April 28, 2023 Ogden, Utah Approved: Date: 05/15/2023 Committee Chair, Arpit Christi, Ph.D. 5/16/2023 Committee member, Yong Zhang, Ph.D. 5/17/2023 Committee member, Nicole Anderson, Ph.D. Copyright © 2023 Brandon Wilber All Rights Reserved ii ACKNOWLEDGMENTS I would like to thank Dr. Aprit Christi for being my committee chair and helping to guide this project to its completion. I would also like to thank Dr. Yong Zhang and Dr. Nicole Anderson for providing additional support as members of my committee. iii TABLE OF CONTENTS Page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contribution of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 LIST OF TABLES 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 2.2 4 5 3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Reduced Test Case, Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preservation of Fault Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Cost of Compilation and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statement-level Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 ReduJavator: Setup, Workflow, Implementation . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 4.2 4.3 8 9 9 3.1 3.2 3.3 3.4 4 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statement Combinations, Multiple Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 14 14 15 15 17 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 18 18 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7.1 7.2 7.3 23 23 23 6.1 6.2 6.3 7 Project Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments 5.1 5.2 5.3 5.4 5.5 6 DDMIN and HDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 7.4 8 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8.1 8.2 25 26 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v LIST OF TABLES Table 1 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1 5.1 Class, Method, and Statement Composition of Test Projects . . . . . . . . . . . . . . . . 16 6.1 6.2 Reduction, Compiles and Failed Compiles, for both Algorithms . . . . . . . . . . . . . . True-Positives, False-Positives, False-Negatives, Precision and Recall . . . . . . . . . . . 19 21 vi LIST OF FIGURES Figure Page 2.1 2.2 The Delta Debugging Minimizing Algorithm [2] . . . . . . . . . . . . . . . . . . . . . . The Hierarchical Delta Debugging Algorithm [3] . . . . . . . . . . . . . . . . . . . . . . 4 5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 ReduJavator Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reducer Method, Interior Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Code Example: JavaParser AST Representation . . . . . . . . . . . . . . . . . . Sample Code Example: SimplifiedTree Representation . . . . . . . . . . . . . . . . . . . Sample Code Example: SimplifiedTree Representation, Level 1 . . . . . . . . . . . . . . Sample Code Example: SimplifiedTree Representation, Level 2 (Nested Children) . . . . Pseudo Code, Custom Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9 10 11 12 12 13 13 6.1 6.2 Manual Optimum Reduction Compared to Custom Algorithm Reduction — Project ID: #8 20 Manual Optimum Reduction Compared to DDMIN Algorithm Reduction — Project ID: #27 22 vii Contribution of Authors The following authors contributed to the manuscript: Brandon Wilber (BW) and Arpit Christi (AC) Table 1: Author Contributions Task Conception Data Collection Empirical Analysis Final Draft Thesis BW BW BW BW, AC 1 CHAPTER 1 Introduction The debugging of modern software can be a time consuming and complex task, often when an error is encountered the IDE or compiler will simply provide a starting point for the investigation. From this starting point a developer will need to backtrack through relevant code to determine what elements contribute to failure. Experimentation with statements may be required to discover if they are part of the problem. In the case of unit testing there are often dozens or even hundreds of configurations being tested. When a particular failure is being targeted and code segments are adjusted to change behavior, care needs to be taken to maintain the original error. One of the simplest ways to improve debugging is to remove tests and associated statements that are not part of the failure. The simplification of the failing test code will provide the benefit of reducing the number of statements to be analyzed, while also removing additional points of failure that could be created during the investigation. Removal of non-faulty code also improves the accuracy of fault localization techniques [1]. Repeatedly compiling and testing bug free code can increase compilation and execution time, the removal of this code can reduce time spent waiting. One of the more popular search algorithms used to automatically attempt reduction of failing test cases is the Delta Debugging algorithm (DDMIN), proposed by Zeller and Hildebrandt [2]. This algorithm can process test inputs in various combinations and creates increasingly smaller subsets of inputs until a minimum set has been reached. Many programming languages such as Java or C use conditional statements and loops that allow for complex execution paths, resulting in a nested structure of statements within statements. In these cases where there are multiple levels of code, DDMIN does not consider the underlying program structure and reduces the program by treating it as an array or list. There is however a Hierarchical Delta Debugging (HDD) algorithm, created by Mishereghi and Su [3] that can attempt reduction of multi-level code blocks with the help of an Abstract Syntax Tree (AST). For the purposes of this project, we propose a tool, ReduJavator, capable of automated statement reduction of failing test code that focuses on the following: 1. Able to reduce Java test cases, utilizing JUnit, that contain failed assertions or thrown exceptions during execution. 2. Works with Maven, Gradle, and Apache Ant build tools. 3. Uses modern language and plugin versions, while supporting programs reliant on older versions. 2 4. Can work on code in its existing state, without requiring preprocessing modifications or analysis. We evaluated this tool by testing 30 failures from 6 open-source projects in Defects4j to show that it is both applicable and accurate. ReduJavator provides correct reduction with high precision (94.29%) and recall (100%). 3 CHAPTER 2 Related Work 2.1 DDMIN and HDD The previously mentioned Delta Debugging algorithm [2] is a frequently used algorithm for automatically testing a variety of statement combinations. The psuedo-code for the Delta Debugging Minimizing Algorithm can be seen in Figure 2.1. This algorithm iterates over a number of groups to test them individually, and then tests the compliment of those configurations. If there is a successful compilation and test failure, that group configuration becomes the new set of inputs for further testing, any inputs that were excluded are no longer considered for future testing. If all group and compliment group combinations have been tested and there is no failure, the granularity of the search will be increased by creating a larger number of smaller groups and testing will begin again. This process continues until the number of groups is the same as the number of remaining inputs and the group compliment phase finishes, at which point the reduction is complete. Figure 2.1: The Delta Debugging Minimizing Algorithm [2] During the testing phase of this tool, 10.58% of statements were conditional or looping statements. The entirety of looping or conditional statements can be treated as inputs, this means that in cases where the entire statement can be removed that DDMIN can still perform well for statement reduction. However, ff the entire conditional or looping statement cannot be removed, DDMIN will be unable to perform reduction on the interior statements. An abstract syntax tree is a tree representation of a programs structure, all the paths that can be taken, and preserves the parent/child hierarchy that exists for control statements such as condititional branches or loops. With the assistance of an AST, the HDD algorithm [3] can step through multi-level code statements and apply DDMIN reduction to blocks of code at any level within the tree. The psuedo-code for 4 the HDD algorithm can be seen in Figure 2.2. Figure 2.2: The Hierarchical Delta Debugging Algorithm [3] 2.2 Additional Resources The HDD algorithm processes the AST starting at the top level of the tree and works its way down to the leaf statements using context-free grammars to represent lists and create potentially unbalanced trees. Research done by Hodovan and Kiss determined that using extended context-free grammars to parse inputs and build trees, resulted in more balanced trees that improved performance and often lead to higher levels of reduction [4]. Considering both work done for HDD and alternatives proposed by Hodovan and Kiss, reduction was attempted beyond the statement level, considering how to properly maintain syntactically correct code fragments that conform to grammar specifications, while removing characters within statements. Eliminating characters, keywords, and variables from within code statements may result in non-compilable code that is not useful to a developer. The focus for ReduJavator is at the statement level, Christi et al. propose that this is the most useful way to approach the automated reduction of code [5, 6]. Attempting to remove portions of a statement is very likely to create a situation where the code will no longer compile and if it does compile, the statement modification is likely to have an impact on any later statements that depend on the modified segment of code. 5 Christi et al. present the case that statement focused reduction provides the potential for accurate removal of unnecessary code while maintaining an improved level of performance, by not attempting modifications that are unlikely to compile. 6 CHAPTER 3 Motivation 3.1 Reduced Test Case, Purpose The reduction of code in failing unit tests has the benefit of helping the developer and improving automated fault localization, both of these require a result that will compile and maintains a useable test case. The methods used for reduction are automated so the time requirement is inconsequential and the resulting code preserves the failure so that developers can continue from the reduced state. 3.2 Preservation of Fault Location Often unit testing will result in the creation of many objects that are tested, modified, and tested again. This process helps ensure that operations meant to adjust the functionality of the object perform in an expected way. By removing a statement that modifies an object, exceptions may be created or previously passing assertions may fail. ReduJavator will focus on maximizing the reduction of statements, while avoiding the creation of new failure. 3.3 The Cost of Compilation and Testing Any changes to code, testing or otherwise, requires that the code be re-compiled. Depending on the structure of a project, a single class may be compiled alone, or the entire project may need to be rebuilt and recompiled. If the average time to compile a project were 15 seconds and it was compiled 100 times during attempted reduction, this would take 25 minutes. One of the goals for ReduJavator is to reduce the number of attempts to compile code that is unlikely to compile as well as prevent the need for backtracking or testing duplication. By reducing the number of configuration variants, the time required for reduction can be significantly reduced. 3.4 Statement-level Reduction Rather than attempting to delete segments of characters from statements, eliminate keywords or programmatic grammar, or trim various nodes from the AST, only entire statements themselves will targeted for reduction. Focusing on only statement removal will significantly decrease the number of potential test combinations and increase the likelihood of successful compilation so the statement combinations can be tested. 7 CHAPTER 4 ReduJavator: Setup, Workflow, Implementation 4.1 Project Setup The ReduJavator tool can be used to reduce Maven, Gradle, and Apache Ant Java projects. Maven is required to be configured on the local system and is a dependency for ReduJavator. Junit is used for the testing of code and is required to be used by the projects to be reduced. In order to reduce Gradle projects, Gradle will be required on the local system. To simplify the deployment of multiple projects and the ease of changing between them, a TestObject class is used. The execution tool used (Maven, Gradle, Ant), root file path to the targeted project, relative path from the root to the targeted class, targeted class name, and targeted method(s) can all be set for this object. The TestObject is passed to a Reduce function and the minimized output is returned, this workflow can be seen in Figure 4.1. If there are multiple failing methods in each class and those methods are independent of each other, multiple methods can be individually targeted for reduction. More than a single TestObject can be queued and processed in sequence. Figure 4.1: ReduJavator Workflow 8 4.2 Workflow A third-party library, JavaParser [7] is used to process the targeted failing method into an abstract syntax tree. The AST is then parsed by a custom SimplifiedTree class that creates a flat tree structure that doesn’t include a lot of custom JavaParser elements and nodes that are not relevant for reduction. A backup is then made of the targeted class and reduction is performed, if more than a single project was targeted, reduction will be performed on all method(s) and class(es) until they have all been reduced. This can be seen in Figure 4.2. Figure 4.2: Reducer Method, Interior Workflow 4.3 Implementation Reduction is performed at the statement level, for this reason the JavaParser AST contains many nodes that are not needed and is more complicated than is required. Continuously traversing through a dense tree to modify targeted statement blocks and making changes to the AST structure is a process that would benefit from a decrease in complexity. To roughly show the structure of a JavaParser AST, consider the block of code depicted in Figure 4.3. A truncated representation of the AST created by JavaParser for this code, that focuses on showing the depth of the actual code statements can be seen in Figure 4.4. Some key things to note with this AST representation are that the initial variable declaration and final assertion statements are 7 levels deep within the AST, the “then” case for the if statement is 8 levels deep, and the “else if” and “else” are further nested with the statements being 13 levels deep in the AST. The SimplifiedTree method steps through the JavaParser AST to find and focus on the statements. The SimplifiedTree structure is composed of nodes that store a reference to the JavaParser statement group nodes (BlockStmts), have an index, flags to indicate if they are disabled or being sampled for removal, flags for if they are a parent or what their parent is if they are a child, and what sub-block of statements they belong to if they are a child. A SimplifiedTree representation of the code example is depicted in Figure 4.5 and shows that unnecessary JavaParser elements as well as conditional logic and looping structures have been 9 Figure 4.3: Basic Code Example removed, everything is simply considered a statement. This SimplifiedTree is passed for reduction, but the reduction will not take place on the entire tree. To begin, the full tree will be stepped through and the sub-set of statements that are at level 1 will have reduction performed on them, this can be seen in Figure 4.6. In the event that the attempted removal is at the single-statement level and cannot be removed, a check will be performed to see if it is a nested statement (a conditional block or loop), if it is, reduction will be attempted on the nested child statements (see Figure 4.7). 10 Figure 4.4: Sample Code Example: JavaParser AST Representation For attempted statement reduction, one or more statements are removed from the specified method and the code is compiled. If the code fails to compile, a different combination of statements is attempted to be compiled. If the compile is successful, the class is tested. If the test passes, the failure has been removed (this is undesired) and a new combination of statements will attempt to be compiled. If the test fails, the statements that were excluded for that test can be set as disabled and reduction can continue on the now smaller subset of statements. The choice of which statements should be tested is made by a given algorithm, the previously mentioned DeltaDebugging algorithm is one such example. For this project I also developed my own algorithm, pseudo code for this can be seen in Figure 4.8. 11 Figure 4.5: Sample Code Example: SimplifiedTree Representation Figure 4.6: Sample Code Example: SimplifiedTree Representation, Level 1 When the reduction algorithm completes, the SimplifiedTree is stepped through and all JavaParser AST BlockStmts are set to whatever statements were not disabled. The custom algorithm is recursive and does not use any loops, it attempts to remove the “back” half of a set of statements and recurses to locally increase granularity for attempted removal. For any recursion level, ”front” statements are never targeted for removal before reduction has completed for ”back” statements. 12 Figure 4.7: Sample Code Example: SimplifiedTree Representation, Level 2 (Nested Children) Figure 4.8: Pseudo Code, Custom Algorithm 13 CHAPTER 5 Experiments 5.1 Research Questions To evaluate this tool, the following questions are considered: 1. How applicable is ReduJavator? 2. When performing statement reduction on failing unit tests, how accurate are the results? RQ1: The expectation is that the completed tool will function correctly when applied to unfamiliar projects, both in terms of general execution as well as logical processes. To test this, ReduJavator will be run on 30 real-world failures from 6 open-source Java projects in Defects4j. The number of failures correctly handled and the number of statements processed will be reported. If a significant number of failures and statements are correctly processed, we can claim that the tool is highly applicable. RQ2: Results are considered accurate when only statements that should be removed are removed. Reduced accuracy can present in 2 different ways: 1. Statements that should be removed are not removed (False-Negative). 2. Statements that should not be removed are removed (False-Positive). The False-Positive and False-Negative labeling of statements is used to calculate the standard measures of precision and recall for accuracy. The precision and recall metrics are expected to be high, confirming a low occurence of undesired behavior. 5.2 Subjects Defects4j was chosen as a database to locate open-source projects with preserved failures [8]. Many of these failures represent project states, compiler language targets, and plugin-versions that are more than a decade old. Selected failures to be tested needed to meet the following requirements: • The program needed to compile with the existing failure • Failures/Errors needed to be able to be targeted – Failing states with multiple non-functional methods or errors across numerous classes were passed on 14 • Reduction had to be possible – Methods with errors that could not logically be reduced were passed on Of the available Defects4j projects, 30 failures were tested from the following 6: commons-lang, jodatime, commons-codec, commons-compress, commons-csv, and jsoup. Some failures allowed reduction across more than one class and some allowed reduction across multiple methods. The specifics of the test projects can be seen in Table 5.1, the Parent Statements label in this table refers to conditional and looping statements. To optimize the use of space, Tables following 5.1 will refer to the projects simply by their ID. The total number of conditional or looping statements was 64, representing 10.58% of all statements. For all testing done, Maven version 3.9.1 and Java 8 were used. 5.3 Process For every method experiencing an assertion failure or throwing an exception, the optimum minimum statements required to maintain the fault was determined manually. From this point on, this minimum state will be referred to as the optimum reduction. 5.4 Measurement The original state of every failing method will be recorded and the minimized state will be compared against it as well as the optimum reduction. The following will be recorded: 1. Percent of statements removed for each project 2. Ratio of compiles for all failures to number of statements in all failures 3. Percent of compiles resulting in failure for each project The minimized test cases returned by ReduJavator will be measured for accuracy using the performance metrics, precision and recall [9]. Precision and recall can be measured by determining the following three outcomes for each statement after reduction has been performed: 1. True-Positive: Statements that were not required for the error and were correctly removed. 2. False-Positive: Statements that were required for the error and were incorrectly removed. 3. False-Negative: Statements that were not required for the error and were incorrectly left in the code. 15 Table 5.1: Class, Method, and Statement Composition of Test Projects 16 5.5 Statement Combinations, Multiple Algorithms All testing done was performed twice, once using the DDMIN approach in combination with the SimplifiedTree AST so that it could handle hierarchical structures and once with a custom algorithm. The custom algorithm is outlined in Chapter 4, Section 3, with psuedo code seen in Figure 4.8. All of the metrics indicated in the previous Measurement section will be used to compare the algorithms against each other. 17 CHAPTER 6 Results 6.1 Applicability As indicated in Table 5.1, 605 statements were attempted to be reduced across 32 Classes and 47 methods, from 6 different projects. All statements were correctly processed without ReduJavator raising any exceptions or causing unexpected behavior. Early testing highlighted some minor logic deficiencies that were not exposed with custom test case examples, time was taken to correct these issues and improve the overall tool experience (for both algorithms). The 6 projects selected from Defects4j show variety in both functional purpose and programmatic style, we expect results from these tests to be applicable to a wider range of projects. 6.2 Algorithm Comparison The percentage of statements removed, the number of compiles, and the number of failed compiles for both algorithms can be found in Table 6.1. The % removed columns represent across the entire failure, in cases where there may have been more than 1 method or class. There are several project IDs in Table 6.1 that have an asterix next to them, the meaning for these will be explained during a later portion of this analysis. The DDMIN algorithm averaged 69.42% statement reduction and the custom algorithm averaged 67.35% statement reduction. Considering the number of compile attempts, duplicate statement combinations were ignored and only unique combinations were counted. The average number of compiles for DDMIN reduction was 1.35 times the number of statements. The custom algorithm required an average of 0.61 times the number of statements (54.82% fewer compiles compared to DDMIN). DDMIN’s attempts to compile failed 53.62% of the time, compared to the custom algorithm failing to compile 33.51% of the time (a reduction of 36.5% compared to DDMIN). 6.3 Accuracy The standard measurements of precision and recall are used to report accuracy, these measurements are determined based on statement level analysis of tool results compared to the optimum reduction solution. In the context of ReduJavator, precision is a measurement of how well the tool was able to predict which statements should be removed. Recall is a measurement of how many statements that should have been removed, were removed by the tool. These measurements are being used because the focus of ReduJavator is the correct removal of statements and these metrics provide insight into how successful the algorithms 18 Table 6.1: Reduction, Compiles and Failed Compiles, for both Algorithms 19 True-Positive were at achieving those goals. Precision can be measured by: True-Positive + False-Positive , recall can be measured True-Positive by: True-Positive + False-Negative . Before calculating these values, let us look at Figure 6.1, an example of an optimum reduction result compared to a custom algorithm reduction result. Here the optimum reduction solution throws an exception (as did the full code state), the value of 29 is not a supported number for the month of February. The custom reduction is similar and contains the same failing statement, but is missing the DateTimeFormatter object. Without this formatting object there is no exception and the assert fails not because 29 is an invalid value, but because the parsed result is incorrectly set as a default value of -1. This is an example of a removed statement that is neither required for the class to compile or for the class to experience a failure at the same statement location as the original code. In this situation, the tool cannot correctly determine exactly what is needed by only considering compilation and test result exit codes. In this example, the removed formatting object is considered a statement that was removed when it should not have been, a False-Positive. Figure 6.1: Manual Optimum Reduction Compared to Custom Algorithm Reduction — Project ID: #8 Table 6.2 contains the True-Positive, False-Positive, False-Negative, and precision and recall results for both the DDMIN algorithm as well as the custom algorithm. Looking at the recall for the custom algorithm, we can see it is 100%, every statement that should have been removed, was removed. The False-Negatives for the custom algorithm are all 0, this is unsurprising because all statement types were correctly parsed by ReduJavator and statement reduction was attempted for every statement down to the single statement level. The precision for the custom algorithm is 94.29% and combined with the recall knowledge that everything that should have been removed, was removed, this means that 5.61% of statements that were removed were not required to compile or to fail the test. These 5.61% of statements were required however to add additional context to the failure or maintain a specific method of fault. 20 Table 6.2: True-Positives, False-Positives, False-Negatives, Precision and Recall 21 Performing an investigation into some of the cases where the DDMIN algorithm created False-Negatives, one such example can be seen in Figure 6.2. Here the original non-reduced code created three objects and tested them with 3 assertions, the ”two” object and its subsequent assertion are retained in the optimum reduction. The removal of the ”two” object by the DDMIN algorithm results in a failed assertion for the ”both” object. In this case, the reduction by DDMIN created a different error that did not originally exist, allowing the original error and associated statements to be removed. By removing the statements connected to the original failure and keeping statements that were not related to the original failure, the DDMIN algorithm simply achieves lower precision and recall scores. The potential severity of this undesired behavior is not something sufficiently reflected in the DDMIN precision and recall scores, which appear acceptably high. This is where the asterix next to several project IDs comes in, DDMIN reduced to a failing statement that varied from the original failing statement in 15 out of the 30 failures, or 18 out of 47 methods (38.3% of the time). The custom algorithm consistently maintained the original failing statement. Furthermore, in every case where DDMIN achieved a higher level of reduction than the custom algorithm, it did so by focusing on a different failing statement that erred for a reason it created. Figure 6.2: Manual Optimum Reduction Compared to DDMIN Algorithm Reduction — Project ID: #27 22 CHAPTER 7 Threats to Validity 7.1 Construct Validity Does our data confirm the utility of ReduJavator? 30 failures from Defects4j open-source projects were tested using both the DDMIN algorithm and a custom algorithm. Using the custom algorithm, the metrics of 94.29% precision and 100% recall were achieved. The reasons for less than 100% precision were investigated and determined to not be a flaw within the tool itself, but a limitation of its reliance on compilation and testing outcomes to determine success or failure. 7.2 Internal Validity Was bias effectively mitigated for the testing? For the 30 failures that were reduced, 6 different open-source projects were used to cover a variety of development styles and purposes. Testing of these open-source projects revealed and allowed a few flaws in the tool to be fixed, these issues were not discovered through custom test cases. The results of custom testing showed the potential for a 60% decrease in required compiles, when compared to DDMIN. Analysis of the open-source testing revealed a compilation decrease closer to 50%. When considering the precision and recall scores, the optimum reduction solution considered was occasionally beyond the scope of what ReduJavator was reasonably expected to handle, but we opted to use a score based on this higher standard. 7.3 External Validity Can our results be generalized? The open-source projects that were tested covered a variety of topics and showed multiple programming styles, we expect the results of the tool to be applicable to other project types as well. The current implementation of the tool is Java focused and dependent on the JavaParser library, the logical techniques used however could be adapted to work with other languages and AST structures. 23 7.4 Reliability Is our evaluation of ReduJavator reliable? Custom unit tests confirm that ReduJavator is capable of parsing and reducing all conditional and looping statement types that the Java language supports, as well as lambda expressions. The testing of 30 different projects from Defects4j showed that all code elements were interpreted and reduced as expected, we believe that continued experimentation would result in the same behavior. 24 CHAPTER 8 Future Work and Conclusion 8.1 Future Work The current state of ReduJavator has shown that it is efficient and effective, but there will always be room for additional improvements. One of the limitations highlighted by the DDMIN reduction is that the automated reliance on compilation and test exit codes creates the opportunity for the original failing statement to be removed. There is more than a single way this could be addressed, but flagging the failing statement before reduction to ensure that it cannot be removed would be a relatively low-cost improvement that would help prevent the program from modifying the underlying code more than it should. Even though the DDMIN algorithm may require more compiles than the custom algorithm, this improvement would allow DDMIN to reduce the correct error 100% of the time. Compilation time represents the greatest resource for the reduction tool, the ability to decrease the number of attempted compiles to save time would be a helpful improvement. The DDMIN algorithm showed frequent attempts to remove code statements that are required by later statements, resulting in failed compilation. The custom algorithm attempts to remove later code segments before earlier ones, so in cases where both the later and earlier statements are not required they can be cleanly removed without causing a failed compilation attempt. In situations where both the earlier and later statements are required to maintain the failure, the custom algorithm will still attempt to remove the earlier code statements, resulting in a failed compile. The use of pattern recognition or other parsing techniques to create a map of how statements are connected would be a helpful improvement to the current program. The parsing and logical mapping of code would likely take only milliseconds, but the reduction of a compile attempt could reduce the total time for reduction by tens of seconds for each attempt removed. Maven and JUnit support the ability to target single classes and methods for testing. ReduJavator attempts to use this targeted approach, when possible, but in the end it is dependent on how the projects are configured. Many of the open-source projects required the entire code-base to be compiled every time a change was made to the test code, test-suites that ran tests on many other non-failing classes were also something seen frequently. The ability to decouple the testing classes from the main project and target them directly would provide a significant time reduction in how long it takes for them to be reduced. Having observed the behavior where 5.61% of statements that were removed, were required to fully support the original failure, it may be beneficial to not remove them. Removing statements has the benefit 25 of creating a smaller set of code to analyze, but it could also make it more difficult to follow the logic of the program with a number of missing statements. If instead of removing statements that weren’t required for the failure, these statements were instead commented out, someone debugging the reduction could more clearly see the flow of the program and easily restore a statement that may be have been better left alone. 8.2 Conclusion During the testing of the open-source Defects4j failures, the most difficult task was simply to determine what was required to successfully compile these projects. Many projects built their own dependencies or had complex build files and used a combination of more than one build tool. If a user has familiarity with the project they are attempting to reduce, which would be likely if they were looking to reduce unit test code and find the source of an error, the largest hurdle for reduction with ReduJavator would already be cleared. After analysis of the testing results for the 30 failures, ReduJavator performed quite well when paired with the custom search algorithm. All of the failures were able to be fully parsed, compiled, and reduced without error, allowing reduction within a reasonable amount of time (especially compared to the DDMIN cost of compilation). While it was determined that the tool successfully removed all statements that weren’t required for a failure, while maintaining that failing statement, it was also seen that it had a tendency to remove additional statements that were required to maintain the full optimum reduction state. The frequency of the removal of these extra statements however, was quite low and the impact was not significant, just a slight variance in the effective cause of the same failing statement. Considering all factors, ReduJavator is very applicable, easy to use, and results in accurate reduction of statements. 26 References [1] A. Christi, M. L. Olson, M. A. Alipour, and A. Groce, “Reduce before you localize: Delta-debugging and spectrum-based fault localization,” in 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2018, pp. 184–191. [2] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183–200, 2002. [3] G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 142–151. [4] R. Hodován and Á. Kiss, “Modernizing hierarchical delta debugging,” in Proceedings of the 7th International Workshop on Automating Test Case Design, Selection, and Evaluation, 2016, pp. 31–37. [5] A. Christi and A. Groce, “Target selection for test-based resource adaptation,” in 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2018, pp. 458–469. [6] A. Christi, A. Groce, and R. Gopinath, “Resource adaptation via test-based software minimization,” in 2017 IEEE 11th International Conference on Self-Adaptive and Self-Organizing Systems (SASO). IEEE, 2017, pp. 61–70. [7] R. Hosseini and P. Brusilovsky, “Javaparser: A fine-grain concept indexing tool for java problems,” in CEUR Workshop Proceedings, vol. 1009. University of Pittsburgh, 2013, pp. 60–63. [8] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440. [9] C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and f-score, with implication for evaluation,” in Advances in Information Retrieval: 27th European Conference on IR Research, ECIR 2005, Santiago de Compostela, Spain, March 21-23, 2005. Proceedings 27. Springer, 2005, pp. 345– 359. 27
Format	application/pdf
ARK	ark:/87278/s6rc0pd8
Setname	wsu_smt
ID	158367
Reference URL	https://digital.weber.edu/ark:/87278/s6rc0pd8