Title | Stone, Jacob MCS_2024 |
Alternative Title | An Analysis of Unsupervised, Semi-Supervised, and Supervised Machine Learning Models to Categorize Procurement Data |
Creator | Stone, Jacob |
Collection Name | Master of Computer Science |
Description | This thesis focuses on both aspects of manually categorizing items and using machine learning to analyze the results of both. The former holds 100% accuracy with a huge time cost, and the latter has varying performances, with some seemingly being beneficial and some not. |
Abstract | Categorizing item purchases can be a headache for major companies, but it can be used in beneficial practices such a cost tracking and other useful metrics of spending. The major issue that can hold a company back is the time cost of categorizing every item that is purchased. Once the bulk of this work is done, however, machine learning can be done to further classify more items without the same time spent. This thesis focuses on both aspects of manually categorizing items and using machine learning to analyze the results of both. The former holds 100% accuracy with a huge time cost, and the latter has varying performances, with some seemingly being beneficial and some not.; Six different machine learning algorithms were implemented spanning from unsupervised, semi-supervised, and supervised learning models. There are two different methods of balancing the datasets and hours spent to determine the optimal preprocessing steps. Each model was then tuned to the best hyperparameters to find the best performance, and the time to execute is used as an evaluation criterion. |
Subject | Machine learning; Artificial intelligence; Algorithms |
Digital Publisher | Stewart Library, Weber State University, Ogden, Utah, United States of America |
Date | 2024 |
Medium | Thesis |
Type | Text |
Access Extent | 959 KB; 47 page pdf |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce his or her theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records: Master of Education. Stewart Library, Weber State University |
OCR Text | Show An Analysis of Unsupervised, Semi-Supervised, and Supervised Machine Learning Models to Categorize Procurement Data by Jacob Stone A Thesis in the Field of Computer Science for the Degree of Master of Science in Computer Science of MASTER OF SCIENCE in Computer Science Approved: Robert Ball Advisor/Committee Chair Abdulmalek Al-Gahmi Committee Member Joshua Jensen Committee Member WEBER STATE UNIVERSITY 2024 Abstract Categorizing item purchases can be a headache for major companies, but it can be used in beneficial practices such a cost tracking and other useful metrics of spending. The major issue that can hold a company back is the time cost of categorizing every item that is purchased. Once the bulk of this work is done, however, machine learning can be done to further classify more items without the same time spent. This thesis focuses on both aspects of manually categorizing items and using machine learning to analyze the results of both. The former holds 100% accuracy with a huge time cost, and the latter has varying performances, with some seemingly being beneficial and some not. Six different machine learning algorithms were implemented spanning from unsupervised, semi-supervised, and supervised learning models. There are two different methods of balancing the datasets and hours spent to determine the optimal preprocessing steps. Each model was then tuned to the best hyperparameters to find the best performance, and the time to execute is used as an evaluation criterion. Dedication Thank you to my mom, my dad, and Dan. You’ve shown me that limits do not exist. Table of Contents Dedication .......................................................................................................................... iii Background ..........................................................................................................................5 Purpose.................................................................................................................................9 Approach ............................................................................................................................12 Criteria ...............................................................................................................................15 Related Work .....................................................................................................................17 Milestone 1: Data Wrangling .............................................................................................21 Milestone 2: Machine Learning Algorithms ......................................................................27 Milestone 3: Analysis of Results .......................................................................................34 Milestone 4: Selection of Best Algorithm .........................................................................38 Conclusion .........................................................................................................................42 References ..........................................................................................................................45 Background Michael goes to the store for his weekly grocery shopping. He plans to purchase the same items he always gets: milk, eggs, bread, and some proteins. He lives a very routine life but is not invulnerable to an erroneous impulse buy from time to time. This week, he decided to buy a 12-pack of energy drinks to try and replace his sugar-heavy Starbucks coffees during his morning commute. Fast-forward two months when Michael goes to check his expenses through his online bank statement. Everything falls into place as it normally does, easy placement of his purchases of milk, eggs, bread, ground beef, and chicken breasts. All these expenses are listed with an easy item description like ‘12PK EGGS LARGE’ and ‘GAL 2% MILK’. However, the monthly totals do not quite add up as they normally do, and he notices that his coffee category is a little lower than he remembers. Another odd piece of this puzzle is a mysterious ‘12-PK SEASONAL RED KIRKLAND’ that he cannot remember. Michael is faced with a problem because he can't recall making this purchase from two months ago, and the description provided is too vague to pinpoint the item. Costco offers a wide range of products under their Kirkland brand, and terms like "seasonal" and "red" leave the possibilities open to interpretation. It could be anything from beverages to ready-made meals or even household items like candles. This uncertainty leaves Michael unsure if he bought a coffee alternative that could have been beneficial, or if it was a regrettable choice he'd rather avoid in the future. This is a small issue that might have been easy to address at the time of purchase, but as time passes, identifying vague item descriptions becomes increasingly challenging. If Michael were to change grocery stores, the situation could worsen because the items he purchases might be listed differently on his online statement. There's no universal standard for labeling items, so the names can vary between stores or even be very similar, adding to the confusion. Personal expense tracking, as illustrated by Michael's situation, highlights how confusion can arise from vague item descriptions. Another scenario involves a major corporation needing to change brands for a routine bulk purchase. For instance, if a large tech company switches the brand of coffee creamer in their break rooms due to a shortage during COVID, it might not raise alarms in the accounting department initially. However, a year later, when the budget is reviewed and the 'Food and Beverage' category shows a 40% increase from projections, many individuals could be surprised and upset about the unexpected expense. This discrepancy can occur if the vendor fails to provide distinct item descriptions for the new product compared to the old one. In this scenario, it might be simpler to track down the supplier if there was a complete switch, but if the change was from a 32oz bottle to single servings commonly seen at diners, identifying the change could be nearly impossible. The distinction between "Nestle Coffee mate Liquid Creamer Singles" and "Nestle Coffee mate Liquid Creamer 32oz" might slip under the radar unless someone is tasked with meticulously reviewing every individual line item of company purchases, which is highly improbable. Another way this could happen is through a gradual increase in the price of an item. Without a specific event to mark the change, such as switching brands, the gradual 6 increase might go unnoticed until it significantly affects the expense report or exceeds a predefined cost limit for purchases. This situation could pose several challenges if it were being tracked: it might remain hidden if tracking only involves averaging the items in that category, it could be completely overlooked if only the category itself is being monitored, or it might be missed entirely if the item has a different enough name that it's not recognized within that category. This issue, where multiple similar items represent the same actual item, is known as a 'record linkage' problem, where two or more descriptions refer to the same underlying entity [1]. The aim of this thesis was to tackle the challenge of categorizing item descriptions, which are often inconsistent. This inconsistency leads to difficulties in labeling each line item accurately, often due to minor variations in spelling or formatting. While it's relatively straightforward for a human to distinguish between descriptions like '30 CT HEAVY DUTY PAPER PLATES' and 'PARTY PLATES, PAPER, 50 PK' at a glance, teaching a machine to strip away extraneous information and focus solely on the essential nouns for every purchasable item is a daunting task. The objective was to categorize every line item, which was broken down into two primary areas of focus: item description normalization and supplier normalization. Item description normalization aimed to address instances where slightly different descriptions referred to the same product, even though the literal descriptions didn't match exactly. Supplier normalization tackled the same issue but from the supplier's perspective. Various suppliers may label their transactions with additional information, leading to challenges in grouping them accurately. 7 For instance, Uber may label each trip with the day, such as 'UBER TRIP JULY 7, 10:37PM', or it could be labeled as 'UBER EATS NOV 10 9:34AM'. The former would typically belong to a transportation category, while the latter would fit into a food category. By parsing these labels, any instance of 'UBER TRIP' could be categorized as 'Transportation', and any instance of 'UBER EATS' could be categorized as 'Food and Beverage'. Supplier normalization presented a more straightforward approach to categorization compared to item normalization. This is because there are typically fewer suppliers than items. Consequently, there tend to be fewer discrepancies in supplier descriptions compared to item descriptions. However, the focus of this thesis was on item normalization because there are significantly more discrepancies among similar items, and there's a much larger variety of items to differentiate between. 8 Purpose The main purpose for attempting to solve this problem was for traceability over time for the cost of goods. My research question was the following: How well do machine learning algorithms categorize items compared to user-created rules? The specific algorithms that I tested are the following: • Naïve-Bayes • K-Means clustering • Random forest • XGBoost • Label Propagation My hypothesis was that the machine learning algorithms will be able to adequately categorize the data, but there will not be a meaningful difference between the machine learning algorithms and the user-created rules, in which the user-created rules will do better. Normalizing similar items like ‘HEAVY DUTY PAPER PLATES’ and ‘PARTY PLATES’ into ‘PLATES’ allowed all items that are “plates” to be identified and analyzed. Without this normalization, there would have been several line items that are all describing disposable plates. These items cannot be consolidated for any kind of analysis since their strings were not the exact same. Companies need to be able to track their spending down to the smallest possible detail, which can mean even finding the cost for each unit within a bulk purchase. By 9 normalizing various items to more generalized categories, ideally by stripping away adjectives and retaining the core noun of the line item, it becomes feasible to analyze the cost of each unit of individual purchases. This approach is also applicable to personal finance budgeting services, although it may be more effective within larger corporations due to the availability of a larger data pool [2, 3, 4, 5]. Having full traceability enables better control over the company's finances and facilitates informed decision-making regarding budgeting. For instance, consider a scenario where a regularly purchased product suddenly doubles in cost. This discrepancy would highlight data that falls beyond the company's cost tolerance threshold. When a data point surpasses the company's pricing tolerance, it signals a need to reconsider suppliers, product brands, or budget allocations. An effective method for identifying the necessity to switch suppliers is by comparing the product cost against the company's cost tolerance. Six Sigma, a widely adopted methodology for process improvement, offers a framework for establishing safe tolerances. Six Sigma aims to enhance business processes by minimizing defects, errors, and variation, ultimately improving quality and efficiency. Through its structured approach, DMAIC (Define, Measure, Analyze, Improve, Control), Six Sigma identifies and addresses causes of variation to achieve nearly flawless quality levels, with only 3.4 defects per million opportunities [6]. Currently, most banking and budgeting software effectively categorize line items, providing users with insights into their spending on groceries, gas, and other expenses. However, these tools typically lack the capability to delve deeply into tracking specific fluctuations in routine purchases, such as the price changes in weekly egg purchases [7]. 10 While it's possible for users to manually track such details with extra effort, the software itself often lacks this tracking ability, limiting users to monitoring fluctuations only within predefined categories. Without this capability, both individual users and corporations lack complete transparency into their purchasing habits [8, 9]. Integrating this functionality into banking and budgeting software would significantly enhance users' financial management capabilities. For example, the software could generate alerts when the price of a routine purchase, like bulk egg orders, suddenly exceeds a specified price limit per unit. 11 Approach Machine learning (ML) is capable of processing words and strings, but its performance shines when dealing with numerical data. When working with text-based inputs, such as words and strings, ML algorithms benefit from converting them into meaningful numerical representations. This conversion process, known as pre-processing, is essential for efficient analysis. Converting words and strings into numerical values simplifies the computational tasks involved. Without this conversion, analyzing raw text would require more intricate algorithms, leading to slower and less accurate results. Thus, pre-processing plays a pivotal role in optimizing both the accuracy and speed of ML models [10, 11]. Several well-established pre-processing methods exist for transforming strings into numerical data. One such method is label encoding, which is particularly useful for binary categories, like "yes" or "no." In label encoding, all instances of "YES" are assigned the value 1, while "NO" is assigned 0. Moreover, label encoding is applicable to multi-class classification problems, assigning a unique numerical value to each distinct string category. For instance, in a dataset with categories like "cat," "dog," and "other," label encoding would assign them the values 0, 1, and 2, respectively [12]. This encoding technique effectively converts string-based output features into numerical formats. One-hot encoding proves beneficial for categorical variables with a limited range of options, akin to distinguishing between different colors of marbles in a bag, such as 'RED', 'BLUE', and 'GREEN'. This technique simplifies the identification of distinct categories. However, its drawback lies in the creation of a binary column for each 12 possible outcome. Although these new columns are numeric, they introduce additional dimensions to the data, with each column representing a potential outcome. Given the dataset's substantial size, comprising hundreds of unique classifications and thousands of data entries, employing one-hot encoding would significantly increase memory usage [13, 14]. In the realm of natural language processing and text analysis, term frequency (TF) emerges as a robust method. TF measures the frequency of a given word within a larger corpus of text, offering insights into its prevalence within the dataset. While this approach proves useful for understanding the distribution of words in item descriptions, it may present limitations. Simply relying on frequency counts might lead to misleading interpretations, particularly considering the varied lengths and content of the descriptions. The approach utilized in this study is term frequency–inverse document frequency (TF-IDF). TF-IDF is a refinement of TF that evaluates a given word's significance within a corpus, or collection of texts, by comparing its frequency against the total word count. Additionally, this method involves filtering out high-frequency, low-context words, commonly referred to as stop words, such as 'the' and 'and'. Stop words are a general category of words in a language that are the most common and do not help describe the key words. Stop words are the words listed above like ‘the’, ‘and’, and additionally words like ‘a’, ‘I’, ‘for’, ‘are’, and other general helping verbs and prepositions [15]. While TF-IDF proves to be a robust algorithm, like all machine learning (ML) algorithms, it requires seed data to function effectively. By manually inputting some data, the TF-IDF algorithm gains targets and a growing dataset of texts to analyze. The aim is 13 to accumulate enough diverse data points to enable the algorithm to classify items autonomously, without the need for continuous manual input. Let's break down the process using the paper plates example mentioned earlier: Initially, a sufficient number of line items classified as 'PLATES' by the seed data will exhibit similar TF-IDF values. This consistency in TF-IDF values enables various machine learning algorithms to effectively identify and classify them. As the overall corpus expands with more diverse data, these algorithms gradually enhance their classification accuracy. This progression creates a snowball effect wherein the algorithms become increasingly adept at categorizing line items based on past classifications. Initially, achieving accurate labeling may require stronger correlations. However, as more items fall under a specific category, the corpus grows larger. Consequently, the target that item descriptions need to match becomes more expansive. This expansion of the target facilitates a faster categorization process. The machine learning algorithms that I used for this thesis are the following: • Naïve-Bayes • K-Means clustering • Random forest • XGBoost • Label Propagation 14 Criteria The evaluation process involved converting the strings into numerical values and subjecting them to analysis using standard machine learning algorithms. A classification report served as the benchmark for comparison, employing samplings of manually categorized data. These reports served as the primary metric for determining the algorithm's performance. The method achieving the highest F1 score, derived from TFIDF and label-encoding pre-processing, was identified as the most effective. Permission to utilize the data was obtained from the Institutional Review Board (IRB) after submitting an application. This step was necessary as the data involved human subjects, even if only indirectly through manual labeling. An IRB application and subsequent approval are mandatory whenever human subjects are involved in an experiment, regardless of their role. In this case, employees from the company used a template derived from the thesis to label each line item. The accuracy of the results obtained from the thesis was then measured against these manually labeled data sets. Due to the confidentiality of the data, none of the actual data will be shared, but the findings and results of the thesis will be made available. A crucial step in this work is the seed data that the preprocessing algorithms will be based on initially. This seed data was created by an employee at one of the companies that will be included in this work. The thesis chair, Dr. Ball, and I also created the format and templating to go along with the data that was created for the seed. With each of these contributions, the seed data was completed and formatted into data that was able to initialize the algorithm for the test data that came from the companies. 15 This data that was used is confidential, and its protection was paramount during this work. The data was stored on a local machine behind encryption. Non-disclosure agreements have been signed for all parties involved in this thesis that are not employed with the company, and no more parties were brought into this work. The safety of this data was one of the primary goals of all involved. 16 Related Work In their paper titled "An Improved TF-IDF Approach for Text Classification," Zhang et al. (2005) delve into the application of TF-IDF in identifying characteristic words and feature words. These words are distinguished by their high confidence and correlation with the classification of specific categories. A characteristic word singularly assigns a string to a category, whereas a feature word necessitates the presence of two feature words within the text for category classification. The study exclusively focuses on utilizing TF-IDF alongside their enhanced method of employing characteristic words and feature words for data classification. Subsequent research suggests a departure from generalized testing methodologies. However, the findings demonstrate an enhancement over conventional TF-IDF approaches, especially when the confidence level in identifying characteristic words is raised to 96%. This adjustment implies that fewer documents contain characteristic words, but those identified are more indicative of category classification, thereby enhancing the method's efficacy [16]. In "A Taxonomy of Privacy-Preserving Record Linkage Techniques" by D. Vatsalan et al. (2013), the authors explore various string comparison techniques to address privacy concerns in record linkage, particularly within medical health records. Unlike some studies, this paper does not treat TF-IDF as a standalone methodology. Furthermore, it does not establish strong correlations between TF-IDF and the results of other methods. 17 Instead, the paper prioritizes the concept of privacy preservation over empirical experimentation. It delves into a range of techniques aimed at safeguarding sensitive data during record linkage processes, known as privacy-preserving record linkage (PPRL). The focus remains on proposing strategies to ensure data privacy rather than presenting experimental findings or emphasizing the outcomes of the methods explored [17]. In "Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping" by Bilenko, M. et al. (2005), the authors employed a unique approach involving a custom similarity function and similarity matrix to construct a perceptron for addressing record linkage challenges in online shopping contexts. Unlike traditional methods that rely on pre-existing training data, this approach was designed to accommodate the continuous influx of new data inherent in online shopping environments. The perceptron-based similarity function, coupled with clustering techniques, proved instrumental in achieving robust results. By dynamically adapting to evolving datasets, this methodology demonstrated effectiveness in accurately linking records despite the constantly changing landscape of online shopping [1]. In "Multi-co-training for Document Classification using Various Document Representations: TF–IDF, LDA, and Doc2Vec" by D. Kim et al. (2019), the authors investigate the effectiveness of several document representation methods, including TFIDF, Latent Dirichlet allocation (LDA), and Doc2Vec. They explore both supervised learning within each method and co-training between pairs of these methods, utilizing classifiers such as Naïve Bayes and random forests. 18 The primary objective is to determine the optimal approach among supervised learning within individual methods, co-training between different combinations of methods, and various classifier options. This comprehensive evaluation is conducted across five distinct collections of texts. The findings suggest that co-training methods yield superior performance. However, it's important to note that this approach is computationally intensive and demonstrates optimal results under specific conditions characterized by low dimensionality and a scarcity of labeled documents [18]. In "Research Paper Classification Systems Based on TF‑IDF and LDA Schemes" by Kim and Gil (2019), the study focuses on categorizing research papers into meaningful categories. This is achieved by extracting key words from the abstracts using latent Dirichlet allocation (LDA), followed by clustering the papers into categories based on these keywords using TF-IDF. The findings indicate that utilizing a combination of both LDA and TF-IDF methods significantly enhances the classification process, resulting in higher clustering scores compared to other combinations [19]. In "Opportunities to Reduce Operating Expenses in Industrial Enterprises" by Gheorghe (2013), the author emphasizes that while reducing expenditure is important, it should not be the primary focus of a business. Instead, the main objective should be to maintain a healthy balance between expenditure and income. Although certain expenses may initially incur higher costs, they may be necessary to facilitate efficient operations. 19 The author suggests that once the benefit of a particular expenditure diminishes or is no longer applicable, it becomes imperative to analyze it over time. This approach allows for a strategic evaluation of expenses and enables the identification of areas where reductions can be made without compromising operational efficiency [20]. In ‘Forecasting as a Way to Reduce the Risks of a Cash Flow Deficit in Agricultural Organizations’ by Nosov et. Al. (2021), they introduce the concept of cost tracking as forecasting being a pivotal tool in keeping a business healthy. The forecasting here shows whether a big farm will be able to sustain its business through the loans it needs to operate. Projecting out the cost of operating expenses and procurement of goods is the optimal way to determine if a year will make or break the business and allows for time to mitigate how they see fit. This project was born from having a significant omission one year that required the development of two separate cash-flow plans, broken down by quarters, to ensure the success of the business for the next 18 months [21]. 20 Milestone 1: Data Wrangling The company involved with this thesis spends approximately $45 million in procurement annually, with half of that data being categorized into a taxonomy known as the United Nations Standard Products and Services Codes (UNSPSC). Despite its widespread use, this taxonomy is extensive, encompassing sections that are irrelevant to the company's operations while lacking some categories necessary for accurately representing the goods and services procured. Taxonomies, by definition, are structured systems of classification, often hierarchical in nature. While the term is commonly associated with biological classifications [22], it pertains to any systematic method of categorization. Universally recognized taxonomies include the Linnean System in biology, which categorizes living organisms into kingdoms, phyla, orders, classes, genus, and species. Additionally, there are specialized taxonomies used in various fields such as education [23]. The initial phase of this project involved various setup tasks, including the creation of seed data, importing raw procurement data, and cleaning the data. Notably, not all items in the raw data were categorized. These uncategorized items could be goods and services not defined by the United Nations Standard Products and Services Codes (UNSPSC), or they could be items that somehow evaded initial categorization. Due to the sheer volume of data processed by this large company, the initial capture of data is not systematically verified for completion. As shown in Table 1 below, there are instances where items were left uncategorized, as well as cases where the UNSPSC categorization was incorrect or aligned with another taxonomy. This table provides an excerpt of the raw data received, highlighting items that required processing. 21 One noteworthy example is the line 'XTM1U,' which remained uncategorized by the company prior to my receiving the data. Through further investigation, it was determined that this item referred to a battery type, allowing for its categorization within the new taxonomy. However, identifying ambiguous items often necessitated individual research, as there was no consistent rule or pattern for such cases. Table 1 Item descriptions and associated categories Name UNSPC Google Taxonomy Staples Hype Stick Highlighters, Chisel, Assorted, 5/Pack (29349) Writing Utensils Sharpie¬¨√Ü Accent¬¨√Ü Highlighters, Fluorescent Orange, Pack Of 12 Writing Utensils Office Supplies > Office Instruments > Writing & Drawing Instruments > Markers & Highlighters Office Supplies > Office Instruments > Writing & Drawing Instruments > Markers & Highlighters Office Supplies > Office Instruments > Writing & Drawing Instruments > Markers & Highlighters Hardware > Tools > Ladders & Scaffolding > Step Stools Home & Garden > Decor > World Globes Home & Garden > Lawn & Garden > Watering & Irrigation > Garden Hoses Baby & Toddler > Baby Safety > Baby Monitors Sharpie(R) Liquid Accent(R) Pen-Style Writing Highlighters, Assorted Colors, Pack Of Utensils 10 Little Giant Ladder Systems Flip-NLite, 5-Foot, Stepladder, Aluminum, Type 1A, 300 lbs Rated (15273-001) Deco 79 94449 Wood Metal Marble Globe, 6" x 11", white 100 foot garden hose and sprayer Workstation Items elesories White Noise Machine, Small Sound Machine for Adults Baby Sleeping, Also Be Used as a Multifunctional Speaker for Home, Office Privacy | Nursery | Travel | 13 Soothing Sounds 16 Pcs 3.5"-4" Unfinished Natural Wood Slices Circles with Bark for Coasters DIY Crafts Christmas Baby Items and Accessories Landscaping and Garden Equipment Crafting Materials 22 Arts & Entertainment > Hobbies & Creative Arts > Arts & Crafts > Art & Crafting Materials > Ornaments Rustic Wedding Decorations Centerpiece Craft Shapes & Bases > Craft Wood & Shapes RYB HOME Bedroom Blackout Curtains - Black Curtains Solar Light Block Insulated Drapes Energy Saving for Bedroom Dining Living Room, 42 x 45 inches Long, Black, Set of 2 Room/Dividers/Now 36in-56in Hanging Curtain Rod with Brackets, Silver PONY DANCE Room Darkening Curtains - Thermal Insulated Light Block Curtain Drapes with Back Tab Energy Saving for Kitchen, 52 Wide x 54 Long, Greyish White, 2 Pieces OC-CK Optics Care and Cleaning Kit SEOCCK Curtains and Blinds Home & Garden > Decor > Window Treatments > Curtains & Drapes Curtains and Blinds Home & Garden > Decor > Window Treatments > Curtains & Drapes Home & Garden > Decor > Window Treatments > Curtains & Drapes TK75230570T AA Battery, AA, High Performance, Capacity - Batteries 3,125 mAh, Standard Battery Series Procell Constant, Battery Chemistry Alkaline, Voltage - Batteries 1.5V DC, Standard Battery Pack Size 24, Max. Operating Temp. 130 Degrees F, Min. Opera TK72953555T AA Battery, AA, Everyday, Capacity - Batteries 2,620 mAh, Standard Battery Series UltraPro, Battery Chemistry Alkaline, Voltage Batteries 1.5V DC, Standard Battery Pack Size 24, Max. Operating Temp. 130 Degrees F, Min. Operating Temp. 20 D XTM1U General Batteries $50 Gift Cards Gift Cards $50 Gift Card Gift Cards Curtains and Blinds Eyeware Accessories General Batteries 23 Health & Beauty > Personal Care > Vision Care > Eyewear Accessories > Eyewear Lens Cleaning Solutions Electronics > Electronics Accessories > Power > Batteries Electronics > Electronics Accessories > Power > Batteries Electronics > Electronics Accessories > Power > Batteries Arts & Entertainment > Party & Celebration > Gift Giving > Gift Cards & Certificates Arts & Entertainment > Party & Celebration > ***Cancel line - no charge per Garrett Scottland Installation/set up New Pig Blue Absorbent Sock | Form Barrier & Prevent Spills from Spreading | 95 oz Absorbency | 3" x 48" | 40 Socks | 4048 Zoom Room Licensing, 10 Rooms. Household Cleaning Products Licensing Fees Gift Giving > Gift Cards & Certificates Services > IT Services > Hardware Installation Home & Garden > Household Supplies > Household Cleaning Supplies Subscriptions & Memberships > Subscriptions There is an employee I worked with that is dedicated to categorizing each line item, and this was used to help generate the seed data and have data to compare against in the final step. This employee was able to indicate whether each line item is a good or a service with the Graphical User Interface (GUI) that has already been deployed. This GUI was specifically designed for this thesis and can be seen in Figure 1. Figure 1. GUI used for item categorization. This GUI was used by the dedicated employee to help facilitate item categorization. 24 Knowing this information allowed for more accurate categorization, since goods and services do not typically share the same sets of categories, and the new taxonomy we are using in this work is different than the UNSPC codes. In addition to the fully dedicated employee, a team of 10 other employees worked with me to determine the best way to analyze and present this data in a meaningful way to the heads of the company. This is imperative because the data that the company currently uses has approximately 15 of the level 1 UNSPSC categories, and those categories tier out into many different branches. The new taxonomy that was used utilizes about 6,000 categories that are more targeted and usable for the company. The extensive volume of unused data not only leads to wastage but also generates confusion within the organization. Over the past two years, the team I collaborated with has iterated through various versions of data and categorizations to discern meaningful insights. Through this process, several key variables have emerged as crucial for this project: • Item description • Supplier • Unit of Measure • Quantity • Manufacturer • UNSPSC codes In this study, it was determined that the 'Quantity' and 'Unit of Measure' columns were irrelevant for the project's objectives. Additionally, the 'Supplier' and 'Manufacturer' columns were deemed to be redundant and were subsequently combined. As a result, 25 only the 'Item Description', 'Supplier', and 'UN Quantity Codes' columns remained relevant, with a new output column, 'Category', created to represent the classification within the new taxonomy. Collaboratively, involving the dedicated employee, the team supporting the thesis, Dr. Ball, and myself, we meticulously analyzed each of the 45,000 line items in the example dataset used for this project. Through manual examination of item descriptions, we accurately categorized 14,000 lines, establishing the seed data necessary for further progress in the project. Given the company's significant procurement volume amounting to millions of dollars annually, a substantial quantity of seed data was necessary to initialize the algorithms effectively. As anticipated, this step proved to be the most time-consuming, requiring approximately 90 hours to classify and clean the dataset. An example of straightforward categorization involves the item description 'Swiss Miss Milk Hot Cocoa K-Cup(R) Pods, 0.65 Oz, Pack Of 44 Pods,' which was assigned to the new category 'Food, Beverages & Tobacco > Beverages > Hot Chocolate’. Conversely, a more challenging instance arose with the item 'ACE ISE Cal B Calibrator 3x90mL Pk,' which lacked categorization under the UNSPC codes. Resolving this required additional research into the item and supplier, ultimately placing it within the 'Business & Industrial > Science & Laboratory > Laboratory Equipment' category. Once all the seed data was completed, the algorithms had all the data needed to move through the rest of the uncategorized data quickly and efficiently. This allowed for all the unused categories to be eliminated, minimized the mass waste and confusion, and 26 allowed useful data to make its way back to the team working with us for meaningful conclusions to be drawn. Milestone 2: Machine Learning Algorithms The objective of this phase is to select the appropriate machine learning algorithms to utilize the seed data gathered in the previous step, while simultaneously integrating them with the GUI for rule creation supplied by the user. During this milestone, a significant challenge arose concerning the implementation of rule generation. This concept had been under consideration within the team for the past year, and various approaches were explored. These methods ranged from utilizing Wordnet, a complex system for stringing together parts of speech, to employing regular expressions and equations to capture different words, among others. While some of these attempts showed promise, they did not yield sufficient success rates to justify the considerable computational resources they would demand. The most successful aspect of this endeavor involved creating rules for major suppliers such as 'UBER', which could be labeled as 'TRANSPORTATION'. However, major suppliers like 'WALMART' posed challenges due to their diverse range of products, requiring numerous sub-rules for adequate categorization. These challenges prompted a shift in focus away from pursuing the generation and analysis of usergenerated rules, redirecting our efforts towards evaluating the performance of machine learning algorithms. 27 Table 2 Amazon Basics items and categorization name Amazon Basic Care Extra Strength Acetaminophen Caplets 500 mg, Pain Reliever and Fever Reducer, 50 Pouches of 2 Caplets Each, Total 100 Caplets Amazon Basic Care Aspirin Pain Reliever and Fever Reducer (NSAID), 325 mg Coated Tablets, 500 Count Amazon Basic Care Aspirin 81 mg Pain Reliever (NSAID) Chewable Tablets, Low Dose Aspirin, Orange Flavor, 36 Count Amazon Basic Care Allergy Relief Diphenhydramine HCl 25 mg, Antihistamine Tablets for Symptoms Due to Hay Fever and Upper Respiratory Allergies, 100 Count manufacturer Google Tax Amazon Basic Care - Original Hand Sanitizer 62%, 12 Fluid Ounce (Pack of 6) Amazon Basic Care Amazon Basic Care - Aloe Hand Sanitizer 62%, 12 Fluid Ounce (Pack of 6) Amazon Basics Hardboard Office Clipboard - 6Pack Amazon Basic Care Amazon Basics Amazon Basics Hanging Organizer File Folders Letter Size, Green - Pack of 25 Amazon Basics Amazon Basics 3-Ring Binder, 1-Inch - White, 4Pack Amazon Basics Amazon Basics 3-Ring Binder, 1 Inch - White, 4Pack Amazon Basics Amazon Basic Care Health & Beauty > Health Care > Medicine & Drugs Amazon Basic Care Health & Beauty > Health Care > Medicine & Drugs Amazon Basic Care Health & Beauty > Health Care > Medicine & Drugs Amazon Basic Care Health & Beauty > Health Care > Medicine & Drugs Health & Beauty > Personal Care > Cosmetics > Bath & Body > Hand Sanitizers & Wipes Health & Beauty > Personal Care > Cosmetics > Bath & Body > Hand Sanitizers & Wipes Office Supplies > Office Instruments > Clipboards Office Supplies > Filing & Organization > File Folders Office Supplies > Filing & Organization > Binding Supplies > Binders Office Supplies > Filing & Organization > Binding Supplies > Binders One significant challenge encountered early in this thesis was the observation of extreme imbalance in the seed data. Given the decision to treat the task as a multi-class classification problem, originating from record linkages, each line within the taxonomy 28 represents a unique classification. Consequently, the seed data alone encompassed approximately 1,000 different classification options, with varying quantities of categorized lines under each classification. Addressing this imbalance became a crucial preprocessing step before proceeding to build the algorithms. To validate this concern, I constructed and executed supervised algorithms including XGBoost, Random Forest, and Naïve Bayes, with their results documented in Figures 1, 2, and 3, respectively. These three supervised learning algorithms were selected based on their effectiveness in addressing multi-class classification challenges [24]. Table 3 Balanced and tuned XGBoost results Accuracy Macro Avg Weighted Avg Precision Recall 0.8025 0.8339 0.7472 0.8286 F-1 0.8286 0.7601 0.8224 Support 2643 2643 2643 XGBoost proved to be the most time-consuming and challenging of the supervised models to set up. This algorithm, based on gradient tree boosting, is inherently complex, which explains its cumbersome nature. Despite being a custom Python library, XGBoost is powerful and does not necessarily require tuning. However, this also means that its effectiveness remains consistent once it is up and running [25]. Table 4 Balanced and tuned Random Forest results Accuracy Macro Avg Weighted Avg Precision Recall F-1 Support 0.8315 0.8501 0.8487 0.7811 0.8391 2643 2643 2643 0.7641 0.8487 29 Random Forest was among the simplest and quickest algorithms to set up initially. However, tuning the hyperparameters proved to be one of the most timeconsuming tasks among all the models. Due to the multitude of parameters and the extensive range of possible classifications, this stage of the process required a significant amount of time for this model. Table 5 Balanced and tuned Naïve Bayes results Precision Recall Accuracy Macro Avg Weighted Avg 0.8095 0.8592 0.7864 0.8489 F-1 0.8489 0.7911 0.8504 Support 2383 2383 2383 Initially, setting up the Naïve Bayes model took some time, and the execution time was prolonged, primarily because I employed the Categorical Naïve Bayes learning model. However, once I had preprocessed the data, I transitioned to the Multinomial Naïve Bayes model. This proved to be the simplest and fastest algorithm in this project. Moreover, it achieved the highest F1 score, further highlighting its effectiveness. The two most effective options identified for balancing the dataset were resampling with SMOTE (Synthetic Minority Over-sampling Technique) and the utilization of ADASYN (Adaptive Synthetic Sampling). Both are pre-existing Python libraries employing distinct methodologies to achieve similar outcomes. SMOTE operates through the k-nearest-neighbors algorithm, identifying two data points in close proximity and interpolating a point along the line connecting them [26, 27]. While this method synergized well with my Naïve Bayes and XGBoost algorithms, it had a 30 significant limitation: it excluded categories with very few classified lines. For SMOTE to function effectively, a category required a minimum of 20 classifications for XGBoost and 35 for Naïve Bayes. Recognizing this inherent limitation in k-nearest-neighbors algorithms, I sought advice from my thesis board on alternative approaches to balancing the data, and ADASYN was recommended as the next step. The ADASYN algorithm for dataset balancing differs from SMOTE in its focus on the most challenging minority classes rather than uniformly oversampling all minority classes. It builds upon the fundamentals of SMOTE with the aim of surpassing its capabilities [28]. However, due to its foundation on SMOTE, ADASYN encounters similar limitations. Consequently, I encountered a comparable issue when implementing it within the algorithms: the dataset needed to be filtered to categories with at least 20 occurrences. Nonetheless, ADASYN demonstrated superior performance when used with the Random Forest algorithm compared to the SMOTE method of balancing. With the preprocessing steps done, I was able to move forward with fully building and tuning the algorithms. As mentioned above, I implemented three supervised algorithms: XGBoost, Random Forest, and Naïve Bayes. In addition, I implemented one semi-supervised approach: Label Propagation. The implementation of these methodologies was fairly straightforward, but tuning the hyperparameters was a timeconsuming endeavor. Another approach I implemented was clustering using K-Means and K-Modes algorithms. Clustering is typically utilized in conjunction with other methodologies to interpret the meaning of the clusters. Further discussion on this approach will be provided in Step 3. 31 In my initial attempt at tuning, I used RandomSearchCV due to the dataset's substantial size. However, after running the code for several hours, I found that the values for the hyperparameters were either insignificant or even worse than the default values. This led me to believe that I needed to employ GridSearchCV with more comprehensive values for the hyperparameters to obtain reliable results. The main distinction between RandomSearchCV and GridSearchCV is that GridSearchCV explores every possible combination of values for each hyperparameter, whereas RandomSearchCV randomly pairs values to generate outputs. While RandomSearchCV is sometimes preferred over GridSearchCV due to its lower computational cost, I needed the thoroughness of GridSearchCV to ensure the impact of the hyperparameters. After multiple iterations of running overnight, employing 5-fold cross-validation, and ensuring that my computer did not run out of memory and crash, I ran GridSearchCV on each of the supervised algorithms. The results indicated that hyperparameter tuning had minimal impact on the F1 score. With all preprocessing steps completed, algorithms fully built, and hyperparameters tuned, I could finally analyze the results of each algorithm. Another algorithm that was considered but not fully incorporated into this work is treating the problem as a multi-label rather than a multi-class classification. The primary distinction between multi-label and multi-class problems is that multi-label problems allow for multiple outputs per line, whereas multi-class problems have a single output per line [29]. 32 Given that the taxonomy is hierarchical with selectively branching sub-options, it is conceivable to view the categorization labels as independent. For instance, in a multiclass scenario, 'YAHTZEE' would be classified as 'TOYS & GAMES > BOARD GAMES,' whereas in a multi-label setup, it could be classified as both 'TOYS & GAMES' and 'BOARD GAMES.' However, a significant challenge with this approach is that not every option is immediately available. For example, 'YAHTZEE' cannot be categorized as 'BOARD GAMES' without first being categorized under 'TOYS & GAMES.' This hierarchical dependency makes constructing a network of possibilities exceedingly complex. Therefore, further exploration of this option will be conducted outside of this work to assess its performance. 33 Milestone 3: Analysis of Results As mentioned previously, I initially ran the algorithms without balancing the dataset to assess their performance, and as expected, the results were poor. However, after implementing preprocessing steps and balancing the dataset, the performance of all three supervised approaches improved significantly. Table 6 Comparison of supervised algorithms results before and after balancing Metric Accuracy Macro XG Avg Weighted Avg Accuracy Macro RF Avg Weighted Avg Accuracy Macro NB Avg Weighted Avg P R Imbalanced F-1 S 0.0008 3599 P R 0.0001 0.0031 0.0002 0.8025 0.7472 0.7601 2643 0.0003 0.0008 0.0003 0.8339 0.8286 0.8224 2643 3599 3599 0.0003 3599 0.0000 0.0014 0.0000 0.0000 0.0003 0.0000 3599 3599 0.0006 3599 Balanced F-1 S 0.8286 2643 0.8487 2643 0.8315 0.7641 0.7811 2643 0.8501 0.8487 0.8391 2643 0.8489 2383 0.0001 0.0019 0.0001 3599 0.8095 0.7864 0.7911 2383 0.0000 0.0006 0.0001 3599 0.8592 0.8489 0.8504 2383 As seen in Table 6, the results from the imbalanced dataset are not passable results. The columns correspond to the classification report metrics of (P)recision, (R)ecall, F-1 Score, and (S)upport. There are barely any visible metrics, let alone anything I could learn from this set of results. Two notable things from this table are the vastly improved results of the balanced dataset, going from virtually 0 accuracy to 0.85 F1 score, and the Naïve Bayes algorithm 34 having a lower support score. This lower support score means that more lines had to be filtered out for this algorithm to run properly. The Naïve Bayes model retained 90% of the sample size of the other two approaches so this is not something that would have impacted the results, but it is worth noting. The implementation of SMOTE and ADASYN on the algorithms resulted in almost equal performance improvement. However, XGBoost and Naïve Bayes performed better with SMOTE, while Random Forest performed better with ADASYN. The classification report of the XGBoost, Random Forest, and Naïve Bayes algorithms after balancing the dataset and tuning the hyperparameters can be found in Figures 4, 5, and 6. All three supervised learning algorithms achieved an F1 score just shy of 0.85, with the Naïve Bayes algorithm performing the best among the three. The F1 score was chosen as the measure of effectiveness because it represents the harmonic mean of the model’s precision and recall, considering False Positives and False Negatives. This consideration is crucial in a classification problem as it provides insight into the model's performance beyond a simple arithmetic average of accuracy, especially in scenarios where class imbalances exist [30]. The semi-supervised learning method was employed last to assess if it could outperform the supervised learning models. To set up this algorithm, a pseudo-semisupervised dataset was created. Initially, the cleaned dataset underwent the train/test/split method. Subsequently, the training data was split again, with half of the data remaining in its original labeled state, while the other half had its classifications removed. This facilitated the creation of a semi-supervised dataset, where half of the training data was 35 unlabeled. These labeled and unlabeled portions of the dataset were then concatenated for further analysis. Table 7 Balanced and tuned Label Propagation results Precision Recall Accuracy Macro Avg Weighted Avg F-1 0.8139 0.6402 0.7271 0.8139 0.7902 0.9747 0.8643 Support 1236 1236 1236 From this point, I proceeded to develop the Label Propagation algorithm, employing the same preprocessing steps as in the supervised approaches. However, the execution of this algorithm proved to be more challenging than anticipated. I encountered issues such as early termination due to excessive memory usage, both from my IDE and my laptop, as well as obtaining very low F1 scores initially. After adjusting the filters to require a minimum of 135 classifications per category, I finally achieved a respectable F1 score of 0.8139, as shown in Table 7. This outcome was a pleasant surprise after grappling with the execution challenges of this model over several days. The unsupervised methods implemented, K-means and K-modes clustering, did not perform well on this dataset, both before and after balancing. The metric used to evaluate the clusters, purity, is specific to clustering algorithms and determines the meaningfulness of the generated clusters. However, clustering is inherently challenging for multi-class classification problems [31, 32] and requires significant computational power to execute. 36 Despite running the algorithms overnight, the purity results remained poor, as depicted in Table 8, prompting the decision to discontinue further experimentation. This conclusion was reinforced through discussions with the Thesis Chair, especially considering the satisfactory performance of the supervised algorithms. Table 8 Clustering models and their purity values Model K-Means K-Modes Purity 0.5623 0.3864 Purity is indeed one of the popular metrics used to evaluate clustering algorithms. It assesses the effectiveness of the clustering by assigning them to the most prevalent classification within each cluster and then comparing this against the overall correctness of the classifications within the cluster itself. Ideally, purity scores should be as close to 1 as possible, indicating that each cluster contains only one accurate classification. Scores closer to 0.5 suggest a lack of clear separation between different classifications within the clusters [33]. Given the computational expense and the low purity scores obtained in our experimentation, further efforts to make sense of the clusters were not pursued. While purity can provide insights into clustering performance, it's important to consider other factors such as computational feasibility and the effectiveness of alternative methodologies. 37 Milestone 4: Selection of Best Algorithm Analyzing the results and reflecting on the performance of the algorithms used in this thesis allows for a subjective assessment of the best approach. While acknowledging that the ultimate judgment lies with the stakeholders of the large company, this section outlines my personal perspective on the most effective algorithm based on several criteria. The evaluation criteria include the F1 score, the runtime efficiency of the program, and the challenges encountered during implementation. By weighing these factors, I aim to determine the algorithm that best suited my needs throughout the project. This evaluation is essential for identifying the most practical and effective approach for future endeavors. Table 9 Head-to-head comparison of F-1 scores and Support Final Results Accuracy XGBoost Macro Avg Weighted Avg Accuracy Random Macro Avg Forest Weighted Avg Accuracy Naïve Macro Avg Bayes Weighted Avg Accuracy Label Macro Avg Prop Weighted Avg 38 F-1 0.8286 0.7601 0.8224 0.8487 0.7811 0.8391 0.8489 0.7911 0.8504 0.8139 0.7271 0.7902 Support 2643 2643 2643 2643 2643 2643 2383 2383 2383 1236 1236 1236 It is worth noting that the Label Propagation algorithm has a significantly lower support score due to the nature of needing a semi-supervised dataset to run. The overall scores are not the only metric in which these models are being evaluated in this thesis. Time to construct the algorithms and time to execute the code is also within the criteria. The only models that took a significant amount of time to execute were the unsupervised approaches, and these neared an hour when trying to tune them. The supervised approaches took the least amount of time to execute, but the semi-supervised model was approximately the same time as the slowest supervised model. This can be seen below in Table 10. Table 10 F-1 scores and time to execute (in minutes) Model XGBoost Random Forest Naïve Bayes Label Propagation Learning Type Supervised Supervised Supervised Semisupervised Time (minutes) 8 3 1 8 F1 Score 0.8286 0.8487 0.8489 0.8139 The first model to talk about is the semi-supervised approach with Label Propagation. Since the dataset I generated was fully labeled, I needed to modify it to be considered a semi supervised dataset by eliminating half of the labels within the training data. Also, since this approach also calculates the distance between each possible point, it consumes a large amount of memory. This meant that the number of categories that needed to be excluded after the modification of the dataset was considerably larger than any other algorithm, with the threshold being a minimum of 135 instances of a category. Determining this threshold emerged as the crucial hyperparameter, as using a lower value 39 risked system crashes, while higher values yielded F1 scores half that of the supervised approaches. Surprisingly, this method yielded one of the higher F1 scores despite its memory-intensive nature, executing within a timeframe similar to that of other supervised approaches. Moving forward, this approach holds promise for the company, as it eliminates the need to manually label every single line before utilizing the categorization code on their data. Next, let's delve into the supervised approaches, which constituted the initial group of algorithms developed. These approaches were comparatively straightforward to implement. The preprocessing steps were predefined during the proposal phase, covering everything from feature engineering to including only the item descriptions and categories, utilizing TF-IDF on the item descriptions, Label Encoding the categories, and applying SMOTE to balance the data. Initially, running the algorithms on the imbalanced dataset served as a simple trial run to ensure their functionality. However, one of the algorithms posed increasing difficulties compared to the other two. XGBoost, being a modified form of gradient descent with boosting, introduced challenges due to its built-in intermediary step that converts the data into a D-Matrix. These challenges involved proper data shaping and ensuring thorough preprocessing of the strings into numerical formats. Despite resolving these issues, XGBoost yielded the lowest F1 score among the supervised approaches and took the longest to execute. On the other hand, setting up Random Forest was simpler, and it achieved a higher F1 score with less execution time compared to XGBoost. Following a similar trend, the Naïve Bayes approach yielded an even higher F1 score with minimal execution time, approximately 1 minute to execute. 40 Originally, the Naïve Bayes approach was in the middle in terms of success among the supervised algorithms. However, I decided to experiment with slight modifications to improve its performance. Initially, I employed the Categorical Naïve Bayes, only to find that it required longer execution times and yielded a lower F1 score compared to the other supervised approaches. Diving deeper into the documentation, I opted to explore the Multinomial Naïve Bayes, known for its compatibility with TF-IDF preprocessing [34]. This adjustment proved fruitful, as it enabled me to reduce the threshold from 50 to 35 occurrences and significantly decrease the execution time from 10 minutes to just 1 minute. This discovery played a pivotal role, as it facilitated a clear distinction in terms of F1 score, development time, and execution time, establishing the Naïve Bayes approach as the top performer across all three metrics. This comparison is illustrated in Table 10 below. Table 11 Categorical Naïve Bayes vs Multinomial Naïve Bayes Categorical NB Multinomial NB F-1 0.0495 0.8489 Time (min) 10 1 Threshold 50 35 Finally, the unsupervised approaches, namely K-means and K-modes clustering, did not prove to be viable solutions for this project. Implementing these algorithms required extensive feature engineering to format the data in a manner acceptable to the models. Moreover, the computational demands were considerable, with execution times stretching into hours and even days. Although the K-means algorithm eventually completed its run after a full night, numerous results had to be filtered out to meet the algorithm's input requirements. The 41 attempt to address this issue by using the K-modes clustering algorithm, tailored for categorical data, also yielded unsatisfactory results [35]. The computational expense associated with calculating distances between each line item and category led to an impractical runtime of nearly two full days. This can be seen in Table 12. Table 12 Supervised Model Purity Scores and Time to Execute Model K-Means K-Modes Purity 0.5623 0.3864 Time to Execute 60+ 95+ Given the substantial time investment and poor outcomes observed during the initial clustering implementation, further exploration of this approach was deemed unwarranted. Consequently, the unsupervised clustering methods were disregarded as viable options for the project. Conclusion Cost tracking can be a priceless tool used by major companies to improve expenditure efficiency, but the first step of that is to accurately categorize expenses. This has been shown to be extremely time consuming, but it is nearly 100% accurate. Determining if the tradeoff of time and helpfulness becomes a gray area that is different from person to person, but what if there was an easier way to categorize the data? Accurately categorizing the dataset for this thesis took approximately 90 hours between the dedicated employee, Dr. Ball, and myself, and we were only able to cover 30% of the entire data pool of 45,000 lines of data that was available. This was done to 42 have the true answers to compare the machine learning predictions against, to have a manual entry method that is 100% accurate to compare against, and to generate seed data for the learning models to train. Using all this seed data, I developed several machine learning algorithms with the goal of determining how they performed. As expected, some did well and some did poorly, but none of them took 90 hours to execute. These various models each had their own obstacles, but an array of different types of models allowed me to ensure that the results would be accurate whether they were good or bad. Some of the models performed well and some did not, but all of them vastly improved after the data was preprocessed. I ran the test on the cleaned dataset without any kind of balancing to determine how imbalanced the data was, and the results came back with virtually a 0% accuracy. Balancing the dataset became a huge step in the preprocessing along with label encoding the outputs and implementing TF-IDF on the item descriptions. Balancing was done with both SMOTE and ADASYN, and this was the most crucial step in this work. Tuning the hyperparameters of each model emerged as another significant hurdle in this thesis. In terms of time expenditure during the project, this step proved to be the most time-consuming aspect of software development. Initially, I employed RandomSearchCV to explore potential parameter configurations. However, the parameters identified through this method did not yield significant improvements in the results. Subsequently, I turned to GridSearchCV to fine-tune the hyperparameters of all the models. This involved running extensive code overnight to allow sufficient time for the algorithm to search for optimal parameter combinations. Unfortunately, despite the 43 considerable time invested in hyperparameter tuning, the improvements achieved were minimal, if any, when applied to the learning models. Despite this challenge, the tuning process was eventually completed, albeit without the expected enhancement in the performance of the base learning models. Throughout this work, a diverse array of unsupervised, semi-supervised, and supervised learning models were utilized and evaluated, with the results summarized in Tables 3-6. Among these approaches, the unsupervised methods exhibited the poorest performance in terms of development time, execution time, and overall effectiveness. In contrast, the semi-supervised approach, while challenging to set up, demonstrated relatively quick execution and yielded a strong F1 score. This suggests its potential utility in scenarios where manual labeling is impractical. The supervised approaches, on the other hand, delivered robust performance across the board, with minimal development and execution times. Among them, the Multinomial Naïve Bayes model stood out, boasting the highest F1 score and the shortest execution time. Despite not achieving the 100% accuracy of manually categorized data, its rapid execution time of just 1 minute coupled with a commendable F1 score of 0.8489 underscores its viability for practical deployment within a company setting. 44 References 1. Mikhail Bilenko, S. Basil, and Mehran Sahami, “Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping,” Nov. 2005, doi: https://doi.org/10.1109/icdm.2005.18. 2. Kirit Pandit and Haralambos Marmanis, Spend analysis : the window into strategic sourcing. Fort Lauderdale, Fla. Ross C, 2008. 3. Z. Tao, B. Wang, and L. Shu, “Analysis on the Procurement Cost of Construction Supply Chain based on Evolutionary Game Theory,” Arabian Journal for Science and Engineering, vol. 46, no. 2, pp. 1925–1940, Jan. 2021, doi: https://doi.org/10.1007/s13369-020-05261-4. 4. T. Kivisto and V. M. Virolainen, “Public procurement spend analysis at a national level in Finland,” Journal of Public Procurement, vol. 19, no. 2, pp. 108–128, Jun. 2019, doi: https://doi.org/10.1108/jopp-06-2019-028. 5. C. Obura, “Procurement Planning: The Principle of Sound Balance Between Procurement Control and Achieving Value for Money,” International Academic Journal of Procurement and Supply Chain Management |, vol. 3, no. 2, pp. 19– 27, 2020, Available: https://www.iajournals.org/articles/iajpscm_v3_i2_19_27.pdf 6. American Society for Quality, “Six sigma definition - what is lean six sigma? ,” Asq.org, 2022. https://asq.org/quality-resources/six-sigma 7. “Use item categories in QuickBooks Desktop Enterprise,” quickbooks.intuit.com, Mar. 15, 2024. https://quickbooks.intuit.com/learn-support/en-us/helparticle/inventory-management/inventory-categories-quickbooks-desktopenterprise/L9P3ArO3n_US_en_US (accessed Apr. 08, 2024). 8. L. J. Servon and R. Kaestner, “Consumer Financial Literacy and the Impact of Online Banking on the Financial Behavior of Lower-Income Bank Customers,” Journal of Consumer Affairs, vol. 42, no. 2, pp. 271–305, Jun. 2008, doi: https://doi.org/10.1111/j.1745-6606.2008.00108.x. 9. “Rewards Category FAQ | Credit Card | Chase.com,” www.chase.com. https://www.chase.com/personal/credit-cards/rewards-category-faq 10. J. Huang, Y.-F. Li, and M. Xie, “An empirical analysis of data preprocessing for machine learning-based software cost estimation,” Information and Software Technology, vol. 67, pp. 108–127, Nov. 2015, doi: https://doi.org/10.1016/j.infsof.2015.07.004. 11. J. T. Hancock and T. M. Khoshgoftaar, “Survey on categorical data for neural networks,” Journal of Big Data, vol. 7, no. 1, Apr. 2020, doi: https://doi.org/10.1186/s40537-020-00305-w. 12. scikit-learn developers, “sklearn.preprocessing.LabelEncoder — scikit-learn 0.22.1 documentation,” Scikit-learn.org, 2019. https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 13. D. Ali, M. M. S. Missen, and M. Husnain, “Multiclass Event Classification from Text,” Scientific Programming, vol. 2021, pp. 1–15, Jan. 2021, doi: https://doi.org/10.1155/2021/6660651. 14. “sklearn.preprocessing.OneHotEncoder — scikit-learn 0.22 documentation,” Scikit-learn.org, 2019. https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 15. A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Information Processing & Management, vol. 50, no. 1, pp. 104– 112, Jan. 2014, doi: https://doi.org/10.1016/j.ipm.2013.08.006. 16. Y. Zhang, L. Gong, and Y. Wang, “An improved TF-IDF approach for text classification,” Journal of Zhejiang University SCIENCE, vol. 6, no. 1, pp. 49–55, Jan. 2005, doi: https://doi.org/10.1631/jzus.2005.a0049. 17. D. Vatsalan, P. Christen, and V. S. Verykios, “A taxonomy of privacy-preserving record linkage techniques,” Information Systems, vol. 38, no. 6, pp. 946–969, Sep. 2013, doi: https://doi.org/10.1016/j.is.2012.11.005. 18. D. Kim, D. Seo, S. Cho, and P. Kang, “Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec,” Information Sciences, vol. 477, pp. 15–29, Mar. 2019, doi: https://doi.org/10.1016/j.ins.2018.10.006. 19. S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TFIDF and LDA schemes,” Human-centric Computing and Information Sciences, vol. 9, no. 1, Aug. 2019, doi: https://doi.org/10.1186/s13673-019-0192-7. 20. C. Gheorghe, “Opportunities to Reduce Operating Expenses in Industrial Enterprises,” Proceedings in Manufacturing Systems, vol. 8, 2013, Accessed: Mar. 16, 2023. 21. “Forecasting as a Way to Reduce the Risks of a Cash Flow Deficit in Agricultural Organizations.” Available: https://managementjournal.usamv.ro/pdf/vol.21_2/Art50.pdf 22. A. J. Cain, “taxonomy,” Encyclopædia Britannica. Apr. 13, 2018. Available: https://www.britannica.com/science/taxonomy 23. J. Irvine, “Taxonomies in Education: Overview, Comparison, and Future Directions,” Journal of Education and Development, vol. 5, no. 2, p. 1, May 2021, doi: https://doi.org/10.20849/jed.v5i2.898. 24. T. Li, C. Zhang, and M. Ogihara, “A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression,” Bioinformatics, vol. 20, no. 15, pp. 2429–2437, Apr. 2004, doi: https://doi.org/10.1093/bioinformatics/bth267. 25. T. Chen and C. Guestrin, “XGBoost: a Scalable Tree Boosting System,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pp. 785–794, 2016, doi: https://doi.org/10.1145/2939672.2939785. 26. A. Fernandez, S. Garcia, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” Journal of Artificial Intelligence Research, vol. 61, pp. 863–905, Apr. 2018, doi: https://doi.org/10.1613/jair.1.11192. 46 27. G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, May 2017, doi: https://doi.org/10.1016/j.eswa.2016.12.035. 28. J. Brandt and E. Lanzén, “A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification.” Available: https://www.divaportal.org/smash/get/diva2:1519153/FULLTEXT01.pdf 29. Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J. (2016). Multilabel Classification. In: Multilabel Classification . Springer, Cham. https://doi.org/10.1007/978-3-319-41111-8_2 30. R. Ball and B. Rague, The Beginner’s Guide to Data Science. Springer Nature, 2022. 31. “12.1.4 - Classification by K-means | STAT 508,” online.stat.psu.edu. https://online.stat.psu.edu/stat508/lesson/12/12.1/12.1.4#:~:text=If%20we%20use %20k%2Dmeans (accessed Apr. 08, 2024). 32. A. Chaturvedi, P. E. Green, and J. D. Caroll, “K-modes Clustering,” Journal of Classification, vol. 18, no. 1, pp. 35–55, Jan. 2001, doi: https://doi.org/10.1007/s00357-001-0004-3. 33. “Evaluation of clustering,” nlp.stanford.edu. https://nlp.stanford.edu/IRbook/html/htmledition/evaluation-of-clustering-1.html (accessed Sep. 30, 2021). 34. Scikit-learn, “1.9. Naive Bayes — scikit-learn 0.21.3 documentation,” Scikitlearn.org, 2019. https://scikit-learn.org/stable/modules/naive_bayes.html 35. N. A. Hamzah, S. L. Kek, and S. Saharan, “The Performance of K-Means and KModes Clustering to Identify Cluster in Numerical Data,” Journal of Science and Technology, vol. 9, no. 3, Dec. 2017. 47 |
Format | application/pdf |
ARK | ark:/87278/s6401bp2 |
Setname | wsu_smt |
ID | 129706 |
Reference URL | https://digital.weber.edu/ark:/87278/s6401bp2 |