Russell, Michael_MCS_2021

Title Russell, Michael_MCS_2021
Alternative Title A Comparison of Natural language Feature Engineering Techniques
Creator Russell, Michael
Collection Name Master of Computer Science
Description An evaluation of feature-engineering techniques applied to natural text examining how such affect machine-learning models' performance. The feature-engineering techniques referred to are spaCy vectorization, Term Frequency-Inverse Document Frequency, and what are referred to as novel features in this work, namely various reading complexity scores as well as part-of-speech frequencies. The machine-learning algorithms used in this work include K-Nearest Neighbors, Support Vector Machine, and Gaussian Naïve Bayes. Results of this work indicate that no feature-engineering technique generalizes its success outside of a specific context of employment.
Abstract The goal of this thesis was to compare feature-engineering techniques commonly used on natural text to see how such techniques affect a machine learning algorithms' ability to predict accurately. Three feature-engineering techniques were applied to a large number of textual data representing books from varying authors. The three feature-engineering techniques are: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a "novel" feature-engineering technique throughout this paper. Three machine-learning algorithms' performance was measured for each of the above-mentioned feature-engineering techniques. The three machine-learning algorithms compared were: 1. K-Nearest Neighbors (KNN). 2. Support Vector Machine Classifier (SVC). 3. Gaussian Naïve Bayes (GaussianNB). The variables named above-feature-engineering techniques and machine-learning algorithms-were examined in varying permutations in order to discern possible relationships and effects which existed among them, e.g., the spaCy-vectorized books were tested on each machine-learning algorithm, followed by the next two feature-engineering techniques each tested on each machine-learning algorithm. This work shows how crucial context is to the success of a feature-engineering technique. Depending on the data being feature engineered and the intended use of such data, different feature-engineering techniques will be better than others. Generally, there is no "best" feature-engineering technique. This work will demonstrate how slight altercations in the context for application of feature-engineering techniques drastically affects the performance of such. In some cases, the subtleties affecting such performance differences arise from factors in the data that yield no apparent reason for resulting in any, let alone drastic, performance differences.
Subject Algorithms; Computer science
Keywords Natural Language Processing; Machine Learning; Feature Engineering; Genres; Authorship Attribution
Digital Publisher Stewart Library, Weber State University
Date 2021
Medium Thesis
Type Text
Access Extent 1.79 MB; 73 page PDF
Language eng
Rights The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights.
Source University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University
OCR Text Show
Format application/pdf
ARK ark:/87278/s6xftxtn
Setname wsu_smt
ID 96852
Reference URL https://digital.weber.edu/ark:/87278/s6xftxtn
Back to Search Results