Russell, Michael_MCS_2021

Title	Russell, Michael_MCS_2021
Alternative Title	A Comparison of Natural language Feature Engineering Techniques
Creator	Russell, Michael
Collection Name	Master of Computer Science
Description	An evaluation of feature-engineering techniques applied to natural text examining how such affect machine-learning models' performance. The feature-engineering techniques referred to are spaCy vectorization, Term Frequency-Inverse Document Frequency, and what are referred to as novel features in this work, namely various reading complexity scores as well as part-of-speech frequencies. The machine-learning algorithms used in this work include K-Nearest Neighbors, Support Vector Machine, and Gaussian Naïve Bayes. Results of this work indicate that no feature-engineering technique generalizes its success outside of a specific context of employment.
Abstract	The goal of this thesis was to compare feature-engineering techniques commonly used on natural text to see how such techniques affect a machine learning algorithms' ability to predict accurately. Three feature-engineering techniques were applied to a large number of textual data representing books from varying authors. The three feature-engineering techniques are: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a "novel" feature-engineering technique throughout this paper. Three machine-learning algorithms' performance was measured for each of the above-mentioned feature-engineering techniques. The three machine-learning algorithms compared were: 1. K-Nearest Neighbors (KNN). 2. Support Vector Machine Classifier (SVC). 3. Gaussian Naïve Bayes (GaussianNB). The variables named above-feature-engineering techniques and machine-learning algorithms-were examined in varying permutations in order to discern possible relationships and effects which existed among them, e.g., the spaCy-vectorized books were tested on each machine-learning algorithm, followed by the next two feature-engineering techniques each tested on each machine-learning algorithm. This work shows how crucial context is to the success of a feature-engineering technique. Depending on the data being feature engineered and the intended use of such data, different feature-engineering techniques will be better than others. Generally, there is no "best" feature-engineering technique. This work will demonstrate how slight altercations in the context for application of feature-engineering techniques drastically affects the performance of such. In some cases, the subtleties affecting such performance differences arise from factors in the data that yield no apparent reason for resulting in any, let alone drastic, performance differences.
Subject	Algorithms; Computer science
Keywords	Natural Language Processing; Machine Learning; Feature Engineering; Genres; Authorship Attribution
Digital Publisher	Stewart Library, Weber State University
Date	2021
Medium	Thesis
Type	Text
Access Extent	1.79 MB; 73 page PDF
Language	eng
Rights	The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights.
Source	University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University
Format	application/pdf
ARK	ark:/87278/s6xftxtn
Setname	wsu_smt
ID	96852
Reference URL	https://digital.weber.edu/ark:/87278/s6xftxtn

Back to Search Results