Title |
Russell, Michael_MCS_2021 |
Alternative Title |
A Comparison of Natural language Feature Engineering Techniques |
Creator |
Russell, Michael |
Collection Name |
Master of Computer Science |
Description |
An evaluation of feature-engineering techniques applied to natural text examining how such affect machine-learning models' performance. The feature-engineering techniques referred to are spaCy vectorization, Term Frequency-Inverse Document Frequency, and what are referred to as novel features in this work, namely various reading complexity scores as well as part-of-speech frequencies. The machine-learning algorithms used in this work include K-Nearest Neighbors, Support Vector Machine, and Gaussian Naïve Bayes. Results of this work indicate that no feature-engineering technique generalizes its success outside of a specific context of employment. |
Abstract |
The goal of this thesis was to compare feature-engineering techniques commonly used on natural text to see how such techniques affect a machine learning algorithms' ability to predict accurately. Three feature-engineering techniques were applied to a large number of textual data representing books from varying authors. The three feature-engineering techniques are: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a "novel" feature-engineering technique throughout this paper. Three machine-learning algorithms' performance was measured for each of the above-mentioned feature-engineering techniques. The three machine-learning algorithms compared were: 1. K-Nearest Neighbors (KNN). 2. Support Vector Machine Classifier (SVC). 3. Gaussian Naïve Bayes (GaussianNB). The variables named above-feature-engineering techniques and machine-learning algorithms-were examined in varying permutations in order to discern possible relationships and effects which existed among them, e.g., the spaCy-vectorized books were tested on each machine-learning algorithm, followed by the next two feature-engineering techniques each tested on each machine-learning algorithm. This work shows how crucial context is to the success of a feature-engineering technique. Depending on the data being feature engineered and the intended use of such data, different feature-engineering techniques will be better than others. Generally, there is no "best" feature-engineering technique. This work will demonstrate how slight altercations in the context for application of feature-engineering techniques drastically affects the performance of such. In some cases, the subtleties affecting such performance differences arise from factors in the data that yield no apparent reason for resulting in any, let alone drastic, performance differences. |
Subject |
Algorithms; Computer science |
Keywords |
Natural Language Processing; Machine Learning; Feature Engineering; Genres; Authorship Attribution |
Digital Publisher |
Stewart Library, Weber State University |
Date |
2021 |
Medium |
Thesis |
Type |
Text |
Access Extent |
1.79 MB; 73 page PDF |
Language |
eng |
Rights |
The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source |
University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University |
Format |
application/pdf |
ARK |
ark:/87278/s6xftxtn |
Setname |
wsu_smt |
ID |
96852 |
Reference URL |
https://digital.weber.edu/ark:/87278/s6xftxtn |