Title | Russell, Michael_MCS_2021 |
Alternative Title | A Comparison of Natural language Feature Engineering Techniques |
Creator | Russell, Michael |
Collection Name | Master of Computer Science |
Description | An evaluation of feature-engineering techniques applied to natural text examining how such affect machine-learning models' performance. The feature-engineering techniques referred to are spaCy vectorization, Term Frequency-Inverse Document Frequency, and what are referred to as novel features in this work, namely various reading complexity scores as well as part-of-speech frequencies. The machine-learning algorithms used in this work include K-Nearest Neighbors, Support Vector Machine, and Gaussian Naïve Bayes. Results of this work indicate that no feature-engineering technique generalizes its success outside of a specific context of employment. |
Abstract | The goal of this thesis was to compare feature-engineering techniques commonly used on natural text to see how such techniques affect a machine learning algorithms' ability to predict accurately. Three feature-engineering techniques were applied to a large number of textual data representing books from varying authors. The three feature-engineering techniques are: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a "novel" feature-engineering technique throughout this paper. Three machine-learning algorithms' performance was measured for each of the above-mentioned feature-engineering techniques. The three machine-learning algorithms compared were: 1. K-Nearest Neighbors (KNN). 2. Support Vector Machine Classifier (SVC). 3. Gaussian Naïve Bayes (GaussianNB). The variables named above-feature-engineering techniques and machine-learning algorithms-were examined in varying permutations in order to discern possible relationships and effects which existed among them, e.g., the spaCy-vectorized books were tested on each machine-learning algorithm, followed by the next two feature-engineering techniques each tested on each machine-learning algorithm. This work shows how crucial context is to the success of a feature-engineering technique. Depending on the data being feature engineered and the intended use of such data, different feature-engineering techniques will be better than others. Generally, there is no "best" feature-engineering technique. This work will demonstrate how slight altercations in the context for application of feature-engineering techniques drastically affects the performance of such. In some cases, the subtleties affecting such performance differences arise from factors in the data that yield no apparent reason for resulting in any, let alone drastic, performance differences. |
Subject | Algorithms; Computer science |
Keywords | Natural Language Processing; Machine Learning; Feature Engineering; Genres; Authorship Attribution |
Digital Publisher | Stewart Library, Weber State University |
Date | 2021 |
Medium | Thesis |
Type | Text |
Access Extent | 1.79 MB; 73 page PDF |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University |
OCR Text | Show A Comparison of Natural Language Feature Engineering Techniques by Michael Russell A Thesis in the Field of Computer Science for the Degree of Master of Science in Computer Science of MASTER OF SCIENCE in Computer Science Approved: Dr. Robert Ball Advisor/Committee Chair Dr. Abdulmalek Al-Gahmi Committee Member Joshua Jensen Committee Member WEBER STATE UNIVERSITY 2021 Abstract The goal of this thesis was to compare feature-engineering techniques commonly used on natural text to see how such techniques affect a machine learning algorithms’ ability to predict accurately. Three feature-engineering techniques were applied to a large number of textual data representing books from varying authors. The three feature-engineering techniques are: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a “novel” feature-engineering technique throughout this paper. Three machine-learning algorithms’ performance was measured for each of the above-mentioned feature-engineering techniques. The three machine-learning algorithms compared were: 1. K-Nearest Neighbors (KNN). 2. Support Vector Machine Classifier (SVC). 3. Gaussian Naïve Bayes (GaussianNB). The variables named above—feature-engineering techniques and machine-learning algorithms—were examined in varying permutations in order to discern possible relationships and effects which existed among them, e.g., the spaCy-vectorized books were tested on each machine-learning algorithm, followed by the next two feature-engineering techniques each tested on each machine-learning algorithm. This work shows how crucial context is to the success of a feature-engineering technique. Depending on the data being feature engineered and the intended use of such data, different feature-engineering techniques will be better than others. Generally, there is no “best” feature-engineering technique. This work will demonstrate how slight altercations in the context for application of feature-engineering techniques drastically affects the performance of such. In some cases, the subtleties affecting such performance differences arise from factors in the data that yield no apparent reason for resulting in any, let alone drastic, performance differences. Table of Contents Introduction ..........................................................................................................................5 Related Work .....................................................................................................................12 Obtaining Full-length Books for Dataset ...........................................................................15 Scraping GoodReads for Genres ........................................................................................16 Cleaning the Data ...............................................................................................................20 Preprocessing and Feature Engineering the Text...............................................................23 spaCy vectorization ................................................................................................23 Term Frequency-Inverse Document Frequency (TF-IDF) ....................................24 Novel Feature-Engineering Techniques ................................................................25 Feature-Engineering Pipeline .................................................................................27 Challenges Faced with Feature Engineering..........................................................29 Multilabel Classification ....................................................................................................35 Building and Scoring Multilabel Classifiers ..........................................................36 Most Important Features ........................................................................................42 Challenges Faced with Multilabel Classification ..................................................44 Author Prediction: A New Approach ................................................................................48 Conclusion .........................................................................................................................63 Introduction Natural language processing (NLP) refers to the branch of computer science which aims at allowing computers to understand and deal with language data in a way similar to humans. Processing textual data is becoming increasingly important and relevant to data scientists, researchers, and social media companies, to name only a few. While intentions may differ for processing natural text via software programs, the rudimentary process remains the same. For textual data to be interpreted or be useful to a computer, it must be transformed into some sort of numerical representation. Techniques that produce this transformation are referred to as “feature engineering” in the literature. These techniques range in levels of computation and complexity as well as in their efficacy to represent all or certain aspects of the original text. The Bag-of-Words feature-engineering method works by taking all of the unique words in a language and counting the number of occurrences for each word within a given text segment. For example, a sentence such as: “The man walked up the mountain and he saw something he had not seen before” contains the unique set of words: {“the”, “man”, “walked”, “up”, “mountain”, “and”, “he”, “saw”, “something”, “had”, “not”, “seen”, “before”} Using this unique set of words as an index, an array representation of counts of words can be produced as follows: [2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1] 6 This array representation could then be fed to a machine-learning algorithm as input. Despite being one of the simplest feature-engineering methods for natural language, Bag-of-Words still achieves its purpose and is a proven and tested method still used in many applications today [1]. More sophisticated feature-engineering techniques such as term frequency-inverse document frequency, various vectorization techniques, and n-grams often employ similar tactics to this fundamental method. Certain pieces of information from the original text are lost during such a transformation, i.e., counts of words do not capture relational positioning between words, definition of words, or sentiment. This occurs not only during the transformation process described but, to some degree, during human and machine translation of a text. Rabinovich describes in her work how text traits can successfully identify an author’s gender but when that same text is translated, such traits are obfuscated to the point that author gender becomes unpredictable [2]. Not only do modifying, translating, or transforming text leave an identifiable mark [3], but such processes cannot be performed while retaining all traits and aspects of the original text. Because certain feature-engineering techniques may be better than others at retaining certain information of a text, such as sentiment or meaning, one should consider what information from their text is most important to their objective. This research addresses differences in the abilities of three feature-engineering techniques in the task of identifying the correct author of a text using several different machine-learning algorithms. Such differences may reflect what pieces of information from text are more useful in the task of identifying the author of a text. Also, the results may indicate how well each feature-engineering technique captures relevant information of a given text. 7 The method for measuring differences among feature-engineering techniques is the crux of authorship attribution. While much work has been done on authorship attribution and other stylometric problems [4] [5] [6], such as author clustering, which is the task of successfully separating texts of n number of authors into n clusters; author verification, which compares pairs of texts and predicts which one belongs to a determined author; author profiling, which predicts demographic and other author information like gender, age, etc.; and author diarization, which clusters a single text written by n authors into n clusters, much like author clustering except it is performed on a single text written by more than one authors [4]. The feature-engineering techniques in this work, excepting the readability scores, obtain data directly from the raw text of a book rather than by some intermediary processing of such text. For example, the readability scores process the text of the book via readability formulas and return a score based on such formulas. Additionally, some have claimed improved authorship identification results when first predicting the gender and age of the author of a work and then using such details as features in additional to the feature-engineered text to improve authorship prediction [7]. Because of the aim of this work is to compare feature-engineering techniques’ performance in a natural-language-processing context, such measures to improve predictive performance for authorship identification are irrelevant. It should be noted here that my initial task differed from the one above—that of using a task of predicting correct authors for a given text to measure differences between feature engineering techniques. Initially, I planned to measure such differences between feature engineering techniques via genre classification. Much of my early work described 8 below will cover details pertaining to this initial method as well as why I shifted from genre classification to author classification. The results of my first method revealed useful information that contributes to the overall substance of this paper. My research question is the following: Which of the three feature-engineering techniques—spaCy vectorization, TF-IDF, and novel techniques suggested in this work—produce the highest accuracy when applied by the three machine-learning algorithms—K-Nearest Neighbors [8], Gaussian Naïve Bayes, and Support Vector Machine Classification—when applied on a dataset consisting of a corpus of full-length book texts and the objective is authorship attribution? Performance here is measured by the accuracy of the three machine-learning algorithms’ predictions of correct authorship for given books. Variables such as number of authors and books in the dataset will be considered and tested incrementally and systematically to access relevant effects. The three feature-engineering techniques this work compares have been described above. The machine-learning algorithms that are used to examine the performance of each feature-engineering technique are here briefly described. 9 K-Nearest Neighbors is a classification method that represents data in an n-dimensional space where n is the number of features in the data. To classify a given example, the algorithm assigns the most popular class of the k-nearest datapoints, where k is the number of nearest neighbors used. The distance between a datapoint and the rest of the datapoints can be calculated using the Euclidean distance formula. Figure 1 shows a 2D plot of datapoints belonging to three different classes. The different classes are represented by color and shape. An unidentified datapoint is shown as a question mark inside a circle. The K-Nearest Neighbors algorithm would classify this datapoint by finding the most popular class of the k-nearest neighbors. If we use a k of four, the datapoint would be assigned the class of the green triangles. This is because the four Figure 1: A visualization of datapoints in a 2-dimensional space with class designations denoted by color and shape—red circles represent one class, purple stars one class, and green triangles another. An unassigned datapoint is shown as a question mark. 10 closest neighbors to the respective datapoint consist of three green triangles and one purple star. Green triangles are the most popular class. Gaussian Naïve Bayes has proven to be effective for classification problems involving natural language processing [9]. The formula for Bayes Theorem calculates the probability of two things occurring. The naïve aspect of the Naïve Bayes classifier is that it assumes the probability value of a given feature is independent of any other feature’s value. This assumption, while often being inaccurate, does not seem to disrupt the performance of the Naïve Bayes’ ability to classify well [10], particularly with text classification [11]. The Gaussian variant of the Naïve Bayes classifier makes an additional assumption that continuous data are distributed in accordance with the normal, Gaussian distribution. An example of how the probability of two values is calculated using the Naïve Bayes probability according to a Gaussian distribution is shown in Figure 2 [12]. Figure 2: A graphic showing the Gaussian probability curve for two events. Points are shown to indicate the joint probability of either event [62]. 11 Figure 2 shows the Gaussian, or normal, probability curve for class A and class B as well as the probability of an event from either class A or class B after an event from the class’s counterpart class has occurred. Probabilities generated by this method allow for classification based on such probabilities, i.e., a probability generated can be rounded to conclude whether a certain outcome or label is appropriate. The Support Vector Machine classifier works by finding the optimal hyperplane in an n-dimensional space where n is the number of features for the dataset. A hyperplane is optimal when it most distinctly separates the examples in a dataset into their classes while maximizing the margin between the boundary of the hyperplane and the nearest datapoints. The datapoints closest to this boundary are called support vectors. Using the boundary, the Support Vector Machine can classify the data accordingly. Support Vector Machine classifiers have been proven to be very effective in applications for textual data in various stylometric texts [13]. 12 Related Work Having an initial objective of comparing feature-engineering techniques’ predictive performance in the task of book genre classification, I found many related works which, instead of predicting book genres, predicted movie or music genres. These alternative media sources have non-textual formats which require entirely different feature-engineering methods. For example, Barney et al [14] use poster images for movies as input in their model predicting genres for movies. Battu et al [15] alternatively use synopses of movies to build their predictor which not only predicts movie genre, but rating based on a five-star rating scale. This latter work manually categorized tens of genres into nine overarching genre labels, which introduces a subtle touch of subjectivity to the categorization of labels. This work also employed recurrent neural networks as a modeling technique as has been successfully demonstrated in other previous works [16]. A large emphasis of this work was put on examining performance differences of machine-learning models rather than such for feature-engineering techniques. The work of Bergstra et al [17] implements a genre predictor for musical pieces. This work does not expound upon the feature-engineering process of the files containing music but does offer a unique technique for minimizing genre labels in their dataset. Relying upon the work of Pachet and Cazaly [18], which presents an attempt at an objective taxonomy of musical genres, Bergstra et al reduce genre labels for musical pieces by replacing child nodes within such hierarchical taxonomy to parent nodes such 13 that a viable number of genres exist. Unfortunately, I could find no standard taxonomy of book genres to be used for this work. As shown in the work of Maharjan et al [19], many have attempted as manually reducing genres to a minimal set but with injection of subjective bias. Maharjan et al, like Battu et al, put their emphasis on performance differences among machine-learning models’ performance rather than feature-engineering techniques. Additionally, the aim of their work is to achieve a likability rating of a book and the ability of their model to predict book genres via their models is an inadvertent biproduct of this. Work has been done to refine the feature-engineering process for natural language processing solutions, however. Book2Vec, presented by Anvari and Amirkhani [20], builds upon document vectorization methods which focus on word tokens. Book2Vec represents books in a novel way by using low-dimensional numeric vectors containing data related to words in sentences of the book and finds that, while this feature-engineering technique may not be ideal for accurate performance, it does allow for the retention of the sentiment of a text or book. Like Anvari and Amirkhani, many have attempted to classify or predict text sentiment [16] [21]. Although, this task is often performed on textual data, such data often differs in length and style when compared to book text. Such differences ought to yield scrutiny for suggested correlations between successful feature-engineering techniques for the two, as suggested in the work of Tang et al which limits their data particularly to short text instead of long text [22]. 14 Many works have demonstrated the efficacy for the feature-engineering techniques employed in this work such as Bag-of-Words [1] and TF-IDF [23]. A similarly large corpus of works supports the efficacy of the machine learning techniques employed in this work [13] [24] [25] [26]. For the task of identifying authorship of a text and other stylometric endeavors, authors have varied much in the feature-engineering techniques, machine-learning algorithms, and focus of their classification pursuits. Rabinovich et al [3] emphasize the utility of classifying author gender in relation to further classification aims. Hong et al [27] classify based on author types rather than identities for online texts and show lexical and syntactic features that can be exploited to effectively identify such. Other works focus on identifying authors for forensic or legal purpose with an emphasis on comparing machine-learning models rather than feature-engineering techniques [28] [6] [29]. 15 Obtaining Full-length Books for Dataset My first task was to obtain a large number of full-length books which would later contribute to my dataset. I used a collection of approximately 62,162 books from Project Gutenberg [30]. After downloading this collection, I extracted the text files of the books from zipped folders within a complex series of subdirectories using bash through the command line terminal. This left me with 62,162 books stored as text files. The file names of such books indicated their directory identity and placement rather than the title or author of their book. To obtain the title and authors of the books I wrote a script in Python to iteratively search the contents of the text file of each book and extract the book title and author name of the given book. Because Project Gutenberg enforces a header revealing copyright information, publication details, et cetera, at the top of text files for books within their public library, extracting the book title was not trivial. Using a Regular Expression (Regex), I was able to filter through the file contents to locate the title of the book. In addition to extracting this information, I needed to remove Project-Gutenberg-enforced headers and content from the book files so that such data would not misrepresent the book content or skew results by injected foreign content into the text of each book. The results of this milestone included obtaining over 40,000 books as text files—some books from the initial set were removed because of invalid text encoding or because they were written in a language other than English—and the associated authors and book titles. 16 Scraping GoodReads for Genres My second task was to find the genres associated with each book in my dataset. To achieve this, I decided to use GoodReads [31] which has over 90 million registered members [32]. GoodReads lets its users rate and review books as well as tag books. Tags can be anything from “to-read” to “fiction”. Most tags on books are genres, but not all. The reason I chose to use GoodReads as my source for obtaining genres associated with the books in my Gutenberg dataset was that, with over 90 million users, GoodReads offers an extensive set of book titles and their crowd-sourced tags. To assign genre tags manually to the books in the Gutenberg dataset, I would have to read each book, a task I certainly could not assume. Additionally, such assignments would be subjective to my opinions and take on each book. Considering that associating genre tags with a given book is not only a subjective process but a laborious and time-consuming one, I decided that using GoodReads genre tags would be the most effective option for obtaining accurate genre tags. This method allowed for tens, or even thousands in some cases, of votes to back up genre tags associated with each book. Popular books, like Don Quixote [33], by Miguel de Cervantes, have thousands of tags, some of which have thousands of votes to back them up, e.g., at the time this paper was written, Don Quixote had 10,081 people tagging the book with the “classics” tag. Other tags for this book include “to-read”, “currently-reading”, “fiction”, “favorites”, “classic”, and “owned”. The tags shown in this example represent some genres that may be accurately attributed to Don Quixote, but these are mixed among irrelevant, non-genre tags such as “to-read”, “currently-reading”, “favorites”, and “owned”. 17 Additionally, GoodReads does not enforce standardized spelling or capitalization for tags. This results in redundant tags based on differences in spelling, capitalization, or plurality. The following tags may all exist for a book even though they all refer to the same idea: “classics,” “Classics,” “classic,” and “Classic”. Some of these potential problems and difficulties were seen in advance and some were realized during the process. I began by creating a web scraper program written in Python to iterate through every book title in my dataset and search it on GoodReads. GoodReads structures its pages and search results in such a way that I could programmatically create a URL that would produce the search results for a given title. For example, if I wanted to search for the book Pride and Prejudice [34], by Jane Austen, I would append the text “https://www.goodreads.com/search?q=” with the book title replacing spaces with plus characters as follows: “https://www.goodreads.com/search?q=pride+and+prejudice” Because there often was more than one result for each book, I had to find the best matching title to the actual title of the respective book searched. To do this, I used FuzzyWuzzy [35], a Python package that uses Levenshtein Distance to calculate the differences between character sequences. The way Levenshtein distance is calculated is by first counting the minimal amount of character changes necessary to convert one string to the other. For this calculation, a character change entails a removal, addition, or replacement of a character. Since, at most, the number of character changes would be the length of characters in the longer string, this distance can be converted into a ratio of the Levenshtein distance divided by the length of characters in the longer string. 18 The FuzzyWuzzy Python package implements the Levenshtein distance as described above to produce a similarity ratio score for the titles of books in the search results for a given query and returns the ratio as an integer 0-100. The book with the highest matching score above a certain threshold was used as the source from which I would obtain tags, or genres. An example is shown in Figure 3 where a book title hypothetically scraped from a book from the Gutenberg dataset is compared against several results obtained by searching GoodReads with the title of the Gutenberg book. FuzzyWuzzy generates a string-matching score for each title and the entry with the highest matching score above the tolerance threshold is selected as the target from which genre tags would be extracted. The tolerance for the match percentage score I used was 75. After experimenting with different tolerance values, I found that the value of 75 filtered out books that were searched for which there were not suitable matches but still allowed for the subtle differences that would sometimes exist between the titles of books scraped from my Figure 3: An example of the FuzzyWuzzy Python library computing string similarity ratios using the Levenshtein distance formula for variations of book titles. 19 Gutenberg set and those found on GoodReads. It was the case that some books in my dataset before this step either were not listed on GoodReads or had such different publication details, such as title, spelling, etc., that the books were disregarded from the dataset. After finding a suitable match from the GoodReads results for a given query, I extracted the URL for the respective matched book and, after running the URL, was able to navigate to the page listing all tags used for the book. All tags were parsed, using Beautiful Soup [36]. Beautiful Soup is a Python library that offers parsing abilities for a variety of markup languages such as HTML and XML. This is exceedingly helpful when navigating through documents written in such languages. For example, when navigating through the GoodReads HTML page of a given book to find the href tag of the element linking to the page listing the genre tags of the given book, Beautiful Soup can be used to specify the element, id, class, and other aspects of markup components to narrow down the search results. This ability allowed for quick and easy navigation to the relevant elements of an HTML page for this project. The results from this milestone were a list of tags from GoodReads for each book in my dataset. Also, my dataset was reduced during this milestone due to books not producing suitable matches when searched in GoodReads. The number of books in my dataset was around 10,000 at this point. 20 Cleaning the Data The next step in my work was to clean and prepare my data to be feature engineered in preparation for classification algorithms to be applied on the dataset. I began by standardizing the tags obtained from GoodReads for each book. GoodReads, despite allowing for custom and non-genre tags, does have a listing of standard genres existing among tags used for books. By using the FuzzyWuzzy package mentioned earlier, I was able to match and replace tags that were either misspelled or alternatives to a standard genre with the standard genre best matched. Tags that didn’t come close to matching any of the standard genres were disregarded from the tag list associated with their book. After this step was complete, I removed any books from my dataset that no longer had any tags associated with them. This resulted in a dataset of approximately 9,000 books and their associated, standardized genre tags from GoodReads. One additional step performed here was reducing the number of unique genre tags. Initially, there were over 80 unique tags, some of which were obscure or rarely used. Foreseeing the classification step ahead, I decided to reduce the number of unique Figure 4: A chart showing the number of occurrences for each genre found in the dataset. Genres labels include those which will later be filtered out because of lack of relevance. 21 tags down to ten. I chose ten because it was a good balance between having enough genres to capture the different expressions and styles of the diverse books in my database and not having too many genre tags to the point where classifying the genres would become less possible. Label density, or the number of labels used in a multi-label classification problem, can lead to poor results when increased in excess [37]. The top ten genres within this set comprised 29,278, or 45.7%, of the 64,015 tags used in the set. I could not calculate the performance of machine-learning models for every combination and number of labels in the multi-label classification problem due to the intense level of resources such a process would demand. If this could have been done the results would have allowed me to ascertain the optimal selection of genres and numbers of genres to use. However, to address the concern of exorbitant labels in my classification problem, which can lead to large obstacles in accurate classification [38], I considered 10 labels to be reasonable according to the data available. Figure 5: A chart showing the number of occurrences for each genre found in the dataset. Genres labels include only such that remained following the filtration process which removed irrelevant and unpopular entries. 22 Figure 4 shows the distribution of genre tags before this reduction took place. Figure 5 shows the distribution of standardized genre tags for the dataset after the reduction took place. You can see that fiction is attributed to more than half of the books in the dataset. Other dominating tags include classics and historical. The results of this milestone were a dataset with book titles, full book text, and genres from the list of top ten selected genres for each book as shown in Figure 6, which shows the dataset as it existed at this point—the dataset contained a column of book titles, a second column of full book texts, and a third column containing a list of genres, but only genres found in the top ten selected genres described above. Figure 6: Five entries from the dataset as it existed at this point. A 'title' column contains book titles. A 'text' column contains full book texts. A 'genres' column contains a list of one or more associated genre labels for the given book. 23 Preprocessing and Feature Engineering the Text To prepare my data for use with classification algorithms, I had to feature engineer the text of each book in my dataset. As mentioned above, there are many methods used by developers to feature engineering natural text. In this work, I use three feature engineering methods: 1. spaCy vectorization. 2. Term Frequency-Inverse Document Frequency (TF-IDF). 3. Reading complexity scores and relative part-of-speech (POS) frequencies, which I will be referring to as a “novel” feature-engineering technique throughout this paper. spaCy vectorization spaCy is a natural language processing tool which offers a unique and effective method for vectorizing text in such a way as to retain the real value of the text [39]. spaCy is among the most capable natural language processing tools and has often been paired with other prominent packages and tools such as Gensim and Keras [40]. While other natural language processing tools offer different vectorization tools, I considered spaCy’s method to be relevant for my task of predicting the genre of text. The vector produced by spaCy’s vectorization [41] functions on a pre-trained model. spaCy pipelines train a model on a given vocabulary obtained from a variety of sources. The sources I used to produce my model and vectors via spaCy was en_core_web_lg. 24 The nomenclature for en_core_web_lg describes several aspects of the model. The corresponding aspects are: English language; core type consisting of vocabularies, syntax, entities, and vectors; genre of web, meaning obtained online from blogs, news, and comments; and size large—the size of this model ends up being 12MB and produces a vector with 300 dimensions, or 685k keys, 685k unique vectors. For an example, the word “vector”, when vectorized via spaCy’s pretrained model, en_core_web_lg, will produce the following: array([ 1.8837e-02, -2.1752e-01, -1.8054e-01, -6.6371e-01, -6.2088e-02, 8.3114e-01, 1.8319e-01, 4.0326e-01, -1.7784e-01, -2.7251e-01, 5.6706e-01, -6.5128e-01, 4.0985e-01, 7.0762e-02, 1.7274e-02, 2.1063e-01, -6.9014e-01, 3.0686e+00, 3.1167e-02, 2.8181e-01, -2.2141e-01, 2.5380e-01, -7.9937e-01, -2.0017e-01, 1.4919e-01, … (50 hidden rows) -2.7692e-01, -2.1884e-01, -2.7855e-01, -3.8132e-01, 3.2371e-01, -6.5140e-02, 3.4043e-01, -5.1375e-01, -1.6814e-01, 2.6561e-01, -1.9013e-01, -1.5040e-01, -1.9021e-01, -5.5351e-02, -6.0245e-02, 2.3297e-01, -2.4317e-01, 3.5750e-01, -3.0022e-01, -3.5387e-01, -4.0000e-01, -1.2519e-01, 1.6943e-01, -3.5287e-01, 9.1983e-01], dtype=float32) The result is a 300-dimensional vector where each dimension represents an aspect of sentiment and meaning relevant to spaCy’s configuration. Less common or nonsensical words such as “afskfsd”, which are not real words and don’t appear in dictionaries, produce a vector of 300 dimensions of 0, which reflects a practically meaningless word according to this paradigm [41]. Term Frequency-Inverse Document Frequency (TF-IDF) TF-IDF is a statistic representing the frequency of a word or term relative to how important such word or term is in its contextual document or corpus. The idea originated from research done on term specificity which involved weighting terms relative to their 25 context [42]. This method has proven to be effective for natural language processing [43], particularly when the results are used by Support Vector Machine [44]. The modern formula applied takes the frequency of a term’s appearance in a document and divides it by the inverse document frequency, which is a logarithmically scaled ratio of the total number of documents in a corpus and the number of documents containing the respective word or term. Here’s an example of TF-IDF. Consider a document of text containing 100 words. Suppose the word “run” appears 13 times in the document. The term frequency (TF) of the word “run” would be 13/100 which gives 0.13. The inverse document frequency (IDF) for the word “run” is calculated by taking the log of the inverse frequency at which it appears in the total documents of a corpus, e.g., if there are 1,000 documents total in the corpus and 467 of those documents contain the word “run” at least once, the inverse document frequency of the word “run” becomes 1,000/467, which gives 2.14. Take the log of this to get 0.33. Now we calculate the term frequency-inverse document frequency by multiplying TF by IDF, 0.13*0.33, giving us 0.04. This number represents the frequency of a word in a document scaled by the inverse frequency for that same word in the corpus. Novel Feature-Engineering Techniques The novel, or non-traditional and uncommon, feature-engineering techniques applied in this work entail the following: 1. Reading complexity scores 2. Part-of-speech frequencies 26 Reading complexity can be calculated via various methods. I chose three scoring methods which have proven to be effective and well-used in professional and academic fields. These scoring methods are the Flesch reading ease, Flesch-Kincaid grade, and SMOG index. There are various scoring methods that each employ different methods for calculating a reading complexity score. The Flesch reading ease score, which is one of the scoring methods I chose to use in this work, scores based on the average length of sentences and the average number of syllables per word. The SMOG index, presented by G. Harry McLaughlin in 1969 [45], takes 10 consecutive sentences near the beginning of the text, 10 near the middle, and 10 near the end, and counts every word with three or more syllables for each of the three sentence groups. The square root of the sum of these three counts is taken and rounded to the nearest tenth. The resulting number is added to 3 and results in a readability score that suggests the number of years of education a person needs to read and understand the content. For this work, I used Textstat [46], an open-source Python library that employs, among many others, the above-mentioned scoring methods for reading complexity. The resulting reading complexity scores correlated with the books in the dataset were scaled and normalized since the numeric values of each test have different representations depending on the test; e.g., the SMOG index suggests a value representing the number of years of education a person should have to understand the text while the Flesch reading ease score works on a scale of 0 to 100 where 0 is the least complex reading level and 100 is the highest. 27 Using part-of-speech members or part-of-speech n-grams is not a new practice in natural language processing feature-engineering techniques as shown in the work of Han [47] and Onan [48], but within the context of feature-engineering full-length text of books in attempt to compare machine-learning model performance, this is a novel application. spaCy provides the ability to identify the part-of-speech for each token within a document. Parts of speech in the English language consist of syntactic categories words can be grouped into, e.g., one of the part-of-speech categories is a noun. A noun is a person, place, or thing. A preposition, which is another part-of-speech category, consist of words that are placed near a noun to establish positional relationships, e.g., “of”, “by”, and “with”. The part-of-speech identifiers provided by spaCy are based on the Universal POS tag set [49], which is a list of 16 part-of-speech categories, namely adjectives, adpositions, adverbs, auxiliary verbs, coordinating conjunctions, determiners, interjections, nouns, numerals, particles, pronouns, proper nouns, punctuation marks, subordinating conjunctions, symbols, and one additional category called X, which subsumes all tokens that don’t fall into any other category. The representation of the part-of-speech frequency for each token is normalized via calculating the sum of part-of-speech member uses and then normalizing the values relative to the whole document. Feature-Engineering Pipeline A pipeline was created to simplify and make efficient the process through which the full text of each book in the dataset would be transformed into machine-learning-friendly representations via each of the three feature-engineering techniques named 28 above. The pipeline began by taking as input the full text of a given book and calculating the three reading complexity scores through the methods described above. Following this, a spaCy document was created based on the en_core_web_lg pretrained model from which the spaCy document vector was obtained. To encourage the best results from the TF-IDF technique, text input was cleaned before being fed into a TF-IDF model. The cleaning process included lemmatization, or removing word inflections and participles, normalizing capitalization within the text, removing non-alpha characters, and removing stop words—words that appear frequently but provide little or no meaning outside of context such as “the”, “but”, or “a”. For example, the sentence “Running is a great form of exercise. Exercising is a good way to stay active and fit.” After being lemmatized would be, “Run be a great form of exercise. Exercise be a good way to stay active and fit.” After normalizing capitalization and removing non-alpha characters and stop words, the sentence becomes: “run great form exercise exercise good way stay active fit”. While the result is much less sensible and readable to the human eye, it is better fitted for some feature-engineering methods which become less functional with noisy distractions such as capitalization, stop words, and inflections. Once text was cleaned as described above with the aid of spaCy’s token attributes identifying relevant attributes of document members, the cleaned text was transformed to a TF-IDF vector representing the relative frequency of each word in the given document reduced by the relative importance of such word within the document. It should be noted that the TF-IDF representations for each book were based on a common vocabulary so 29 that each dimension within the TF-IDF vector would have consistent significance and meaning across the whole dataset. Following this step in the pipeline, part-of-speech frequencies were calculated based on the spaCy document which had already been created. A list of the part-of-speech identifiers for each token within a document was created and then, for each unique identifier of the part-of-speech set, value counts were counted. From this list, relative document frequencies for each part-of-speech member were calculated and returned for the given text along with the reading complexity scores, and spaCy and TF-IDF vectors. The pipeline allowed for all feature-engineering techniques to be applied to the text of the respective book given as input. The results were saved as variables and returned as a list, i.e., the preprocessing function of the pipeline took as input the book text and returned as output a list containing the spaCy vector, the TF-IDF vector, three columns for the three reading complexity scores, and 17 additional values representing the normalized part-of-speech frequencies. Challenges Faced with Feature Engineering After creating the pipeline as described above and performing all other necessary prerequisites, including testing that the pipeline worked as intended, which was done by running it on small subsets of the dataset, I came across a challenge of running out of computing resources. I will first describe the environment I was working in and the limitations of computing power and resources available there, then I will describe how such limitations were met, alternatives that were considered, and the solution I eventually arrived at. 30 Besides the web crawler used to scrape GoodReads for the book genres, all the code used for this project was run in the Kaggle environment [50]. Kaggle is an online resource used by data scientists and offers remote computational environments consisting of 4 CPU cores, 16 Gigabytes of RAM, an optional GPU with 2 CPU cores and 13 Gigabytes of RAM, and an optional TPU with 4 CPU cores and 16 Gigabytes of RAM [51]. Kaggle notebooks, which take on a Jupyter [52]-like interface, allow for markdown-notated code notebooks written in Ruby, Python, R, and other languages. For this project, Python was used exclusively owing to the extensive availability of libraries and packages that are available for it, e.g., Sci-kit Learn [53], Pandas [54], Numpy [55], and many others. A limitation to using the Kaggle kernel is that it limits notebook execution time to 9 hours and, as mentioned above, has a limit of 16 Gigabytes of RAM. Both these bounds became obstacles while running the whole dataset through the pipeline described above. I discovered that spaCy documents of full-length books ended up being several Gigabytes in size per book. Having thousands of books in my dataset, this became an immediate obstacle. I found that many of the features offered by default in spaCy’s pretrained pipeline were computationally weighty and unnecessary to utilize for my purposes. Some of these features provided by default are named entity recognition, sentence dependency visualization via spaCy’s displaCy visualizer, and complex morphology parsing. Named entity recognition is an ability where spaCy’s pipeline can identify and tag words that refer to known entities as an entity token type. For example, the sentence “Apple and Google vie for aesthetic dominance for user interfaces” refers to two 31 corporations, Apple and Google. spaCy’s pretrained model is smart enough to identify these entities and will tag them as being entities. If this feature is enabled in the pipeline, a list of entities within a text can be obtained once the spaCy document has been created. Because of the additional computation resources demanded by the spaCy pipeline to provide this feature, and because I didn’t find entity recognition to be a worthwhile feature to exploit for feature engineering purposes, I chose to turn of this feature in the pipeline. Sentence dependency visualization presented by displaCy, spaCy’s visualizer, shows part-of-speech dependency relationships as a graphic, as shown in Figure 7. While this is a helpful tool in other contexts, for feature engineering, it offers little or no relevance or utility. Other default features in the spaCy pipeline named above include morphological features, which include token and sentence tags identifying the mood, tense, and point of view of the writer for a given sentence of text. While this feature could potentially be utilized for feature-engineering purposes in a future work, I chose to Figure 7: An example of spaCy's DisplaCy feature which visualizes sentence token directional dependencies [63]. 32 avoid utilizing it for sake of maintaining a simple pipeline and saving computation resources. I merely needed part-of-speech identification, tokenization, and a limited amount of morphology features. After adjusting the spaCy pretrained model to be more lightweight and suited for my pipeline, the memory demanded per each spaCy document was reduced significantly but still exceeded the allotted 16 Gigabytes of RAM in the Kaggle environment. At that time, I was able to estimate the required memory and time to run for my pipeline to process the whole dataset. My estimation, based on running small subsets of the whole dataset, was that I would need approximately 40 Gigabytes of RAM and roughly 30 hours of computing time. To meet these new demands, I considered using the Google Cloud Platform [56] which offers cloud computing services. This service, although costly, can provide virtual computer instances, or combinations of instances, to an impressive extent of preference and capability. Although this option would have worked, after examining other alternatives I decided to go with another solution. My first attempted solution was to serialize the spaCy document object for each text using Pickle [57]—a Python package that allows for object serialization—and save such objects as external files. Not only was this difficult for organizational purposes, but it also exceeded the hard drive space within the Kaggle environment which is 20 Gigabytes. 33 A second attempted solution was to avoid saving the spaCy document objects for each text since that was the most memory-intensive aspect of the process. As mentioned above, each spaCy document could potentially become as much as three or more Gigabytes in size and multiplying that by the number of books in my dataset results in tens of thousands of Gigabytes that would have to be saved in RAM. By combining several of the preprocessing steps, data cleaning, and feature engineering into one step that could be performed for each iteration in the dataset, I could simply create the spaCy document, use it for spaCy vectorization and document features necessary to prepare the natural text for the TF-IDF model (lemmatization, removing stop words, etc.), and then delete the document object to avoid accumulating massive amounts of memory demands. It is not typical to combine data cleaning, preprocessing, and feature engineering steps normally, but in this case, it provided a worthwhile benefit of saved memory. After testing this method on small subsets of data, I gained confidence to apply it to the whole dataset but still exceeded both the Kaggle memory and computing time limitations, although this time it was not by much. I merely needed to reduce the RAM demands by a few Gigabytes and the time by 10 or 15 hours. My successful solution which overcame these constraints was to split my dataset up into seven chunks, each containing approximately 1,500 books. This size approached, but did not exceed, the Kaggle memory and computing time limitations. After creating new notebooks associated with each of the seven chunks of the dataset, I was able to run all of them simultaneously and save the preprocessed feature-engineered outputs of the books within each chunk as a serialized object saved to an external file. In a new notebook, I unpacked and combined the seven serialized objects 34 into one object, a Pandas DataFrame, containing all 10,645 books from the original dataset, but now, instead of having the full text of each book, I had its representation as a spaCy vector, a TF-IDF vector, and the novel features described above. The DataFrame object which contained these thousands of books in their several feature-engineered forms was less than one Gigabyte of space in memory. 35 Multilabel Classification With a feature-engineered dataset, I now was ready to build classification models to predict the various genres associated with each book in my dataset. There are different types of classification problems that vary in nature and complexity. For example, a simple type is binary classification. This is where a machine-learning model merely predicts a binary output for each book, such as “yes” or “no”. A relevant example that differs from my classification objective would be to predict whether a book was fictional or not. Another, more complex, type of classification is multiclass classification. This is where the output is one of three or more of a finite set of classes or categories that are mutually exclusive. An example of this would be to predict whether an example animal is a cat, dog, fish, or snake. Such an output would be one and only one of the available categories. A third type of classification is multilabel classification where the output for an example is zero or more of a finite set of labels or categories that are not necessarily mutually exclusive. For example, the objective of this work at this point could be used to demonstrate multilabel classification—predict which genre or genres apply to a given book. The output in this case would be one or more genres from a finite set of genres. Although technically zero labels could be an expected output in some multilabel problems, that would not occur in this dataset because of the filtration and design processes involved in its creation which would have already disregarded any books which had zero associated genres. 36 To mitigate the exponentially large complexity and difficulty which arises in multilabel classification problems as the number of distinct labels grows, I opted to use 10 genres as my labels, as described in the Cleaning the Data section. The number of labels in a multilabel classification problem can very much alter the efficacy of machine-learning models built upon the data, as shown in the work of Li [58]. This number of genres retains a good portion of the book in the dataset, disregarding whichever books did not have one or more of the ten genres, and presents a feasible classification problem for the machine-learning models. Building and Scoring Multilabel Classifiers Scoring multilabel classifiers can be quite a challenge. In fact, building multilabel classifiers can also be a challenge and some of the solutions to one of these problems becomes part of the solution to the other. There are many solutions to this problem, such as classifier chain and label powerset, and binary relevance. Classifier Chain builds a series of classifiers that are dependent upon the input data and output results of their prerequisite or parent classifiers. For example, a multilabel dataset with n labels is encoded so that the target column is split into n number of columns that respectively represent binary association of the n labels from the target column. This encoding process is referred to as One-Hot Encoding. An example of a Figure 8: An example of how the Classifier Chain method works for classifying multilabel problems. Multiple classification models are built upon one another in a chain-like fashion. Example data for each model is shown in yellow. 37 multilabel dataset with three distinct labels is shown in Figure 8 being encoded, and then being used to produce three classifiers exhibiting the classifier chain method described above. Label Powerset transforms the multilabel problem into a multiclass problem by representing each distinct combination of labels as a class. In Figure 9 there are four distinct combinations of labels in the multilabel dataset. The various encoded target labels are transformed into a single target column containing classes which refer to the corresponding combination of labels. Binary Relevance works by creating a binary classification problem for each label in the multilabel dataset. The way this is accomplished is via creating an encoding for the labels like what is done with the Classifier Chain method. Once the target labels are encoded in this fashion, a classifier model is built for each column among the target columns as shown in Figure 9. Here, n models are built for the n distinct labels in the dataset. Each model uses the same example data, shown in yellow in Figure 9 but the target data varies for each model—target columns shown in blue in Figure 9. Figure 9: An example of the Label Powerset method creating a classifier from multilabel data. Each unique combination of labels is represented by a class label. This method transforms multilabel problems into multiclass problems. 38 To extract the multilabel prediction for a given example, Binary Relevance returns a binary value for each label indicating whether such label is predicted for the respective example. For example, the first example in the dataset, “x1”, shown in Figure 10 is shown to be correlated with the labels “y1” and “y3”. If this target were returned from a Binary Relevance solution, three classification models would exist—one for each distinct target label. The Binary Relevance solution would return binary values 1, 0, and 1, which respectively correlate with target labels “y1”, “y2”, and “y3.” Figure 10: An example of multilabel data encoded according to One-Hot Encoding and then being split into several classification models—one for each distinct label in the dataset. Models return binary values indicating association with a given label. Examples in yellow and targets in blue. Sci-Kit Learn provides a class BinaryRelevance which implements this solution efficiently. Combing this library with the MultiLayerBinarizer class, which implements One-Hot Encoding for multilabel and multiclass data, multilabel classifier models were successfully able to be built. Scoring multilabel classifiers, just as building them, is not a trivial task. One must decide whether to score output labels as a simple binary true or false correctness—does the predicted output set of labels exactly match the actual set or not? —or should each correct label be given a partial score? What about inaccurate predicted labels? Should these be ignored in terms of scoring, or should they count against the overall score for a given set prediction? All these questions and more must be considered when choosing a scoring metric for multilabel classification scenarios. 39 I chose to use a combination of three scoring metrics: subset accuracy, as implemented by Sci-kit Learn, Jaccard score, and a new metric built through a string-matching algorithm based on Levenshtein distance. The subset accuracy metric implemented for multilabel scenarios under the accuracy_score of the Sci-kit Learn metrics library works by giving a partial score for each predicted label. The partial score amount per each label is equal to one divided by the total number of labels, e.g., for the ten genre labels in this dataset, a prediction could give n number of genres as a predicted output. This would be restructured in an encoded fashion as described above so that, for the unique set of labels, a one or zero is ascribed to each label—[fiction, romance, fantasy] might be represented as [1, 0, 0, 0, 1, 0, 0, 1, 0, 0], where each binary number in the list corresponds to whether the respected genre is predicted. Such would then be compared against the actual genre labels for such example. For each of the ten possible genres, the predicted genre labels are rewarded one-tenth for 40 each correctly assigned label. This goes for affirmative and negative predictions of a given genre. An example is shown in Figure 11 showing a score of 0.8 out of 1 for a set of predicted values. It is seen how out of ten genres, eight were predicted correctly, giving one-tenth for each correct score, which results in a total of 0.8 out of 1. Figure 11: An example of the accuracy score from Sci-Kit Learn's scoring metrics library being used on binary values indicating which of the 10 distinct labels in the dataset are associated with a given example. "y_pred" refers to predictions made by the classification model. “y_true” refers to the actual values. The “accuracy_score” method returns a value of 0 to 1, 1 being 100% accuracy and 0 being 0% accuracy. Figure 12: An example of the Jaccard score from Sci-Kit Learn's scoring metrics library being used on binary values indicating which of the 10 distinct labels in the dataset are associated with a given example. "y_pred" refers to predictions made by the classification model. “y_true” refers to the actual values. The “jaccard_score” method returns a value of 0 to 1, 1 being 100% set similarity and 0 being 0%. 41 The second scoring metric used was the Jaccard score, or Jaccard similarity coefficient. This metric takes the size of the intersection of the predicted and actual label sets and divides this by the size of the union of the same two sets. This metric punishes mistaken predictions more harshly than the accuracy_score metric described above while still encouraging predictions. An example of this metric in use is shown in Figure 12. The Jaccard scoring metric doesn’t reward the accurate prediction of a negative value for a genre, e.g., if there were an eleventh genre in this example given in Figure 12 for a value of “0” both in the predicted and actual sets, the score would not change. However, if that value were “1” for both sets, the score would slightly improve. The third and final scoring metric mentioned was a new, or non-traditional, metric I invented which works very similarly to the accuracy_score metric implemented by Sci-kit Learn. This method uses the Levenshtein distance to calculate the similarity between Figure 13: An example of the FuzzyWuzzy library being used to compute string similarity for two strings. Such strings here represent 10 binary values correlating with distinct labels in the dataset. "y_pred" refers to predictions made by the classification model. “y_true” refers to the actual values. The “ratio” method called returns a score of 0 to 100 representing the ratio of how similar the two strings are. 42 strings. By converting the predicted output set of genres used in the previous examples to a string, this method can be applied to the predicted and actual sets. The package used to implement the Levenshtein distance as described is FuzzyWuzzy, as shown in Figure 13. This scoring metric rewards negative values of genre predictions as well as positive. It is the most generous scoring metric among the ones described. The reason for using this scoring metric in addition to the accuracy_score and jaccard_score from the Sci-Kit Learn library was to provide a metric that would reward correct negative values for a label as well as positive. This way I would have three scoring metrics which rewarded or punished positive and negative values for a label differently but consistently. This would provide a well-rounded overall score. Most Important Features I will describe briefly how the most important features of the novel techniques were found. My brevity is because the models built for this objective proved to be mostly ineffective due to challenges that will be described later. However, this step is still of some importance if only for the sake of proving depth and rigor of effort. 43 The most important features, or the features that had the most significant correlation with corresponding genre labels, for the multilabel classification problem, among the features which were provided via the novel techniques described, were AUX, INTJ, SCONJ, SYM, and X, which respectively stand for auxiliary verb, interjection, subordinating conjunction, symbol, and other. Symbols include word-like entities that are different from ordinary words by form or function, such as $, %, and @. X (other) includes words that cannot be assigned to any other part-of-speech group. These might be nonsensical words such as xfgh, pdwl, or jsdfjkl. These were obtained via recursive feature elimination and are shown in Figure 14. Because the novel feature-engineering techniques did not yield accurate predictors from any of the machine-learning models, the relevance of such important features becomes negligible. The same applies for another technique employed to discover feature importance used, namely correlation heatmaps, as shown in Figure 15. Figure 14: A bar graph showing values indicating the ranking of feature in terms of an estimated importance for such feature in predicting a particular column. The values here, however, are a sum of the ranking positions of each feature for each label in the dataset. In other words, a ranking was assigned with each feature for the first label in the dataset where the rank 1 meant most important feature. This was performed for each label in the dataset so that the overall best-ranked features ended up having the smallest total rank value when compared to others. Five columns, highlighted in green, are shown having the smallest total rank value, meaning they were consistently ranked as most important for the labels in the dataset. 44 Challenges Faced with Multilabel Classification After building machine-learning models, which included KNN, SVC, and Gaussian NB, I tuned the hyperparameters for each with GridSearchCV, a tool provided in the Sci-kit Learn library which allows for lists or ranges of hyperparameter values to be tested and scored for comparison and optimal hyperparameter selection. Following the use of this tool, I ran each model with the hyperparameters tuned for optimal accuracy according to each of the scoring metrics described in the Building and Scoring Multilabel Classifiers section. Initially, the results were promising, as shown in Figure 16, but I soon realized that the models that performed well were simply resorting to predicting positive values for the two or three most common genres in the dataset and negative values for all else. Figure 15: Correlation heatmaps for four of the ten genre labels from the dataset. These heatmaps show colors corresponding with the level of correlation each feature in the novel-feature-engineered dataset had with a given target label. In this case, only four target labels are shown: Fiction, Classics, Historical, and 20th-Century. The scale used for this graphic is -0.04 to 0.04. Almost no correlation values exceeded these min and max parameters, showing that no statistically significant correlation was achieved in this examination. 45 Because the dataset had not been normalized to avoid this, such predictions resulted in relatively good overall scores. But after redistributing the dataset to include a relatively equal representation of references to each genre, the models had no statistically significant performance in predicting genres. In fact, all the models produced predictions no better than random guesses. Attempts were made to overcome this poor performance of the models. Among these attempted efforts were more hyperparameter tuning, using fewer genre labels, reducing noise in the dataset by removing unimportant features, and even reducing the Figure 16: Bar graphs shown comparing classification scores for the three feature-engineered datasets on three different machine-learning classification models. The first row shows results for the K-Nearest Neighbor classification model, second for Support Vector Machine, and third for Gaussian Naïve Bayes. The TF-IDF feature-engineered model scores are shown in blue, spaCy in orange, and novel in green. Three clusters of bars are shown within each row represent different scoring techniques employed, i.e., the left-hand cluster shows scores calculated via the FuzzyWuzzy method, middle shows scores calculated via Sci-Kit Learn’s accuracy score, and the right-hand cluster shows scores calculated via Sci-Kit Learn’s Jaccard score implementation. 46 classification problem to a simple binary problem of whether a book was fictional. None of these attempts significantly improved the predictive models. Eventually, decomposition analysis and visualizations of such led to the realization that the problem most likely resided in the dataset itself, not in the models or augmentation of the dataset. It was found that even decomposition techniques such as Principal Component Analysis, Dictionary Learning, and Kernel Principal Component Analysis could not reduce the components of the data into discernable portions with respect to their associated genre labels. With this understanding, I reviewed the correlation heatmaps of the data for each genre and noticed that the correlation Figure 17: Scatter plots showing results of two components of a Principal Component Analysis performed on the spaCy dataset for each of the distinct labels in the dataset. The dataset here is encoded via One-Hot Encoding. A binary classification problem is extracted from this encoding in accordance with the Binary Relevance classification solution. A binary classification problem for each label in the dataset is made and fed into a Principal Component Analysis method and two components are returned and plotted as shown. Ideally, components produced through this method show clear distinction and separation among classes, but here, separation is not achieved for any label in the dataset. 47 coefficient p was almost never statistically significant for any features and genres when using a significance level of α = 0.05. Figure 17 shows the results of Principal Component Analysis (PCA) when run on the dataset structured as a binary classification problem for each of the ten genres. The ten different genre labels are shown in the legend area for each plot. It can be clearly seen that there are no discernable components for the positive and negative values of each genre label. This additional insight allowed me to conclude that the books and associated genre labels in this dataset constitute bad data. The reasons why this data could be bad are various. One reason could be that the genre labels obtained from GoodReads, being crowdsourced, do not follow logical or consistent patterns in their attribution for each book. The idea that predictive models can be built for this multilabel scenario assume that there would be some discernable patterns or differences in the text of these various genres. Such an assumption—while it may apply to smaller datasets with limited books, genres, or authors—appears to be false as the diversity of the dataset increases. More on this point will be discussed later. 48 Author Prediction: A New Approach Having discovered substantial difficulties in the task of predicting the genres obtained from GoodReads for books in the dataset, I considered alternative methods that would still satisfy the initial research question of how my novel feature-engineering techniques compare to other more established techniques. A new problem of predicting authors for the books became satisfactory as it was not only possible based on the current dataset, but it was a suitable problem for classification models to show performance for the various feature-engineering techniques. Authorship attribution, or the prediction of the author of a given text, has been studied and refined since the 19th century when Mendenhall examined plays supposedly written by Shakespeare. Others examined “The Federalist Papers” which are attributed to John Lay, Alexander Hamilton, and James Madison in attempt to distinguish which segments were written by which author. These attempts have been built upon ever since with only more sophisticated and successful methods and more powerful computational force [59]. Such improvements have granted a deeper and more rigorous implementation of authorship attribution methods on large corpuses of texts which would have been impossible to use through traditional human methods. In addition to being a subject of academic interest, authorship attribution has yielded practical benefits in uses for detecting plagiarism [60], forensic profiling [29], maintaining textual continuity on collaborative works [61], and many others of professional, educational, and legal benefit. To prepare the existing dataset for this new problem, authors of each book would have to be extracted. As structured, the books contained their book title and authors at the 49 beginning of the full-book text. Using Regular Expressions, this information was extracted and saved to the DataFrame. The DataFrame now contained the full-book text for each book as the example and the associated author as the output. An analysis of author prolificness, or how prolific an author was, and distribution was performed to understand what size of dataset could be used for the machine-learning models. This helped in consideration of how many authors could and should be used to build the models as well as what number of books each author could provide for such models. The number of books per author and the number of authors used have an inverse relationship when building the dataset for models because as the number of authors increases, the number of books used per each author must be close or equal to the minimum number of books attributed to an author. This would be to allow for a normalized and equally distributed dataset. If one author dominated the dataset in terms of book authorship, the models might simply resort to guessing that author for every example, which would not allow for true performance to be measured for each feature-engineering technique. 50 The raw counts of books for each author initially showed that nearly 25% of the books in the dataset were written by one author and the eleven most prolific authors in the dataset comprised nearly 50% of the books in the dataset. For the least prolific authors, their contribution of ten books each only represented about .5% of the dataset per author. Figure 18: A bar graph showing the number of books appearing in the dataset for each author found in the dataset. Al Haines dominates as the most prolific author in the dataset, largely skewing results with over 500 books under his name. All other authors save one have less than 100 books in the dataset under their name. 51 Shown in Figure 18, most authors had about 25 books appearing in the dataset. The mean number of books per author was 27.26 but, after removing the most prolific author from the dataset, whose name revealed no authorship status when researched, the average drops to 20.81. The median number of books for the dataset was 15. Figure 19 shows an alternative visualization—a pie chart showing relative frequencies of books per author within the dataset. Following the same preprocessing and feature-engineering procedures described in the Preprocessing and Feature Engineering the Text section, I obtained additional columns to the DataFrame containing the three main feature-engineered representations, namely, a spaCy vector, a TF-IDF vector, and the novel techniques described. Figure 19: A pie chart showing authors relative contribution to authorship of books in the dataset. Each authors' slice of the pie indicates the relative portion of books under their name of the total books in the dataset. 52 Figure 20: A table of scatterplots showing two components from a Principal Component Analysis performed on combinations of feature-engineered datasets and number of books per author where the number of authors is three. The first row shows three scatterplots, one for each feature-engineering technique, where the dataset includes one book per author. The second row shows the same but for four books per author, and so forth all the way to the last row, having 37 books per author. 53 Figure 20, Figure 22, and Figure 21 show various visualizations describing the data and the performance for each machine-learning model used and for each of the feature-engineering techniques used on the data. There are various factors in this scenario worth examining. In addition to the three feature-engineering techniques used to represent the book data and the three machine-learning algorithms used on each of these feature-engineering techniques, other factors to consider are the number of books per author and the number of authors to include in the dataset. The figures referenced above show changes in performance and results as the book count for each author in the dataset increases. For my dataset, the largest number of Figure 22: Three line-charts showing the accuracy for each machine-learning model on datasets of incrementing book count for three authors. This depicts the scores for the 13 rows found in Figure 20. Accuracy was computed as an average of the three scoring-metrics described earlier. Figure 21: The average scores for each feature-engineering technique on the datasets with incrementing book counts for three authors. This line chart is derived from Figure 22 but instead of showing each of the three machine-learning model results for the three feature-engineering techniques, an average of all three models is shown for the three feature-engineering methods on the 13 rows from Figure 20. 54 books I could supply the dataset for three authors was 37. This was because, of the three Figure 23: A table of scatterplots showing two components from a Principal Component Analysis performed on combinations of feature-engineered datasets and number of authors when the number of books is 37. The first row shows three scatterplots, one for each feature-engineering technique, where the dataset includes two authors. The second row shows the same but four authors, and so forth all the way to the last row, having five authors. 55 authors being used, 37 was the number of books that the least prolific of the three authors contributed to the dataset. With a larger dataset, this number could be increased more. It is shown in Figure 20 that, for the three authors selected in the example, as the number of books per each author increases, the two components from the PCA represent mostly distinct and separable clusters. To the human eye, it appears that the spaCy and novel feature-engineering techniques allow for better separation via PCA than that of the TF-IDF technique. This difference is proven when examining the performance results of the three machine-learning models built from the three feature-engineering techniques. Figure 21, which shows the average performance of the three machine-learning algorithms for each feature-engineering technique in Figure 20, shows that the spaCy and novel feature-engineering techniques almost always lead to better performance than the TF-IDF model. For these averages of machine-learning model performance for each of the three feature-engineering techniques, taking five of the average performances scores which used the most books (the five most former of those shown in Figure 22), the average accuracy scores were 75.19% for spaCy, 61.21% for TF-IDF, and 70.02% for the novel techniques. The reason for using the five most former averages of predictions made by each model is that the first few predictions that were based on ten or fewer books per each author performed worse than average for each model, i.e., the performance of each model evened out after reaching about 19 books per author. Figure 23 shows visualizations much like Figure 20 where two components of PCA results are plotted, but here, instead of incrementing books per author, the number of unique authors in the dataset are incremented, each having 30 books in the dataset. It is interesting to note that certain authors are more distinguished from others according to 56 the PCA component results for any given feature-engineering model. Further, while two authors may be well distinguished according to one feature-engineering model, in another model, such same authors may be indistinguishable. This opens a discussion as to what factors or features of an author’s writing style most distinguish them, and which feature-engineering technique is best suited to capture these. The most important features and most effective feature-engineering methods for machine-learning models predicting text authorship would likely be different than such for another predictive problem, such as genre prediction, spam prediction, etc. Because each natural language classification problem is different and demands different features and information from the text, feature-engineering techniques cannot necessarily be given one performance ranking for all problems, i.e., there may not be a best feature-engineering technique when it comes to natural text classification problems. The feature-engineering technique that is best for a given classification problem will be dependent upon what data the classification most relies upon. For some problems this may be data captured best in a TF-IDF model and for others, the novel techniques may be best. To be thorough in the examination of these principles, I will provide further results relevant to these discussion points. Figure 24 and Figure 25 show the 20 possible combinations of six unique prolific authors from the dataset, namely Mark Twain, John Galsworthy, George Manville Fenn, Nathaniel Hawthorne, Martin Ward, and William Dean Howells. For each of these combinations, two components from PCA are plotted showing how separable each author is from another. 57 Figure 24: The first ten of twenty combinations of six selected prolific authors: Mark Twain, John Galsworthy, George Manville Fenn, Nathaniel Hawthorne, Martin Ward, and William Dean Howells. For each combination, results are shown of two components of Principal Component Analysis for a dataset including 37 books from each of the three authors in the given combination. Three scores are shown to the right of each scatterplot showing the performance of three machine-learning models built on the dataset containing 37 books from each of the three authors. The scores respectively correlate, from left to right, with K-Nearest Neighbor, Support Vector Machine, and Gaussian Naïve Bayes. An average performance score of the three models is shown outside the bar graphs on the right of each row. 58 Figure 25: The second ten of twenty combinations of six selected prolific authors: Mark Twain, John Galsworthy, George Manville Fenn, Nathaniel Hawthorne, Martin Ward, and William Dean Howells. For each combination, results are shown of two components of Principal Component Analysis for a dataset including 37 books from each of the three authors in the given combination. Three scores are shown to the right of each scatterplot showing the performance of three machine-learning models built on the dataset containing 37 books from each of the three authors. The scores respectively correlate, from left to right, with K-Nearest Neighbor, Support Vector Machine, and Gaussian Naïve Bayes. An average performance score of the three models is shown outside the bar graphs on the right of each row. 59 It is shown by the performance of the three machine-learning models built upon each respective dataset that the particular authors in a classification situation hugely affect the ability of the classifier to accurately predict such authors. Figure 26 shows just how much difference can be made when altering the group of authors used for a given model. It shows the average scores from each of the 20 combinations of authors shown in Figure 24 and Figure 25. The x-axis of Figure 26 represents the iteration of the 20 different author combinations and the y-axis shows the average accuracy score of the three models built on the dataset for such authors. Models performed, on average, substantially different for different author groups. When using the authors George Manville Fenn, Mark Twain, and Nathaniel Hawthorne, the accuracy of the predictive models averaged only 49%. But when using the authors George Manville Fenn, Nathaniel Hawthorne, and William Dean Howells, the models performed with an accuracy of 100%. It can be asked whether the difficulty of telling authors apart based on these results can be reduced to some factor not described in the data. Did the authors with Figure 26: Average scores for the 20 combinations of the six selected prolific authors shown in Figure 24 and Figure 25. The average is taken of the three machine-learning models’ performance shown in Figure 24 and Figure 25. There are twenty average scores shown here ranging from values between 0 and 1, 1 being 100% accuracy. 60 which the models had 49% accuracy share vernaculars, preferred writing genre, or influencers? From the first example, which had 49% accuracy, George Manville Fenn (1831—1909) was an English novelist and journalist who wrote many historical fiction novels for a young adult audience. Mark Twain (1835—1910), which was the pen name of Samuel Langhorne Clemens, was an American writer who contributed mostly novels and short fiction. Nathaniel Hawthorne (1804—1864) was an American writer who contributed a lot of fiction centered on New England featuring moral metaphors expressed through a dark romanticism, being part of the Romantic movement. Although these authors were contemporaneous, native English speakers, and influenced to some degree by the Romantic movement which they were all born during, the analyses shown above had difficulty distinguishing between them. What makes the case more curious is that the group of authors that achieved 100% accuracy included two of the three authors from this case and replaced Mark Twain with William Dean Howells. William Dean Howells (1837—1920) whose lifespan and lifetime very closely overlapped Mark Twain’s, was, as Mark Twain, an American realist novelist, literary critic, and playwright. It turns out that Mark Twain and William Dean Howells were not only contemporaries with much in common, but they were friends who frequently spent time together. Seeing how the worst and best performing author groups differed only by one author, which in both cases ended up being incredibly similar authors as to their writing style, time era, and preferred genres. This oddity suggests that features such as preferred genre, time era, writing style, and other features expressed in the spaCy vector representation, sometimes distinguish among very similar authors—namely Mark Twain 61 and William Howells—but seem unable to distinguish among less similar authors—Mark Twain, George Manville Fenn, and Nathaniel Hawthorne. It should be noted that the differences in an author’s writing style depicted in the PCA results above are subjective to the feature-engineering technique used, which, in this case, was the spaCy technique. The spaCy technique proved to have the highest average performance in all other demonstrations so it is a safe assumption that neither the novel feature-engineering techniques nor TF-IDF would provide new information regarding the PCA results. Figure 27: Correlation heatmaps for eight authors found in the dataset. Colors indicate how correlated features from the novel-feature engineering dataset are with each author. Correlation values greater than 0.05 or less than -0.05 indicate statistically significant correlation. The scale of correlation here goes from -1.0 to 1.0. Note that author Al Haines is represented in this collection of heatmaps as it was performed on all authors in the dataset before the removal of any. Although Al Haines was removed for data normalization purposes, representing his works via heatmaps does not skew any results. 62 Figure 27 shows how, depending on the author, different features provided by the novel techniques feature set are correlated. The only author having statistically significant correlation with any of the reading complexity scores is Martin Ward. Mark Twain, on the other hand, has negligible correlation with any of the features here. Other authors vary in which features they are correlated with and by what amount. 63 Conclusion Based on the findings of this work, I conclude that there is no supreme feature-engineering technique. Although some techniques tend to lead to better performance by machine-learning algorithms than others, not every classification problem demands the same information and, because such, no feature-engineering technique among those compared here can be named “best”. It was discovered that the classification problem, such as genre prediction, or author prediction, will necessitate different data aspects of the text. Such unique differences may be represented better by one feature-engineering technique but no one feature-engineering technique captures every unique difference of any given classification problem. When choosing which feature-engineering technique to use for a given problem, one ought to compare several techniques and measure their performance to find for certain which technique optimizes the performance for their classification problem. Of course, many of the popular techniques, such as the spaCy vectorization used in this work, have a high average score and are safe as a candidate for most classification problems, it cannot be known without rigorous testing that another technique would not be better suited for the problem. The answer to the research question is that the best feature-engineering technique cannot be generalized for all classification problems and must be tested uniquely for each case. In the case of genre prediction, the answer was not able to be obtained because of defects in the labels of the data. In the case of author prediction, the spaCy technique led to better predictive model performance, on average. In addition to answering the research question, the results of this work led to interesting discoveries about writing styles of authors. It was found that some authors, 64 irrespective of their preferred genre, time era, or vernacular, are difficult to distinguish while this is the opposite in other cases. Future work may include an analysis of differences in authorship regarding these facts. Further, it would be useful to find new features to represent authors which may be more useful than all suggested in this work. 65 References [1] R. J. Z.-H. Z. Yin Zhang, "Understanding bag-of-words model: a statistical framework," International Journal of Machine Learning and Cybernetics, pp. 43-52, 2010. [2] S. M. R. N. P. M. C. L. S. S. W. Ella Rabinovich, "Personalized Machine Translation: Preserving Original Author Traits," in EACL, Valencia, 2017. [3] S. W. Ella Rabinovich, "Unsupervised Identification of Translationese," Transactions of the Association for Computational Linguistics, 2016. [4] Y. I. L. M. Mahmoud Khonji, "Authorship Identification of Electronic Texts," IEEE Access, vol. 9, pp. 101124-101146, 2021. [5] P. M. K. A. Mudit Bhargava, "Stylometric Analysis for Authorship Attribution on Twitter," in International Conference on Big Data Analytics, 2013. 66 [6] H. C. Ahmed Abbasi, "Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace," ACM Transactions on Information Systems, vol. 26, no. 2, pp. 1-29, 2008. [7] B. Plank, "Prediction Authorship and Author Traits from Keystroke Dynamics," in Association for Computational Linguistics, New Orleans, 2018. [8] J. L. H. Evelyn Fix, "Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties," USAF School of Aviation Medicine, Randolph Field, 1951. [9] R. S. Unggul Widodo Wijayanto, "An Experimental Study of Supervised Sentiment Analysis Using Gaussian Naïve Bayes," International Seminar on Application for Technology of Information and Communication, vol. 10, no. 1109, pp. 476-481, 2018. [10] I. Rish, "An Empirical Study of the Naïve Bayes Classifier," Université de Montréal, 2001. [11] D. L. Haiyi Zhang, "Naïve Bayes Text Classifier," 2007 IEEE International Conference on Granular Computing (GRC 2007), vol. 10, no. 1109, p. 708, 2007. [12] P. Majumder, "Gaussian Naive Bayes," OpenGenus IQ: Computing Expertise & Legacy, 2021. [Online]. Available: https://iq.opengenus.org/gaussian-naive-bayes/. [Accessed 20 October 2021]. 67 [13] T. Joachims, "Text categorization with Support Vector Machines: Learning with many relevant features," in European Conference on Machine Learning, Chemnitz, 1998. [14] K. K. Gabriel Barney, "Predicting Genre from Movie Posters". [15] V. B. R. R. R. M. K. R. R. M. Varshit Battu, "Predicting the Genre and Rating of a Movie Based on its Synopsis". [16] B. Q. T. L. Duyu Tang, "Document Modeling with Gated Recurrent Neural Network". [17] A. L. D. E. James Bergstra, "Predicting genre labels for artists using FreeDB". [18] D. C. François Pachet, "A Taxonomy of Musical Genres". [19] M. M.-y.-G. F. A. G. T. S. Suraj Maharjan, "A Genre-Aware Attention Model to Improve the Likability Prediction of Books". [20] H. A. Soraya Anvari, "Book2Vec: Representing Books in Vector Space without using the Contents," in 8th International Conference on Computer and Knowledge Engineering, Mashhad, 2018. [21] B. J. X. G.-i.-N. Victor Campos, "From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction". [22] K. B. A. W. K.-L. Yichen Tang, "Enriching feature engineering for short text samples by language time series analysis". 68 [23] D. S. S. C. P. K. Donghwa Kim, "Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec," Information Sciences, pp. 15-29, 477. [24] S. C. S. A. Keun Lee, "Web-based algorithm for cylindricity evaluation using support vector machine learning," Computers & Industrial Engineering, vol. 60, no. 2, pp. 228-235, 2011. [25] A. D. J. C. K. M. J. C. O. D. L. K. P. E. A. Clayton A. Turner, "Word2Vec inversion and traditional text classifiers for phenotyping lupus," BMC Medical Informatics and Decision Making, vol. 17, no. 1. [26] F. M. G. D. Khadim Dramé, "Large scale biomedical texts classification: a kNN and an ESA-based approaches," Journal of Biomedical Semantics, vol. 7, 2016. [27] R. T. F. S. T. Richmond Hong, "Authorship Identification for Online Text," in International Conference on Cyberworlds, 2010. [28] A. A. a. H. Chen, "Writeprints: A Stlometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace". [29] W. J. S. C. W. F. T. C. A. T. B. S. A. R. B. C. E. S. Anderson Rocha, "Authorship Attribution for Social Media Forensics," IEEE Transactions on Information Forensics and Security, vol. 12, no. 1, pp. 5-33, 2016. [30] "Project Gutenberg," Urbana, Illinois, [Online]. Available: https://www.gutenberg.org/about/. [Accessed May 2021]. 69 [31] "GoodReads," GoodReads, [Online]. Available: https://www.goodreads.com/. [Accessed May 2021]. [32] S. R. Department, "Number of registered members on Goodreads from May 2011 to July 2019," GoodReads, July 2019. [Online]. Available: https://www.statista.com/statistics/252986/number-of-registered-members-on-goodreadscom/. [Accessed 22 September 2021]. [33] M. d. C. Saavedra, Don Quixote, Francisco de Robles, 1605. [34] J. Austen, Pride and Prejudice, Simon & Schuster, 1797. [35] "FuzzyWuzzy," [Online]. Available: https://github.com/seatgeek/fuzzywuzzy. [Accessed 22 September 2021]. [36] L. Richardson, Beautiful Soup, April 2007. [Online]. Available: https://beautiful-soup-4.readthedocs.io. [Accessed June 2021]. [37] G. Tsoumakas, "Multi-label classification: an overview," International journal of data warehousing and mining, vol. 3, no. 3, p. 1, 2007. [38] J. K. Wei Bi, "Efficient Multi-label Classification with Many Labels," Proceedings of Machine Learning Research, vol. 28, pp. 405-413, 2013. [39] M. H. a. I. Montani, "spaCy," Explosion, [Online]. Available: https://spacy.io/. [Accessed July 2021]. 70 [40] B. Srinivasa-Desikan, "Natural Language Processing and Computation Linguistics," in Natural Language Processing and Computation Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras, Birmingham, Packt Publishing Ltd., 2018. [41] M. H. a. I. Montani, "spaCy: Word vectors and semantic similarity," Explosion, [Online]. Available: https://spacy.io/usage/linguistic-features#vectors-similarity. [Accessed July 2021]. [42] K. S. Jones, "Letters to the Editor," Journal of the American Society for Information Science , vol. 24, no. 2, pp. 166-167, 1973. [43] R. W. P. L. K. F. W. K. L. K. Ho Chung Wu, "Interpreting TF-IDF term weights as making relevance decisions," ACM Transactions on Information Systems, vol. 26, no. 3, pp. 1-37, 2008. [44] Ö. B. E. U. Đlker Nadi Bozkurt, "Authorship Attribution," in International International Symposium on Computer and Information Sciences, Ankara, 2007. [45] G. H. McLaughlin, "SMOG Grading—a New Readability Formula," Jounral of Reading, vol. 12, no. 8, pp. 639-646, 1969. [46] "Textstat," Textstat, [Online]. Available: https://textstat.readthedocs.io/en. [Accessed 4 October 2021]. 71 [47] Y. D. L. M. Y. W. S. L. X. Z. Xue Han, "A Novel Part of Speech Tagging Framework for NLP Based Business Process Management," IEEE International Conference on Web Services (ICWS), pp. 383-387, 2019. [48] A. Onan, "An ensemble scheme based on language function analysis and feature engineering for text genre classification," Journal of Information Science, vol. 44, no. 1, pp. 28-47, 2016. [49] "Universal POS tags," Universal Dependencies contributors, [Online]. Available: https://universaldependencies.org/docs/u/pos/. [Accessed 4 October 2021]. [50] "Kaggle," Google, [Online]. Available: https://www.kaggle.com. [Accessed 5 October 2021]. [51] "Kaggle Documentation technical specifications," Google, [Online]. Available: https://www.kaggle.com/docs/notebooks#technical-specifications. [Accessed 5 October 2021]. [52] F. Perez, "Project Jupyter," [Online]. Available: https://jupyter.org/about. [Accessed 5 October 2021]. [53] [Online]. Available: https://scikit-learn.org/stable/. [54] W. McKinney, "Pandas - Python Data Analysis and Manipulation Tool," [Online]. Available: https://pandas.pydata.org/. [Accessed 5 October 2021]. 72 [55] T. Oliphant, "Numpy," CZI, [Online]. Available: https://numpy.org/. [Accessed 5 October 2021]. [56] "Google Cloud Platform," Google, [Online]. Available: https://cloud.google.com/. [Accessed 5 October 2021]. [57] "Pickle—Python object serialization," [Online]. Available: https://docs.python.org/3/library/pickle.html. [Accessed 5 October 2021]. [58] J. O. X. Z. Ximing Li, "Supervised topic models for multi-label classification," Neurocomputing, no. 149, pp. 811-819, 2014. [59] E. Stamatatos, "A survey of modern authorship attribution methods," Journal of the American Society for Information Science and Technology, vol. 60, no. 3, pp. 538-556, 2008. [60] B. S. M. K. Sven Meyer zu Eissen, "Plagiarism Detection Without Reference Collections," in Advances in Data Anlaysis, Berlin, 2007. [61] D. K. P. V. B. B. S. I. Jeff Collins, "Detecting Collaborations in Text Comparing the Authors' Rhetorical Language Choices in The Federalist Papers," Computers and the Humanities, vol. 38, pp. 15-26, 2004. [62] P. Majumder, "Gaussian Naive Bayes," OpenGenus IQ, [Online]. Available: https://iq.opengenus.org/gaussian-naive-bayes/. [Accessed 23 November 2021]. 73 [63] M. Honnibal, "spaCy Visualizations," Explosion, 2021. [Online]. Available: https://spacy.io/usage/visualizers. [Accessed 23 November 2021]. |
Format | application/pdf |
ARK | ark:/87278/s6xftxtn |
Setname | wsu_smt |
ID | 96852 |
Reference URL | https://digital.weber.edu/ark:/87278/s6xftxtn |