Title | Romine, Samuel_MCS_2022 |
Alternative Title | Primary Texts and Their Effects on Sentiment and Emotional Analysis |
Creator | Romine, Samuel |
Collection Name | Master of Computer Science |
Description | This Masters of Computer Science thesis explores how sentiment sources do not have siginficant impact on prediction quality of machine learning models. |
Abstract | When performing sentiment analysis, it is common to derive sentiment from a multitude of sources, including lexicons, crowdsourcing, online tools, or even reading and analyzing the text yourself as the researcher. In my research I prove that you cannot simply derive your own sentiment from another person's text to test the validity of your sentiment analyzers, but that you must use the original author's sentiment as the basis for modeling the accuracy of various analyzers. To create such a set of data, I used MTurk, a surveying service provided by Amazon, to distribute various surveys to create primary annotated text. Using basic emotional theory presented by Paul Ekman, I was able to create data with annotated sentiment and emotion. In this paper I create a method to compare sentiment sources rather than sentiment analyzers, and I show that the source of sentiment does not have any statically significant impact on the prediction quality of various machine learning models. I also show that these models do not perform any better than the common person and are still able to be outclassed by subject matter experts. Finally, I present a brief exploration into emotional analysis and demonstrate that emotional analysis tools are still outclassed by humans performing the same task. |
Subject | Algorithms; Computational linguistics; Computer science; Machine learning |
Keywords | machine learning, emotinal intelligence, artificial intelligence, computer science |
Digital Publisher | Stewart Library, Weber State University, Ogden, Utah, United States of America |
Date | 2022 |
Medium | Thesis |
Type | Text |
Access Extent | 70 page PDF; 1.83 MB |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records: Master of Computer Science. Stewart Library, Weber State University |
OCR Text | Show Primary Texts and Their Effects on Sentiment and Emotional Analysis by Samuel Romine A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE OF COMPUTER SCIENCE WEBER STATE UNIVERSITY Ogden, Utah December 12, 2022 ________________________________ (signature) Dr. Robert Ball Committee Chair ________________________________ (signature) Dr. Sarah Herrmann Committee Member ________________________________ (signature) Joshua Jensen Committee Member ________________________________ (signature) Samuel Romine Copyright 2022 Samuel Romine Samuel Romine (Dec 20, 2022 08:44 MST) Samuel Romine Abstract When performing sentiment analysis, it is common to derive sentiment from a multitude of sources, including lexicons, crowdsourcing, online tools, or even reading and analyzing the text yourself as the researcher. In my research I prove that you cannot simply derive your own sentiment from another person’s text to test the validity of your sentiment analyzers, but that you must use the original author’s sentiment as the basis for modeling the accuracy of various analyzers. To create such a set of data, I used MTurk, a surveying service provided by Amazon, to distribute various surveys to create primary annotated text. Using basic emotional theory presented by Paul Ekman, I was able to create data with annotated sentiment and emotion. In this paper I create a method to compare sentiment sources rather than sentiment analyzers, and I show that the source of sentiment does not have any statically significant impact on the prediction quality of various machine learning models. I also show that these models do not perform any better than the common person and are still able to be outclassed by subject matter experts. Finally, I present a brief exploration into emotional analysis and demonstrate that emotional analysis tools are still outclassed by humans performing the same task. Table of Contents Introduction ......................................................................................................................... 5 Related Work ....................................................................................................................... 7 The First Survey .................................................................................................................. 9 Creation And Distribution ........................................................................... 9 Survey Breakdown .......................................................................... 9 Quantitative Analysis ................................................................................ 11 Response Quality ........................................................................... 12 Demographics ................................................................................ 13 Pre-emptive Sentiment Breakdown ........................................................... 14 Positive Emotions .......................................................................... 15 Negative Emotions ........................................................................ 17 ....................................................................................................... 17 Text Processing ................................................................................................................. 20 Now, let’s dive into the process I used to prepare the text for analyzing. The process for preparing text for sentiment analysis is nearly universal: . Error! Bookmark not defined. Step 1 ............................................................................................. 20 Step 2 ............................................................................................. 22 Steps 3 & 4 .................................................................................... 25 Sentiment Sources ..................................................................................... 26 VADER ......................................................................................... 27 AFINN ........................................................................................... 30 Researcher ..................................................................................... 32 Sentiment Analysis ............................................................................................................ 35 Naïve Bayes ............................................................................................... 36 Logistic Regression ................................................................................... 37 KNN .......................................................................................................... 38 Random Forest .......................................................................................... 39 Ensemble ................................................................................................... 40 Machine Learning Results ......................................................................... 41 Alternative Analysis .......................................................................................................... 41 Second Survey (Public Crowdsourcing) ................................................... 42 Third Survey (Private Crowdsourcing) ..................................................... 45 Results ....................................................................................................... 47 Sentiment vs Emotion ............................................................................... 49 Emotional Analysis ........................................................................................................... 52 Text2Emotion (T2E) ................................................................................. 53 LeXmo ....................................................................................................... 54 Second Survey (Public Crowdsourcing) ................................................... 55 Third Survey (Private Crowdsourcing) ..................................................... 56 Results ....................................................................................................... 56 Conclusion ......................................................................................................................... 61 References ......................................................................................................................... 64 Introduction As I began learning about sentiment analysis, I noticed a common theme performed by modern researchers of the topic. When the author(s) of a given paper want to determine the sentiment of their data, they do not use the primary source to determine the sentiment, instead they use a wide variety of other methods to apply what they assume is the correct sentiment. Whether the data is sourced from movie reviews, product reviews, tweets, Facebook posts, responses to surveys, or perfected datasets that already have the correct matching sentiment (like News Group Movie Review Sentiment Classification and Reuters Newswire Topic Classification), the researchers use methods such as: • Lexicon-based approaches like AFINN or VADER [1, 2] • Researcher created sentiment [3] • Metadata attached to the reviews such as star ratings or scoring on a range from 1 to 10 [4, 5, 6, 7, 8, 9] • RapidMiner, Nvivo, or other online tools [10, 11, 12] • Algorithms such as naïve Bayes, random forests, support vector machines, and others [13, 14, 15, 16, 17, 18, 19] • SentiWordNet or ensemble algorithms [20, 21] • Crowdsourced sentiment from services like MTurk or pre-annotated/perfected datasets [22, 23] 6 • Semi-supervised learning where a portion of the data uses one the aforementioned methods [24, 25, 26, 27] to determine the sentiment of the text they are analyzing. Plenty more examples are available, but this should help illustrate that a vast majority of papers create their own sentiment, rather than identifying the author’s original intent. Each of these methods have some form of drawback when used as the primary sentiment annotator. Lexicons generally operate on a word-by-word basis, meaning they can miss context such as “not happy” since each word is treated separately. Metadata doesn’t always correlate correctly. Many low star reviews contain things like “The product was great but arrived late for my daughter’s birthday”. The metadata of a 1-star rating would correlate this review as negative, despite the reviewer stating they loved the product. Online tools are generally proprietary or black box, meaning you as the researcher have no idea how the sentiment was created at all, your only option is to trust the results are accurate. Machine learning based sentiment relies on large datasets of pre-annotated data. This then asks the same question “Where did that annotation come from?” Also, machine learning tends to handle outliers poorly, such as unique words or uncommon phrases that are not within its training data; they cannot extrapolate well. When I discovered this trend, I began to wonder if the process of creating your own sentiment impacts the accuracy and validity of the very research being performed. That is when I formulated my research question: Does the use of primary annotated text impact the results of various sentiment analysis methods? My original hypothesis and answer to this question before I started my research was simply: “Yes.” My hypothesis was that there would be a significant and measurable difference in the results of lexicon- 7 based and machine-learning-based sentiment analysis compared to the author’s self-defined sentiment. I firmly believed that formulating and “guessing” at the sentiment of a given piece of text did not properly demonstrate the effectiveness of the given method in question. However, as I write this retrospectively, I have realized that using secondary sentiment does not have any major impact on prediction accuracies. Related Work Originally, sentiment was used by governments, both modern and old, to gauge the opinions of their peoples and governmental bodies [28]. One of the earliest known examples of this is from the Ancient Greeks who were attempting to find opposition, such as spies from other territories and dangerous disgruntled persons within their own city-states [28][29]. Moving forward in time, in the late 1930s and early 1940s, the U.S. government reached out to universities to perform sentiment analysis studies to see how 8 the public would respond to a potential war with Germany [30]. These practices are still being done today. A study in 2019 utilized sentiment analysis of Facebook posts from news sources located in Indonesia to determine public opinion during the presidential elections of Indonesia [4]. However, as technology progressed, institutions outside of governmental bodies and universities were able to begin conducting their own sentiment-style studies. In 1978 an inventor by the name of John Decatur Williamson filed one of the first patents for an emotional state analyzer based on the tone of a person’s voice [31]. It is amazing to think that nearly 50 years ago people were already pioneering technologies like speech-to-text that still considered a technological marvel today and how the concept originated in sentiment analysis. I have also worked on the progression of sentiment and emotion analysis with my committee members. I previously executed research on the sentiment behind recipes and their descriptions. I created a survey that took key words from recipes such as “delicious” or “gross” and asked people to rate what emotion best matched this word. I then used these results to derive sentiment and emotion from pieces of text and compare these results to other sentiment and emotion analyzers. This paper acts as a continuation of my previous research with a focus on the sentiment and the impact of the sentiment’s source. 9 The First Survey Creation And Distribution Before I could begin to test my research question, I needed to source a sizeable amount of text that had been reviewed by the author and contains their own personal sentiment rating of their work. I met with my committee to discuss ideas to create such a corpus, and I determined that a survey would be the best way to approach this. Then came the discussion of how prompt surveyors to write stories that invoked such a response. Figure 1 First Survey Prompt. An example of the first survey’s prompt, and the question for “happy.” Survey Breakdown I had several ideas, but in the end, I decided to utilize Paul Ekman’s theory on basic emotions [32]. Within Ekman’s work, he theorized that human emotion routes back to one of six core emotions: Anger, Disgust, Fear, Happiness, Sadness and Surprise [33]. I produced a generalized pattern for our prompts: “Write about a time you felt 10 {EMOTION}, without using the word {EMOTION}”, where “emotion” is one of the six emotions listed previously. Figure 1 shows an example for the emotion happy. I also decided to instruct the surveyors not use the prompt’s emotion within their response to help promote more unique language within the survey results. I was worried that I would see many responses like “One time I felt happy when…” “A time I was scared was…”. By instructing them to not use the emotion word, I hoped to increase the variety of words I would see, creating a more diverse corpus to work with. Finally, I instructed the surveyor to rate the sentiment of their response using a Likert scale from 1 to 7, where 1 is “overall negative” and 7 is “overall positive”, visible in Figure 1 [34]. The survey was then hosted on by Amazon Web Services (AWS) using a service known as Amazon Mechanical Turk (MTurk) [35]. MTurk is a commonly used service to recruit research participants, giving researchers access to a large pool of surveyors (also called “Workers” by MTurk) [36]. Figure 2 Demographics questions given to the surveyors. 11 In total, six questions were given to each of the surveyors, one for each of the emotions listed previously. Surveyors were also asked to optionally answer two general demographics questions: age, and race/ethnicity as seen in Figure 2. Surveyors were required to answer at least three of the six questions for their response to be accepted by my manual review of their response. I requested two hundred responses, meaning I would obtain a minimum of 600 pieces of text (if every surveyor only did the three required of them) and a maximum of 1200 (assuming every surveyor answered all six prompts). Also, since this survey was distributed on MTurk, all surveyors received compensation of $1 for successfully completing the survey. The following is an example of the instructions that were given to the surveyors: “You will be prompted to write about six emotions: Happy, fear, anger, disgust, surprise, and sadness. You can write about an experience you had in your life, or something you read from a book, or a story you heard, or whatever else that caused you to feel that emotion. There are no wrong answers and please feel free to write as much as you think is necessary. (However, please write at least two sentences at a minimum) If you feel uncomfortable writing about a certain emotion or question, feel free to skip any questions related to that emotion and move onto the next one. However, you must answer at least 3 of the 6 questions for your response to be approved. After each short answer question you will be asked to review the sentiment of your own writing. There is a slider that goes from 1(one) to 7(seven), where 1 is: ‘What I wrote is overall negative" and 7 is: ‘What I wrote is overall positive’. Please select the number that best represents the sentiment of what you wrote for that question.” Quantitative Analysis There are several metrics to review from the first survey that will help to understand the data I gathered. However, I must explain how specific responses will be presented. All submissions on MTurk are anonymous. As the provider of the survey, I can only view a randomized ID of the surveyor. Any time I quote something a surveyor 12 said, I will cite it with their ID. Meaning, if you see something like “’When I see any spider’ (A2WV4ZZCI7982H)”, the origin of quote is from a surveyor within the data I gathered. Response Quality In total, there were two hundred responders accepted, and 514 rejected. Of those 200 responders, there were a total of 1108 answered questions. Each surveyor had to answer a minimum of three questions, but most answered all six. At first that may seem like an abnormally high rejection rate of 71%, but this is quite typical with MTurk. I rejected responses for a multitude of reasons, but the main reasons were the surveyor not answering a minimum of three questions or surveyors writing single word answers. This was my first realization that MTurk may not have been the best tool to use for distributing my survey. It appears a lot of the workers try to skirt the rules in any way they can. This will be repeated in the second survey further in this paper which also utilizes MTurk. Returning to the one-word answers, the answers almost always used the emotion word they were directly instructed not to use. The one-word responses that did not use the emotion word used seemingly random words. For example, Worker A28E3F67UGRIHM’s responses were rejected. They wrote the following for each of their answers: • Happy: “Enjoy” • Sad: “curry” • Anger: “happy” 13 • Fear: “cry” • Disgust: “augest” (sic) • Surprise: “price” Luckily, most of the accepted responses are of a much higher quality and provide meaningful data to analyze. Here is an example of a typical accepted response. These text responses come from Worker A2UTDZAZV1DF0N: • Happy: “My daughter received an award at school and I was elated. She was one of the highest rated students in her grade.” • Sad: “I was very depressed when my mom passed away. She suffered with cancer for many years.” • Anger: “I yelled at my daughter for staying out too late. She missed curfew and did not call us.” • Fear: “I was unsure if I would have enough money to pay my mortgage. I was very unsettled and afraid.” • Disgust: “My boss cut his finger tip off at work. I felt grossed out by the sight.” • Surprise: “My neighbor had a problem with our lawn guy. I was astonished that he yelled at him.” Demographics Table 1 The age responses from the first survey. Age Surveyors 18-24 16 25-34 93 35-44 45 45-54 28 55-64 9 14 Over 65 4 Total Responses 195 out of 200 For demographics, the raw results are presented in tables 1 and 2. Table 1 shows the age demographics of the accepted responses from the first survey. For my dataset specifically, it appears that the majority of Workers are under the age of 44, only 41 Workers stated they are over the age of 44 or 22%. The demographic questions were optional, 5 of the 200 surveyors chose to not answer. Table 2 The race/ethnicity responses from the first survey. Race/Ethnicity Surveyors White/Caucasian/European American 168 African/African American/Black 17 Hispanic/Latinx 9 Asian/Asian American 4 American Indian/Alaska native 3 Pacific Islander 1 Middle Eastern 0 Other 0 Total Responses 200 out of 200 Table 2 shows the race/ethnicity results. Peculiarly all two hundred responders answered what their race/ethnicity was, even the 5 that chose not to give their age. White/Caucasian/European makes up for an overwhelming 84% of the dataset, and 0 responders identifying as Middle Eastern or Other. Pre-emptive Sentiment Breakdown There are few insights one can gain simply looking at sentiment scores without performing any form of natural language processing or textual analysis. For example, when the sentiment scores from the surveyors are compared to their respective emotions, one can see some general trends within each emotion. The positive emotions happy and surprise are strongly correlated with a positive sentiment score (greater than 4), but the 15 negative emotions do not share this characteristic with negative sentiment. One can also see the overall quality of the MTurk responses. Positive Emotions Figure 3 shows the breakdown of each sentiment score (1 to 7) by their respective emotion. Essentially, every emotion is broken down into sentiment scores. For instance: 53.54% of all the responses for the happy prompt were given a sentiment score of 7 by the author. Figure 3 Sentiment Scores for all Emotions. This chart shows all of the emotions broken down by their related sentiment scores. At first, this chart might be seemingly random, but breaking down the emotions to “positive” (Happy and surprise) and “negative” (Sad, anger, disgust and fear), one can see some strong trends forming from the sentiment alone. 16 Figure 4 Sentiment Scores for Happy and Surprise. This chart shows the emotions happy and sad broken down by their related sentiment scores. Figure 4 shows only the positive emotions happy and sad from Figure 3. The results in Figure 4 are generally consistent; the emotions happy and surprise are strongly correlated with positive sentiment scores. In fact, happy does not have a single response with a sentiment score below 3, 4 being neutral. However, that begs the question: Why are there responses to the happy prompt that have a negative sentiment, and what do they say? Table 2 shows the two responses: Table 3 Low Sentiment Happy Responses. This table shows the two responses for the negative sentiment happy responses. Also yes, the word “Throughout” is missing the letter t; that is how the response is written. Worker Response A2ZQG0M4X018HD hroughout the history of moral philosophy, there has been an oscillation between attempts to define morality in terms of consequences leading to happiness and attempts to define morality in terms that have nothing to do with happiness at all A2D762IOIR82GM Happiness remains a difficult term for moral philosophy. These responses appear to be somewhat random and only relate to the emotion word by usage alone, which they were instructed to not use in the first place. The first response appears to copy and pasted. I decided to locate the original source and found out it is a near direct copy/paste from an essay listed on the Course Hero website (Fun fact: if you go to the link listed in the citation, the title of the essay is spelled wrong as well; 17 “Topic: hapiness”) [37]. The second response is a direct citation from the Wikipedia article of the word “Happiness” [38]. I chose to keep these responses for a couple of reasons: Firstly, it was becoming increasing difficult to get responses that were even a single complete sentence. I rejected nearly three quarters of the responses I got, but these responses are still in the top 25% of quality. Secondly, aside from the spelling mistake, these pieces of text are grammatically correct and linguistically diverse. I might disagree with the sentiment score and the fact they did not produce the text themselves, but overall, their responses are adequate. On the other hand, this is an example of MTurk and my experiences using it. Negative Emotions Figure 5 Sentiment Scores for Sad, Anger, Disgust, and Fear. This chart shows the emotions sad, anger, disgust, and fear broken down by their related sentiment scores. Figure 5 shows the result for the negative emotions. While Figure 4 was easy to explain with a visualization, Figure 5 fails to do so as there is no obvious trend. The only semblance of a trend is that surveyors tended to stay away from neutral, 4, along the extremes of positive/negative, 1 and 7. Most of the negative emotion responses tended to stay in 2/3 and 5/6. 18 For any given negative emotion, roughly 40% of all the responses were rated at some level of positivity (5, 6, or 7). For the emotion “fear” specifically, 80 of the 181, or 44%, sentiment responses were positive. There are a number of responses that both do and do not make sense to have a positive fear score. Table 4 shows three of examples of what I consider to be good examples of “positive fear.” Table 4 Positive Fear Responses. This table shows three responses to the “fear” prompt that were given a positive sentiment. Worker Response A2W5X6R3FUZERZ I WASN'T ABLE TO SLEEP AFTER I PLAYED A HORROW GAME. I WAS SCARED RIDING THE BICYCLE FOR THE FIRST TIME EVEN THOUGH I HAD TRAINING WHEELS. A3RDHC556AFCOW I was watched a horror movie yesterday A39KSFTUVXX7KJ In fact, helps you instinctively protect yourself from harm. Your fear might help you to recognize when you're about to do something dangerous, and it could help you to make a safer choice. But, you might find yourself fearful of things that aren't actually dangerous, like public speaking. Each of the responses found in Table 4 have some merit as to why the author listed them as positive. Response 1 talks about playing a “horrow” (horror) video game and riding a bike. Both topics can be scary and/or fearful actions, but both can also be positive and/or fun. Number 2 is discussing a horror movie. Plenty of people enjoy that feeling of fear and can be seen as a positive experience. Number 3 seems to be discussing the word fear itself rather than an experience of it, but this can also be a positive. The text appears to be rationalizing the fears people feel on a daily basis. 19 Despite there being several examples of “positive fear,” there is also a large number of responses that I would argue whose sentiment does not correctly represent the text as shown in Table 5. Table 5 Positive Fear Responses. This table shows several responses to the fear prompt that have a positive sentiment score. Worker Sentiment Score Response A11LSO6D7BMY99 6 When i saw that cockroach in the kitchen my heart lept out of my body!! I despise those things, they just give me the creeps! A3HUBIDD9FTEK4 6 In this post, we have included 32 things for you to consider when you write about fear. A27O7H19C0WQ7T 6 I was terrified when I got in a car accident Every response found in Table 5 would conceivably merit a negative sentiment, aside from the clear copy/paste response, but the authors decided to rate their response as positive. Since I am not the author, nor was I there directly administering the survey, I can only guess at the reasons for these ratings. It’s possible they do not know what the word sentiment means. Maybe they didn’t read the instructions and only saw “overall positive”/ “overall negative” within the local question prompt in Figure 1. This leads me to the assumption that they rated their work positively regarding its quality rather than its sentimentality. Based on the number of responses that were completely blank or one-word or even copy/pasted, I also perceive surveyors just dragging the slider to a random number without thinking. These results are not exclusive to fear; they are repeated within the other negative emotions. There are few responses that make sense sentiment-wise, but the majority do not. Regardless, even if I 20 disagree with their responses, they are valid and follow the instructions, so I accepted them. I was not expecting this to be a problem. Text Processing Now, let us dive into the process I used to prepare the text for analyzing. The process for preparing text for sentiment analysis is nearly universal: 1. Tokenize the text, remove punctuation, remove stop words, remove any non-alpha characters, and lowercase all letters. Stop words are commonly used words like “the”, “and”, “for”, “is”. These words need to be removed to help sentiment analyzers to key in on more important terms. 2. Perform Parts-Of-Speech (POS) tagging on the remaining words and perform any necessary lemmatization or perform stemming instead. Lemmatization and stemming are two difference processes of reducing the complexity of a piece of text, without reducing its meaning. I will explain further when I discuss this step. 3. (Optional) Perform any process specific changes such as: retokenizing (preferred by lexicon approaches) or vectorization/Bag-Of-Words (ML Approaches). Vectorization is the process of converting the literal words within the text into a numerical representation for machine learning algorithms operate on. 4. Perform sentiment analysis. Step 1 Step 1 is a very direct process. The goal is to remove any unnecessary attributes from the text that could impact the sentiment analyzers. These algorithms rely on creating 21 associations between the words within the text and a given sentiment score. The process requires removing “meaningless” words and symbols, otherwise incorrect associations might be formed by the algorithms. For example: Let’s look at these pretend responses from surveyors who gave rated their text with 7 or positive: 1. “One time I went fishing and enjoyed myself.” 2. “One time my daughter was in a play, and she did amazing.” 3. “I remember when I ate more than once ice cream sandwich, I was so full.” The next step is to determine what words to associate with a “very positive” sentiment score. As far as the machine is concerned, the most important words are “I”, “one”, and “.”. The way the algorithm sees it, “I” appears four times, making it the most frequent word, and “one” appears 3 times, once in every sentence, this is the same for the “.” symbol. A machine does not understand that these words are meaningless for determining sentiment; all it knows is that these words occur nearly every time and therefore have a strong correlation with positive sentiment. Therefore, it is necessary to remove any portion of the text deemed non-essential, including punctuation, numbers (non-alpha characters), and stop words. Figure 6 Step One of Text Processing. This figure shows the python used to tokenize and remove stop-words, punctuation, and non-alpha characters. 22 Figure 6 shows how I went about this process. “nltk.RegexTokenizer(r’\w+’)” removes any non-alpha characters and punctuation. The attribute immediately after “tokenize(sentence.lower())” converts all the characters remaining in the text to lower case. Finally, the second line “[word for word in sentence if not word in s_words]” removes any stop-words such as “I”, “the”, “and”, “or”, “for”, etc. Running the code in Figure 6 on one of the example sentences, “One time my daughter was in a play, and she did amazing” becomes: “daughter play amazing”. The remaining keywords from the sentence may have an impact on the overall sentiment. Step 2 Step 2 involves the process of performing Parts-Of-Speech tagging (POS) and lemmatization or stemming. These two processes combine to formulate a way to reduce the complexity of the language in each piece of text. To demonstrate, let’s use the sentence “I remember when I ate more than one ice cream sandwich, I was so full.” Parts of speech tagging is the process of grammatically tagging each word with its positional meaning. Essentially, each word is tagged with its grammatical type, such as: noun, adjective, conjunction, determiner, etc. Using our example sentence, Figure 7 shows what parts of speech tagging can do. Figure 7 Parts of Speech Tagging. This figure shows an example sentence broken down using parts of speech tagging. 23 Figure 7 is a parse tree of the tagged words within the sentence. At the very top one can see the letter “S” for “Sentence”. This is the root of the tree with everything underneath being different parts of the sentence. The green text represents the original word, followed by the grammatical typing that matches the word. On the far left, the word “I” has been tagged with “PRP” or “Personal Pronoun”. Next, the word “remember” is tagged with “VBP” or “Verb Present Tense”. The blue text represents custom tags, which have a multitude of uses. They can emphasize specific parts of speech, or they can be used to create completely new tags like NP. “IN” and “VBD” are just highlighting specific types of speech. However, “NP” or “Noun-Phrase” shows the connection between nouns. The words “ice”, “cream”, and “sandwich” are all nouns with unique meanings, but when combined they formulate a completely new object; “ice cream sandwich”. I added these just as an example to show the power of parts of speech tagging. When I discussed why it is important to remove stop words, the same can be said for word variety. As far as an algorithm is concerned “amaze”, “amazing”, and “amazed” are all distinct words and therefore will be treated uniquely, but as human beings we know that all these words generally refer to root word “amaze”. This is not always the case as past/present tense have a strong impact on the meaning of the word, but there a lot of instances where these words maintain the same meaning. The simple solution would be to remove the suffixes of all these words, so they match, creating the word/symbol “amaz”. The machine does not care if this isn’t a real word because it will still associate them all together. This is a process knowing as 24 stemming. Rather than using logic-based rules or some other method of determining when to truncate a word, the stemmer just looks for any prefix/suffix within the text and removes it. This introduces several flaws than can occur when stemming. The biggest of these flaws comes into play with similar words. Take the words “selfish” and “selfless.” Stemming has the potential to convert both words to the same origin “self”, which would be an incorrect connection as the two words “selfish” and “selfless” have very different meanings. Some stemmers are smart enough to handle cases like this, but there so many possible combinations of prefix-word-suffix that stemmers often fail to get every word as intended. This is where lemmatization comes in and why I chose to use it over stemming. Lemmatization performs morphological analysis, or analysis on the type of word and its positioning relative to the words before and after the word in question. Rather than bluntly chopping of parts of the word, lemmatization utilizes annotated corpuses or lists of words and the parts of speech for the given word to determine what way to lemmatize. Given the text “I am amazed. I was amazed.,” lemmatization would convert this to “I be amazed. I be amaze.” Figure 8 reveals why the same word was treated differently. Figure 8 Parts Of Speech for Lemmatization. This figure shows the difference between the two words based on context. The key is “am” vs “was.” Parts of speech tagging was able to determine that in the first case the word “amazed” is a “JJ” or “Adjective”, whereas in the second case, 25 “amazed” is a “VBN” or “Verb Past Particle” (past tense verb). Without context one might assume that “amazed” is the same in every instance, but parts of speech tagging helps to determine the difference based on the context. Looking back at step 1, the sentence “One time my daughter was in a play, and she did amazing.” became “daughter play amazing.” Adding in parts of speech tagging and lemmatization, this becomes “daughter play amaze.” In this case, lemmatization determined that “amazing” needed to become “amaze”, reducing the complexity of the sentence without reducing meaning. Steps 3 & 4 Step 3 is labelled as optional because it is specific to the given method of determining sentiment. In the case of this paper, I performed one of two actions. Either the text was detokenized, or a bag-of-words was created. Detokenization is the process of converting these individual words back into a single piece of text. Lexicon based approaches, such as VADER and AFINN which I explain later, require a piece of text or string, rather than a list of words. For these approaches, the text must be detokenized back into “daughter play amaze” for the algorithms to process. On the reverse, algorithms like naïve Bayes or random forests require the list object to be broken down further into something a machine can understand. This is where bag-of-words comes in. Let’s make two short sentences to use as an example: “I like ice cream” and “I like pie and I like ice cream.” Bag-of-words operates by gathering every unique word in our data (“I”, “like”, “pie”, “ice”, “and”, “cream”) and creating a count of 26 each word present in the sentence. A bag-of-words created from these two sentences is represented in table 5. Table 6. Bag Of Words Example. This table shows a bag of words break down of two sentences. Sentence I Like Pie Ice And Cream I like ice cream 1 1 0 1 0 1 I like pie and I like ice cream 2 2 1 1 1 1 Table 6 shows that the first sentence contains “I”, “like”, “ice”, and “cream”, but does not contain “pie” or “and”. The second sentence contains some duplicate words, meaning their count is two. By performing this breakdown, I created a sparce matrix which is easily understood by machines. The more times the word occurs, the stronger the correlation between the word and the given sentiment. This process was repeated on all of the responses to the first survey. Finally, step 4 is the actual sentiment analysis. I will go over each algorithm/method in detail in the next sections. Sentiment Sources The sentiment analysis I performed for the first survey has two key parts: The sentiment source, and the method of prediction. There is some overlap on these two categories. The main purpose of this analysis is to prove my initial research question that the use of secondary sentiment impacts the results of sentiment analyzers. Now that I have primary sentiment and cleaned text, I needed to derive multiple sets of secondary sentiment. I used arguably the two most popular sentiment lexicons: VADER and 27 AFINN. Then I personally annotated all 1108 responses with a sentiment score to use as secondary sentiment. I read every piece of text, then attempted to determine what sentiment score the author originally gave their text. I used the same scale of 1 to 7 that the surveyor had to use. This number was then reduced to 1106 as two pieces of text were just the word “when” and nothing else. VADER VADER is the first of the two sentiment lexicons I used to create secondary sentiment. VADER was created in 2014 as a rule-based lexicon to determine sentiment of a given piece of text [1]. A rule-based lexicon means to take in text and process each word by looking up the word within its internal dictionary. There are roughly 7,500 annotated words within the VADER corpus. Each word was rated by a human being on a scale from -4 to 4, representing that word’s specific sentiment. Any word that is not present within the corpus is rated as a 0, or neutral. Once the string has been analyzed, VADER scales the final score down to some value between -1 and 1. -1 means “Overall negative sentiment” and 1 means “Overall positive sentiment”. Since the purpose of VADER was to create an alternative sentiment source to use in comparison to the original, I polarized the results to allow for direct comparison. Any score from VADER that was between 0 and 1 (inclusive) would become “positive” or 1. All of VADER’s scores less than 0 would become 0 or “negative”. This will be the sentiment scale used for the rest of this paper. The reason polarization is performed is due to the nature of the machine learning algorithms used for sentiment analysis. Nearly all machine learning algorithms used for 28 sentiment analysis are either binary classifiers (yes/no, 1/0) or categorical. Continuous based algorithms are seldom used for sentiment analysis as there are only three major categories: negative, neutral, and positive. Polarization only allows for two options, positive or negative, neutral is absorbed by positive. I believe this to be a flaw within the research of sentiment analysis itself. Not only am I removing one of the core options within sentiment, but I am also merging it into another category, skewing the results. In saying this, if I were to change this “standard” within my paper and used some other measurement method, my research would not be comparable to previous works. To maintain the purpose of my research question, I use this general approach to simply polarize the sentiment. For the use of training and comparison, the surveyor sentiment was polarized as well. Ratings of 1, 2, or 3 were converted to 0, and the remaining 4 (neutral), 5, 6, and 7 were converted to positive. Figure 9 Vader Polarized vs Emotion. This image depicts the results of VADER’s sentiment analysis, broken down by emotion. This is where the aforementioned overlap comes in. VADER is used to determine sentiment and can be the result or be used for machine learning purposes. Figure 9 shows 29 the raw results (after polarization) of VADER compared to the original sentiment of all 1106 responses. Overall, VADER got 67.51% of the sentiment correct, or 748 out of 1106. It’s worth noting that positive emotions, happy and surprise, had overwhelming success, with “happy” scoring 97.47%. VADER got 193 out of 198 of the happy responses correct. This will be a reoccurring theme, even into the emotional analysis. Figure 10 VADER Ranged. This image shows VADER’s non-polarized accuracy. I created Figure 10 to visualize the performance of VADER without polarization. Since VADER naturally operates on a scale of -1 to 1, the surveyors’ responses were scaled to match. The surveyors were given the option to respond with a value from 1 to 7 inclusive. When these values are scaled down to a range of -1 to 1 each value represents a step of 0.33. (1 = -1.0, 2 = -0.67, 3 = -0.33). The big difference between the two data pools is that VADER operates on a continuous scale, so it can produce any value between -1 and 1. To help represent categorical data within VADER, three options were created. For example, a surveyor gave their response a 5, or 0.33 on VADER’s original scale. If VADER was able to predict a score within a range of ± 0.33 or one step to the surveyors, 30 (Between 0 and 0.66), the accuracy was considered “great.” This was expanded by an additional step for an accuracy of “good” (±0.67, between -0.33 and 1). Anything beyond these two ranges was considered an accuracy of “poor”. VADER scored 41.52% responses as “great” or 460 out of 1106. Combining the “great” and “good” predictions leaves VADER with a total score of 67.78% or 751. This score is almost the same as the polarized score, having a difference of only 3, 751 for categorical vs 748 polarized. This eased my mind slightly about removing “neutral” as a prediction option since there was near zero information loss from “normalizing” the data. However, this is not always the case as will be explained later. AFINN AFINN is another popular lexicon for deriving sentiment and is similar to VADER [2]. VADER operates on a word rating range of -4 to 4, whereas AFINN uses a range of -5 to 5, giving AFINN a little bit more specificity per word. However, AFINN does not scale any of its results directly. Instead, AFINN adds and subtracts the score of each word within a given piece of text individually and presents the final value. To normalize this data, I divided the score by the number of words processed within the sentence. This produces some score between -5 and positive 5 inclusive. This value was then scaled by dividing by 5 again to decrease the range to our -1 to 1. This process allows polarization of the sentiment and to repeat the categorical test that was performed using VADER. 31 Figure 11 AFINN Polarized. This chart shows AFINN’s sentiment polarized by emotion. Figure 11 shows a similar result to Figure 9. AFINN has a slightly reduced accuracy of less than half of one percent, but this can be deceiving. AFINN’s dictionary is only 2,477 words, over 5000 less than VADER (~7,500). When AFINN does not recognize a word, it repeats the same thing that VADER does, it gives the word a score of zero. This means that when AFINN does not recognize a single word within the text, the resulting score is 0 or neutral. In total, 20% (224 out of 1106) of AFINN’s results are 0. When I remove neutral by associating it with positive with polarization, I artificially inflate the accuracy of AFINN. Figure 12 AFINN Ranged. This image shows AFINN’s non-polarized accuracy. 32 Figure 13 AFINN vs VADER. This chart shows the accuracies of AFINN and VADER. Figure 12 and Figure 13 demonstrate the artificial inflation. AFINN’s initial accuracy seems almost identical with VADER, but when the same score is presented as categorical rather than polarized, the “great” scores for AFINN compared to VADER decrease by nearly half. AFINN got 675 or 61.0% responses correct for “great” and “good” scores combined. VADER got 751 correct, meaning AFINN did 11% worse than VADER, not half a precent worse. Since AFINN has a dictionary less than half the size of VADER, false neutrals are more common. Polarizing creates literal “false positives”, inflating the accuracy of AFINN. Researcher The next set of secondary sentiment that I created was from the researcher directly. Within the following charts and throughout this paper, anytime the column is labelled as “Researcher” was derived from me. I randomized the text that a surveyor wrote, then I then read the piece of text and tried to predict what sentiment score the author gave their writing. I was not able to see the author’s score nor the emotion the text related to or any other potential information. The only piece of data I had to predict sentiment with was the text itself. 33 To be 100% transparent, during the surveying process I had to manually read and approve every person’s response. This means that I had technically seen the text with the author’s sentiment rating before my analysis of it, but I argue that it would not be possible for me to remember any of the responses individually with their matching scores. I had a total of 714 submissions of the survey, which translates to 4284 pieces of text that I had to manually check for approval. Although there exist people capable of remembering thousands of scores and texts after reading them only a single time, I am not one of them. Figure 14 Researcher Polarized. This image shows the results of the researcher sourced sentiment. Figure 14 shows my results. At a surface level, I performed better than both VADER and AFINN, getting a total score of 797 or 71.9%. This is 7% better than VADER polarized (748) and 6% better than AFINN polarized (752). The trend repeated where the positive emotions have much higher accuracies than the negative emotions. 34 Figure 15 AFINN vs VADER vs Researcher. This chart shows the categorical sentiment of AFINN vs VADER vs Researcher. Figure 16 shows the categorical comparison of these three sentiment sources. Researcher performed far better either AFINN or VADER, having a “great”/ “good” total of 888 or 80.0%. My score is 15% better than VADER (751) and 24% better than AFINN (675). Overall, Researcher performed substantially better than VADER and AFINN in both the polarized sentiment and the categorical. 35 Sentiment Analysis For the machine learning portion of the sentiment analysis, I chose 4 commonly used algorithms to derive sentiment with: naïve Bayes, logistic regression, K-nearest neighbor (KNN), and Random Forest. Also, I created an ensemble out of all four of these as well to see if that had any major impact. The original 1106 responses were randomly split into two groups, a training set and testing set. 884 of the responses, or 80%, were placed into the training set to be used for training and building each model. The remaining 222 responses, or 20%, functioned as the testing set. Once the models were trained on the training data, each model then predicted the sentiment for all 222 responses in the test set. The purpose of this thesis is to prove (or disprove) that the source of the sentiment for a given piece of text impacts the performance of sentiment analyzers. To do so, each algorithm was used to create four models. The 884 training responses were paired with one of the following sentiment sources: The original from the surveyor as a control, VADER’s sentiment, AFINN’s sentiment, or the Researcher’s sentiment. Then, each model was trained on their respective data/sentiment set. The results in the following figures show each model’s accuracy based on the sentiment source. It is also worth noting that all the chosen algorithms produce polarized results, meaning I have no representation of neutral. 36 Naïve Bayes Figure 16 Naïve Bayes Vs Surveyor. This chart depicts the results of the four naïve Bayes models compared to the author’s original sentiment The first algorithm I used is naïve Bayes. For a brief explanation, naïve Bayes operates on the principle that each observation is independent, where each word in the text is an observation in my case. naïve Bayes uses the frequency of each word compared to every other word in the sentence and the training set to determine its importance and correlation to sentiment. Figure 16 shows all four of the models generated with naïve Bayes. Each model was trained using the matching sentiment on the x-axis then had its prediction results compared back to the original given by the author of each text. The “Surveyor/Baseline” bar utilizes the author’s sentiment instead of any secondary sentiment for its training. The model derived from the Surveyor/Baseline test set provides something to compare the other three against. To measure for statistical significance, I chose to use two-tailed T-tests. All my results are independent as one model/prediction-method did not impact another, and all of 37 their results maintain similar variances. Also, I am comparing each of the models back to their respective Surveyor/Baseline, not to each other, meaning I am only comparing two groups within each test. I also chose to use a two-tailed T-test rather than a one-tailed because I only care if the two samples are different from one another, not the difference between population means. I used a 95% confidence interval in all tests as well. Does the use of alternative sentiment sources, other than the original, impact the overall performance of algorithms such as naïve Bayes? In this case, no. The baseline model scored 152 out of 222 correct predictions, or 68.5% accuracy. The worst performing model utilized AFINN as its sentiment source, getting 138 out of 222 or 62.16%. Utilizing a 95% confidence interval, a p-value of 0.16 is generated, meaning the results between Surveyor/Baseline and AFINN are not statistically significant. Logistic Regression Figure 17 Logistic Regression vs Surveyor. This chart depicts the results of the four logistic regression models compared to the author’s original sentiment. The second algorithm used for sentiment analysis is logistic regression. Briefly, logistic regression operates by attempting to predict the dependent variable, sentiment in this case, by focusing on the relationships between the independent variables, the words, 38 and their counts in our post-processed text. Figure 17 shows that the logistic regression baseline performs slightly worse than naïve Bayes, getting 147 out of 222 or 66.2% correct predictions. Once again AFINN is the worst performer, getting 136 out of 222 or 61.2%. Utilizing a 95% confidence interval, a p-value of 0.27 is generated, meaning the results between Surveyor/Baseline and AFINN are not statistically significant. KNN Figure 18 KNN vs Surveyor. This chart depicts the results of the four KNN models compared to the author’s original sentiment. Figure 18 shows the results of the third algorithm, K-nearest-neighbor, or KNN. KNN works by finding the minimum distance between like datapoints. KNN creates a plot in the nth dimension using the training data as the datapoints. KNN creates clusters based around the positioning and minimum distance between all the datapoints. KNN then plots the testing data and attempts to determine which category the test texts belong in. The KNN baseline is the worst performer so far, getting only 138 out of 222 or 62.2% correct. Interestingly, AFINN manages to not only be the best performer, but it also beats the baseline. Unfortunately, utilizing a 95% confidence interval, a p-value of 0.49 is 39 generated, meaning the results between Surveyor/Baseline and AFINN are not statistically significant. Random Forest Figure 19 Random Forest vs Surveyor. This chart depicts the results of the four random forest models compared to the author’s original sentiment. The final model used was a random forest. Random forests operate by creating a multitude of uncorrelated decision trees which all act together to predict the outcome. Each decision tree uses a set of features, in this case a set of unique words, and then attempts to logically split all 222 pieces of text into distinct groups. The decision tree tries to make each group as different as possible, all while making sure that the text within a given group is as similar as possible. This process is repeated until all the features are exhausted within a given tree of the forest. Figure 19 shows the baseline random forest scored 142 out of 222, or 63.9%. Random forest performed worse than naïve Bayes and logistic regression but did better than KNN. In this case VADER outperformed the baseline. Utilizing a 95% confidence interval, a p-value of 0.84 is generated, meaning the results between the baseline and VADER are not statistically significant. Researcher performed the worst at 135 out of 40 222 or 60.8%. A p-value of .49 is generated, meaning that baseline vs the researcher is also not statistically significant. Ensemble Figure 20 Ensemble vs Surveyor. This chart depicts the results of the four ensemble models compared to the author’s original sentiment. After performing and evaluating the previous four algorithms, I realized that they all performed nearly equal, their baselines were all within 14 correct predictions of each other. I decided that I would create an ensemble algorithm from each of the sentiment categories to see if this would produce any meaningful results. Each algorithm was grouped by its sentiment source. For example, in Figure 20, the “Researcher” column, is a combination of the naïve Bayes, logistic regression, KNN, and random forest Researcher-based sentiment models. Of the four models, if the majority predicted positive, the prediction for the ensemble model was positive. Ties went to positive. The baseline scored 143 out of 222 or 64.4%. This places it higher than the random forest and KNN, but worse that naïve Bayes and logistic regression. The worst performer was VADER at 137 out of 222 or 61.7%. Utilizing a 95% confidence interval, 41 a p-value of 0.55 is generated, meaning the results between the baseline and worst performer, VADER, are not statistically significant. Machine Learning Results Figure 21 Machine Learning Sentiment Results. This figure shows the combined results of all four algorithms, along with the ensemble algorithm. Figure 21 shows figures 16-20 combined for comparison. Figure 21 shows that there is minimal variation in the prediction results across sentiment sources, and the previously mentioned T-tests also proves there is no statistical significance. The source of sentiment does not impact the results of sentiment analyzers. This led me on to a new question. Since everything performed approximately the “same,” is there some other way of predicting sentiment that could outperform any of these methods? This is where the second and third surveys come into play. Alternative Analysis 42 Stemming from success of the Researcher sentiment in Figure 15 and the near equal results in Figure 21, I decided to create a second survey to create crowdsourced sentiment. If I personally can get an accuracy of 71%, how well could hundreds of people do? The second survey also prompted the creation of a third survey which will also be discussed in this section. Second Survey (Public Crowdsourcing) Figure 22 Second Survey. This image shows the layout of the second survey. “${text_response}” was replaced with whatever piece of text was currently being displayed. Figure 22 shows a screenshot of the second survey. For each surveyor, one of the 222 texts from the test set were shown. The surveyor was asked to select what sentiment score best represented the text and select an emotion that matched as well. For the successful completion of a response, the surveyor was rewarded with $0.10. I would have liked to have had all 1106 original texts annotated this way to create a 4th source of sentiment, but the surveys were not free, creating monetary constraints that did not allow for all 1106 pieces of text to be analyzed. The detailed instructions read as follows: “You may accept up to 10 (TEN) HITs. All additional hits will be rejected! 43 Select the primary emotion that's expressed within this text. Then, rate the overall sentiment of the text using the sider. You must rate the sentiment and the emotion for your hit to be approved. If there are multiple emotions expressed, use your judgement and choose the one that is the strongest of the emotions. Not all texts were written by native English speakers. Some texts might be hard to understand, but please do your best to determine the primary emotion and overall sentiment.” For reference, a “HIT” (Human Intelligence Task) is what MTurk calls each individual task or within a survey. I wanted all 222 pieces of text reviewed 20 times each; this was my survey. However, each Worker only operated on one piece of text at a time, or a single “HIT” at a time. I requested 4,440 responses (20 responses for each of the 222 texts from the test-set). Out of the 8,540 responses, 4,442 were accepted (52% approved). MTurk allowed two HITs to be performed at the exact same time; I did not realize this until I had already approved them. I decided to keep the two extra responses because out of 4,442 they would have minimal impact and I had already paid the Workers, so I might as well use their responses. Regarding the HIT/response limit, I initially wanted to limit the number of responses a Worker could give (HITs) to help increase the diversity in responses. I did not want a single worker to produce 5% of all the responses since they could technically answer once for all 222 texts. I eventually removed this requirement as I had a repeat of what happened with the first survey. The Workers found a multitude of ways to “cheat” on this survey. Instead of leaving responses blank or using one word answer, they would literally leave the sentiment slider on the starting position of “4” and click “Anger” or “Surprise” (The top or bottom option) then hit submit. Since a worker is technically 44 allowed to respond to all 222 prompts once, people would do this exact action for all 222 responses. I began encountering the problem I had with the first survey, where it would have taken weeks to get enough “quality” responses, even with a task as simple as this. I maintained the 10 HIT limit for a while, but I eventually started allowing more than 10 HITs to workers who were presenting honest answers. It is possible the “good” workers were doing the same thing and randomly clicking/sliding instead of just clicking “anger”/ “surprise”, but if they at least submitted unique answers for their initial 10 responses, I would allow them to answer more. Figure 23 Crowdsourced Polarized. This image shows the results from the second survey. Figure 23 displays the results of the crowdsourced survey. The results match the other sentiment sources as shown in Figure 9, Figure 11, and Figure 14. The process for deriving the scores is as follows: 1. All responses from the second survey were polarized. (1-3 goes to 0, 4-7 goes to 1). 2. All the responses for each of the 222 pieces of text had their polarized scores totaled. If there were more 0s, the crowdsourced predicted a negative sentiment; if there were more 1s, the prediction was positive. Ties went to positive. 45 3. The results were compared to the original author’s polarized sentiment and displayed. The crowdsourced results reflect a trend in what I found in the other sentiment sources: Positive emotions have a high accuracy. For example, 100% for happy, and the negative emotions overall have a much lower accuracy. The crowdsourced data got a total of 157 correct sentiments out of 222 or 70.27%. Percentagewise, this places its score above VADER/AFINN, but just below researcher, meaning the crowdsource results beat out the lexicon analysis just slightly. Figure 24 Crowdsource Ranged. The categorical results for the Workers’ sentiment prediction. Figure 24 shows the crowdsource range. The combined “great”/ “good” score for the crowdsource is 3173 out of 4442 or 71.3%. The crowdsource data beats out VADER/AFINN but loses to Researcher in both polarized and categorical tests. Third Survey (Private Crowdsourcing) Seeing the mediocre performance of the public crowdsourcing, I started thinking of different ways to get a better quality of responses. Ultimately, I decided on repeating the second survey, but instead of using MTurk, I used other researchers. I created an 46 excel sheet that contained the 222 test set responses and sent this to my three thesis committee members. Each person was asked to read each piece of text, fill in what sentiment and emotion they thought best matched the text. I had already completed a portion of this survey as I had already annotated the sentiment for 1106 of the original responses, but I had not annotated emotion at that time. I instead read each response and filled in the emotion, but then I used my original sentiment (the “Researcher” sentiment source) instead of redoing it to maintain consistency in my answers. Figure 25 Committee Sentiment Results. This chart shows the results for the polarized sentiment for each committee member. To match the theme of all the previous charts, Figure 25 shows the average committee member’s sentiment polarized and displayed, along with mine (“Researcher”). The average score from all 4 persons was 155.5 out of 222 or 70.1%, placing the average better than all the machine learning models, but worse than Researcher and the crowdsourced. Committee Member 3 had the highest score at 170 out of 222 or 76.6%. 47 Figure 26 Committee Ranged. This chart shows the ranged (or categorical) sentiment instead of the polarized sentiment. There are some interesting shifts that one can see when the sentiment is not polarized in Figure 26. Based on the average, the third survey results greatly outperform AFINN (22.0%) and VADER (41.52%) but performs worse than Researcher (60.0%). The “great”/ “good” score also does better, averaging at 166 out of 222 or 74.8%. This beat’s VADER’s score of 67.8% and AFINN’s 61.0%. (Note: VADER/AFFINN’s results are based on all 1106 responses, not the test set alone) Results Figure 27 Polarized Sentiment Results. This chart shows the results of the top performing models, the crowdsource, and each committee member’s sentiment scores. 48 Figure 27 shows the summation of three surveys, 4 sources of sentiment, and 17 models. Figure 27 contains each of the committee member’s responses (private crowdsource), the crowdsource sentiment from the second survey, the lexicons VADER and AFINN’s accuracy for the 222 test-set, and the top performing model from each algorithm labeled with its respective sentiment source. Overall, Figure 27 shows that almost everything performed equally, but there are a couple of points worth noting. The public crowdsourced results are not statistically significant to any other sentiment predictor. This can be seen in Table 7. Every sentiment predictor had a T-test performed between itself and the crowdsourced results. Table 7 Crowd T-Tests. This table shows the resulting T-Tests for crowdsourced vs all Sentiment Source Resulting P-Value VADER 0.36 AFINN 0.13 Logistic Regression (Researcher) 0.22 Naïve Bayes (Researcher) 0.22 Knn (AFINN) 0.22 Random Forest (VADER) 0.18 Another interesting point is that VADER also performed better than all the models percentage-wise. This creates a muddied answer to previously proposed question of “is there some other way of predicting sentiment that could outperform any of these algorithms?” Looking at the raw scores, yes, there are ways of creating sentiment that outperform the model and lexicon sentiment analyzers. The private crowdsourcing and public crowdsourcing beat all four top performing models presented in Figure 27, VADER, and AFINN. The crowdsourced responses are not statistically significant to any method or model. In other words, the average person cannot outperform the machine. Only a highly 49 educated and skilled person in this field, such as private crowdsourcing done by the thesis committee, can outperform these specialized tools. This also means the reverse is true. A highly specialized tool with an extreme amount of research and preparation performs equally to the average person and can still be beaten by a subject matter expert. Sentiment vs Emotion Recall that the positive emotions have a high accuracy, and the negative emotions have a lower accuracy. Since the models are based on those same sentiment sources, it is clear that their predictions will result in a similar trend. What about the private and public crowdsourced? I know that my own answers (Researcher) followed suit, but do the other human beings involved in this project repeat this? Figure 28 Sentiment vs Emotion. This chart shows the breakdown of the top sentiment performers compared to their individual accuracies across the six emotions. As it turns out, yes, they do repeat it. Figure 28 shows every sentiment source, human or not, repeating what was observed in the sentiment sources section of this paper. As a reminder, these results are not predicting the emotion, rather these are the accuracies 50 of the sentiment prediction broken down by emotion. Figure 29 Sentiment Accuracy for Happy. This chart is zoomed in on the happy emotion in Figure 28. Figure 29 shows the Happy results in Figure 28. Three analyzers got 100% accuracy, meaning they correctly predicted all the sentiments for the happy responses in the test set. All three of those 100% scores came from human beings: Researcher, private and public crowdsourced survey. In fact, all but logistic regression got above a 90%. Figure 30 Sentiment Accuracy for Negative Emotions. This chart is zoomed in on the negative emotions in Figure 28. Figure 30 shows that the negative emotions have an average accuracy between 50% and 60% across the board. This brings up a multitude of questions which opens the door to additional research. At this point I have completed all the original research I set 51 out to do and I have successfully disproven my research question and shown that my hypothesis was wrong. However, I decided to carry out a very brief exploration into emotional analysis to see if that will help me understand why the negative emotions are performing so badly. Is this the fault of human beings for not understanding sentiment and how to measure it, or is this a sign that human beings do not understand certain emotions as well as others? 52 Emotional Analysis As stated in the previous section, this is a brief exploration into emotional analysis. I realized I had a dataset capable of performing this analysis, but as this is not the main focus of this paper, I performed a light study on the subject. In the first survey, Figure 1, 200 MTurk workers were prompted with answering at least 3 of 6 questions, each of which were based around one of six emotions: happy, sad, anger, fear, disgust, and surprise. At this point I was so familiar and used to working with the test set, I performed this analysis on that data. Since every response came from one of those six emotions, I used this as the relating emotion. All the responses to the happy prompt were tagged with as happy, all of the responses to the sad prompt were tagged as sad, etc. I used two analyzers, Text2Emotion (T2E) and LeXmo. T2E has its own internal setup for doing text processing. In response to this, I did not perform any of the steps I listed in the text processing section for this analysis. By performing my own processing, I would interfere with T2E’s processing. Because of this, I did not modify the original texts in any way. If I had, I believe I would have fouled my results as each of the two analyzers would be receiving different sets of data since LeXmo does not do its own pre-processing. 53 Text2Emotion (T2E) Figure 31 Text2Emotion. This figure shows the emotion prediction accuracies of T2E. T2E analysis works like VADER and AFINN. T2E contains a corpus of words, where each word is rated with what emotion, or emotions, best match the given word. T2E gives a score ranging from 0 to 1 for each of the following 5 emotions: happy, sad, anger, fear, and surprise. This also means that T2E can produce a multi-emotion answer. For the purposes of analysis, the emotion with the highest number is what I used as T2E’s prediction for that text. It is also worth mentioning, that if T2E failed to produce any result, i.e., it did not provide any emotion with a score higher than 0, T2E’s prediction was considered wrong. Figure 31 shows the breakdown of T2E’s prediction results. T2E predicted 78 out of 222 emotions correctly, or 35.1%. Omitting the disgust responses, since T2E does not support disgust, T2E scores 78 out of 181 or 43.1%. This is an improvement, but still fails to get even half of the text’s emotions correct. T2E outperforms the majority class of 1/6 or 16.66%. T2E’s scores are still better than randomly guessing. T2E does not repeat the trend seen with the emotions from the sentiment analyzers. It is important to once again make the distinction that the previous figures were showing the sentiment accuracy by emotion, they were not predicting the emotion 54 directly. The question is if this is a misunderstanding of sentiment, or the inability for the sentiment analyzers to understand certain emotions as clearly as others. T2E’s sadness prediction rate is nearly as high as happy, and anger/fear/surprise are all under 40%. It is unfortunate that disgust cannot also be compared but performing emotional analysis only with these 5 emotions changes the results by emotion dramatically. As a result, the positive emotions are no longer the most accurate. LeXmo Figure 32 LeXmo. This figure shows the emotional accuracies of LeXmo. LeXmo fundamentally works nearly identically to T2E. LeXmo has its own corpus of words, rated with a score for each emotion. The words are then processed and returns a score for each emotion. The main two differences between T2E and LeXmo are the following: LeXmo supports disgust and LeXmo does not do any major preprocessing on the text it’s given. In Figure 32 LeXmo gets an overall accuracy of 85 out of 222 or 38.29%. Curiously, LeXmo produces results differently than T2E. Instead of seeing happy and surprise as the strongest emotions, happy and anger are the strongest. LeXmo only got a single surprise response correct; it failed to predict the emotion surprise a 55 massive thirty-four times. Also, unlike T2E, sad was the second worst performer for LeXmo, whereas it was the second best for T2E. So far this is not conclusive. Second Survey (Public Crowdsourcing) Figure 33 Crowdsourced Emotions. This figure shows the emotion prediction results from the second survey. As a reminder, I had asked participants of the second survey to rate the sentiment and indicate what they thought was the strongest emotion. Figure 33 shows that the crowdsourced emotion analysis was able to get 2409 correct emotions out of 4442 prompts, or 54.2%. There is also a third set of unique outcomes. Like T2E, the strongest emotions are happy and sad, but T2E’s worst emotion, aside from disgust, was fear with the crowdsourcing got fear correct over 60% of the time. The only major similarity between LeXmo and the crowdsourced is that happy and fear both did fairly similar for each analyzer; otherwise, all three of these predictors are fairly unique as well. 56 Third Survey (Private Crowdsourcing) Figure 34 Committee Emotions. This chart shows the results from the third survey. The emotion results from the third survey are shown in Figure 34. Since the third survey was a direct repeat of the second survey, I can also use this as a source of emotion analysis. Averaging a total accuracy of 79%, the committee’s responses far outperformed all other analyzers in nearly every emotion. Results Figure 35 All Emotion Analyzers. This shows the results for the total accuracy of each emotion analyzer. Figure 35 shows that it is evident from this visual that the human beings involved in this process performed better at emotional analysis than either of the machine 57 algorithms. All four humans have over double the accuracy of LeXmo and T2E. They also average 24% higher accuracy than the crowdsourced. The crowdsourced also beats out T2E and LeXmo, just with a tighter margin of 19% and 16% respectively. These results demonstrate that human beings universally outperform these emotion analyzers. It is difficult to interpret these results as all the surveys for both the committee members and the public crowdsourcing were done remotely without any direct observation. However, I believe there is at least one provable takeaway here from my very limited exploration: emotions are complicated. T2E excels at determining happy and sad but did extremely poor with fear and cannot determine disgust at all. LeXmo does very well with happy and anger but is abysmal at sad and surprise. The crowdsourcing did quite well with happy, and decent with sad and fear. Everything else was mediocre, being under 50% accuracy. The committee did excellent with happy, sad, anger, and fear, with only one of these getting under 80%. The committee also did decent with disgust and surprise, averaging 61% and 68% respectively. When comparing back to the sentiment analysis done on the same texts in Figure 27, crowdsourcing did a much better job at predicting sentiment, getting a 15% bump in accuracy when compared to its emotion prediction. On the reverse, the committee did better at predicting emotion, each person doing roughly 10% better. Every emotion’s prediction accuracy is all over the place and the sentiment vs emotion comparisons conflict each other. What does this mean? When looking at text on a binary scale, positive or negative, it is relatively easy to say what that text represents. Look at VADER, just by taking each word and assigning it with a static sentiment score results in over 70% accuracy. Emotion analysis operates a 58 multitude of dimensions, and often, a single emotion cannot be pinpointed. As an example, look at the response from worker A2R28I9SM3G0YL: “I am a pets lover. I raised a puppy whose name was Jack and it died one day.” What emotion would you personally rate this with? I predicted sad, the committee predicted sad, sad, and happy. What did the original author say? Sad. What about sentiment? I predicted 5, the committee predicted 2, 4, and 7. The author gave this a 5. I got it spot on, while the committee had a range of responses. The emotions were very similar, but the sentiment was all over the place. This initially sounds like a contradiction with the claim I just made about emotions being more complex than sentiment, but I think it actually proves it. As annotated by the author, this story is both sad, and positive. The author misses his dog and is sad Jack is dead, but overall, the author says this is story is positive, most likely due to the time the author got with Jack. I believe this to be evidence as to why there was such an extreme variety in accuracies across the emotion analyzers. Sentiment is a one-dimensional spectrum, positive to negative. Emotions on the other hand are in the nth dimension. Happy, sad, anger, fear, etc. are not limited to just six emotions as well as each of which also can range from positive to negative as well. That story was sad, but also positive; that sounds like a contradiction, but it is not. Think about the positive fear responses in Table 3. “I was watched a horror movie yesterday”. This response was given a 6 by the author for sentiment but was written for the fear prompt. This response was not in the test set, so the committee, me included, did not analyze this response directly. At the time of writing these words, I would have rated this as happy or maybe even surprise. I would struggle to decide which emotion best matches 59 this text if I did not already know its matching emotion. I think the biggest limitation of emotion analysis is that to be “correct” in your prediction, you must match one singular emotion. I know that LeXmo and T2E are capable of reporting more than a single emotion, but I do not possess a way to measure these weightings for my data specifically. It is relatively straightforward to decide if something is positive or negative, but this is not the case at all when it comes to emotion. 60 Limitations Throughout the creation of this paper, I encountered two large limitations that I feel are necessary to discuss. First is the limitation of MTurk. The quality of the text created by the MTurk Workers was far below what I was expecting. MTurk allows you to list qualifications for your surveyors; I required that all surveyors be in the United States to help maintain some form of consistency within the cultural understanding of the emotions used within the prompts. I also required that the Workers be over the age of 18 so no minors were involved with the survey. This means that all my participants were legal adults located within the United States who were getting monetary compensation to participate. Despite this, a large quantity of the Workers found ways to break the rules and effectively cheat. I would have the same response(s) 10 or 15 times from 15 “unique” Workers. They would copy/paste their exact answers with the same spelling mistakes and same story over and over on various accounts that are supposedly different people. The survey had spell check enabled, but many of the surveyors (both accepted and rejected) submitted responses with spelling mistakes. This research is completely based on words and their meaning. Spelling mistakes can reduce the quality of my results. In a previous section I presented a response that contained the misspelled word “horrow” (horror). The lexicons VADER and AFINN will not be able to process this word, and the machine learning approaches will treat this as a unique word, distinct from “horror.” This means spelling mistakes can dramatically reduce the accuracies and therefore my results. You have also already seen several examples of the Workers not creating their own text. They would copy something from an essay, or Wikipedia, or somewhere else. If I were to repeat this research, I would find an alternative to MTurk for deriving my annotated text. 61 The main redeeming quality of MTurk was the speed and volume of the text itself. I received 200 responses within 30 minutes of placing the survey live. By the time I finished the process of accepting and rejecting, I had 714 total unique responses. However, this does not replace the necessity of quality responses. The other limitation I encountered was the lack of emotion analyzer tools. I only had two within this paper, Text2Emotion and LeXmo, and this was a struggle as well. Both are Python packages designed to analyze the emotions represented within a piece of text. Not only was it difficult to find these, but both were also so deprecated that I had to rewrite portions just so they would be usable. The rewriting was not the limitation itself, but this does reflect the overall environment of emotion analysis. There are a limited number of tools to perform this type of analysis, and those that do exist are often so old they no longer function out of the box. Conclusion My original research question reads: “Does the use of primary annotated text impact the results of various sentiment analysis methods?” and my hypothesis to this question was a simple “yes.” However, after the extensive research I did within this paper, I have proven that this is not the case. My hypothesis was wrong; using alternative sentiment sources does not affect the accuracy of these algorithms when using the author’s sentiment rating as the baseline. In saying this, no prediction method was able to get 100% across the board. Figure 9 and Figure 11 show that lexicon approaches only managed to correctly predict the sentiment ~67% of the time. Figures 16 – 20 also yield 62 comparable results. Regardless of where the sentiment came from, even the source, the accuracies stayed between 60% and 70%. Figures 23 and 25 from the public and private crowdsourcing both landed on an average of 70%. My other question of “is there some other way of predicting sentiment that could outperform any of these methods?” None of the sentiment sources aside from individual respondents were able to produce statistical significance. I interpret this to mean that these models and lexicons are not distinguishable to the average person. It is possible that these algorithms may pass a Turing test as the perform equally to most human beings. The public crowdsource survey and the three committee members results help to solidify this claim as they all scored very similar accuracies and do not produce statistical significance when compared to the lexicons and models. For future work, I would focus on two aspects specifically. First, I want to explore emotional analysis further. I want to find a way to create a multi-emotion test set. Maybe something like repeating the first survey, but rather than using the emotion as the prompt, I will provide prompts like “Write about a time you were stressed” and then ask them to select which emotion(s) best match their text. This would allow tools like LeXmo and T2E to be used to their full potential and not forcing them to predict a singular emotion. The second route I would like to explore is finding subject matter experts to repeat this experiment. I do believe committee member 3’s responses are evidence that human beings can perform better sentiment analysis that machines can, but I want to know if their ability is unique to them, or if other subject experts can perform this well. Ultimately, I was wrong, I successfully disproved my research question, but I believe I have opened the door for exploring alternative concepts such as the creation of a 63 multi-emotion analysis test-set, and the skill of subject matter experts in predicting sentiment. In saying that, I was able to derive a way to definitively demonstrate the accuracies of these algorithms. Rather than using some secondary source of sentiment and comparing a multitude of algorithms to each other, I formulated a way to compare sentiment source to sentiment source using said algorithms. 64 References 1. Hutto, C., & Gilbert, E., “Vader: A parsimonious rule-based model for sentiment analysis of social media text.” In Proceedings of the international AAAI conference on web and social media 2014, Vol. 8, No. 1, pp. 216-225. 2. Nielsen, F. Å. afinn project. 2017 3. Hu, M., & Liu, B., “Mining and summarizing customer reviews.” In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168-177. 4. Haryanto, B., Ruldeviyani, Y., Rohman, F., TN, J. D., Magdalena, R., & Muhamad, Y. F., “Facebook analysis of community sentiment on 2019 5. Nasukawa, T., & Yi, J., “Sentiment analysis: Capturing favorability using natural language processing.” In Proceedings of the 2nd international conference on Knowledge capture, 2003, pp. 70-77. 6. Whitelaw, C., Garg, N., & Argamon, S., “Using appraisal groups for sentiment analysis.” In Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 625-631. 7. Liu, Y., Bi, J. W., & Fan, Z. P., “Ranking products through online reviews: A method based on sentiment analysis technique and intuitionistic fuzzy set theory.” Information Fusion, 2017, 36, 149-161. 8. Hemmatian, F., & Sohrabi, M. K., “A survey on classification techniques for opinion mining and sentiment analysis.” Artificial intelligence review, 2019, 52(3), 1495-1545. 65 9. Morency, L. P., Mihalcea, R., & Doshi, P., “Towards multimodal sentiment analysis: Harvesting opinions from the web.” In Proceedings of the 13th international conference on multimodal interfaces, 2011, pp. 169-176. 10. Sankaran, S., Dendale, P., & Coninx, K., “Evaluating the impact of the HeartHab app on motivation, physical activity, quality of life, and risk factors of coronary artery disease patients: multidisciplinary crossover study.” JMIR mHealth and uHealth, 2019, 7(4), e10874. 11. Duwairi, R. M., & Qarqaz, I., “Arabic sentiment analysis using supervised classification.” In 2014 International Conference on Future Internet of Things and Cloud, 2014, pp. 579-583. IEEE. 12. Raina, P., “Sentiment analysis in news articles using sentic computing.” In 2013 IEEE 13th International Conference on Data Mining Workshops, 2013, pp. 959- 962. IEEE. 13. Mullen, T., & Collier, N., “Sentiment analysis using support vector machines with diverse information sources.” In Proceedings of the 2004 conference on empirical methods in natural language processing 2014, pp. 412-418. 14. Pang, B., Lee, L., & Vaithyanathan, S., “Thumbs up? Sentiment classification using machine learning techniques,” 2002, arXiv preprint cs/0205070. 15. Chen, T., Xu, R., He, Y., & Wang, X., “Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN.” Expert Systems with Applications, 2017, 72, 221-230. 16. Samuel, J., Ali, G. G., Rahman, M., Esawi, E., & Samuel, Y., “Covid-19 public sentiment insights and machine learning for tweets 66 17. Yadav, A., & Vishwakarma, D. K., “Sentiment analysis using deep learning architectures: a review.” Artificial Intelligence Review, 2020, 53(6), 4335-4385. 18. Habernal, I., Ptáček, T., & Steinberger, J., “Sentiment analysis in czech social media using supervised machine learning.” In Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis, 2013, pp. 65-74. 19. Narayanan, V., Arora, I., & Bhatia, A., “Fast and accurate sentiment classification using an enhanced Naive Bayes model.” In International Conference on Intelligent Data Engineering and Automated Learning, 2013, pp. 194-201. Springer, Berlin, Heidelberg. 20. Denecke, K., “Using sentiwordnet for multilingual sentiment analysis.” In 2008 IEEE 24th international conference on data engineering workshop, 2008, pp. 507- 512. IEEE. 21. Ghosh, M., & Kar, A., “Unsupervised linguistic approach for sentiment classification from online reviews using SentiWordNet 3.0.” Int J Eng Res Technol, 2013, 2(9), 1-6. 22. Jimenez, S., Gonzalez, F. A., & Gelbukh, A., “Soft cardinality in semantic text processing: experience of the SemEval international competitions.” Polibits, 2015, 51, 63-72. 23. Chandler, J., Rosenzweig, C., Moss, A. J., Robinson, J., & Litman, L. “Online panels in social science research: Expanding sampling methods beyond Mechanical Turk.” Behavior research methods, 51(5), 2019, 2022-2038. 67 24. He, Y., & Zhou, D., “Self-training from labeled features for sentiment analysis.” Information Processing & Management, 2011, 47(4), 606-616. 25. Liu, Z., Dong, X., Guan, Y., & Yang, J., “Reserved self-training: A semi-supervised sentiment classification method for chinese microblogs.” In Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 455-462. 26. Mahyoub, F. H., Siddiqui, M. A., & Dahab, M. Y., “Building an Arabic sentiment lexicon using semi-supervised learning.” Journal of King Saud University- Computer and Information Sciences, 2014, 26(4), 417-424. 27. Mizumoto, K., Yanagimoto, H., & Yoshioka, M., “Sentiment analysis of stock market news with semi-supervised learning.” In 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, 2012, pp. 325- 328. IEEE. 28. (8) Mäntylä, M. V., Graziotin, D., & Kuutila, M., “The evolution of sentiment analysis—A review of research topics, venues, and top cited papers.” Computer Science Review, 2018, 27, 16-32. 29. J. A. Richmond, “Spies in ancient Greece. Greece and Rome” (Second Series), 1998, vol. 45, no. 01, pp. 1–18. 30. Stagner, R., “The cross-out technique as a method in public opinion analysis.” The Journal of Social Psychology, 1940, 11(1), 79-90. 31. Williamson, J. D. 1978. U.S. Patent No. 4,093,821. Washington, DC: U.S. Patent and Trademark Office. 32. Ekman, P., “Are there basic emotions?” 1992 68 33. Ekman, P., “Facial expressions of emotion: New findings, new questions,” 1992 34. Joshi, A., Kale, S., Chandel, S., & Pal, D. K. “Likert scale: Explored and explained.” British journal of applied science & technology, 7(4), 2015, 396. 35. Amazon Mechanical Turk [Online]. Available: https://www.mturk.com/. [Accessed: Throughout 2022] 36. Chandler, J., Rosenzweig, C., Moss, A. J., Robinson, J., & Litman, L. “Online panels in social science research: Expanding sampling methods beyond Mechanical Turk.” Behavior research methods, 51(5), 2019, 2022-2038. 37. Pabile. “Topic: hapiness.” Course Hero. (n.d.). Retrieved October 2022, https://www.coursehero.com/file/89848825/Topic6docx/ 38. Wikimedia Foundation. Happiness. Wikipedia. 2022, September 20. Retrieved October, 2022, from https://en.wikipedia.org/wiki/Happiness Sam Romine Thesis final Final Audit Report 2022-12-20 Created: 2022-12-19 By: Isabelle Vivier (isabellevivier@weber.edu) Status: Signed Transaction ID: CBJCHBCAABAAt5ydTPnJQqXB_niUBQE7z41NTQKmMIq- "Sam Romine Thesis final" History Document created by Isabelle Vivier (isabellevivier@weber.edu) 2022-12-19 - 8:10:49 PM GMT Document emailed to Robert Ball (robertball@weber.edu) for signature 2022-12-19 - 8:13:48 PM GMT Email viewed by Robert Ball (robertball@weber.edu) 2022-12-19 - 10:25:29 PM GMT Document e-signed by Robert Ball (robertball@weber.edu) Signature Date: 2022-12-19 - 10:25:37 PM GMT - Time Source: server Document emailed to Sarah Herrmann (sarahherrmann@weber.edu) for signature 2022-12-19 - 10:25:38 PM GMT Email viewed by Sarah Herrmann (sarahherrmann@weber.edu) 2022-12-19 - 10:29:29 PM GMT Document e-signed by Sarah Herrmann (sarahherrmann@weber.edu) Signature Date: 2022-12-19 - 10:29:38 PM GMT - Time Source: server Document emailed to Joshua Jensen (joshuajensen1@weber.edu) for signature 2022-12-19 - 10:29:40 PM GMT Email viewed by Joshua Jensen (joshuajensen1@weber.edu) 2022-12-19 - 10:56:14 PM GMT Document e-signed by Joshua Jensen (joshuajensen1@weber.edu) Signature Date: 2022-12-19 - 10:56:43 PM GMT - Time Source: server Document emailed to samuelromine@mail.weber.edu for signature 2022-12-19 - 10:56:44 PM GMT Email viewed by samuelromine@mail.weber.edu 2022-12-20 - 3:43:29 PM GMT Signer samuelromine@mail.weber.edu entered name at signing as Samuel Romine 2022-12-20 - 3:44:00 PM GMT Document e-signed by Samuel Romine (samuelromine@mail.weber.edu) Signature Date: 2022-12-20 - 3:44:02 PM GMT - Time Source: server Agreement completed. 2022-12-20 - 3:44:02 PM GMT |
Format | application/pdf |
ARK | ark:/87278/s6v7t7js |
Setname | wsu_smt |
ID | 96891 |
Reference URL | https://digital.weber.edu/ark:/87278/s6v7t7js |