Title | Reeder, Samuel_MCS_2020 |
Alternative Title | Effects of Explanations of Recommendation Engines |
Creator | Reeder, Samuel |
Collection Name | Master of Computer Science |
Description | Recommendation engines are a ubiquitous feature in the modern era. The advent of the internet as a means of buying and selling products has pushed the field of artificial intelligence to produce software and algorithms that produce accurate predictions with respect to consumer's desirers and buying tendencies. Producing accurate recommendation engines generally involves the use of complex algorithms and machine learning driven models. The complexity of these models brings with it several disadvantages, one of these, is the lack of human comprehensibility. To resolve this issue research teams in the field are involved in programs that attempt to make artificial intelligence systems explainable. In the context of recommendation engines an explainable system would be one that provides users with a recommendation paired with an explanation that is relevant to the user. In this paper research is presented that gives insight into what types of explanations are most effective in changing a user's trust and understanding in recommendation engines. |
Subject | Computer science |
Keywords | Recommendation engines; Internet; Buying patterns |
Digital Publisher | Stewart Library, Weber State University |
Date | 2020 |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University |
OCR Text | Show Author A Thesis in the Field of Computer Science for the Degree of Master of Science in Computer Science Weber State University December 2020 Copyright 2020 Samuel Walter Reeder Effects of Explanations of Recommendation Engines Abstract Recommendation engines are a ubiquitous feature in the modern era. The advent of the internet as a means of buying and selling products has pushed the field of artificial intelligence to produce software and algorithms that produce accurate predictions with respect to consumer’s desirers and buying tendencies. Producing accurate recommendation engines generally involves the use of complex algorithms and machine learning driven models. The complexity of these models brings with it several disadvantages, one of these, is the lack of human comprehensibility. To resolve this issue research teams in the field are involved in programs that attempt to make artificial intelligence systems explainable. In the context of recommendation engines an explainable system would be one that provides users with a recommendation paired with an explanation that is relevant to the user. In this paper research is presented that gives insight into what types of explanations are most effective in changing a user’s trust and understanding in recommendation engines. Dedication I would like to dedicate this work to my family. My wife has been a never ending well of support and inspiration to me, I never could have done this without her sacrifice for and tolerance of me during this process. I would also dedicate this work to my two daughters who get to have their daddy back now that this work is finished. Acknowledgments I would like to acknowledge and thank the United States Air Force for their support of this work by providing me with the means and opportunity to complete this research and earn my master’s degree. I would also like to thank my Thesis committee for their dedication to helping me succeed. I would like to especially thank Robert Ball who was always willing to answer my phone calls, give of his expertise, and most importantly, help keep my vision and efforts focused during this project. Table of Contents Dedication .......................................................................................................................... iii Acknowledgments.............................................................................................................. iv List of Tables .................................................................................................................... ix List of Figures .....................................................................................................................x Chapter 1. Introduction ......................................................................................................21 1.1 What is a recommendation Engine? ................................................................21 1.2 How do Recommendation Engines Work........................................................22 1.2.1 Content Based Recommendation ......................................................23 1.2.2 Collaborative Filtering Recommendation Engines ...........................24 1.3 Realistic Application and the Issue of Complexity..........................................25 Chapter 2. Related Work....................................................................................................26 2.1 The Need for Explainable AI. ..........................................................................27 2.2 Challenges and Solutions to Making AI Explainable ......................................28 2.2.1 The Black Box Problem ....................................................................30 2.2.1.1 Explaining the Model: Global Explanations ......................30 2.2.1.2 Explaining the Outcome: Local Explanations ...................33 2.3 Explainable Recommendation Systems ...........................................................34 2.3.1 Explanation Methods ........................................................................35 2.3.1.1 Text-Based Explanations ...................................................35 2.3.1.2 Visual Explanations ...........................................................37 2.3.2 Explainable Models ..........................................................................38 2.4 Summary ..........................................................................................................40 Chapter 3. Research Questions ..........................................................................................41 3.1 Questions..........................................................................................................41 Chapter 4. Experiment and Methodology ..........................................................................42 4.1 Experiment Administration ..............................................................................43 4.1.1 The Tutorial Stage.............................................................................44 4.1.2 The Recommendation State ..............................................................46 4.1.3 Final Survey ......................................................................................46 4.2 The Recommendation System .........................................................................48 4.2.1 Simple Explanations .........................................................................48 4.2.2 Technical Explanations .....................................................................50 4.2.3 Visual Explanations ..........................................................................51 4.2.4 False Explanations ............................................................................53 4.2.5 How the Experimental Recommendation System Works .................55 4.2.5.1 Experimental Consistency .................................................55 4.2.5.2 The Challenge of Creating False Explanations ..................55 4.2.5.3 Focus on the Research Questions ......................................56 4.3 Experiment Design...........................................................................................56 4.3.1 The Experimental Premise ................................................................56 4.3.2 The purpose of the False Explanations .............................................57 4.3.3 Answering the Research Questions ..................................................58 4.3.3.1 Measuring Trust and Understanding ..................................59 4.3.3.2 Non-Survey Data Capture ..................................................62 4.3.3.3 The Final survey ................................................................62 Chapter 5. Milestones ........................................................................................................63 5.1 Milestone 1: Gathering and Preparing Data.........................................63 5.2 Milestone 2: Database Creation ...........................................................63 5.3 Milestone 3: Create a Recommendation System. ................................64 5.3.1 The Choice to Contrive .........................................................64 5.3.2 Creating the Experimental Website ......................................66 5.4 Milestones 4 and 5: IRB Application, Participant Recruitment, and Execution of Experiment ...........................................................................68 5.5 Milestone 6: Analyze results and Thesis Document Creation .............69 Chapter 6. Results ..............................................................................................................70 6.1 Bias in the Data ................................................................................................70 6.1.1 Participant Demographics .................................................................70 6.1.2 Degrees of Separation from Participants ..........................................71 6.1.2 The Assumption of Trust ..................................................................71 6.2 Results: Main Findings ....................................................................................72 6.2.1 Trust ..................................................................................................72 6.2.2 Understanding ...................................................................................77 6.2.3 Analysis of Explanation types. .........................................................79 6.3 Anecdotal Findings ..........................................................................................81 6.3.1 Final Survey Question 1....................................................................82 6.3.2 Final Survey Questions 2 and 4 ........................................................84 6.4 Summary of Main Findings .............................................................................86 Chapter 7. Future Work .....................................................................................................87 7.1 Inter-Disciplinary Efforts .................................................................................88 7.2 Investigating Types of Explanations ................................................................88 7.3 Validating Research .........................................................................................89 Chapter 8. Conclusion ........................................................................................................90 8.1 Paper Review ...................................................................................................90 8.2 Final thoughts...................................................................................................91 Appendix 1: Survey Questions ..........................................................................................92 Appendix 2: Final Survey Questions .................................................................................94 References ..........................................................................................................................95 List of Tables Table 1. Experiment Demographics ................................................................................. 71 List of Figures .................................................23 Figure 1. User Matrix .........................................................................................................23 Figure 2. An Example of Experimental Recommendation ................................................47 Figure 3. Example of Simple Explanation .........................................................................49 Figure 4. Another Example of Simple Explanation ...........................................................49 Figure 5. Example of a Technical Explanation ..................................................................50 Figure 6. Another Example of a Technical Explanation....................................................50 Figure 7. Example of Visual Explanation ..........................................................................52 Figure 8. Example of a False Simple Explanation .............................................................53 Figure 9. Example of a False Technical Explanation ........................................................53 Figure 10. Example of a False Visual Explanation ............................................................54 Figure 11. Survey Question 1 ............................................................................................59 Figure 12. Survey Question 2 ............................................................................................59 Figure 13. Survey Question 3 ............................................................................................60 Figure 14. Survey Question 4 ............................................................................................61 ............................61 Figure 15. Survey Question 5 ............................................................................................61 Figure 16. Survey Question 6 ............................................................................................62 Figure 17. Average Response to Survey Question 1 .........................................................73 Figure 18. Average Response to Survey Question 2 .........................................................74 Figure 19. Interaction Between Gender and STEM status of participants for survey question 1. ..........................................................................................................................76 Figure 20. Interaction Between STEM status and Reaction to False Explanations for Survey Question 1. .............................................................................................................77 Figure 20. Average Response to Survey Question 3 .........................................................78 Figure 21. Average Response to Survey Question 4 .........................................................79 21 Chapter 1. Introduction In 1950 Allen Turing published his landmark paper titled “Computing Machinery and Intelligence” where he proposed the question “Can Machines think?” [1]. Since that time, the field of artificial intelligence has been and continues to be a hot bed of research and discovery in computer science. Artificial intelligence (AI) has become a part of everyday life in the modern world with AI-driven systems growing up and evolving all around us. The use of AI powered systems has taken many forms ranging from machines that play chess better than humans [2] to complex networks that drive cars [3]. To fully survey they many ways AI is being integrated into our world is beyond the scope of this work, instead our efforts will be focused on a subset of AI-driven systems known as recommendation engines. This introductory section will be made up of the following sections, first a brief overview will be given of what a recommendation engine is. Next, a short overview will be given on how recommendation engines work and lastly the concept of explainable AI within the context of recommendation engines will be given. 1.1 What is a recommendation Engine? Imagine you are on vacation with some of your friends and you pass by a stand selling some souvenirs. You stop and look at some of different clothes that they have. You like a few of them but you are unsure of what to buy. So, you ask your friends what you should buy. They suggest to you based on their knowledge of who you are and what you like as well as their own experiences with similar products. This example may seem 22 trite, but it is at the core of what recommendation engines driven by AI try to replicate. Broadly speaking, a recommendation engine is an AI system that suggests some form of content or product to a user based on analysis of data relevant to that user. To flesh this out, let us examine a recommendation service you are likely familiar with, Amazon.com. Amazon.com has millions of customers each day and moves hundreds of millions of products to these customers each year. One of the big reasons people use Amazon.com is that they can provide access to a huge variety of types of products and a large set of options for similar products. The challenge to you as a potential customer is finding what you want. Instead of forcing users to sift through millions of products and the millions of reviews that come with these products Amazon uses AI systems to recommend items to you based on your purchase history and the purchase history of other Amazon users. The question here is how does a system like this work? 1.2 How do Recommendation Engines Work At the most basic level, recommendation engines are filters. They take a large set of items and filters them down to those items which best suit a given user. The way this filtering is done can be visualized as a matrix completion puzzle. Consider figure 1 below. 23 Figure 1. User Matrix A matrix showing a set of users and a set of items. Here we see a matrix of the first 4 users and first 4 items in a data set. For each item we know if a user did or did not buy an item. A cell value of no data tells us that the user has never made a choice about that item. The job of the recommendation engine is to fill in this matrix and predict how each user will respond when given the option to buy or not buy every item. This filled in matrix is used to provide recommendations to the user. So now the question becomes how does a recommendation system fill in this matrix? The answer to this question is broad and complicated. For the purpose of this overview we are going to cover the concepts of content-based recommendation and collaborative filtering. 1.2.1 Content Based Recommendation One of the most basic forms of recommendation engines is the content-based recommendation system. A good explanation of these systems can be found here [4]. At a high level the idea that underpins this type of system is that a user will likely be interested in items that are like ones that the same user has already indicated a liking for. 24 To use the analogy of buying products, a content-based recommendation system will lean on prior knowledge about your purchases history and will suggest items for you to buy that are like it in some way. There are a lot of ways to infer similarity. One common method mentioned in [4] is to use an items description as a measure of similarity. There are many mathematical means for determining the similarity between documents. One example of this is Term Frequency-Inverse Document Frequency (TF-ID). This strategy uses a well-known formula given in [4] to determine the frequency and importance of words in a document. These values can then be compared to other documents to determine how similar they are from an empirical perspective. This similarity score is then used to recommend items to a user. 1.2.2 Collaborative Filtering Recommendation Engines One of the issues with recommendation engines is access to data for individual users. To make accurate predictions about a specific person the recommendation system must have a large amount of data on that person. One of the ways that this issue can be overcome is with the use of user profiles and leveraging other user’s data. This strategy depends on the assumption that a prediction can be made for a user based on both the user’s previous choices and the choices of users who are similar to the target user. For the purpose of this example, a user is defined as anyone who has used the recommendation system and a target user is the person who is currently asking for a recommendation. The first step in this system is to create a profile for the target user. The profile is meant to store information about items the user has previously given a rating to. This profile can either be created over time as the user interacts with the system or it can be created by providing the user with a series of items that they must rate to use the system. 25 The purpose of creating a profile is to help overcome what is known as the cold start problem in recommendation engines. The cold start problem refers to the fact that until a target user of a recommendation engine has given some information about their preferences the recommendation engine cannot make accurate predictions. As a user rates product, a profile is built overtime. Once a target user has a profile then recommendations for new items are given based on the similarity of that item to others the target user has already rated as well as the ratings given to the new item by users that are similar to the target user. A good practical example of this process is given in [5]. Some of the theorical underpinning for this process is given in [6]. 1.3 Realistic Application and the Issue of Complexity. The recommendation system strategies that have been discussed so far represent the foundation of recommendation systems. In practice, these strategies are mixed and combined with many forms of advanced machine learning to create reliable recommendations. One of the more famous examples of this can be found in the “Netflix Prize” competition [7]. In 2006 the video streaming company Netflix announced a coding challenge. The premise was that Netflix wanted to see if a team could develop an algorithm better than their Cinematch recommendation system. Any team who could produce a system that made a 10% or more improvement over the Cinematch system would be awarded $1,000,000.00 USD. Details of the competition are given by Netflix in reference [7]. The winning team presented a series of papers at the end of the competition detailing their solution, some of which are given in the following references [8-10]. 26 These reference show in detail the extraordinarily complex algorithm that was developed to win this competition. The common thread among these papers, especially [9], is that to beat the native Netflix recommendation engine many different statistical and machines learning algorithms were blended together to get the improved performance [8-10]. One fair question to ask about this system and any modern recommendation system is can these complicated systems be decomposed such that individual recommendations be explained? And furthermore, if they can be explained does it matter to the users of the recommendation? The goal of this work is to provide some insight into the “does it matter question”. In the related works section, we will introduce research that makes it clear that explainable recommendation systems are possible. The remainder of the paper will be structured as follows. Chapter 2 will provide an overview of related work. Chapters 4,5, and 6 will give the research questions and describe the experiment and methodology used to answer them. Chapter 7 will provide the results of the research. Chapters 8 and 9 will provide some commentary on future work and conclude the document. Chapter 2. Related Work In this section a review of relevant literature about explainable artificial intelligence and its application to recommendation engines. The chapter will be divided up as follows. First, an exploration into the need for explainable artificial intelligence will be given. The next section will highlight challenges of making artificial intelligence models explainable and give an overview of some of the solutions that have been found 27 in research. Lastly, I will provide some insight into the research surrounding explainable recommendation engines. 2.1 The Need for Explainable AI. The use of machine learning powered artificial intelligence is become more and more ubiquitous in our day. One example that makes this clear is the advent of self-driving cars. In a report published by Grand View Research in March of 2020 it was estimated that in by 2030 the demand for self-driving cars will grow by 63.1% compared to the demand seen in 2020 [11]. Another report published in February 2020 estimated that the market share value for self-driving car technology would grow from 56.21 billion USD in 2020 to 220.44 billion USD in 2025 [12]. These reports make it clear that this technology is on the rise. Despite all the excitement around this technology the use of artificial intelligence presents some serious challenges. One of the obvious concerns in safety, for example, in 2018 a woman was killed in Tempe Arizona when she was struck by a self-driving car that was conducting a road test [13]. This example is shocking but the concerns about the use of artificial intelligence are more nuanced then just questions of life and death. In an article written by the Harvard Business Review Dattner, et al spoke on some of the legal and ethical implications of using artificial intelligence-based systems to aid in the hiring process for businesses. The authors comment on the idea that hiring tools that employ AI can be problematic because “it is not always clear what they assess, whether their underlying hypotheses are valid, or why they may be expected to predict job candidates’ performance” [14]. The article points out that since the underlying AI is not 28 well understood it is unclear if these systems comply with nondiscrimination laws, the ADA (American with Disabilities Act), as well as others [14]. Another area where AI accountability is particularly important is when these systems are used for military applications. In 2019 an article published for AI magazine authored by David Gunning and David W. Aha gives insight into the need for explainable artificial intelligence in military applications. The authors note that prolific growth in the machine learning world has given rise to AI systems that are being used by the military. The article outlines a 4-year study of research into creating system that can account for choices made as well as inform users of its shortcomings [15]. Gunning and Aha explain that the focus of the study was to find ways to develop AI systems that could provide justification for the suggestions that were made. In addition to the justifications the AI systems should also be able to inform the user of potential flaws with the suggestions. The purpose of this type of AI system was to provide a tool to the military that could help in making strategic choices while at the same time providing enough transparency to justify it’s use. More examples are readily available, but these should be sufficient to make the case that explainable AI is important. In the next section a discussion about the challenges to making AI systems explainable as well as highlight some solutions to those problems. 2.2 Challenges and Solutions to Making AI Explainable Explaining why a decision was made by ourselves or others is a daunting task. A study of history will show that even with all or most of the facts present, different motivations and accompanying explanations are posited for why individuals or groups 29 made the choices that they did. Even more difficult is the task of predicting future actions that a person or group of people will take. The main reason for the complexity of the task is that for any given choice the human brain considers a huge set of factors when making choices. This means trying to create an artificial model for human decision making necessitates complex models. In fact, the human brain itself is the inspiration of one of the most prominent forms of artificial intelligence known as artificial neural networks. Neural networks attempt to model the human brain by creating artificial neurons that are connected to each other in layers. Each neuron takes 1 or more inputs from neurons in nodes that are in previous layers and produces a single output that it passes on to the next layer of neurons until it reaches the output layer. At this point the collection of outputs from each of the output neurons is used to make some type of prediction of classification depending on what the artificial network is designed to do. The learning process for a neural network comes through a process called backpropagation. At a high level this process involves passing back the errors produced by the neural networks output back into the layers of machine so that it can adjust the weights of each neuron. The adjustment of these weights will cause each neuron to act differently when presented with the same input. This adjustment process has the goal of tunning the new output to reduce the error in the predication or classification the machine makes. For more details on the process see [16]. A good survey of the history of this type of artificial intelligence is given in [17]. Neural networks by their nature are not explainable and certain properties, such as the use of randomized neuron weights [18], as well as other factors contribute to this. 30 While it is true that much research has been devoted to interpreting neural networks with external tools and algorithms such as is shown in references [19-20] these methods do not change the nature of neural networks to make them comprehensible but rather find an acceptable method for analyzing the output of the machines to explain how they work. There are many other types of machine learning algorithms that suffer from this same issue in the context of comprehensibility. In the next section we will discuss the generalized issue of explaining AI known as the black box problem. 2.2.1 The Black Box Problem In general, the issue of comprehensibility in machine learning and AI is known as the “black box problem”. In an exhaustive survey of this issue Guidotti, et al defines the black box problem as follows: “In recent years, many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue” [21]. In the following subsections I will discuss two components of the black box problem that are relevant to my thesis. These components are global explanations and local explanations. 2.2.1.1 Explaining the Model: Global Explanations The essence of the model explanation problem is to take a black box model and make an approximation of it using a model that is known to be explainable. This type of explanation can be thought of as a global explanation. A global explanation is one that gives a set of rules or some other type of instructions that can be used to replicate any outcome that is produced by an AI system. 31 One example is to use decision trees as the approximated model. Decision trees are useful because they provide a global explanation of how a system modeled by the tree works. This is because the route to any leaf in a decision tree can be traced from the root along each node. Furthermore, the algorithms for developing decision trees are generally simple and well understood. One of the early examples of this kinds of explanation method is presented by Mark Craven and Jude W. Shavlik in a paper titled “Extracting Tree-Structured Representations of Trained Networks” [22]. The authors describe an algorithm they developed called Trepan. The purpose of Trepan is to create a decision tree from a trained neural network. This is done inductively by presenting the network with input and determining the splits in the tree based on the network’s output [22]. An interesting continuation of Craven and Jude’s work is given in [23]. Reference [23] describes the work of Olcay Boz. Boz proposed an expansion of the Trepan algorithm where the tree generated is then pruned to make a simpler model that still retains the fidelity with the neural network it describes. Another method that is highlighted in the work of Guidotti, et al [21] introduces for model explanation is the concept of rule set. Essentially, the idea is that a black box model could be explained by generating a set of rules that humans can understand that dictate a prediction for a given input. This concept is applied to neural networks by authors M. Gethsiyal Augasta and T. Kathirvalavakumar [24] In which they describe a process where a trained neural network used for classification is reversed engineered to produce a set of simple rules that can be applied to examples from a data set to determine their class. The algorithm described evaluates the neurons in the trained neural network 32 and prunes out the less important ones and then derives the rule set based on the neurons identified as the most impactful in predicting the class of an example. Support Vector Machines (SVM) can also be approximated using this rule set idea. SVM is an algorithm used for classification, at a high level this algorithm treats data points in a data set as vectors that are used to define a hyperplane in an N-dimensional space where N is the number of features in the set. The algorithm looks to define the hyperplane that has the greatest margin of distance between data points of different classes. A good explanation of the math involved as well as a working example is given by R. Gandhi in an online tutorial titled “Support Vector Machine — Introduction to Machine Learning Algorithm”[25]. SVM based systems have proven to be accurate classifiers, but the underlying math makes explaining the predictions in a human comprehensible way largely impossible. This is not to say that the math underpinning SVM is not well known, but rather that providing numbers or scores as an explanation does not give comprehension into why the prediction was made. N. Barakat and A. P. Bradley [26] note this issue and presents a review of some algorithms that can extract a rule set from SVM based machines. An interesting specific example of extracting rule sets from an SVM is given by D. Martens et al. [27] where they note that SVM-based classifiers are used in fields such as medical diagnosis and credit risk evaluation. In both areas the ability to account for the predictions is important and, in some applications, a legal requirement. They also provide a summary of how de-compositional and pedagogical strategies are used to extract rule sets that approximate the results of SVM predictions with little loss in fidelity. 33 Another example of complex models that can be explained via a rule set is Tree Ensembles. A tree ensemble is a learning strategy where a set of decision trees are constructed from a data set and used collectively to make predictions. Some example of tree ensembles are random forests, boosted tree, and weak tree-based Ada boost predictors. An interesting example of this is given by H. Deng [28]. Deng describes a framework called inTrees. The inTrees framework examines all the trees in the ensemble and prunes from them common rules. The frequency of these common rules is used to provide a set of rules that approximates the ensemble with the added benefit of the rule set providing explanations for why any given prediction is made. 2.2.1.2 Explaining the Outcome: Local Explanations Guidotti et al., [21] the idea of local explanations is presented, the describe a model that is locally explainable as a model the has that following property: “Is able to explain the prediction of the black box in understandable terms for humans for a specific instance or record” [21]. An interesting example of this type of explanation is given in the work of Xu et al. [29]. In this work a system that automatically generates captions describing an image is given. The system writes one word per pass over the picture. The algorithm provides an explanation for why each word in the caption is included by showing the portion of the picture that was used to generate the word. This example should more fully highlight the difference between a local and global explanation. In the local version as shown in work of Xu et al. [29], the explanation does not give a rule or approximate model for how future pictures would be captioned. Instead all it does is explain why a given picture was assigned a given caption. 34 Generally, a local explanation is an explanation that provides justification for only one outcome from an AI system. The explanation does not make any attempt to explain how a future outcome will occur. Instead it simply justifies each outcome independently of other outcomes. Another example of local explanations related to image processing and self-driving cars is given in the work of Bojarski et al. in a paper titled “VisualBackProp: Visualizing CNNs for autonomous driving” [30]. This work describes a system called VisualBackProp. VisualBackProp is designed to be used in concert with convoluted neural networks (CNN) based steering systems for self-driving cars. VisualBackProp runs in real time with the steering system and is used to validate the system. This validation comes in the form of identifying the set of pixels in an image used by the steering system to make choices about how to direct the car. These identified pixels serve as a local explanation for how the CNN is helping the steering system work [30]. As a final example of local explanations, the work of Dino Pedreshi et al, [31] is given. In this paper the authors create a method of accountability for data mined classification systems that could discriminate against its users. In the case of this work, discrimination is thought of in terms of social/legal discrimination as you might find defined in laws such as the United States Anti-Discrimination Act. The algorithm and process described in [31] provides local explanations for classifiers based on multiple types of black boxes. This reference is unique from the other examples in that it is an example of an explanation methods that can be generalized to fit many types of black box systems. 2.3 Explainable Recommendation Systems 35 Now that the reader is sufficiently introduced to the challenges related to explainable artificial intelligence (XAI) and some of the literature related to overcoming these challenges, we turn our attention to literature related to recommendation engines that are explainable. To start off this discussion, reference [32] is given. Reference [32] is an exhaustive survey of explainable recommendation systems written by Zhang, Yongfeng et al in 2018. For the purpose of this thesis, sections 2 and 3 are the most useful. The paper breaks down XAI into two subgroups: the method of explanation and the models the generate them. 2.3.1 Explanation Methods Zhang et al. [32] break down the types of explanations into 5 categories: based on relevant users/items, feature based explanations, textual based explanations, visual explanations, and social explanations. This section will review text based and visually based explanations since they are the most relevant the work. 2.3.1.1 Text-Based Explanations Within the context of recommendation engines, text-based explanations often involve an attempt to leverage text from reviews about the items being recommended. In the cases presented here, some type of natural language processing was employed with the goal of attempting to determine a set of features or sentiment that could be used to explain why the recommendation was given. A paper written in 2014 by Y. Zhang et al [33] gives a good example of sentiment analysis. In this paper the authors describe a framework called Sentires. The Sentires framework takes reviews of a certain type of product and generates triplets in the forms 36 of aspect-opinion-sentiment. These triplets are used as an explanation for an individual recommendation. The key idea with sentiment analysis is to determine if a user feels negatively or positively about a product. One of the challenges with doing this is a lack of labeled training data for recommendation systems to learn from. This problem is highlighted in an article written by X. Guan, et al[34]. X. Guan, et al introduces a method for deriving a larger labeled set of training data for algorithms to train on. They use a semi-supervised model to add sentiment labels to unlabeled data [34]. Instead of using the sentiments of a review to produce triplets or other ordered pairs, a more human friendly approach can be taken be shaping explanations as sentences that express to the user why the recommendation was given. An example of this is given in the work of Y. Zhang et al [35]. Here the authors note that the user of Latent Factor Models (LFM) has become more popular in recommendation systems because of their high prediction accuracy. However, the nature of the latent factors makes explanation of the recommendation difficult. To solve this issue, Y. Zhang, et al propose a model referred to as a new “Explicit Factor Model (EFM)” [35]. The main idea is that reviews on a product or type of product can be used to identify features of a product as well as the sentiment users have about them. The EFM model learns about these feature sentiment pairs and makes recommendations to users based on features that a product preforms well on that are also attractive to the user. This method of recommendation lends it’s self well to a sentence style explanation. Interestingly, the system described in [35] can also produce anti-recommendations by notifying a user if a product preforms poorly on a feature that is 37 desired by the user. Theses explanations are given to the user in the form of a templated sentence. An example of a template is a sentence such as “You might be interested in [feature], on which this product performs well.” [35] Explanations like the ones in the work of Y. Zhang et al [35] are useful but the templated sentence structure approach can feel mechanical. Consider getting a recommendation from a friend or co-worker. This type of recommendation is known colloquially as a word-of-mouth recommendation. These types of recommendations can be very convincing since they are often paired with a personalized explanation. Reference [36] shows an example of a recommendation system that tried to approximate word-of-mouth type recommendations by generating more organic feeling explanations. The paper describes a process that combines natural language processing and human input to create natural explanations for movies recommended from the movie lens data set. Human workers are tasked with writing the explanations, the obvious issue is that no one person will have seen all the movies in the data set and even with many workers there will still be many movies that are not seen by all the workers. To solve this problem a natural language processing algorithm was developed to extract quote from user reviews and movie descriptions. These quotes were given to human workers along with tags that described the movie. The workers were then tasked with creating an explanation for why the movie should be viewed [36]. 2.3.1.2 Visual Explanations The concept of a visual explanation is when a product is recommended a justification is provided in a visual format. This can take the form of a picture, or a picture can be used to improve a text-based explanation. 38 A paper written by Y. Wu and M. Ester gives an example of this principle in action. [37] describes an algorithm called Factorized Latent ModEL (FLAME) that attempts to “solve the problem of personalized latent aspect rating analysis” [37]. The FLAME model leverages users reviews for a product to discover features about a product that users have commented on and the opinion the user has about them. The FLAME model uses these feature opinion sets to make recommendations to users based on their previous history of choices related to the product and their similarity to other users. The thing that makes the FLAME model different from the previous examples is that it presents the explanation for the recommendation as a word cloud. The word clouds generated aim to give the user a sense of what features are most prominent for the recommended product [37]. Visual aids may also be used to help augment text-based explanations as shown in the work of Y. Lin et al [38]. In this paper Lin, et al tackle the problem of recommending clothing pairs via a recommendation engine. The authors propose a learning algorithm called neural outfit recommendation (NOR). The goal of NOR is to recommend a top and bottom to a user and provide an explanation for the recommendation. NOR does this by leveraging online communities that review clothing via images of clothes paired with text-based reviews. The explanations given by NOR are images of clothes paired with a sentence explaining the recommendation. 2.3.2 Explainable Models To conclude this chapter a few examples of the models used in explainable recommendation engines will be given. One of the ways this is done is by using graph-based models for generating explanations. 39 An article written in 2015 by X. He et al [39] gives a working example of this. In [39] the authors propose a system that utilizes a tripartite graph to rank aspect opinion data about items based on users’ reviews. This graph is used to create the explanations that accompany recommendations. This approach was devised to try and address to shortcomings, where explanation is concerned, of matrix factorization (MF) models that are often used in collaborative filtering. In the 2018 work of Xiang Wang et al [40] we see an example of decision trees being used to generate recommendation explanations. [40] introduces an explanation model called the tree-enhanced embedding method (TEM). The premise of the TEM model is that there is a lot of ancillary data that is not used by collaborative filtering algorithms. Via the use of Gradient Boosted Decision Trees combined with Matrix Factorization the TEM model produces recommendations as well as generating human comprehensible explanations for them. This is done by TEM using a decision tree that is generated alongside the recommendation and used to explain it. Another approach to making matrix factorization-based recommendations explainable is given in the work of N. Wang et a [41]. Here the authors present a multi-task learning process, and described it as follows: “We develop a multi-task learning solution for explainable recommendation. Two companion learning tasks of user preference modeling for recommendation and opinionated content modeling for explanation are integrated via a joint tensor factorization. As a result, the algorithm predicts not only a user's preference over a list of items, i.e., recommendation, but also how the user would appreciate a particular item at the feature level, i.e., opinionated textual explanation” [41]. 40 The resulting model shown in [41] was tested using data from both amazon and yelp and was found, experimentally, to be effective at generating recommendations as well as legitimate explanations for them. Seo, et al give an interesting example of how deep learning can be applied to explaining recommendations [42]. The authors propose a system that uses convolutional neural networks (CNN) to both make the recommendation and generate the explanation. The CNN is used in [41] to analyze user review data and target important words and then uses these words to construct the explanation. 2.4 Summary In this chapter I have given an overview of important literature regarding explainable artificial intelligence and its application to recommendation engines. It is clear from these works that there is a lot of interest in explainable recommendations. In the introductions to almost all of the papers referenced in this chapter the authors assume that explainable recommendations are valuable because they increase a user’s trust in the recommendation system. In the rest of this paper I will present research that attempts to provide some verification for this claim. Some of the key ideas from the research in this chapter as it relates to my experiment are as follows. My experiment was heavily impacted by the description text based explainable recommendation systems as was shown [33-35]. The concept of using the sentiment found in reviews to generate explanations made a lot of sense to me and was the inspiration for my simple and technical explanations in the experiment. 41 Another important paper from this section was [37]. The concept of using a word cloud as a means of explanation a recommendation was especially useful to my experiment. I wanted to incorporate some form of visual explanation in my experiment and [37] gave a great example of what a visual explanation might look like. Chapter 3. Research Questions The rest of the document will detail research done that provides some insight into how explainable recommendations affect the experience a user has with a recommendation system in the context of trust in the system and understanding of the system. 3.1 Questions Most of the references given in the related works chapter have a common assumption. The assumption is that if a recommendation system provides an explanation the system will be more trusted by a user. The research in this paper formally investigates this via the first research question and associated hypothesis is given below. Question 1: How is a recommendation system user’s trust changed when a recommendation is paired with an explanation? Hypothesis: I believe that, in general, a user will have more trust in a recommendation that can be explained to them. 42 Question 2: Can a recommendation system’s users understand the method for recommendation with only the recommendation as reference? Hypothesis: I believe that, in general, if a recommendation system can be reduced to a human comprehensible explanation a user will be able to understand more about the system than they did before receiving the explanation. These research questions and hypotheses may appear to the reader to be trivial. However, the answer to these questions is important because it can help guide and direct how explainable recommendations are constructed. To test these hypotheses, I conducted an experiment involving exposing a set of users to an explainable recommendation system and gauging the effects of the explanations via a series of surveys. The remainder of the paper is organized as follows. The next chapter, called Experiment and Methodology will describe the experiment as well as the way it was administered. The next chapter, milestones, will outline how each of the milestones in the thesis proposal were completed. The next chapter called Results and Analysis, provides a discussion on the results of the experiment. The last two chapters will detail future work and give a summary of the paper. Chapter 4. Experiment and Methodology As stated in the previous chapter, the purpose of this research is to investigate how explanations in recommendation engines affects a user’s trust and comprehension in a recommendation system. To do this I designed an experiment that exposes participants to a series of recommendations where each recommendation is paired with a collection of 43 explanations. After viewing each recommendation, a participant is asked to respond to a survey. The purpose of the survey is to gather information about how trust and comprehension are developing over time as the participant progresses through the experiment. After viewing all the recommendations, the participants will be asked to respond to a final survey that asks some summary questions. In this chapter, the experiment will be described, and the method of delivery will be discussed. Note that details such as how participants were recruited will not be covered in this chapter, instead these types of details will be covered in the Milestones chapter. 4.1 Experiment Administration Before describing the recommendation engine, this section will give an overview of how the experiment was administered to each participant. The experiment was delivered via website to each participant. This delivery method was not the originally intended method for conducting the experiment. In the early stages of the research the plan was to recruit university students at random in person and get them to participate in the experiment. The experiment was originally going to be given to batches of participants. However, the experiment was conducted during the COVID-19 pandemic, during this time Weber State University had closed its campus. This necessitated the change to an online format so that participants could participate without risking exposure to large groups. The experiment is broken up into three distinct parts. These parts are the tutorial stage, the recommendations stage, and the final survey. The following subsections describe the different stages of the experiment. 44 4.1.1 The Tutorial Stage The purpose of the tutorial section is twofold. The first goal is to make sure that each participant is familiar with the mechanics of experiment. As noted above, each participant engaged with the experiment remotely. The tutorial was designed with the goal of answering any questions a participant might have about how the experiment was to be taken. Each participant was given a phone number to call in the event they encountered any issues. The tutorial was delivered as video that each participant was invited to watch. The video presented a live walk-through of the pages in the tutorial section as well as commentary on the purpose of the experiment. During the tutorial, the participant is told that the premise of the experiment is that they are helping to validate the performance of a recommendation engine designed to recommend cookie recipes. To help with this task, participants were introduced to a current user of the recommendation software named Steve. Steve is introduced via his user profile. The profile shows a list of Steve’s favorite types of cookies, a list of his favorite ingredients, and a final list showing the names of Steve’s favorite recipes that he has found through the recommendation software. Participants are told that they are supposed to imagine that they are Steve while they are viewing the recommendations and explanations. The purpose of this is to remove the need for each participant to have their own profile. The main feature of the tutorial is showing the participant how they are supposed to view recommendations as well as how to take the survey that accompanies each recommendation. During this part of the tutorial a demo recommendation is shown, each button on the screen is explained, with emphasis given to the buttons that show the 45 explanations. The survey is explained question by question with an explanation given for what individual questions mean as well as how the responses are to be given. The full text of each question is given in Appendix 1. The next part of the tutorial involves getting participant consent to be a part of the experiment. Normally, participants would be given this consent form by hand and would only be allowed to participate if they provided their signature. Since this experiment was administer online participants were presented with the text of the consent form. To exit the consent form page participants had to click a continue button. Participants were notified that by clicking continue they were giving their consent to participate in the experiment. The last stage of the experiment involved collecting demographic data about each participant. Each participant was asked to provide their sex, age, and status as a technical or non-technical person. The definition of technical person was given as any person who meets one of more of the criteria listed below. Each participant was asked to self-select whether they met the definition of a technical or non-technical person using the following definition. I decided what the criteria was for a technical person. • Earned a degree in a STEM field • Currently enrolled in a university level computer science program • Currently employed in a STEM related job. Each step of the tutorial was shown in the recorded video, but participants were required to visit each page of the tutorial section to make sure that they understood the content of the video. 46 4.1.2 The Recommendation State The Next stage constitutes the actual experiment. During this stage users viewed 25 different recommendations. Each recommendation has three explanations given as justifications for the recommendation. An example of how recommendation appears in given in figure 2 below. Participants were encouraged to view all the explanations for each recommendation, but they software did not require it. After each recommendation participants were required to complete a six-question survey. Details of the survey are given later in the chapter. 4.1.3 Final Survey The last step of the experiment involved asking participants to give some feedback on the recommendation software. This feedback was collected in the form of four free response questions. The questions are given in appendix 2. The purpose of this last survey was to 47 Figure 2. An Example of Experimental Recommendation This is the layout of an experimental recommendation. Each recommendation displays a picture of the recommended cookie. Meta data about the cookie given. Users can click on buttons bellow the nutrition facts section to see directions for the recipe and ingredients (not shown in the figure). Explanations are viewed by clicking the buttons on the right-hand side of the screen. 48 gather information about each participant’s perspective on key elements of the experiment. The hope was that this feedback could give some insight and context to the results of the experiment. 4.2 The Recommendation System The recommendation system used in this experiment provides recommendations about cookie recipes to its users. The recipe data was extracted from an online recipe repository (details of this are given in the Milestone 1 section of the next chapter). For each recommendation three types of explanations are provided to the user. These explanations are categorized as simple, technical, and visual. These explanation types are described in the following subsections. 4.2.1 Simple Explanations A simple explanation represents a basic type of explanation for a recommendation that a participant should be familiar with from using real recommendation engines. The explanation is one to two sentences long. The goal of the explanation is to show the user how the recommendation ties back to Steve’s profile. Figures 3 and 4 bellow give examples of simple explanations that were used in the experiment. 49 Figure 3. Example of Simple Explanation This single sentence provides an explanation for a recipe called “Carmel Filled Chocolate Cookies”. Steve’s profile notes that he liked chocolate as a recipe and that he likes cookies that contain chocolate. In the case of this explanation the participant must recall Steve’s profile to make the connection. Figure 4. Another Example of Simple Explanation This explanation is provided for a recipe called “Delicious Raspberry Oatmeal Cookie Bars”. As with figure 3 the explanation makes a simple connection to Steve’s profile. This explanation goes a step further an reminds the participant that the recipe is like one of Steve’s favorite cookies. The intent of this recommendation type is to provide an explanation that is more understandable and substantive than something like a star rating that you might encounter when looking at suggested products on Amazon. The other goal of the simple explanation is to be basic enough that anyone would be able to understand it and connect it Steve’s profile. This type of explanation was inspired by some of the papers I included in the related works section, specifically references [34-35]. 50 4.2.2 Technical Explanations These explanations build off the simple explanations by attempting to give some insight into how the recommendation engine may have generated the explanation. These explanations are still text based. Figures 5 and 6 below provide examples of these explanations. Figure 5. Example of a Technical Explanation This is the technical explanation given for the recipe “Carmel Filled Chocolate cookies”. Figure 6. Another Example of a Technical Explanation This is the technical example five for the recipe “Delicious Raspberry Oatmeal Cookie Bars. 51 The purpose of the technical explanation is to help the user understand how the recommendation may have been generated by exposing the internal mechanism of the software. In other words, these explanations try to open the black box of the recommendation system. The motivation for this explanation type stems from research around feature sentiment-based explanations such as those described in [33-35]. The other motivating factor is to provide the participants some signals about different types of data and analysis strategies that could have been used to generate the explanation. For instance, the mention of similar users hints at the use of some type of collaborative filtering method. The last sentence of each explanation gives a clue that some type of natural language processing might have been used as well. 4.2.3 Visual Explanations The last type of explanation are visual explanations. For this explanation type I took inspiration from refence [37] and used word clouds as visual explanation. Figure 7 gives an example of a word cloud used in the experiment. There are some draw backs to this approach. For instance, the word clouds do not do a great job of associating a sentiment with the features. For example, the word cloud shown in figure 7 gives the most emphasis to the words chewy, cookies, butter, and peanut when read from the top to bottom of the image. Each of the words on its own could hold positive or negative sentiment for a participant and it is not clear what the sentiment of reviews that generated the world cloud are for the emphasized features. 52 Figure 7. Example of Visual Explanation The word cloud shown here is meant to be an explanation for the recipe called “Chef John’s Peanut Butter Cookies”. The main idea of the word clouds was to highlight the most common words that appeared in reviews by other users. The hope was that these common words would line up to some degree with key words in Steve’s profile. 53 4.2.4 False Explanations One of the key features of the experimental recommendation system was that 5 of the 25 recommendations provide false explanations. At a high level the purpose of these false explanations is to test participants to see if being presented with false explanations for an otherwise correct recommendation had an impact on their trust and comprehension. For a recommendation that has false explanations each of the three types of explanations will be explanations that are for a different recipe. Figures 8-10 show a set of false recommendations. Figure 8. Example of a False Simple Explanation This simple explanation was given for a recipe called Beth’s Spicy Oatmeal Raisin Cookies. Figure 9. Example of a False Technical Explanation This technical explanation was given for a recipe called Beth’s Spicy Oatmeal Raisin Cookies 54 Figure 10. Example of a False Visual Explanation This visual explanation was given for a recipe called Beth’s Spicy Oatmeal Raisin Cookies All these false explanations shown in figures 8-10 were inspired from a pulled pork recipe. The goal of these false explanations is to give the impression that the software was experiencing a bug. All the Recipes recommended are cookie recipes that Steve would be reasonably interested in. 55 4.2.5 How the Experimental Recommendation System Works Now that the functionality of the recommendation system has been described it is now important to cover how it works. In the early stages of the research the plan was to write a recommendation system utilizing existing libraries/frameworks with the necessary additions to make the recommendations explainable. This plan changed over the course of the process and the final system that was presented to the users is best described as an artificial recommendation system. This means that there is no machine learning algorithm that procedurally generates the recommendations and the explanations. Instead each recommendation and the accompanying explanations are hard coded. This was so that the system used in the experiment would not introduce any additional variables to the experiment. 4.2.5.1 Experimental Consistency The hard-coded set of recommendations allows the experiment to focus on the research questions without concern over recommendation variance or the accuracy of programmatically generated recommendation. If the experimental recommendation system generated a new set of recommendations for each participant there is no way to guarantee that each participant has the same experience. This variance would make the analysis of the results more difficult. 4.2.5.2 The Challenge of Creating False Explanations The goal of any recommendation system is to produce as many accurate recommendations as possible. Vast amounts of research and development have been focused on developing machine learning models that produce highly accurate recommendations. As is shown in the related works chapter, this work has been extended 56 to making these models and systems explainable. However, nowhere in the research that I was able to review is there any work dedicated making a system that deliberately produces inaccurate recommendations or explanations. So far as I can tell the only time a false explanation would be produced is in the case of a malfunction in an otherwise accurate system. The challenge to generating false explanations is that AI systems are optimized to produce correct outcomes as often as possible. Wrong answers are used in the training process of an AI system and are then discarded as the model evolves. To track these wrong answers adds new elements and behaviors to existing learning algorithms. These new parameters also change the way the explainable AI systems generate explanations. Solutions to these problems are not present in any research that I encountered. 4.2.5.3 Focus on the Research Questions As mentioned previously, the aim of this research is to investigate how explanations affect a user’s trust in, and comprehension of, a recommendation system. The method of implementation of the recommendation system is not an important factor since the research questions are universally applicable to any kind of recommendation system. 4.3 Experiment Design 4.3.1 The Experimental Premise The premise of the experiment given to the participant is that they are being asked to help validate a recommendation engine that gives recommendations about cookie recipes. This premise is given during the tutorial phase of the experiment. The tutorial 57 phase makes no attempts to explain how the software creates recommendations and does not give the participant a criterion upon which they are to validate the system. During the tutorial phase the participants are directed to view the explanations but they are not explicitly told that they are supposed to be evaluating the software based on these explanations. The goal of presenting this scenario to participants is to encourage them to pay attention while at the same time mask the purpose of the experiment. Since the experiment is aimed at gauging trust, I presented the experiment in a way that provided as little bias as possible where the software is concerned. I hoped that by asking the participants to validate the system it will give them some feeling of obligation to view at least some of the explanations. At the same time there is no explicit requirement to view them and it is possible to get past a recommendation without clicking on any of the explanations. The open nature of the experiment was meant to give each participant the opportunity for their trust and understand of the system to develop as organically as possible. 4.3.2 The purpose of the False Explanations One of the assumptions that the experiment has is that each participant has some level of trust in, and understanding of, recommendation engines. Given this assumption, I decided that some amount of false explanations be given to provide a way to better understand how explanations were changing participants trust and understanding. The design of this part of the experiment went through several iterations. 58 The other purpose of the false explanation is to provide a check for if the participants are paying attention to the explanations. The false explanations are obviously wrong so if when analyzing the results none of participants indicated that they had seen some questionable data then I could conclude that they had not read the explanations. 4.3.3 Answering the Research Questions Recall that the first research question is the following: How is a recommendation system user’s trust changed when a recommendation is paired with an explanation? The purpose of this question is to investigate how trust changes when a recommendation is justified with an explanation. The way my experiment approaches testing this is by attempting to establish trust in the recommendations and then challenge that trust with a false explanation. To do this the experimental system presents the first 5 recipe recommendations with true explanations. The hope is that the participant will develop some level of trust while viewing these recommendations. Then on the 6th recipe recommendation the participant will be given a set of false explanations. After this point the remaining 4 false explanations sets will be given at what appears to be random intervals to the participants. The second research question is the following: Can a recommendation system’s users understand the method for recommendation with only the recommendation as reference? For this question I expected that the main source of increased understanding would come from the technical explanations. The technical explanations were designed to provide some clues about how the recommendations engine might have produced the recommendation. My purpose in doing this was to give clues over time to the participants 59 so that by the end of the experiment they may have developed an intuition about how the system works. 4.3.3.1 Measuring Trust and Understanding In order to measure trust and comprehension participants were presented with survey questions. After each recommendation, a participant was required to give an answer to a six-question survey. The survey questions are broken up into three categories; the first two questions are given below in figures 11 and 12. Figure 11. Survey Question 1 Figure 12. Survey Question 2 60 The general idea with these questions in that over time the different values reported after each completed survey can be interpreted as a trend for how the participants trust changes over time. The first question places emphasis on the most recent recommendation the participant saw. With this question I was trying to capture a reaction from the participant about how each recipe recommendation and its explanation modified their trust in the context of a single recommendation. The second question was designed to capture the overall sense of trust that the participant had in the recommendation system as the experiment progressed. The next set of questions deal with understanding. These questions are given in figures 13 and 14. Figure 13. Survey Question 3 61 Figure 14. Survey Question 4 These questions have the same goals as the first two questions but instead are meant to track how a participants’ understanding changes over time. The last two questions revisit trust and understanding by asking the participant about which explanation method had the most impact on their understanding and trust. These questions are shown in figures 15 and 16. Figure 15. Survey Question 5 62 Figure 16. Survey Question 6 The purpose of these two questions is to get a sense of how participants are responding to the different types of explanations. 4.3.3.2 Non-Survey Data Capture In addition to the survey, I collected other data about the behavior of the participants as they interacted with the recipe recommendations. Each time a participant clicked on any button that displayed an explanation it was tracked. Likewise, I tracked the amount of times the participants clicked on the buttons for the ingredients and directions for the recipe. This means that after each recommendation the participant viewed, I recorded how many times the users clicked on each of the important buttons on the page. The purpose of capturing this data was twofold. The first reason was that I felt that knowing how often participants looked at each type of recommendation would provide interesting context for the way survey questions were answered. The other purpose was to make sure that participants were viewing the explanations. 4.3.3.3 The Final survey 63 At the end of the experiment each participant was asked to provide feedback about the experiment by answering four free response questions. The questions are given in appendix 2. The purpose of this final survey was to gather impressions that participants had about the experiment in their own words. This type of data is obviously anecdotal, however I felt that it might provide some context or insight into the empirical data that was also being collected. Chapter 5. Milestones In this chapter I will review the milestones that I set up in my thesis proposal. I will also explain how each of these milestones were completed during my research. 5.1 Milestone 1: Gathering and Preparing Data A stated above, the data source for my experimental recommendation system is allrecipes.com. I was provided a data set of over 100,000 recipes, site users, and recipe reviews. The data was provided to me by my advisor Dr. Robert Ball, who, via web scrapping, had collected the data. 5.2 Milestone 2: Database Creation In the early stages of the research, prior to the decision to create a contrived recommendation system as described in section 5.2.5, a database was to be used to hold the recommendations and explanations. 64 In the final version of the experiment the database instead acts as a means of storing the answers to surveys and tracking clicks. The database used is MySQL. The reasons for the use of this flavor of SQL was twofold. First, MySQL is the default database system used by phpMyAdmin. This framework was used in the development and deployment of the experimental website, so MySQL was a natural choice. The second reason for using MySQL was that it provided an effective method for parameterizing insert statements via support for PDO (PHP Data Objects). The experimental website was written entirely in PHP, and because of this, MySQL again presented itself as a natural choice. Despite the benefits of MySQL, implementing and using the database proved to be one of the larger challenges I encountered during the process of developing the experimental website. My experience as a programmer professionally has been largely focused on supporting legacy code written in C and FORTRAN. Before conducting this research, I had no practical experience with coding a database supported web application. Fortunately, I have also worked as database technician for a company early on in my career, so I did not have to learn SQL as well as how to create and connect a database to a web-based application. 5.3 Milestone 3: Create a Recommendation System. 5.3.1 The Choice to Contrive Before making the choice to contrive the recommendation system I had planned to write my own recommendation system to support the needs of the experiment. The first step I took to do this was to get my wands wet by writing a few simple recommendation systems as a part of read the book titled “Hands-On Recommendation 65 Systems with Python”. This book served as my first introduction into how recommendation systems worked from a theorical and practical perspective. This exposure to recommendations systems was very instructive but it was also far too simplistic to support the experiment I wanted to conduct. My initial hope was that I could create a content-based recommendation system. I had hoped this for a few reasons, the biggest of these was that content-based recommendation systems are simple in terms of their implementations. There are a wide selection of tutorials and libraries in Python that would have made the development of such a system straight forward. The challenge with this approach is that to get accurate recommendations these types of recommendation engines generally requires the use of AI models that are not inherently explainable. It was shown in chapter 2 this issue can be overcome by generating a second model that approximates the one used by the recommendation engine to provide the explanations. But this task is significantly more complicated than the task of writing the recommendation engine itself. This strategy was ultimately abandoned and for a time I considered creating a hybrid system by combing content-based recommendation strategies with collaborative filters. The reasons for considering this new approach was motivated by my research. In every paper I read related to recommendation engines the software was a hybrid recommender, indeed this method is the industry standard. Because of the ubiquity of this approach there are a lot of resources online to help guide the creation of these types of recommendation systems. But the same challenges that were present with the content-based system persisted with this approach. 66 After much discussion with Dr. Ball, we concluded that creating a system from scratch was not the best approach given my research questions. The rational for this was that my research questions deal specifically with investigating trust and comprehension of an explainable recommendation system. The implementation of the system has no bearing on the research questions. If the participants of the experiment believed that they were working with a real recommendation engine, then it did not matter how it worked in reality. Another issue with creating a traditional recommendation system was the need for false explanations. Adding this requirement adds additional technical difficulties that seemed to have a small return on the time investment that would be necessary to develop them. For these reasons I decided that the recommendation engine used in the experiment should be contrived. After this choice was made, I put a lot of focus on reading about how actual explainable recommendation engines function. The goal of this research was to get a firm understanding of what forms explanations take in working examples so that I could approximate them in the experiment. The ultimate benefit of the contrived system approach was that it allowed me to tailor the recommendations and explanations so that they supported my research questions. 5.3.2 Creating the Experimental Website As I eluded to in section 5.2, prior to my involvement in this research I had no experience with web development outside of an undergraduate class I took in 2016. This meant that I had to learn, how to develop and deploy a website so that my experiment could be conducted. 67 I was fortunate to have Joshua Jensen on my thesis committee. Joshua has a wealth of experience with web development as a professional and an educator. His guidance was invaluable and helped with my learning. Even with expert help, the creation of the website was perhaps that largest technical challenge of the entire project for me. I cycled through several different approaches before I settled on PHP. Initially I tried looking for some type of framework that could create boilerplate code. My hope was that this approach would help compensate for my lack of expertise. I investigated several Java and Python based frameworks as well as the Dreamweaver product from Adobe. The strongest candidate of these frameworks for me was Dreamweaver, this product offered a visual coding experience where I could drag drop web elements and the supporting code would be generated by Dreamweaver. The main drawback was that this generated code was very obtuse and did not provide a good environment for creating the custom code that I needed to support the collection of data. For this reason, I ultimately abandoned the idea of using a framework. The final solution was to use PHP. I did not know PHP, but after talking with Josh and doing my own research I chose to use it for the following reasons. First of these reasons was phpMyAdmin. This tool provided an easy to use interface between the code of the website and the data base. Another benefit was that I could treat my development environment as if it was also my production environment. This meant that once I was happy with the site, I could transfer it directly to live server by simply passing the server my phpMyAdmin configurations. 68 The second reason that I used PHP was that of all the languages I investigated for website creation, this one seemed the most like languages I was familiar with such as PERL. While there was still a learning curve the intra-language familiarity was quite beneficial. Another motivating factor was that Josh has a significant amount of expertise with PHP and so it seemed prudent to lean on that strength. The site, of course, required some JavaScript, but this was thankfully minimal and not overly complex to implement. In terms of creating the visual part of the website I used the bootstrap CSS styles extensively. Bootstrap was especially useful because it helped provide a feeling of continuity to the user experience for the website. I used it hoping that it would minimize the impact my lack of web development skills such that it did not detract from the experience of the participants. 5.4 Milestones 4 and 5: IRB Application, Participant Recruitment, and Execution of Experiment As a necessity of conducting experiments with human participants I was required to complete the IRB training and then apply for IRB approval. I applied for IRB exemption under the premise that my experiment did not pose any significant risk of harm to my participants. The Weber State IRB board agreed with my assessment and granted the exemption I requested in my application. Once I received my IRB approval, I began the task of recruiting participants. My goal was to get 50 participants split evenly by sex. These two groups would also be split evenly between technical and non-technical persons. The main challenge to this was that I conducted my experiment during the COVID-19 pandemic in 2020. I turned to social media to look for participants. I was quite fortunate to have several extended family 69 members who were themselves connected to large networks of people who were willing to participate in my research. Since I anticipated difficulty in recruitment, I compensated my participants by providing each person who successfully completed the experiment with a $10.00 Amazon gift card. The cards were purchased by Weber State University and their use was noted in my IRB application. I had initially projected that my experiment would take a month or more to complete with the limiting factor being the rate at which I could recruit participants. To my great surprise, I was able to recruit my target participant distribution in less than a week. In fact, these efforts were so successful that I expanded my target number from 50 to 60. I did this because more data is always better, and I hoped that I would be able to achieve statistical significance in my results since my experiment was designed as a between-subject study. I formally began my experiment August 1, 2020. The method of execution was that each person who expressed interest in participating was asked to provide me their email address. I then sent them an email with a link to my experimental website. Participants were able to prove their completion by sending me an email containing a code that was uniquely generated at the end of the experiment. After this point, participants were provided the promised gift card after completing some paperwork confirming that they would receive the payment from Weber State. 5.5 Milestone 6: Analyze results and Thesis Document Creation The analysis of the data began almost immediately after the experiment concluded. The primary tool for the analysis was Python. The choice to use python was 70 driven by Pythons extensive statistical libraries and it is easy to use tools for creating graphs and figures. I used two main statistical methods for determining if the results had significance. These methods ANOVA and Chi-Squared tests. The ANOVA analysis was done with the data collected from questions 1-4 of the survey and the Chi-Squared analysis was done with the last two survey questions. The results of the analysis are given in the next chapter. Chapter 6. Results In this chapter I will give a discussion on the results of my experiment. First, I give a discussion about the demographics of the participants and expose any potential bias that I am aware of that is relevant to the results. 6.1 Bias in the Data 6.1.1 Participant Demographics For my analysis I used a pool of 60 participants. As I said in chapter 5, I wanted to get an even division between male and female and well as between technical and non-technical people. A table summarizing the population of the participants is given below. It is important to here to explain how I was able to get such nice symmetric groups of participants. When I closed the experiment 71 people had participated. I elected to randomly remove participants until I had the splits shown in table 1. I decided to take a 71 random sub set so that I did not need to account for disparities in group size when doing my analysis. Table 1. Experiment Demographics Gender Total Technical Non-Technical Male 30 15 15 Female 30 15 15 6.1.2 Degrees of Separation from Participants As was described in chapter 5 I used family and friends as my initial set of participants. After that I asked them to reach out to friends and family to get more people. Most of my participants were directed to me through one of my cousins. I bring this up because I feel that my participants had a higher than average desire to help me be successful. This is a possible bias because the participants may have felt a greater willingness to pay attention and do “their best” when compared to selecting participants at random who had no personal to connection to me would have. My original plan was to do this recruitment from random students at Weber State University but the campus restrictions due to COVID-19 prevented that. 6.1.2 The Assumption of Trust In my experiment I assumed that my participants have a baseline of trust in recommendation software. This assumption came because of the nature of the research questions since they are asking about how explanations affect trust and understanding. Because of this assumption in my results I cannot make any claims about how trust and understanding evolved compared to the participants previous levels. Instead I put a focus on how these feeling changed in the context of the experiment itself. However, it is 72 important to realize that each participant will have had a different range of experience with recommendation software before the experiment. These variations in experience means that each person comes with a different baseline of trust and understanding about recommendation systems. 6.2 Results: Main Findings As I stated in the introduction to this chapter, the results of this experiment show that trust and understanding are affected by the presence of explanations. To show this I will review the analysis of each survey question starting with survey question 1. 6.2.1 Trust The first survey question states, “How did the recommendation you just saw affect your trust in the recommendation software?” The participants were asked to answer this question by giving a rating of 1-5 where 1 represented a significant reduction in trust and 5 indicated a significant increase. The first thing to note is the general trend of how the participants answered this question. This trend is given below in figure 17. 73 Figure 17. Average Response to Survey Question 1 This graph shows the response to survey question one for each recommendation averaged over all the participants. The main feature of this graph is that is shows that false recommendations did affect trust. This can be seen by the dip at recommendations 6, 8, 11, 16, and 23. Each time a false recommendation was encountered, on average participants reported that they had less trust in the recommendation software. The follow up question would be: does the presence of false explanations damage trust over time? Figure 18 gives some insight into this. 74 Figure 18. Average Response to Survey Question 2 This graph shows the response to survey question two for each recommendation averaged over all the participants. Survey question two states, “Based on all the recommendations you have seen up to this point, how much do you trust the recommendation software?” We can see from figure 18 that the trends in question 2 follow nearly identically the trend in questions 1. This leads to the conclusion that exposure to faulty recommendations does not impact trust negatively over the long term. This conclusion comes from the results of experiment, it is important to note that I only ran the experiment once. The next question is what contributes to these trends? The ANOVA analysis gives us some insight into this question. In my analysis of this question I investigated several different possible interactions. These combinations are given in table 2. 75 Table 2. ANOVA Combinations for Survey Question 1 Participant Gender Participant STEM status Participant STEM status and gender Participants Reaction to bad explanations and their Gender Participants Reaction to bad explanations and their STEM status Participants Reaction to bad explanations, their STEM status, and their gender. Of these I found that the interaction between participants’ gender and STEM status was statistically significant with an F score of F(11.1979, 1) = 0.0144. The interaction is visualized in figure 19. The interaction graph shows insight into the difference in the genders relative to trust. The graph shows that, on average, men have a larger variance in their trust than women do. It is important to also note that there is a statistically significant factor that contributed to participants identifying false explanations. The feature is STEM status, this feature has an F-score of F(9.1811, 1) = 0.0025. The interaction is shown in figure 20. This is important to note because it gives more insight into the difference between men and women in the context of the experiment. In fact, it was shown that this interaction between gender and education was consistent throughout the analysis of all the survey questions. 76 Figure 19. Interaction Between Gender and STEM status of participants for survey question 1. 77 Figure 20. Interaction Between STEM status and Reaction to False Explanations for Survey Question 1. 6.2.2 Understanding Survey Questions 3 and 4 relate to understanding. As in the previous section, I start by observing the trends for both questions. These trends are given in figures 20 and 21. Survey question three states, “How did the explanation of the recommendation affect your understanding of the software?” The main result from this trend is that after participants see a false recommendation that their average understanding decreases. The highest average understanding occurs during the first 5 recommendations. Interestingly, 78 after each recommendation that has a false explanation the understanding of future recommendations is impacted. This is especially true after recommendation 11. It appears that the false explanations create lingering confusion about how correct explanations were being generated. The trend for question 4 is similar to that of question 3 shown in figure 21. Figure 20. Average Response to Survey Question 3 This graph shows the response to survey question three for each recommendation averaged over all the participants 79 Figure 21. Average Response to Survey Question 4 This graph shows the response to survey question 4 for each recommendation averaged over all the participants In terms of the ANOVA analysis for these questions I used the same set of combinations stated in table 2. I found that interaction between gender and STEM status was statistically significant with a F score of F(5.3545, 1) = 0.0003 for question 3 and an F score of F(4.1216, 1) = 0.0425 for question 4. This result is consistent with the F scores from questions 1 and 2. This leads to the conclusion that for both trust and understanding the important factor was the interaction of gender and STEM status. 6.2.3 Analysis of Explanation types. Now I will review the findings for questions 5 and 6. These questions differ from the previous four because they are categorical. First, we will look at question 5. 80 Survey question 5 states, “Which Explanation increases your trust the most?” Participants are asked to select which of the three types of explanations helped the most, if none of them were helpful then they select the none option. Table 3 gives the results of the Chi-squared analysis. Table 3. Survey Question 5 Chi Squared Analysis Gender Female Male Stem No Yes No Yes Simple 194 203 180 191 Technical 196 104 116 149 Visual 9 14 7 17 None 67 62 74 71 P Value 0.00034 This table gives a summary of how each participant answered question 5 broken down by gender and STEM status. The P-value here shows that the results are statistically significant. The table also shows us a few interesting things about the explanation types. The first thing is that the visual explanation was viewed as far less effective at increasing trust when compared to the simple and technical explanations. Despite this, it is interesting to note that both men and women who were considered STEM preferred the visual explanations when compared to their not STEM counterparts. We see a different outcome when we consider the technical explanations. Here we see that non-technical women preferred the technical explanation over STEM women. In men it was the opposite, STEM men preferred the technical explanations more than non- STEM men. Simple explanations seemed to be preferred roughly the same between men and women as well as their sub STEM and non-STEM groups. These findings continue the theme that the main factors affecting participant trust and understanding is the 81 interaction of gender and STEM status that is present in the findings from the first 4 questions. These patterns also appear, with some variation in the results of the Chi-squared analysis of question 6. Survey question 6 states, “which explanation did you find the most useful when trying to understand the recommendation?” Table 4 gives the results of the analysis. Table 4. Survey Question 6 Chi Squared Analysis Gender Female Male Stem No Yes No Yes Simple 181 106 140 148 Technical 213 189 115 208 Visual 15 36 62 18 None 57 52 60 54 P Value 5.20e-15 Again, we see that visual recommendations were the least preferred type when participants were trying to understand the recommendation software. We again see that STEM women preferred the visual explanations vs non-STEM women. However, for the men, we see that non-STEM men prefer the visual explanations more than STEM men did. In terms of the technical explanations, the results show that STEM men preferred technical explanations far more than non-STEM men. This preference is reversed for the STEM/non-STEM women. Interestingly, non-STEM women found the simple explanation much more useful than STEM women did, meanwhile, the men were roughly even as they were in question 5. 6.3 Anecdotal Findings 82 In this section I will review some of the responses that participants gave to the final survey question answers. Refer to appendix 2 for the text of these questions. The concepts of trust and understanding are hard to quantity. Because of this, the comments made by participants can provide useful insight into the results in the previous section. There are a few things to note, I did not require that each participant answer these questions. Because of this I will not be providing any type of quantitative analysis of these responses. Instead I will be providing a summary of the sentiment from the responses to each of these questions broken down by gender and STEM status. 6.3.1 Final Survey Question 1 The final survey question 1 asks participants about which explanation did the best job of increasing their trust and why. Of the STEM females who replied the consensus was that the simple explanations were best. A few examples of STEM female response are given below. “Simple because it made it clear what it wanted.” “The simple explanation was very clear as to why I was being recommended something, which increased my trust the most.” “The simple explanation increased my trust the most in the software because it concisely stated why the recipe was recommended.” There were a few responses where STEM females note that it was a combination of the simple and technical explanations but only stated that the technical explanations were best at increasing their trust. Non-STEM females who responded were evenly split between the simple and technical. Here are some their responses. 83 “I found technical more useful, it created a rating to help me gauge the worthiness of using that particular recipe. As opposed to going of just words that were similar to what I was interested in. “ “The simple explanation was generally the best because it was straight forward and gave the quickest answer as to why that recommendation was being given.” “It was probably a tie between the simple and technical explanations for me. Sometimes the simple explanation was more convincing and other times the technical explanation was more convincing. Sometimes, there was information in the technical explanation that I didn't feel like was as relevant to the user profile I was basing my answers on, such as cookie texture or how much family members liked the cookies, which is why I cannot say definitively that the technical explanation increased my trust the most.” “Technical: The other methods seemed almost condescending, and the sentence extraction seemed more trustworthy.” This feedback is consistent with the outcome of the Cchi-squared analysis of question five. It also gives valuable context to what participants may have been thinking as they went through the experiment. For STEM males the responses were more split. However, the participants who responded seemed to favor the simple explanation as well. Below are some of the more emblematic responses. “Mostly the simple one, since it was straight on saying I am giving you this suggestion based on what you like or your favorite picks from what we know.” “the simple explanation was the best for trusting because it was simple and to the point in understanding why it was choosing it.” “Technical explanation - it provided more details about how it was comparing its understanding of my profile to other users' profiles to determine recommendations. I agree with this approach, so it increased my trust.” “The technical explanations were best at describing how the software arrived at the conclusion that the cookies should be recommended.” 84 Non-STEM males were also somewhat divided between technical and simple explanations, but they are leaned more towards the simple explanations. Here is some of their feedback. “The simple explanation was the best at gaining my trust, it told me what type of cookie it was and that it has an ingredient that is one of my favorites.” “The simple explanation built trust because it allowed me a quick glance at why it was recommended” “Technical helped my trust in the software because I could read it in detail” 6.3.2 Final Survey Questions 2 and 4 Survey question 2 asked participants to indicate which explanation method was most helpful in understanding the recommendations and question 4 asked the participants to give their best guess as to how the software was generating recommendations. STEM females who responded preferred the technical explanations generally. Almost all the participants who responded were able to give a high-level description of how hybrid recommendation system works. Here a few of the responses. “The software uses my stated preferences (ingredients, cookie type, etc.) and searches for recipes that fit those preferences. then using reviews of users with similar preferences, picks the best reviewed recipes of that group to recommend to me.” “The software takes the information provided by the user in what I am assuming is a user profile when starting the account. The software then analyzes similar user profiles and recipes for words that match the users preferences.” “I think it scans the profile and grabs a few key words and then searches for recipes with those words that have a high-ish rating (perhaps >4.5), whether or not they are cookie recipes.” 85 Non-STEM females also tended to prefer the technical explanations, the interesting thing here is the response to question 4. These participants were non-STEM and the assumption I had would be that they would struggle to understand the recommendation software. To my surprise, most of the participants who responded were able to provide a good guess. The following are the participants’ own words: “It looks at your favorites cookies and ingredients and compares that to the recipes collected. It searches for keywords and makes a recommendation based on those.” “first I think it tries to pull highly rated cookies from the favorites list. It also compares Steve to other users who have similar tastes and see what they rate other similar cookies that match his favorites list. Then it does a sentence level analysis to find a common theme to the reviews. Sometimes this sentence level analysis of reviews will pull up an incorrect recipe (like the pumpkin one that in the visual explanation used the word instead -- many recipe reviewers significantly change recipes they start with and explain how and why in the review. that could be what's leading to some of the mismatches). I think the mismatch on other explanations like pulled pork? Idk just technical mistakes in coding I guess.” “The software pulls key words from the users profile, I.e. chocolate, and then finds other highly rated recipes that include chocolate in them. However, I think the software can be confused and misguided if there is a comment or something else associated with the recommended recipe that is in the profile, for example the pumpkin and peanut butter recommendation.” STEM males had a response sentiment like the females in that they generally also preferred technical explanations. This group was also able to provide well educated guesses for how the software might be working. Below are some of these guesses. “Prior to getting the meat explanations, I would have thought that the simple test was just trying to find high-rated recipes that contained ingredients that matched Steve's liking, the technical was comparing my review comments to other people's comments on the provided recipe, and the visual was finding the most common words in the reviews of the provided recipe and seeing if any of them matched Steve's profile.” 86 “Based on past actions, including ratings and comments on recipes, the recommendation engine creates profile that includes preferred recipe types, preferred ingredients, and favorite recipes. The engine then ranks each of these things in terms of what it determines is most important to this user. Each recipe in the system is also cataloged with a set of keys from both the recipe and its ratings/comments. When generating a recommendation, the system finds recipes with numerous or strong matches between the recipe's cataloged keys and the user's profile.” Non-STEM males who responded, were more divided between technical and simple explanations being the best for increasing understanding. This was reflected in their responses to question 4. Most of the non-STEM males who responded were able to make a good guess, but some of them were less specific and in one case, a participant admits that they did not know. Here are some examples. “It appeared most of the recommendations came because of previous user ratings, comments, and user profiles, ratings, and favorites” “in my opinion the software takes recent options it received and uses it to make a selection that is in the same category as previously mentioned selections” “Based on the technical explanation, it seems that the software analyzes sentences from users most like Steve (I'm not sure what those metrics are, maybe favorite ingredients, cookie type, etc.). Those reviews that have words consistent with recipes that Steve enjoys get recommended.” 6.4 Summary of Main Findings In chapter I have shown that the trust and understanding users have in recommendation engines is influenced by the presence of explanations. Specifically, I found that there is an interaction between a participants STEM status and their gender. I also found that the presence of explanations was impactful to a user’s trust of the system overall. The consistent appearance of the STEM status and gender interaction across the first 4 questions provides an explanation of what factors are important in influencing users trust and explanation in a recommendation system. 87 I also found that different explanation methods have different impact on participants based on their STEM status and gender. In general, I found that simple explanations were best at increasing trust while technical explanations were best and aiding in understanding the recommendation system. It was also shown that visual explanations, as they were expressed in the experiment, were by far the least useful for increasing trust and understanding. In short, this chapter has shown that explanations do influence the trust and understanding that users have in recommendations engines. I have shown that users are impacted differently depending on the interaction of their gender and STEM status as it was defined in the experiment. And finally, I have shown that different types of explanations have different levels of impact on people based on the combination of their gender and STEM status. With these results and the context of the anecdotal findings I feel confident that the research questions were answered. Chapter 7. Future Work The work presented in this paper is in a lot of ways a proof of concept. I felt that it was important to do this research since a qualitative examination of how explainable AI was lacking in the research that I did prior to developing my experiment. With so much effort in the industry and academia being focused on making AI explainable it seemed especially important to research how human users of an explainable system react to the 88 explanations. In this chapter I want to cover a few specific areas where I feel that this research could be carried on in the future 7.1 Inter-Disciplinary Efforts One thing that is clear to me is that future research into this topic should be an inter-disciplinary endeavor. The subject of explainable AI is firmly in the realm of Human-Computer Interaction. This means that any study such as mine would benefit from including experts in human behavior to assist in experimental design as well as analyzing results. One concern I had with my results is that participants may have had a hard time separating the intent of question 1 from question 2 as well as the difference between question 3 and 5 from the survey. This may have contributed to why the trends for these sets of questions were so similar. This variable might have been avoided or controlled if I had included consultation from experts in that area. Another reason that I feel that future work should reach beyond computer science experts is that the results of my experiments try to quantify human emotions and cognizant capabilities. Because of this, my results make the statement that something is definitely happening when users are exposed to explanations in a recommendation engine, but it struggles to quantify why this happens. Including experts in why humans act as they do would help future work produce more insightful conclusions. 7.2 Investigating Types of Explanations In my experiment I selected three different types of explanations with the goal of getting a sense for how they interacted with one another. A main goal for future research 89 should be to focus on one specific explanation strategy and explore all its nuances. For instance, in chapter 7 I noted that the visual explanations were the universally least preferred method for helping users develop trust and understanding. This leaves me with lot of questions that could be the subject of future works. For example, was the problem that I used a word cloud or was it the manner of generation used in the word cloud could form many useful research efforts? Another area that would be interesting to investigate in the importance of understanding vs trust when it comes to recommendation engines. One of the prevailing themes that I encountered when reading the anecdotal feedback was that my participants appreciated simple easy to digest answers when it came to building trust. When it came to understanding, the quantitative and qualitative results made it clear that the technical explanations were the best method for increasing system understanding. Since simple explanations are generally easier to create in live systems it would be important to research how much a user’s experience was impacted understanding the system providing recommendations as compared to trusting it. 7.3 Validating Research The final area I want to cover in this chapter is the idea of conducting more research to validate my findings. My analysis set was limited to 60 individuals. Further research should be conducted to validate my findings. This research could take the form of repeating the experiment but with a much wider audience. Another important measure would be to conduct experiments that could help establish a general baseline of trust and understanding that people have in recommendation engines. Knowing this would help give context to my research by giving insight into the magnitude of the change in trust 90 and understanding that different methods of explanations have on trust and understanding. Chapter 8. Conclusion 8.1 Paper Review In this paper I have introduced the concept of explainable artificial intelligence both from a theoretical perspective as well as giving numerous examples of how explainable artificial intelligence in being pursued in the current research. In chapters 1 and 2 I give the reader an introduction into the challenges of taking complex machine learning algorithms and deriving explanations from them by way of approximation with explainable models or by finding way to use explainable models directly in the AI system. I also introduce the reader to the foundational ideas of recommendation engines and provide some useful resources for furthering the understanding of how recommendation engines can be implemented. In Chapters 4 and 5 I describe my experiment from a practical perspective. I describe the design process for developing the experimental recommendation system as the key idea being the user of a contrived recommendation system designed to address the research questions. I also give a full explanation of how my experiment was conducted so that any future efforts will be aware of how I came to my results and conclusions. In chapter 6 I give an overview of development of my research questions, constructing the experiment, and analyzing the results. 91 Chapter 7 gives the reader my summary findings. I show that explanations do influence the trust and understanding users have with recommendation engines. I also show that my results indicate a driving factor of this influence is an interaction of the gender and type of education a user has. Finally, in chapter 8 I give my thoughts on the directions that future work could take based on the experience I had in doing the research and analyzing the results. 8.2 Final thoughts AI will continue to become an ever-increasing factor in our day to day experience. In this paper I have focused solely on recommendation engines, but the ubiquity of AI reaches into all aspects of the modern experience. Because of the complex nature of machine learning it is important to make sure that transparency into how AI system learn and make decisions keeps pace with the types of decisions we ask AI systems to make. My research adds to this effort by helping to provide some insight into how people experience explanations in an AI system. This insight can help guide the development of transparent AI systems towards explanation methods that provide the biggest benefit for the people who use them to make important choices in their daily lives. 92 Appendix 1: Survey Questions Each question in the survey is meant to be answered on a scale of 1-5 where 1 is the low and 5 is the high. The following are screenshots of my survey questions that were asked after every recommendation: Question 1 Question 2 Question 3 Question 4 93 Question 5 Question 6 Appendix 2: Final Survey Questions Each question in the survey is meant to be answered on a scale of 1-5 where 1 is the low and 5 is the high. The following are screenshots of my final survey questions that were asked after all the recommendations were finished: Question 1 Question 2 Question 3 Question 4 95 References 1. A. M. Turing, "Computing Machinery and Intelligence," Mind, vol. 59, (236), pp. 433-460, 1950 2. D. Silver, T. Hubert, J. Schrittwieser and I. Antonoglou, "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play," Science, vol. 362, no. 6419, pp. 1140-1144, 2007. 3. S. Grigorescu et al, "A survey of deep learning techniques for autonomous driving," Journal of Field Robotics, 2019. 4. K. Luk, “Introduction to TWO approaches of Content-based Recommendation System,” Towards Data Science, 2 February 2019. [Online]. Available: https://towardsdatascience.com/introduction-to-two-approaches-of-content-based-recommendation- system-fc797460c18c. [Accessed 14 September 2020]. 5. A. Ajitsaria, “Build a Recommendation Engine With Collaborative Filtering,” Real Python, [Online]. Available: https://realpython.com/build-recommendation-engine- collaborative-filtering/#:~: text=Collaborative%20filtering%20is%20a%20technique,similar%2 0to%20a%20particular%20user. [Accessed 14 September 2020]. 6. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web (WWW '01). Association for Computing Machinery, New York, NY, USA, 285–295. DOI:https://doi.org/10.1145/371920.372071 7. Netflix, “Netflix Prize,” Netflix,[Online]. Available: https://www.netflixprize.com/. [Accessed 14 September 2020]. 8. Y. Koren, "The BellKor Solution to the Netflix Grand Prize," August 209. [Online]. Available: https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf. [Accessed 14 September 2020]. 9. M. Piotte and M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize," August 2009. [Online]. Available: https://www.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf. [Accessed 14 September 2020]. 10. A. T¨oscher and M. Jahrer, "The BigChaos Solution to the Netflix Grand Prize," 4 September 2009. [Online]. Available: https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf. [Accessed 14 September 2020]. 11. Grand View Research, "Autonomous Vehicle Market Size, Share & Trends Analysis Report By Application (Transportation, Defense), By Region (North America, Europe, Asia Pacific, South America, MEA), And Segment Forecasts, 2021 - 2030," March 2020. [Online]. Available: https://www.grandviewresearch.com/industry-analysis/autonomous-vehicles-market. [Accessed 21 September 2020]. 12. Market Data Forecast, "Self-Driving Cars Market Segmented by Application (Taxi, Civil, Public Transport, Heavy Duty Trucks, Ride Shares, Ride Hail), by 96 Component (Hardware, Software and Services), by Automation Level (Level 3, Level 4, Level 5), Regional analysis- Global Indust," February 2020. [Online]. Available: https://www.marketdataforecast.com/market-reports/self-driving-cars-market. [Accessed 21 September 2020]. 13. D. Wakabayashi, "Self-Driving Uber Car Kills Pedestrian in Arizona, Where Robots Roam," The New York Times, 19 March 2018. [Online]. Available: 2020. [Accessed 21 September 2020]. 14. B. Dattner, T. Chamorro-Premuzic, R. Buchband and L. Schettler, "The Legal and Ethical Implications of Using AI in Hiring," 25 April 2019. [Online]. Available: https://egn.com/dk/wp-content/uploads/sites/3/2020/06/The-legal-and-ethical-implications- of-using-AI-in-Hiring.pdf. [Accessed 21 September 2020]. 15. Gunning, David, and David W. Aha. “DARPA’s Explainable Artificial Intelligence Program.” AI Magazine, vol. 40, no. 2, Summer 2019, p. 44+. Gale In Context: College, https://link-gale-com. hal.weber.edu/apps/doc/A594831613/CSIC?u=ogde72764&sid=CSIC&xid= 3440e18d. Accessed 8 May 2020 16. T. Yiu, "Understanding Neural Networks," 2 June 2019. [Online]. Available: https://towardsdatascience.com/understanding-neural-networks-19020b758230. [Accessed 28 September 2020]. 17. J. Schmidhuber, "Deep learning in neural networks: An overview," Neural Networks, vol. 61, pp. 85-117, 2015. 18. W. Cao et al, "A review on neural networks with random weights," Neurocomputing (Amsterdam), vol. 275, pp. 278-287, 2018. 19. W. Cao et al, "A review on neural networks with random weights," Neurocomputing (Amsterdam), vol. 275, pp. 278-287, 2018. 20. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In <i>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</i> (<i>KDD '16</i>). Association for Computing Machinery, New York, NY, USA, 1135–1144. DOI:https://doi.org/10.1145/2939672.2939778 21. Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. 2018. A Survey of Methods for Explaining Black Box Models. <i>ACM Comput. Surv.</i> 51, 5, Article 93 (January 2019), 42 pages. DOI:https://doi-org.hal.weber.edu/10.1145/3236009 22. Mark Craven and Jude W. Shavlik. 1996. Extracting tree-structured representations of trained networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 24–30 23. Olcay Boz. 2002. Extracting decision trees from trained neural networks. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 456–461. 97 24. Augasta, M.G., Kathirvalavakumar, T. Reverse Engineering the Neural Networks for Rule Extraction in Classification Problems. Neural Process Lett 35, 131–150 (2012). https://doi.org/10.1007/s11063-011-9207-8 25. R. Gandhi, "Support Vector Machine — Introduction to Machine Learning Algorithm," Towards Data Science, 7 June 2018. [Online]. Available: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning- algorithms-934a444fca47. [Accessed 28 September 2020]. 26. N. Barakat and A. P. Bradley, "Rule extraction from support vector machines: A review," Neurocomputing (Amsterdam), vol. 74, (1-3), pp. 178-190, 2010. 27. D. Martens et al, "Comprehensible credit scoring models using rule extraction from support vector machines," European Journal of Operational Research, vol. 183, (3), pp. 1466-1476, 2007. 28. H. Deng, "Interpreting tree ensembles with inTrees," International Journal of Data Science and Analytics, vol. 7, (4), pp. 277-287, 2018;2019;. 29. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. 30. Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Larry Jackel, Urs Muller, and Karol Zieba. 2016. VisualBackProp: Visualizing CNNs for autonomous driving. CoRR, Vol. abs/1611.05418 (2016). 31. Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. 2008. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 560–568. 32. Zhang, Yongfeng & Chen, Xu. (2018). Explainable Recommendation: A Survey and New Perspectives. 33. Y. Zhang et al, "Do users rate or review?: Boost phrase-level sentiment labeling with review-level sentiment classification," in 2014, . DOI: 10.1145/2600428.2609501 34. X. Guan et al, "Attentive Aspect Modeling for Review-Aware Recommendation," ACM Transactions on Information Systems (TOIS), vol. 37, (3), pp. 1-27, 2019. 35. Y. Zhang et al, "Explicit factor models for explainable recommendation based on phrase-level sentiment analysis," in 2014, . DOI: 10.1145/2600428.2609579. 36. S. Chang, F. Harper and L. Terveen, "Crowd-based personalized natural language explanations for recommendations," in 2016, . DOI: 10.1145/2959100.2959153 37. Y. Wu and M. Ester, "FLAME: A probabilistic model combining aspect based opinion mining and collaborative filtering," in 2015, . DOI: 10.1145/2684822.2685291. 38. Y. Lin et al, "Explainable Outfit Recommendation with Joint Outfit Matching and Comment Generation," IEEE Transactions on Knowledge and Data Engineering, vol. 32, (8), pp. 1502-1516, 2020. 39. X. He et al, "TriRank: Review-aware explainable recommendation by modeling aspects," in 2015, . DOI: 10.1145/2806416.2806504 40. Xiang Wang, Xiangnan He, Fuli Feng, Liqiang Nie, and Tat-Seng Chua. 2018. TEM: Tree-enhanced Embedding Model for Explainable Recommendation. In 98 Proceedings of the 2018 World Wide Web Conference (WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1543–1552. DOI:https://doi.org/10.1145/3178876.3186066 41. Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. 2018. Explainable Recommendation via Multi-Task Learning in Opinionated Text Data. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). Association for Computing Machinery, New York, NY, USA, 165–174. DOI:https://doi-org. hal.weber.edu/10.1145/3209978.3210010 42. S. Seo et al, "Interpretable convolutional neural networks with dual local and global attention for review rating prediction," in 2017, . DOI: 10.1145/3109859.3109890. |
Format | application/pdf |
ARK | ark:/87278/s62wfnq7 |
Setname | wsu_smt |
ID | 96827 |
Reference URL | https://digital.weber.edu/ark:/87278/s62wfnq7 |