Title | LIang, Ying_MCS_2021 |
Alternative Title | Applying Knowledge Graph and Natural Language Processing to Question Answer System |
Creator | Llang, Ying |
Collection Name | Master of Computer Science |
Description | A knowledge graph is typically built on top of existing databases to link all data together at web-scale combing both structured and unstructured information. Apache Jena is composed of different APIs interacting together to process Relationship Database (RDB) data and generating a knowledge graph. Natural language processing (NLP) is a subfield of linguistics and concerned with the interaction between computers and human language. Combine knowledge graph with NLP and retrieve question's answer is the main goal in this project. |
Subject | Natural language processing (Computer science); Computational linguistics; Computer science |
Keywords | NLP (Computer science); Knowledge graph; Relationship database data; RDB; Linguistics |
Digital Publisher | Stewart Library, Weber State University |
Date | 2021 |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records; Master of Education in Curriculum and Instruction. Stewart Library, Weber State University |
OCR Text | Show 2 TABLE OF CONTENTS Page LIST OF FIGURES .........................................................................................................3 LIST OF TABLES ..........................................................................................................4 ACKNOWLEDGMENTS ...............................................................................................5 ABSTRACT ....................................................................................................................6 INTRODUCTION ...........................................................................................................7 Background and Related Research ...................................................................................8 Generating Knowledge Graph ........................................................................................ 11 Natural Language Processing ......................................................................................... 16 LIMITATIONS ............................................................................................................. 19 Relational Database (RDB) VS Knowledge Graph (RDF) ............................................. 20 RESULTS ..................................................................................................................... 22 Bibliography .................................................................................................................. 31 3 LIST OF FIGURES Page Figure 1.1 Ontology Class Entity – this plot is clear and shows the ontology entity concept ...................................................................................................... 13 Figure 1.2 N-triple mapping process ............................................................................. 14 Figure 1.3 Apache Jena – this plot is a work flowchart for Apache Jena. ....................... 15 Figure 2.1 SQL query ask question “Who is teaching Asp.net MVC Capstone”.............. 20 Figure 2.2 SPARQL query ask question “Who is teaching Asp.net MVC Capstone”....... 21 Figure 2.3 SPARQL query based on reasoning rules ...................................................... 21 Figure 3.1 System User Interface ................................................................................... 23 Figure 3.2 System Testing Results (240 questions) ......................................................... 23 Figure 3.3 Questions and answers ................................................................................. 24 Figure 3.4 JSON Data.................................................................................................... 24 Figure 3.5 System Testing Answer .................................................................................. 25 Figure 3.6 Collected questions and Correct Answers ..................................................... 26 Figure 3.7 System Question Answers .............................................................................. 27 Figure 3.8 System Testing Results (50 questions) ........................................................... 28 Figure 3.9 System Question Answer – Question “What classes does Yong Zhang teach?” ...................................................................................................... 28 Figure 3.10 System Question Answer – Question “Is cs6610 an online class?” .............. 29 Figure 3.11 System Question Answer – Question “The email of Time Fowers” .............. 29 Figure 3.12 System Question Answer – If answer is not included in system, system alert corresponding message ...................................................................... 30 iv LIST OF TABLES Page Table 1.1 Course – This course table describe class information, include some class details ........................................................................................................ 11 Table 1.2 Course_to_Professor – This table is shown the course and teacher’s connect information ................................................................................... 12 Table 1.3 Professor – This course table describe professors’ personal information ....... 12 Table 1.4 Course_to_Semester – This course table describe course and semester’s connection ................................................................................................. 12 Table 1.5 Smester – This course table describe course and semester’s connection ......... 12 Table 2.1 SpaCy Categories – This table detail describe the different classification for SpaCy ................................................................................................... 17 v ACKNOWLEDGMENTS I would like to thank my committee chair, __________, and my committee members, __________, __________, __________, and __________, for their guidance and support throughout the course of this research. In addition, I would also like to thank my friends, colleagues, and the department faculty and staff for making my time at Weber State University a positive experience. Acknowledgments vi ABSTRACT A knowledge graph is typically built on top of existing databases to link all data together at web-scale combing both structured and unstructured information. Apache Jena is composed of different APIs interacting together to process Relationship Database (RDB) data and generating a knowledge graph. Natural language processing (NLP) is a subfield of linguistics and concerned with the interaction between computers and human language. Combine knowledge graph with NLP and retrieve question’s answer is the main goal in this project. 7 INTRODUCTION With the increased school course and faculty’s databases, students and teachers need take some time to find the course or professors’ information. Since a website support to use the natural language to search the related information will be helpful. Basic course information is already included in structured JSON file. JSON data and ontology file are provided by Dr.Zhang. A knowledge graph acquires and integrated information into an ontology and applies a reasoner to derive new knowledge. Knowledge graph could be recognized as a relational database, while the difference is it more powerful and support reasoning. Apache Jena is a helpful tool in this project aid to build the knowledge graph. In order to construct knowledge graph, several steps will be taken. First, store the JSON data into relational database based on relationship. Second, generating N-triple files. Third, provide N-triple file in Apache Jena as TDB format to generate the knowledge graph. Fourth, configure the reasoning rules to support Apache Jena reasoning. Natural language processing is a widely used technology, be implemented in different areas like search engine, Gmail, text analysis. Processing natural language pipeline contains tokenizer, lemmatizer, tagger, dependency parser, entity recognizer, text categorizer, matcher, phrase matcher, entity ruler, sentencizer. 8 Background and Related Research Some related publications such as “Knowledge Graph Refinement: A survey of approaches and evaluation methods” [1] help me get to know the basic concept about knowledge graph. The Knowledge graph is a knowledge base used by Google and its services to enhance its search engine’s results with information gathered from a variety of resources. JSON structured data and ontology file has been gathered from students who attended Weber State University between 2014 and 2018. JSON data includes the basic course and professors’ information. An ontology typically provides a vocabulary describing a domain of interest and a specification of the meaning of terms in that vocabulary [2]. The first consideration is to map our data with those corresponding ontologies. Then get the N-triple file to generate knowledge graph. Many existing knowledge graphs are either available as linked open data, or they can be exported as RDF (Resource Description Framework) datasets enhanced with background knowledge in the form of an OWL2 ontology [3]. Some platforms support the knowledge base graph or RDF (1) D2RQ The D2RQ is a popular RDB-to RDF mapping platform that supports mapping relational databased to RDF and posing SARQL queries to these relational databases. [4] (2) Neo4j Neo4j is a native graph database platform specifically optimized to map, store and traverse networks of highly connected data to reveal invisible contexts and hidden relationships. By intuitively analyzing data points and the connections between 9 them, neo4j powers intelligent, real-time applications that tackle today’s toughest enterprise challenges. [5] (3) Apache Jena Apache Jena is an open source semantic web framework for Java. It provides an API to extract data from and write to RDF graphs. [6] Natural language Processing tasks including part-of-speech tagging, chunking, name-entity recognition, and semantic role labeling [7]. Fortunately, there are some NLP libraries includes those algorithms and could be utilized in this project. (1) Spacy Spacy focuses on providing software for production usage. Spacy also supported deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc. [8] (2) NLTK Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. [9] (3) Stanford CoreNLP Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. Widely used both in NLP community and also among commercial and government users of open source NLP technology. [10] 10 For Knowledge graph reasoning, based on some related articles there are different ways to approach. For Jena, finding a reasoner, applying a reasoner to data, create an inference model, accessing inferences [11] are necessary steps to use the reasoner in Jena. The output result will be generated after performing inference are displayed on the console in turtle format. Reasoner is applied after merging both the schemas a list of all classes in added schema must be generated in the output along with the properties for class. In knowledge graph, the entities and their relations could be represented by nodes and edges, so based on this concept we could set up the algorithm to learn to use the reasoning rule, since for question answer system there is no annotations for such reasoning step, so the QA system has to learn it only from question-answer pairs. The model the likelihood of an answer being current given entity y and question as P(probability) = (answer | entity, question) [12]. Also, since the topic entity in question is not annotated, it is natural to formulate the problem by treating the topic entity y as latent variable. Besides, for knowledge graph reasoning, there are some model being published, for example, TransE is an energy-based model that produces knowledge base embedding, Relationships are represented as translations in the embedding space: if (h, l, t) holds, the embedding of the tail entity t should be close to the embedding of the head entity h plus some vector that depends on the relationship l. [13] 11 Generating Knowledge Graph As we mentioned before, the structured JSON data includes the basic course and professors’ information. The first step to generate a knowledge-based graph is to build the N-triple file based on the ontology file. (1) Json to RDB Store the structured Json data into MySQL database. There are five tables, by using pymysql to import data into MySQL database. Table 1.1 Course – This course table describe class information, include some class details Course Row 1 CourseID Row 2 CourseTitle Row 3 CourseCategory Row 4 CourseCreditsNumber Row 5 CourseTextBook Row 6 CourseSyllabus Row 7 CourseRoom Row 8 DepartmentName Row 9 CourseCode Row 10 CourseOnline 12 Table 1.2 Course_to_Professor – This table is shown the course and teacher’s connect information Course_to_Professor Row 1 CourseID Row 2 TeacherID Table 1.3 Professor – This course table describe professors’ personal information Professor Row 1 TeacherID Row 2 TeacherEmail Row 3 TeacherOfficehour Row 4 TeacherOfficeLocation Row 5 Teacherphone Table 1.4 Course_to_Semester – This course table describe course and semester’s connection Course_to_Semester Row 1 CourseID Row 2 SemesterID Table 1.5 Semester – This course table describe course and semester’s connection Semester Row 1 SemeasterID Row 2 Academicyear Row 3 Semester 13 (2) RDB to Knowledge Graph As the Figure 1.1 shown the ontology information, mapping our data like “course” information into “course” entity. There are two ways to do this, direct mapping or utilize the RDB to RDF mapping language, here I used the second one to make it better matched with our data and relationship. Through the mapping language, we generate our N-triple file. Apache Jena support to load the N-triple file convert into TDB data and generate the final knowledge graph. The mapping process is like the Figure 1.2 shows. Figure 1.1 Ontology Class Entity – this plot is clear shown the ontology entity concept 14 Figure 1.2 N-triple mapping process Apache Jena provides TDB, Rule-Reasoner and Fuseki components. TDB is a component used by Jena to store RDF data, which is a storage technology. Rule- Reasoner can perform simple rule reasoning and support users to customize reasoning rules. Fuseki is provided by Jena SPARQL server, supports SPARQL language for retrieval and can run efficiently on the stand-alone and server side. The Apache Jena work framework is shown as Figure 1.3. Once the knowledge graph built, could use the SPARQL language to query the information. 15 Figure 1.3 Apache Jena – this plot is a work flowchart for Apache Jena. 16 Natural Language Processing If we want to be able to answer the questions from user, system should understand what the question is asking and looking for. Since here include an important technology Natural Language Processing (NLP). The abundant volume of natural language text in the connected world, though having a large content of knowledge, but it is becoming increasingly difficult to disseminate it by a human to discover the knowledge/wisdom in it, specifically within any given time limits [14]. In this project, we separate the Natural Language Processing into several steps. (1) Name-Entity Recognize In the process of Name-Entity Recognition, SpaCy is a important tool, SpaCy is used to perform tokenization, PoS tagging [15], Name-Entity and dependency parsing, taking into account existing annotations if available. Also, SpaCy already had relatively mature package to identify the Name- Entity in English sentences. Several categories for SpaCy is like Table 2.1 shows. In order to improve the recognizer, added the JSON dictionary to recognize the Name-Entity more accurate. 17 Table 2.1 SpaCy Categories – This table detail describe the different classification for SpaCy Type Description PERSON People, including finctional NORP Nationalities or religious or political groups FAC Buildings, airports, highways, bridges, etc ORG Companies, agencies, institutions, etc. GPE Countries, cities, states LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LAW Named documents made into laws. DATA Absolute or relative dates or periods. TIME Times smaller than a day. PERCENT Percentage, including”%“. MONEY Monetary values, including unit. QUANTITY Measurements, as of weight or distance. ORDINAL first”, “second”, etc. CARDINAL Numerals that do not fall under another type. LANGUAGE Any named language. 18 (2) Attributes Connect Analysis the keywords in sentences and regard the keywords which contained in the questions as the target attribute. After the analysis, the target attribute could be classified based on the part of speech, like verb, noun, subject and so on, or set up some keywords sets for some necessary target attribute. (3) Rule-based Answer Reference Once we get our target attribute and Name-Entity from sentences, then utilized the Python Refo library to set up Fuzzy matching rules. 19 LIMITATIONS The current limitation for this system could be summarized into some aspects: (1). This system is based on the fuzzy match rules to process natural language and the fuzzy match rules are based on the question that I collected from different people. Since when different people ask same question might have different ways so sometimes the system could not give the correct answer if the sentence pattern not included in my collections. (2). The system is only support questions which concludes one main Name-Entity. For example, “who is the teacher for Advance Computer”. In this sentence, “Advance Computer” is the most significant Name-Entity so we could not add another Name-Entity and ask like “Who are the teachers for Advance Computer and Computer Architecture.” For those multiple Name-Entities questions, system will response wrong answers. 20 Relational Database (RDB) VS Knowledge Graph (RDF) A relational database is a collection of data items with pre-defined relationships between them. Tables are used to hold information about the objects in the databases. Unlike a relational database, a graph database is structured entirely around data relationships. Graph databases treat relationships not as a schema structure but as data, like other values. A relational database is much faster when operating on huge numbers of records. In a graph database, each record must be examined individually during a query in order to determine the structure of the data, while this is known ahead of time in a relational database. Relational databases use less storage space, because they don’t have to store all those relationships. However, graph databases have more powerful ability to search information based on stored relationships. Relationship database is more depend on the SQL query and when we need to search some complex relationships the SQL query is more easily to make mistakes. For example, we want to know “who is teaching Asp.net MVC Capstone?”. In relational database, the SQL query is shown like Figure 2.1. Figure 2.2 SQL query ask question “who is teaching Asp.net MVC Capstone” 21 But in Apache Jena, if we want to ask same question. The SPARQL query could be simply asked like Figure 2.2. Besides, we could add the reasoning rules, then ask like Figure 2.3 to get the same results. In future, if our system needs to support more complex questions, Knowledge Graph should be our first choice to improve our system accuracy. Figure 2.3 SPARQL query ask question “who is teaching Asp.net MVC Capstone” Figure 2.4 SPAQL query based on reasoning rules 22 RESULTS Loaded the Structured data into MySQL database, and based on the mapping language generated N-triple file. N-triple file provided the data and relationship information very comprehensive. Has been tried to use the Neo4j to generate the knowledge graph while Neo4j still lacks reasoning mechanism at current stage and time cost is expensive. Since instead of Neo4j, Apache Jena support Resources Description Framework is another form of knowledge graph. Not only allowed to retrieve the data we need, but also could generate the reasoning file to support some relationship reasoning like “teaches” in verse of “is teach by”. This project aims to help students or teacher to get the information more efficiently and easily, so Natural Language Processing is the most important part. SpaCy English core library and our JSON dictionary file is mostly covered the Name-Entity we want to recognize. As so far, the system is completed and could retrieve the data by using Natural Language Processing. In order to improve the user experience, I built a user interface through HTML+CSS. System front-end shown like Figure 3.1. Before the system running, the Jena Fuseki server should also be active to make sure we could reach the knowledge graph. Once running Fuseki server, running main.py file and open browser to “http://127.0.0.1:666/ “. Inserting the question in the text box and searching the related information, system could answer the questions if the question answer is included in our databases. If not system alert message error. As so far, Knowledge graph is completed and includes all the courses, professors and relationships information. For testing, I separate it into general testing and more specific testing. General testing data included 240 questions, automatically generated questions and correct answer through JSON data. So, it has some limitation in sentences pattern. The total 23 running time is 670.789296s. Average running time for each question is 2.79s. For precision, the 240 questions have 238 correct, the final result is shown like Figure 3.2. The incorrect two is because of the answer for this information is empty in JSON data. So, the accuracy is 99.16%. Questions is like Figure 3.3. The correct answer questions are based on each block of Json data like Figure 3.4, so it should be included in the results like Figure 3.5. Figure 3.1 System User Interface Figure 3.2 System Testing Results (240 questions) 0 10 20 30 40 50 60 correct Incorrect QA-System Answer(total 240 questions) 24 Figure 3.3 Questions and answers Figure 3.4 JSON Data 25 Figure 3.5 System Testing Answer In order to test the system accuracy more deeply, I collected some different questions which included same question use different ways to ask. Then generate the correct answer based on JSON file. The questions and answers are shown like Figure 3.6. There are 50 questions, the answer get from system is shown like Figure 3.7. The accuracy is 49/50. One question could not find the answer. Total running time is 183.55178, each question running time is about 2.94s. Results plot like Figure 3.8. 26 Figure 3.6 Collected questions and Correct Answers 27 Figure 3.7 System Question Answers 28 Figure 3.8 System Testing Results (50 questions) System could successfully recall answer from system and shown like Figure 3.9, Figure 3.10 and Figure 3.11. If the answer not included in system. System will alert message error like Figure 3.12. Figure 3.9 System Question Answer – Question “What classes does Yong Zhang teach?” 0 10 20 30 40 50 60 correct Incorrect QA-System Answer(total 50 questions) 29 Figure 3.10 System Question Answer – Question “Is cs6610 an online class?” Figure 3.11 System Question Answer – Question “The email of Time Fowers” 30 Figure 3.12 System Question Answer – If answer is not included in system, system alert corresponding message 31 Bibliography [1] H. Paulheim, “Knowledge graph refinement: A survey of approaches and evaluation methods,” Semantic Web, vol. 8, no. 3, pp. 489–508, 2016. Available: 10.3233/sw- 160218. [2] Jérôme Euzenat, Pavel Shvaiko, Ontology Matching, second ed., Springer,. ISBN: 978- 3-642-38720-3, 2013. [3] M. Arenas, B. Cuenca Grau, E. Kharlamov, S. Marciuska and D. Zheleznyakov, "Faceted Search Over RDF-Based Knowledge Graphs", SSRN Electronic Journal, 2016. Available: 10.2139/ssrn.3199228. [4] Vadim Eisenberg, "D2RQ/update: updating relational data via virtual RDF," Proceedings of the 21st International Conference on World Wide WebApril , vol. https://doi.org/10.1145/2187980.2188095., pp. 497–498 , 2012. [5] "Graph Database Platform | Graph Database Management System | Neo4j", Neo4j Graph Database Platform, 2021. [Online]. Available: http://neo4j.org. [Accessed: 21- Apr- 2021]. [6] "Apache Jena -", Jena.apache.org, 2021. [Online]. Available: https://jena.apache.org/. [Accessed: 21- Apr- 2021]. [7] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa, "Natural Language Processing from Scratch," Journal of Machine Learning Research 12(2011) 2493-2537. [8] "spaCy - Wikipedia", En.wikipedia.org, 2021. [Online]. Available: https://en.wikipedia.org/wiki/SpaCy. [Accessed: 21- Apr- 2021]. [9] "Natural Language Toolkit — NLTK 3.6.2 documentation", Nltk.org, 2021. [Online]. Available: https://www.nltk.org/. [Accessed: 21- Apr- 2021]. [10] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, David McClosky, "The Stanford CoreNLP Natural Language Processing Toolkit," Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, vol. Association for Computational Linguistics., pp. 55–60, June 23-24, 2014. [11] Ayesha Ameen, Khaleel Ur Rahman Khan and B.Padmaja Rani, "Reasoning in Semantic Web Using Jena," Computer Engineering and Intelligent Syatems., Vols. ISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)Vol.5. , No.4, 2014. [12] Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander J. Smola, Le Song, "Variational Reasoning for Question Answering with Knowledge Graph," College of Computing, Georgia Institute of Technology., 27 Nov 2017. [13] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, Oksana Yakhnenko, "Translating Embeddings for Modeling Multi-relational Data," Curran Associates, Inc. 2013. [14] Chowdhary K.R, Fundamentals of Artificial Intelligence, Vols. https://doi.org/10.1007/978-81-322-3972-7_19, 2020. 32 [15] N. Colic and F. Rinaldi, "Improving spaCy dependency annotation and PoS tagging web service using independent NER services", Genomics & Informatics, vol. 17, no. 2, p. e21, 2019. Available: 10.5808/gi.2019.17.2.e21. [16] D. Yogish, T. N. Manjunath and R. S. Hegadi, "Survey on trends and methods of an intelligent answering system," 2017 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), 2017, pp. 346-353, doi: 10.1109/ICEECCOT.2017.8284526. [17] Sanglap Sarkar, Venkateshwar Rao, Baala Mithra SM, Subrahmanya VRK Rao, “NLP Algorithm Based Qustion and Answering System”, 2015 Seventh International Conference on Computational Intelligence, Modelling and Simulation, 10.1109/CIMSim.2015.29. |
Format | application/pdf |
ARK | ark:/87278/s6e9x5xe |
Setname | wsu_smt |
ID | 96838 |
Reference URL | https://digital.weber.edu/ark:/87278/s6e9x5xe |