Title | Carter, Anthony_MCS_2020 |
Alternative Title | Applying Machine Learning to Predicting Human Life Expectancy, and Distance from Birth Location to Death's Location Using Genealogical Data |
Creator | Carter, Anthony |
Collection Name | Master of Computer Science |
Description | I focused on the following research questions: can I predict better than the majority class for the following questions: will a person reach adulthood, or predict within reasonable bounds of a person at their age at time of death? How about the distance from the location of their birth to their location of death? Finally, for the previous questions, does having generational information about a person's parents and grandparents make the prediction more accurate or is family history a non-important factor? I used multiple machine learning algorithms for these predictions based on if it is a classification prediction. For example, I used regression prediction for the question on if a person reaching adulthood and for the questions about the age at time of death and distance from the location of birth to location of death. For classification, I used decision trees, knearest neighbor, naïve Bayes, and neural networks. For regression, I used linear regression, regression tree, and neural networks. I show that for classification, I am capable of getting within several percentage points away from beating the majority class but was always fall short. For regression, I am not capable reducing the root mean square error to be less than 20% of the mean result. |
Subject | Computer science |
Keywords | Family history; Machine learning algorithms |
Digital Publisher | Stewart Library, Weber State University |
Date | 2020 |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University |
OCR Text | Show Applying Machine Learning to Predicting Human Life Expectancy, and Distance from Birth Location to Death’s Location Using Genealogical Data by Anthony Carter A thesis/project submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE OF COMPUTER SCIENCE WEBER STATE UNIVERSITY Ogden, Utah November 25, 2020 ________________________________ (signature) Faculty Advisor, Dr. Robert Ball Committee Chair ________________________________ (signature) Second Committee member, Dr. Kyle Feuz Committee Member ________________________________ (signature) Second Committee member, Dr. Hugo Valle Committee Member ______Anthony Carter______________ (signature) Student, Anthony Carter Copyright 2020 Anthony Carter Signature: Email: Signature: Email: Signature: Email: Hugo Valle (Nov 30, 2020 12:27 MST) hugovalle1@weber.edu Nov 30, 2020 Nov 30, 2020 kylefeuz@weber.edu Robert Ball (Nov 30, 2020 14:32 MST) Robert Ball robertball@weber.edu Nov 30, 2020 Abstract I focused on the following research questions: can I predict better than the majority class for the following questions: will a person reach adulthood, or predict within reasonable bounds of a person at their age at time of death? How about the distance from the location of their birth to their location of death? Finally, for the previous questions, does having generational information about a person’s parents and grandparents make the prediction more accurate or is family history a non-important factor? I used multiple machine learning algorithms for these predictions based on if it is a classification prediction. For example, I used regression prediction for the question on if a person reaching adulthood and for the questions about the age at time of death and distance from the location of birth to location of death. For classification, I used decision trees, k-nearest neighbor, naïve Bayes, and neural networks. For regression, I used linear regression, regression tree, and neural networks. I show that for classification, I am capable of getting within several percentage points away from beating the majority class but was always fall short. For regression, I am not capable reducing the root mean square error to be less than 20% of the mean result. Dedication I am dedicating this paper to Robert Carter, my father. I grew up as my father went from being in the military as military police to getting his master’s degree in business. Before my father received his master’s degree, it felt that our family was poorer than the neighbors. After my father got his master’s degree, our quality of life was noticeably improved. This was impressionable to me as it is easy to say that having a college degree can improve my life. But another to see the changes that going from close to military assisted poverty to having a master’s degree has. He helped me get my bachelor’s degree as well. I had a part-time job in Walmart, which was not sufficient to pay for all of the tuition, class fees, and book costs, and definitely with little in the way of savings. He helped me pay for my tuition with the restriction of every class had to be at least B- or higher. Because of these restrictions, he allowed me to focus on my schooling and not have to spend considerable amount of time working. After I got my bachelor’s degree, I went for my master’s degree, not because of any need for job reasons to get a master’s degree, but a personal desire to continue my education to what my father has. I have lived my entire life doing school, and because of my father, I am capable of continuing my education still. Table of Contents Dedication .......................................................................................................................... iii List of Tables ..................................................................................................................... vi List of Figures .................................................................................................................... ix 1.0 Introduction ..................................................................................................................11 2.0 Related Work ...............................................................................................................16 3.0 Data Manipulation .......................................................................................................19 3.1 Formatting and inserting the records ...............................................................21 3.2 Cleaning the data ..............................................................................................23 3.3 Splitting the database .......................................................................................27 4.0 Understanding the language of Python, the base language ..........................................30 5.0 Child Mortality.............................................................................................................32 5.1 Child Mortality: Decision Tree ........................................................................37 5.2 Child Mortality K-Nearest Neighbor ...............................................................44 5.3 Child Mortality Naïve Bayes ...........................................................................48 5.4 Child Mortality Neural Network ......................................................................52 6.0 Age at Time of Death ...................................................................................................56 6.1 Age at Time Linear Regression .......................................................................59 6.2 Age at Time of Death Regression Tree ............................................................63 6.3 Age at Time of Death Neural Network ............................................................67 6.4 Age at Time of Death by Year .........................................................................71 6.5 Age at Time of Death Subset ...........................................................................73 7.0 Distance Traveled ........................................................................................................77 7.1 Distance Traveled Linear Regression ..............................................................81 7.2 Distance Traveled Regression Tree .................................................................84 7.3 Distance Traveled Neural Network .................................................................88 7.4 Distance Traveled by Year ...............................................................................92 8.0 Result Findings ............................................................................................................94 9.0 Demonstrable Console Application .............................................................................97 10.0 Conclusion .................................................................................................................99 11.0 References ................................................................................................................101 List of Tables Table 1 Decision Tree Not Balanced Generation 1 .......................................................... 38 Table 2 Decision Tree Random Under Sampling Generation 1 ....................................... 39 Table 3 Decision Tree SMOTE Generation 1 ................................................................... 39 Table 4 Perfectly balanced dataset .................................................................................... 40 Table 5 Decision Tree Not Balanced Generation 2 .......................................................... 40 Table 6 Decision Tree SMOTE Generation 2 ................................................................... 40 Table 7 Decision Tree Random Under Sampling Generation 2 ....................................... 41 Table 8 Decision Tree Perfectly Balanced Generation 2 .................................................. 41 Table 9 Decision Tree Not Balanced Generation 3 .......................................................... 41 Table 10 Decision Tree SMOTE Generation 3 ................................................................. 41 Table 11 Decision Tree Random Under Sampling Generation 3 ..................................... 42 Table 12 Decision Tree Perfectly Balanced Generation 3 ................................................ 42 Table 13 Decision Tree Summary Results ........................................................................ 42 Table 14 K-Nearest Neighbor Not Balanced Generation 1 .............................................. 44 Table 15 K-Nearest Neighbor SMOTE Generation 1 ....................................................... 44 Table 16 K-Nearest Neighbor Random Under Sampling Generation 1 ........................... 44 Table 17 K-Nearest Neighbor Perfectly Balanced Generation 1 ...................................... 44 Table 18 K-Nearest Neighbor Not Balanced Generation 2 .............................................. 45 Table 19 K-Nearest Neighbor Smote Generation 2 .......................................................... 45 Table 20 K-Nearest Neighbor Random Under Sampling Generation 2 ........................... 45 Table 21 K-Nearest Neighbor Perfectly Balanced Generation 2 ...................................... 45 Table 22 K-Nearest Neighbor Not Balanced Generation 3 .............................................. 45 Table 23 K-Nearest Neighbor SMOTE Generation 3 ....................................................... 46 Table 24 K-Nearest Neighbor Random Under Sampling Generation 3 ........................... 46 Table 25 K-Nearest Neighbor Perfectly Balanced Generation 3 ...................................... 46 Table 26 K-Nearest Neighbor Summary Results .............................................................. 46 Table 27 Naïve Bayes Not Balanced Generation 1 .......................................................... 48 Table 28 Naïve Bayes SMOTE Generation 1 ................................................................... 48 Table 29 Naïve Bayes Random Under Sampling Generation 1 ....................................... 48 Table 30 Naïve Bayes Perfectly Balanced Generation 1 .................................................. 48 Table 31 Naïve Bayes Not Balanced Generation 2 .......................................................... 49 Table 32 Naïve Bayes SMOTE Generation 2 ................................................................... 49 Table 33 Naïve Bayes Random Under Sampling Generation 2 ....................................... 49 Table 34 Naïve Bayes Perfectly Balanced Generation 2 .................................................. 49 Table 35 Naïve Bayes Generation 3 ................................................................................. 49 Table 36 Naïve Bayes SMOTE Generation 3: .................................................................. 50 Table 37 Naïve Bayes Random Under sampling Generation 3 ........................................ 50 Table 38 Naïve Bayes Perfectly Balanced Generation 3 .................................................. 50 Table 39 Naïve Bayes Summary Results .......................................................................... 50 Table 40 Neural Network Not Balanced Generation 1 ..................................................... 52 Table 41 Neural Network SMOTE Generation 1 ............................................................. 52 Table 42 Neural Network Random Under Sampling Generation 1 .................................. 52 Table 43 Neural Network Perfectly Balanced Generation 1 ............................................ 52 Table 44 Neural Network Not Balanced Generation 2 ..................................................... 53 Table 45 Neural Network SMOTE Generation 2 ............................................................. 53 Table 46 Neural Network Random Under Sampling Generation 2 .................................. 53 Table 47 Neural Network Perfectly Balanced Generation 2 ............................................ 53 Table 48 Neural Network Not Balanced Generation 3 ..................................................... 53 Table 49 Neural Network SMOTE Generation 3 ............................................................. 54 Table 50 Neural Network Random Under Sampling Generation 3 .................................. 54 Table 51 Neural Network Perfectly Balanced Generation 3 ............................................ 54 Table 52 Child Mortality Summary Results ..................................................................... 54 Table 53 Age at time Of Death Summary Results ............................................................ 67 Table 54 Distance Traveled Distance Traveled Summary Results ................................... 88 Table 55 Child Mortality Summary .................................................................................. 94 Table 56 Age at Time of Death Summary ........................................................................ 95 Table 57 Distance Traveled Summary .............................................................................. 96 Table 58 GED Fields ........................................................................................................ 97 List of Figures Figure 1 Age at Time of Death Linear Regression Prediction .......................................... 60 Figure 2 Age at Time of Death Linear Regression Actual ............................................... 60 Figure 3 Age at Time of Death Linear Regression Prediction - Actual ............................ 61 Figure 4 Age at Time of Death Linear Regression Difference ......................................... 62 Figure 5 Age at Time of Death Regression Tree Prediction ............................................. 64 Figure 6 Age at Time of Death Regression Tree Actual .................................................. 64 Figure 7 Age at Time of Death Regression Tree Prediction - Actual ............................... 65 Figure 8 Age at Time of Death Regression Tree Difference ............................................ 66 Figure 9 Age at Time of Death Neural Network Prediction ............................................. 68 Figure 10 Age at Time of Death Neural Network Actual ................................................. 68 Figure 11 Age at Time of Death Neural Network Prediction - Actual ............................. 69 Figure 12 Age at Time of Death Neural Network Difference .......................................... 70 Figure 13 Age at Time of Death Neural Network Over Years Before 1900 .................... 71 Figure 14 Age at Time of Death Neural Network Over Years After 1900 ....................... 72 Figure 15 Age at Time of Death Neural Network Age Under 5 Prediction ..................... 73 Figure 16 Age at Time of Death Neural Network Age Under 5 Actual ........................... 74 Figure 17 Age at Time of Death Neural Network Age Over 5 Prediction ....................... 75 Figure 18 Age at Time of Death Neural Network Age Over 5 Actual ............................. 75 Figure 19 Age at Time of Death Neural Network Over Years Limited to Before 1900 .. 76 Figure 20 Distance Traveled Linear Regression Prediction ............................................. 82 Figure 21 Distance Traveled Linear Regression Actual ................................................... 82 Figure 22 Distance Traveled Linear Regression Prediction – Actual ............................... 83 Figure 23 Distance Traveled Linear Regression Difference ............................................. 83 Figure 24 Distance Traveled Regression Tree Prediction ................................................ 85 Figure 25 Distance Traveled Regression Tree Actual ...................................................... 85 Figure 26 Distance Traveled Regression Tree Prediction – Actual .................................. 86 Figure 27 Distance Traveled Regression Tree Difference ................................................ 87 Figure 28 Distance Traveled Neural Network Prediction ................................................. 89 Figure 29 Distance Traveled Neural Network Actual ...................................................... 90 Figure 30 Distance Traveled Neural Network Prediction – Actual .................................. 91 Figure 31 Distance Traveled Neural Network Difference ................................................ 91 Figure 32 Distance Traveled Over the Years Via Regression Tree Before 1500 ............. 92 Figure 33 Distance Traveled Over the Years Via Regression Tree After 1500 ............... 93 1.0 Introduction Genealogy is the study of families and family histories. Genealogists use oral interviews, historical records, genetic analysis, and other records to obtain information about a family and to demonstrate kinship and pedigrees of its members. This data allows genealogists to gather a better understanding of the populace’s lifestyles, biographies, and motivations [1]. A purpose for better understanding a historical populace is to compare and ultimately understand the current one. Important parts of genealogy are birth, death, and where those events were located. Standard ways for machine learning algorithms to compare and determine success are based on if the theoretical answer being predicted is number based (called regression) or definite, distinct possible answers, like true and false, based (called classification). One standard for testing the accuracy of regression machine algorithms is to use root means square error. A standard for testing classified machine algorithms is to compare against the majority class. Root means square error is the standard deviation of the residuals (the algorithm’s errors). Residuals are a measure of how far from the regression line data points actually are; root means square error is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit [2, 3, 4, 5]. Classification has two or more definite, distinct possibilities (such as a problem that has true or false as an answer). That highest percentage is called the majority class. To have 12 an algorithm that is better than the majority class is to have an algorithm that is more accurate than comparing against unintelligently choosing the most popular answer. For example, for the 2,224 people who were on the titanic, over 1,500 people died when it sank against the iceberg, while 705 individuals survived. On a proposed problem of “will I survive the titanic?”, at best 31% of the people survived, so the majority class would be that a person did not survive the titanic [6]. Combining machine learning techniques with genealogy could result in mathematical predictions that lead to better understanding of the relationships in the genealogical data. My research question is the following: Is it possible to use up to three generations of family history data to reasonably predict the root means square error of person’s age at time of death, better than the majority class if the person is at least able to get to eighteen, and the root means square error of the distance between the locations of birth and death? My hypothesis is that I will be able to predict the person’s age at time of death within reasonable bounds. I am aiming for reasonable bounds to have a root mean square error of 20% or less of the average of the age at time of death for that generation. For instance, generation 1’s mean age at time of death is 61.356 years, so the prediction should have an error of 12.271 or less. Genetics are handed down generational trees and genetics have an impact on the lifespan on individual’s lives. Even with no direct knowledge of which families have genetic diseases that will shorten their lifespan or have the genetic lottery and are extremely healthy, enough records should indicate a relationship to reduce the age of the first situation and expand the age of the second. 13 My hypothesis on the results is that I will not be beat the majority class to determine if the person will reach adulthood. The reason for this is that the amount of people that reach adulthood is extremely high, with the database I am working with at 92%, and that is from 1000 C.E. to 2010 C.E. Because of that, the majority class will be difficult to beat. My hypothesis on the results is that I won’t be able to predict on distance between the locations of birth and death. While there is research in applying statistics to ancestor’s locations [7], my intuition is that there will not be a comprehensive enough relationship of family history data to find out how far they travel from where they were born and where they died. By “comprehensive enough”, the goal will be to have a root mean square error result as 20% or less of the overall average of that generation. For instance, generation 1 has an average of 14.261 miles, so I predict that the error will not be equal to or smaller than 2.852 miles. The data I used for the machine learning analysis is from FamiLinx. “FamiLinx is a scientific resource from tens of millions of people mostly from the last 500 years. Different from traditional studies, this resource is the product of an ultra-crowd-sourcing approach and is based on collaborative work on genealogy enthusiast around the world who documented and shared their family stories” [8]. FamiLinx is useful for genealogical research due to the large amount of records they have accumulated. They have collected around 86 million individuals. They have genealogical data on 43 million of those individuals. This gives a large amount of family trees available to researchers. They were able to gather these records by working with MyHeritage’s Geni.com. Geni.com is a website to enter in family history and cross check that data with comparable data already entered to combine family trees. The actions I will 14 be performing on the FamiLinx’s data is similar to the work in the study, Quantitative analysis of population-scale family trees with millions of relatives [9] . I used multiple machine learning algorithms to answer my research question. For classification, I used decision trees, K-nearest neighbor, naïve Bayes, and neural networks. For regression, I used linear regression, regression decision trees, and neural networks. Decision tree learning for both classification and regression is effectively a nested set of if/else statements. Each branching path having more if/else statements about the data that eventually leads to the prediction. A new input is tested by just going through the nested if/else statements to find the results. K-nearest neighbors use the idea that similar inputs will have similar results. It works by creating a graph of inputs to outputs and measures the distance between these points. Any new input is graphed to its inputs and determine which section or output closely matches and uses that as its answer. Unlike K-nearest neighbor or decision trees, naïve Bayes looks at each of the features separately to predict the outcome. It determines the probability and impact of each feature to determine the output. With a new input, it uses each of the features to separately predict the outcome and use the one most likely from each of the predictions. Neural networks are multi-step algorithms that uses the output of the first layer to feed the second layer which uses those outputs to feed the third layer, that then generates the final prediction. I used a three-layered neural network: one layer hidden and the other two being the input and output layers. This algorithm works similarly to how brain cells work with multiple neurons sending data to even more neurons who do their own designated calculations and move the findings forward. 15 Linear regression is an algorithm to create a line that best represents the output to all of the inputs. This line is the smallest overall distance from the line to the points. The points are each record of data, the input features and output result can form a point on a graph. 16 2.0 Related Work The concept of comparing many different and unrelated features to determine a result occurs often and with sometimes extremely accurate results. The study, Predicting Titanic Survivors using Artificial Neural Network [6], was done on the Titanic passengers to see if their characteristics could be used to predict if they survived the Titanic sinking on April, 1912. This study had 12 demographic variables, that include passenger class, gender, and their age. Several fields were extremely impactful to determining if a passenger survived, such as passengers from Class 1 who had a 63% survival rate while females survived 74% of the time compared to the males at 18%. With that basic information, the study found they could predict if the person survived with a 99.28% accuracy. To only have a 0.7% failure rate with demographic data is extremely impressive. Another study that used the same concepts as predicting the Titanic survivors with a more real-world application is Applying Deep Learning to Public Health: Using Unbalanced Demographic Data to Predict Thyroid Disorder [10]. This study used a deep neural network to predict if there were natural tendencies towards diseases, specifically the thyroid disease being tested from demographic information alone. Their data had 747,301 samples with 13 demographic variables and had a lot of missing information. Some of the features included: age, education, gender, income, marital status, and race. Amongst many of the features had unknown data for a significant portion of the data. One such example is the education and income variables where data was missing for almost one-third of the sample population. 17 Unknown data is a problem for machine learning algorithms. The solution this study used was to give them a specific code to attempt to let the neural network learn from them. The result of this algorithm is to predict the classification of the thyroid disease. One of the complications faced was the samples they had were highly imbalanced with the prevalence of thyroid disorder at 6.1%. To help solve the class imbalance, the study used both down-sampling and up-sampling methods. The study utilized both bootstrap and SMOTE for up-sampling techniques. Their results were that “minimal demographic data can provide some insight on detecting thyroid disorder. Using our deep neural network model, the effectiveness of targeting the potential patients can increase up to 140% with 20th percentile of the population, as compared to random selection” [10]. The prior two studies were able to use demographic data for predictions, but neither of them used heritability or longevity, which is intrinsically linked to my research questions. The study, Familial Excess Longevity in Utah Genealogies [11], found that the influence of heritability of the longevity of a person after the age of 65 was approximately 15%. The study, Heritability Analysis of Life Span in a Semi-isolated Population Followed Across Four Centuries Reveals the Presence of Pleiotropy Between Life Span and Reproduction [12], used samples from isolated alpine communities in Italy to find that the results were similar to the Familial Excess Longevity in Utah Genealogies study with the heritable of 15% even before the age of 65 and that the heritability increases from 20% to 35% as they get significantly older than the average age of death. With two studies using Utah and Italy locations as test subjects for their research, the study, Regional hot spots of exceptional longevity in Germany [13], looked at Germany 18 to see if there was “support for the application of spatial analysis techniques in investigating regional variation in exceptional longevity”. Other studies that tested heritability are: Genetics of healthy aging and longevity [14], which put the relationship to be about 25% and Inheritance of longevity evinces no secular trend among members of six New England families born 1650–1874 [15] that found the relationship to be upper limit range of 33%–41%. The studies, Genetic Signatures of Exceptional Longevity in Humans [16] and the genetics of extreme longevity: lessons from the New England Centenarian study [17], tested specifically for the relationship for individuals who were 100+ years old. Genetic influence on human lifespan and longevity [18] found that identical twins lived an additional 0.18 year for each year their twin is alive after age 60 with their 100% identical genetics over fraternal twins with their 50% identical genes. In the Estimates of the Heritability of Human Longevity Are Substantially Inflated due to Assortative Mating [19] study, they challenge the 15% heritability by finding that it may be as low as 7% due to research that shows that it had more to do with environmental factors around the people that you married instead of genetics. Environmental factors contributing to longevity has been researched as well. The relationship between income and longevity was studied in The Association Between Income and Life Expectancy in the United States [20]. These studies show that there is a link between your heritability and your longevity, but my research questions are focused on seeing if just the family tree and their demographics are able to find that relationship. 19 3.0 Data Manipulation FamiLinx’s data is split amongst two files: 86 million records of individual’s records, and 43 million records of genealogy data. The individual’s records file is 14.6 GB, and genealogy data file is 876 MB. The individual’s records file is all that person’s information, such as where they were born and when they were baptized. Genealogy data file is the relationship between the people, such as person A was married to person B. The first step needed to be done was to insert the data into an alternative format in order to analyze it. Afterwards, I had to clean the data because of missing data, errors, etc. Lastly, I had to split the data into multiple databases to handle each research question separately. I explain briefly the concepts of the changes needed to be made to the FamiLinx’s data but go into more depth in the next section. The data needed to be in a format easy to analyze, modify and ultimately create machine learning models. The format of the FamiLinx’s files is csv, or comma separated values. Each record is a line and the fields are separated by commas. This format is effective at holding large amounts of data but terrible at manipulating. A format that is able to be read and modify large data sets is Microsoft SQL Server. SQL Server is bad at keeping the files condensed into an easy transferable file but excels at reading and writing efficiently. Because of the short comings of csv and the benefits of SQL Server, I transferred the records into this new format. In the context of data science and machine learning, data cleaning means filtering and modifying data such that it is easier to explore, understand, and model. I filtered out the parts I did not want or need so that I did not need to look at or process them. I modified the parts that I did need but were not in the format I needed them to be in so that I could 20 properly use them. One part of the data that had to be modified was that the algorithm cannot understand English. What I had to do was turn the name of the cities where the person was born at into unique numbers; for example, London would no longer be that word but would become a numbered id of 2 and New York City would become id of 3. This way the algorithm doesn’t have to deal with the words, but just understand that there is a difference between the ids (the difference between cities themselves). So, if a certain city improves or worsens the age at time of death, the algorithm might be able to understand that, but doesn’t have to try and understand language itself. The last modification to the database I needed to do was to split the data to have a database for each research question. Up till this point I had been working on a singular database to be attempting to answer all three questions. I created separate databases to be able to do specific cleaning based on the individual research questions. 21 3.1 Formatting and inserting the records The first issue was to put the data into a format that was easier to use for the hardware I had available. A csv file is useful for a normal amount of records, but is unable to be used when there are more records than able to be held in RAM (Random Access Memory) for that tasks that I needed to do. At 86 million records, I needed a tool that I am skilled in that is capable of handling extremely large amount of records. Because of these requirements, I decided to move the dataset to a Microsoft SQL server database. I used C# as the programming language to move and clean the data. I chose this language due to my prior experience with handling csv files and manipulation of databases using the Entity Framework library. I started with the standard C# standard brute force approach of creating a new object for each record, connect to the database and send the request to the SQL server for each record. I recorded the amount of time it took to save the records over a period of one hour. At the rate the brute force was able to read a record from the csv file and save it into the database, the entire csv file would have taken about a year to move over to the database. Since this was unacceptable, instead of creating a new object for each record, I created a stored procedure that would get called with parameters for each of the records fields with a constant connection to the database. Again, I recorded the amount of time it took to save the records over a period of one hour. This solution was faster than the brute force, it still was expected to take about a month to move over. The solution that I finally used to move the data to the database within an acceptable time period was to parallelize the prior solution. Instead of having a single thread that read and wrote every single record, I had multiple threads where each thread was tasked with 22 reading and writing only 5 million records. I used the standard C# library to handle the thread pooling to determine how many threads the system had available to work with. I could not easily skip each thread to its designated section to read/write as this csv file did not have identical line lengths as each record could be practically empty excluding commas or long sentences for each field. What this meant was that each thread had to skip a number of lines to get to its target record. I knew this was going to be a bottleneck with this solution as it meant that spinning up a new thread would be extremely costly. As before, I let this run for one hour and recorded the amount of records saved to the database. The rate indicated it would be done with the entire csv file in three days. While three days is a long amount of time, it was capable of running independently and so I decided that the speed was acceptable and I let it run during the weekday in a cool place to ensure it didn’t overheat being left alone at 100% CPU and memory usage for multiple days. The next option that I had to speed up read/writing the csv file to the SQL database if the prior solution did not work was to change how the threads found which line they were to start at. I had a scouting thread that recorded how many bytes it took to get to each thread’s starting point (as each thread was to read 5 million records, I would record the byte count to get to the 5 millionth line, the 10 millionth line, etc.). from there, each following thread would then be able to skip directly to that byte count. Now that I was successful with the demographic data, I had to do the same for the genealogical data on 43 million records to handle the data’s parents and children. 23 3.2 Cleaning the data Now that I had the database in an easy format to be able to read and query against, I needed to clean it [21]. The first thing I did was to manipulate the data so it was in a consistent format. Dates were saved as strings in the file in a number of various forms, such as DD/MM/YY, MM DD, YYYY or MMMM YYYY. For dates that were in a format that was completely unknown, I set those fields to null. For dates that were partially complete, such as just the year, I used January 1st as the month and day. An interesting state that FamiLinx’s data has is that in most records the records do not always both a baptism date and a birth date. The same is true for burial and death dates. While a birth date and baptism date are not the same day, they are often only a few days upwards to a few months [22]. I require birth and death dates to answer two of my research questions. Because of that, I used the baptism date and burial date as the birth and death dates respectively if those records were not be able to be used. As these records did not have fields that directly related to answer to my questions, I had to create new fields for age at time of death in days, a Boolean for if the person reached adulthood, and distance traveled in miles. I then went through each record to fill in these fields. I used the same parallelization design that I created for the read/writing of the csv file to file in these fields. Because I only used supervised machine learning algorithms, I could not use any record that had no answer for any of my questions and so I removed them. Only one of my questions used any of the records of people who are still living. Because child mortality requires me to either know the living person has reached adulthood 24 or died before they had the chance to, I could not use any record of a living person who had not reached 18 yet. I could not judge if that person might make it to 18, and so I had to delete all records of people who still have the ability to reach adulthood but haven’t done so yet. I then removed the records that must have been incorrectly entered. For instance, people whose birth date was after their death date. The other group that was removed was individuals who were far to old when they died. The oldest person to have ever lived was Jeanne Louise Calment at the age of 122 [23]. If the person was older than 122, they must have been entered incorrectly or they would have had the world record instead of Ms. Calment. A restriction of the machine learning algorithms that I used requires no fields to have nulls in them. Because of this I had to remove columns or records that were null depending on how many records would be overall impacted. A purpose of this research is to see how these questions are better answered using information from the parents, grandparents, etc. Because of this, individuals with only one parent was not able to be used. So, I removed everyone who had only one parent. This caused other people to then have only one parent. These people were then removed as well. This was recursively done till no people were left with only one parent. There were about 2 million records that were removed because of this. For the same reason, I could not use records that had neither parents nor children as heredity was a significant contributing factor in this document and so I removed all of those records. This left about 38 million records to be analyzed. 25 For the last question, is the person going to live to adulthood, if the person is alive and not yet adulthood, we do not know enough to make a decision on if they did reach adulthood. There is the possibility they will live till they are eighteen or a chance they could die that day. Another restriction of the algorithms is that words and sentences (strings) are not able to be used. A string is understandable to humans and so has meaning. To a machine, words have no meaning, but they can be set to distinct numbers, so their deviation has meaning, even if the words themselves do not. Because the data type was strings, I had to change all the strings to distinct enumerated numbers. The idea is that with the newly enumerated fields, they will have a small amount of deviations of strings. This was true for each of the fields except one: cause of death. What I tried to do to give cause of death some more useful characteristics was I created a new field to try and group the cause of death into simpler terms: cancer, medical, murder, unknown, etc. The reason for this was to reduce the many different cancers, causes of deaths and differently worded formats of unknown. At this point I had removed and cleaned up all the shared fields for the three questions. I added a SQL indexing on several of the relevant fields. This sped up access for all following queries. I had prior to this point determined to use heredity information to see if multiple generational information can make a better prediction. What I did was figure out how many records I had of how deep the generational tree I had. I built a SQL script to determine how many records of people with n number of generations of data I had. While I did have a family that reported 261 generations, they quickly diminished the number of useful records. 26 In the end, I decided three generations to see if there was a trend could prove that heredity useful in prediction. 27 3.3 Splitting the database All the records that I had created up until this point do not use heredity. I needed to create the ability to handle generational information. The tables for generation two and generation three hold all the records for the child, their parent’s data, and their grandparent’s data. The table that was already created by this point handles generation one. How I did this for the second generation was by copying all the records from the generation one table to the new generation two table. I then went over each record and copied their parent’s data and placed it into new fields. I did not use mother or father labels because I did not use any check against forcing the first parent to be the father and second parent to be the mother or vice versa. I gave the parent’s field a consistent naming convention to ensure I know what is relevant about the person and what is relevant from their parents. The naming convention was secondGenerationFirstParent_XXX and secondGenerationSecondParent_XXX where XXX is the name of the field being copied over. Generation three’s table is the same structure as generation two except it has information about the parent’s parent in the table as well. To do that, I created a copy of generation two data structure and copied all the data over. I then went over each record and either deleted them if they do not have a grandparent or copied over their data if they did. As the same for Generation two I kept a consistent naming convention to ensure I knew what is relevant about the person, the person’s parents, and the person’s grandparents. The naming convention was secondGenerationFirstParent_ThirdGeneration_FirstParent_XXX, and secondGenerationFirstParent_ThirdGeneration_SecondParent_XXX. 28 However, the data was not set up to answer any of the specific research questions. I needed to split this database into three separate databases and re-clean the database with those specific question in mind. This is where I encountered a hardware limitation. Even removing extraneous data, the SQL files were too large to have multiple copies located on my c:/ drive. What I did to solve this was have a backup database through an external hard drive. I dealt with each database separately and saved the file as a backup on the external hard drive. For each new database, I had even more records or fields to remove due to nulls or because the field conflicted with that specific research question. All the tables needed to be modified, but the parents and grandparent information was unaffected. For instance, the databases for age at time of death and child mortality could not have the field for “distance traveled”. What I found out was that for most records where “distance traveled” is not null, the entire record would be complete. Effectively only completely filled out records would most likely have the location where the person was born and where the person died. Because of this, “distance traveled” was a smaller subset of age at time of death and child mortality. There was another field needed to be removed for the age at time of death and child mortality databases. Their death date had to be removed so the algorithm would not follow the same arithmetic I did to reach the person’s age (and thus invalidating the underlying question in determining if other fields can be used to determine the answer). For the age at time of death and distance traveled databases, another group of records that needed to be removed was every individual recorded as still living. Both of 29 these questions require the person to be dead to be able to determine their age as well as how far they were from where they were born. The changes made to the child mortality database were to remove the “is alive” Boolean field that determined if the person was alive or not. This was true for both the individual and for the generational fields. The reason for needing to delete this field is that I already deleted all underage living individuals, which left the only living people left in the database to be older than eighteen. I needed to prevent the algorithms from being able to tell if a person is alive or not with the being able to check a single Boolean value. 30 4.0 Understanding the language of Python, the base language Knowing that the work to accomplish the research questions would require multiple different machine learning algorithms, I needed the tools to create models for decision trees, K-nearest neighbor, naïve Bayes, linear regression, and neural networks. One library that is capable of creating all of these models is called Scikit-Learn (also called sklearn). Scikit-Learn is an open-source library in Python that is built upon more Python libraries such as NumPy. Without prior experience in Python, I built a simple application to make sure I understand the similarities and differences with the other languages I know. I built a simple application displaying the words “Hello World” to ensure I had the environment working and I could understand the basic structure of the language. The next priority with Python was that sklearn requires to have the records of data in a NumPy object, an efficient multi-dimensional array. With the data in a Microsoft SQL database, I needed to build the connection to the tables that I had built to be available in Python. I had never used this library before. Because of this, I used a test application of the classification decision tree to make sure I understand how to use the library correctly. The classification decision tree test application is to determine which Iris flower species based on the petals and sepal’s width and lengths. I used this to understand how the syntax and outputs are formed. The last piece that I needed to get a better understanding before working on the larger problem was feature elimination and recursive feature elimination with cross validation [24]. A feature is one of the data fields that the record holds, such as a petal’s 31 length. The number of features in one individual record can be quite large. Not every feature has a significant impact on the output. For example, using the prior example of determining the species of Iris flowers, the age of the owner of the plant or his name are unimportant features. Feature elimination is removing a feature and then determine the output using a scoring matrix to determine if the impact was minimal. It then does this on every feature to then put the features in a ranking of importance. Feature elimination can be combined with recursive feature elimination with cross validation to be able to determine the optimal number of features, and which features those are. This tool works by using feature elimination to determine the ranking of important features and then removes them from least important to most important recursively and records the impact. This allows it to return with a number of features that should be used. This can then be fed back into feature elimination to determine which features to use. I used the examples on sklearn to have a good example to use for the models I needed to create. 32 5.0 Child Mortality With the records cleaned and split to handle the supervised learning algorithms, I moved to using the machine learning algorithms for child mortality. Specifically, I was looking at the question of if I can predict better than the majority class if a person will reach eighteen using the classification algorithms using decision tree, naïve Bayes, k-nearest neighbors, and neural networks. While I had many fields given to me by FamiLinx, not every field is important to the outcome of child mortality. For example, in the prior example of the titanic sinking, the color of the person’s hair would have a low to nonexistent impact to the person’s survival. The wealth and status of the person would have a more significant impact to the person’s survival instead. The scikit-learn documentation defines recursive feature elimination (RFE) as “Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.” [25]. Using the previous example for the Iris flower species, feature elimination did not work as the scoring default was not effective. Scoring is the metric to use to allow the feature elimination to attempt to maximize with smaller and smaller subset of features. The default scoring is accuracy, or how correct is the prediction to the result. The problem that arose was that the database was so heavily imbalanced that the feature elimination was 33 attempting to reduce the algorithm to just predicting the majority class, effectively removing every feature. At this point I had to look at more complicated scoring metrics to use. Scoring metrics use a matrix of predictions to create 4 different possibilities then use various mathematics to determine the effectiveness. The 4 different possibilities are: true positive (tp) or predicting that it would be positive (for child mortality, that the individual will survive to adulthood) and being correct. False positive (fp) or predicting that it would be positive and being incorrect. True negative (tn) or predicting that it would be negative (or the individual did not survive to adulthood) and being correct, and false negative (fn) or predicting that it would be negative and being incorrect. These are important as the recall and precision are used in the calculation for determining f1 score. Recall is tp / (tp + fn). Recall is intuitively the ability of the classifier to find all the positive samples. Precision is tp / (tn + fp). Precision is the ability of the classifier to find all the samples that were incorrectly marked as positive when they were negative. Both recall and precision are important as the scoring mechanism that I used was a weighted harmonic average between them called f1 score. I choose this type as I needed the overall accuracy to be high, such as with recall, but wanted more than just to replicate the majority class, which I gained with precision [26, 27, 28, 29]. With choosing f1_score as the scoring for the feature elimination, I then used feature elimination on the first-generation child mortality. The fields that I was left with were: Generation One: Birth_year Birth_month Birth_day 34 GendersId Baptism_location_citiesId Death_location_country_codesId Death_location_countriesId Death_location_statesId Death_location_citiesId Death_location_place_namesId Cause_of_deathsId Birth_location_citysId Birth_location_statesId Birth_location_countrysId Birth_location_country_codesId Birth_location_place_namesId Birth_location_resolved_extern_typesId Burial_location_place_namesId Burial_location_citysId Burial_location_statesId Burial_location_countrysId Burial_location_country_codesId SurvivedToEighteen I did the same thing with generation two and generation three. The following fields are a subset of the fields that I was left with (I am only displaying to only 1 parent from second generation and third generation). Generation Two: Generation_Two_First_Parent_Birth_year Generation_Two_First_Parent_Birth_month Generation_Two_First_Parent_Birth_day Generation_Two_First_Parent_Death_year Generation_Two_First_Parent_Death_month Generation_Two_First_Parent_Death_day Generation_Two_First_Parent_GendersId Generation_Two_First_Parent_Baptism_location_citiesId Generation_Two_First_Parent_Baptism_location_statesId Generation_Two_First_Parent_Death_location_country_codesId Generation_Two_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Death_location_statesId Generation_Two_First_Parent_Death_location_citiesId Generation_Two_First_Parent_Death_location_resolved_extern_typesId Generation_Two_First_Parent_Death_location_place_namesId Generation_Two_First_Parent_Cause_of_deathsId 35 Generation_Two_First_Parent_Baptism_location_resolved_extern_typesId Generation_Two_First_Parent_Baptism_location_place_namesId Generation_Two_First_Parent_Baptism_location_countrysId Generation_Two_First_Parent_Baptism_location_country_codesId Generation_Two_First_Parent_Birth_location_citysId Generation_Two_First_Parent_Birth_location_statesId Generation_Two_First_Parent_Birth_location_countrysId Generation_Two_First_Parent_Birth_location_country_codesId Generation_Two_First_Parent_Birth_location_place_namesId Generation_Two_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_place_namesId Generation_Two_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_citysId Generation_Two_First_Parent_Burial_location_statesId Generation_Two_First_Parent_Burial_location_countrysId Generation_Two_First_Parent_Burial_location_country_codesId Generation_Two_First_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_SurvivedToEighteen Generation Three: Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_year Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_month Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_day Generation_Two_First_Parent_Generation_Three_First_Parent_Death_year Generation_Two_First_Parent_Generation_Three_First_Parent_Death_month Generation_Two_First_Parent_Generation_Three_First_Parent_Death_day Generation_Two_First_Parent_Generation_Three_First_Parent_GendersId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_citiesId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_statesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_country_codesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_statesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_citiesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_place_namesId Generation_Two_First_Parent_Generation_Three_First_Parent_Cause_of_deathsId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_place_namesId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_countrysId Generation_Two_First_Parent_Generation_Three_First_Parent_Baptism_location_country_codesId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_citysId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_statesId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_countrysId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_country_codesId 36 Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_place_namesId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_place_namesId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_citysId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_statesId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_countrysId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_country_codesId Generation_Two_First_Parent_Generation_Three_First_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_Generation_Three_First_Parent_SurvivedToEighteen 37 5.1 Child Mortality: Decision Tree Numpy, a Python numerical library, creates a multi-dimensional array the size of the records being retrieved, and the number of fields per record. The Iris python example held all the records in RAM as an array and fed the entire array into the algorithm to be trained. For my data, with an array of 2,809,463x23, it exceeded my 16GB of RAM and caused the application to fail. Because of this, I could not use the entirety of my records to train the algorithm using the same technique as the iris dataset [30]. My first goal was to use the Python code on the child mortality database generation 1 table. Attempting to query the database for the records and train the algorithm failed. This is due to a hardware limitation due to the amount of records and Sklearn. This was not a problem for testing the records, and so I was able to use 20% of the total database. At that point, I started by not using all the records intended to be used for training but instead grabbed recursively an increasingly smaller percentage of them until I can hold the records in RAM. The records were grabbed randomly within the first 80% of the database, leaving the training data alone. For generation one, I was able to hold 561,175 records. This is ~19.97% of the total 2,809,463 records. For generation two, I was able to hold 264,964 records. This is ~15.98% of the total 1,657,634 records. For generation three, I was able to hold 113,629 records. This is 80% of the total 142,036 records. The base result, as seen in Table 1, is 89% accuracy while the majority class is 92.069%. The accuracy is high due to the precision of positive is correct, the issue is with the negative results. The precision and recall are extremely low for the negative results. 38 Table 1 Decision Tree Not Balanced Generation 1 precision recall f1-score FALSE 0.22 0.14 0.17 TRUE 0.93 0.96 0.94 accuracy 0.89 In classification, when the balance of one of the answers is significantly weighted to have one result, the data is said to be imbalanced. For example, in this dataset, a person’s chance to reach adulthood is overwhelmingly high at 92%. This is potentially a problem as various machine learning algorithms, including decision trees, use the minority class to help determine what the boundaries of the outputs should be. There are techniques that are able to modify the records to be able to attempt to fix the imbalanced data [31, 32, 33, 34, 35, 36]. The first option is random under sampling [37, 38]. This technique is to randomly remove records for the overrepresented class until both classes are evenly matched. In an example of 1,000 “A” records to 1,000,000 “B” records, it would randomly remove 999,000 “B” records. The issue is that lack of intelligence to make sure that important features of “B” are still consistent with the larger pool, but as long as there is still a significant number of “A” and “B” left, the features might be able to provide a useful model. The results I had with under sampling the generation 1 using decision tree, as seen in Table 2, was that it increased the recall of negatives by 54% but decreased the recall of positives by 37% making the overall accuracy drop by 40%. 39 Table 2 Decision Tree Random Under Sampling Generation 1 precision recall f1-score FALSE 0.12 0.68 0.21 TRUE 0.95 0.59 0.73 accuracy 0.59 The second option is SMOTE (Synthetic Minority Over-sampling Technique) [39, 40]. This technique is to artificially add records for the underrepresented class until both classes are evenly matched. It does it by analyzing the similarities of the features and creates new records with similar but not identical as those already existing. In an example of 1,000 “A” records to 1,000,000 “B” records, it would randomly add 999,000 “A” records. SMOTE is “smarter” than random under sampling as it attempts to build new records, but the downside is that the new records are simulated and so if there is a problem with the records being created it could lead the algorithm astray. The results I had with SMOTE on generation 1 using decision tree, as seen in Table 3, was that its increased negative’s recall by 27%, decreased negative’s precision by 8% but also decreased positive’s recall by 17% making the overall accuracy drop by 13%. Table 3 Decision Tree SMOTE Generation 1 precision recall f1-score FALSE 0.14 0.41 0.21 TRUE 0.94 0.79 0.86 accuracy 0.76 40 To test the extremes for determining the problems with the data being imbalanced, I created a subset of the data where the two classes are even at 222,820 records for surviving to adulthood and 222,820 records of those that did not. Table 4 shows the result that in this reduced balanced state against the subset of data, the decision tree algorithm is finally capable of beating the majority class of 50%. Table 4 Perfectly balanced dataset precision recall f1-score FALSE 0.66 0.71 0.68 TRUE 0.68 0.63 0.65 accuracy 0.67 For generation 2 without any balancing techniques is shown in Table 5, the results that I had was an accuracy of 80% while the majority class is ~86.756%. In Table 6, I used SMOTE on Generation two Decision tree. In Table 7, I used random-under-sampling balancing on decision tree generation 2. I used the same technique as Generation 1 to create a perfectly balanced dataset for generation 2 for Table 8. SMOTE created a pretty high accuracy, but none of the balancing techniques were better than leaving the dataset imbalanced. Table 5 Decision Tree Not Balanced Generation 2 precision recall f1-score FALSE 0.25 0.27 0.26 TRUE 0.89 0.88 0.88 accuracy 0.79 Table 6 Decision Tree SMOTE Generation 2 precision recall f1-score FALSE 0.2 0.26 0.23 TRUE 0.88 0.85 0.86 accuracy 0.77 41 Table 7 Decision Tree Random Under Sampling Generation 2 precision recall f1-score FALSE 0.18 0.62 0.28 TRUE 0.91 0.58 0.7 accuracy 0.58 Table 8 Decision Tree Perfectly Balanced Generation 2 precision recall f1-score FALSE 0.2 0.66 0.31 TRUE 0.92 0.6 0.73 accuracy 0.61 For generation 3 without any balancing techniques is shown in Table 9. The results that I had was an accuracy of 80% while the majority class is ~ 84.771%. I then applied the various balancing techniques to Generation 3. Table 10 was SMOTE being applied which reduced the accuracy by ~4%, compared to Table 11 random under sampling which reduced the accuracy by ~19% but did manage to have the recall to be fairly balanced, whereas SMOTE did not. I then tested what if the data was perfectly balanced in Table 12 which fit in line with the random under sampling. Table 9 Decision Tree Not Balanced Generation 3 precision recall f1-score FALSE 0.35 0.37 0.36 TRU E 0.8 9 0.88 0.88 accuracy 0.8 Table 10 Decision Tree SMOTE Generation 3 precision recall f1-score FALSE 0.27 0.36 0.31 TRUE 0.88 0.83 0.85 accuracy 0.76 42 Table 11 Decision Tree Random Under Sampling Generation 3 precision recall f1- score FALSE 0.23 0.66 0.34 TRUE 0.91 0.6 0.72 accuracy 0.61 Table 12 Decision Tree Perfectly Balanced Generation 3 precision recall f1- score FALSE 0.22 0.66 0.34 TRUE 0.91 0.59 0.72 accuracy 0.6 In the best cases presented, I was not able to predict better than the majority class on if a person will reach adulthood of the age of eighteen or not for any of the generations. In all the base cases I could not do any better of detecting if a person WON’T reach the age of eighteen better than 37% percent of the time. If I used random under sampling then my accuracy was 70% but then I had only a 59% overall accuracy. The summary of all the results for Decision tree is in Table 13. Table 13 Decision Tree Summary Results Gen 1 Gen 2 Gen 3 Decision Tree No Balancing 0.89 0.79 0.8 Decision Tree SMOTE 0.76 0.77 0.76 Decision Tree Under sampling 0.6 0.58 0.61 Decision Tree 50/50 0.63 0.61 0.6 Majority Class 0.92069 0.86756 0.84771 Records used 561175 264,964 113,629 Records Total 2,809,463 1,657,634 142,036 43 Record Percentage Used 19.974% 15.984% 80% 44 5.2 Child Mortality K-Nearest Neighbor For K-nearest neighbor, I used the same fields and same number of records being used as I did with decision trees. For generation 1 without any balancing techniques is shown seen in Table 14. The results that I had was an accuracy of 91% while the majority class is ~92.069%. I applied SMOTE balancing in Table 15, random under sampling in Table 16, and forced the dataset to be perfectly balanced in Table 17. Table 14 K-Nearest Neighbor Not Balanced Generation 1 precision recall f1-score FALSE 0.27 0.06 0.1 TRU E 0.9 2 0.99 0.95 accuracy 0.91 Table 15 K-Nearest Neighbor SMOTE Generation 1 precision recall f1-score FALSE 0.14 0.43 0.21 TRUE 0.94 0.78 0.85 accuracy 0.75 Table 16 K-Nearest Neighbor Random Under Sampling Generation 1 precision recall f1-score FALSE 0.12 0.73 0.2 TRU E 0.9 6 0.53 0.68 accuracy 0.55 Table 17 K-Nearest Neighbor Perfectly Balanced Generation 1 precision recall f1-score FALSE 0.13 0.62 0.22 TRU E 0.9 5 0.65 0.77 accuracy 0.65 For generation 2 without any balancing techniques is shown in Table 18. The results that I had was an accuracy of 85% while the majority class is ~86.756. I applied smote 45 balancing in Table 19, random under sampling in Table 20, and forced the dataset to be perfectly balanced in Table 21. Table 18 K-Nearest Neighbor Not Balanced Generation 2 precision recall f1-score FALSE 0.29 0.1 0.15 TRU E 0.8 7 0.96 0.92 accuracy 0.85 Table 19 K-Nearest Neighbor Smote Generation 2 precision recall f1-score FALSE 0.17 0.62 0.27 TRU E 0. 9 0.55 0.68 accuracy 0.56 Table 20 K-Nearest Neighbor Random Under Sampling Generation 2 precision recall f1-score FALSE 0.17 0.66 0.27 TRUE 0.91 0.5 0.65 accuracy 0.53 Table 21 K-Nearest Neighbor Perfectly Balanced Generation 2 precision recall f1-score FALSE 0.18 0.7 0.29 TRUE 0.92 0.53 0.67 accuracy 0.55 For generation 3 without any balancing techniques is show in Table 22. The results that I had was an accuracy of 83% while the majority class is ~ 84.771%. I applied SMOTE balancing in Table 23, random under sampling in Table 24, and forced the dataset to be perfectly balanced in Table 25. Table 22 K-Nearest Neighbor Not Balanced Generation 3 precision recall f1-score 46 FALSE 0.38 0.23 0.29 TRU E 0.8 7 0.93 0.9 accuracy 0.83 Table 23 K-Nearest Neighbor SMOTE Generation 3 precision recall f1-score FALSE 0.2 0.67 0.3 TRUE 0.89 0.51 0.65 accuracy 0.53 Table 24 K-Nearest Neighbor Random Under Sampling Generation 3 precision recall f1-score FALSE 0.2 0.69 0.3 TRUE 0.9 0.49 0.63 accuracy 0.52 Table 25 K-Nearest Neighbor Perfectly Balanced Generation 3 precision recall f1-score FALSE 0.2 0.7 0.31 TRUE 0.9 0.5 0.64 accuracy 0.53 While these numbers are much closer than that of the decision trees, they still are not enough to beat the majority class. A summary of the results will be in Table 26. Table 26 K-Nearest Neighbor Summary Results Gen 1 Gen 2 Gen 3 K-Nearest Neighbor No Balancing 0.91 0.85 0.83 K-Nearest Neighbor SMOTE 0.75 0.56 0.53 K-Nearest Neighbor Under Sampling 0.55 0.53 0.52 K-Nearest Neighbor 50/50 0.65 0.55 0.53 47 Majority Class 0.92069 0.86756 0.84771 Records used 561175 264,964 113,629 Records Total 2,809,463 1,657,634 142,036 Record Percentage Used 19.974% 15.984% 80% 48 5.3 Child Mortality Naïve Bayes For Naïve Bayes, I used the same fields and same number of records being used as I did with decision trees. For generation 1 without any balancing techniques is shown in Table 27, the results that I had was an accuracy of 84% while the majority class is ~92.069%. I applied SMOTE balancing in Table 28, random under sampling in Table 29, and forced the dataset to be perfectly balanced in Table 30. Table 27 Naïve Bayes Not Balanced Generation 1 precision recall f1-score FALSE 0.14 0.19 0.16 TRU E 0.9 3 0.9 0.91 accuracy 0.84 Table 28 Naïve Bayes SMOTE Generation 1 precision recall f1-score FALSE 0.1 0.77 0.17 TRU E 0.9 5 0.36 0.53 accuracy 0.4 Table 29 Naïve Bayes Random Under Sampling Generation 1 precision recall f1-score FALSE 0.1 0.77 0.18 TRU E 0.9 5 0.39 0.56 accuracy 0.42 Table 30 Naïve Bayes Perfectly Balanced Generation 1 precision recall f1-score FALSE 0.1 0.77 0.18 TRU E 0.9 5 0.4 0.56 accuracy 0.43 For generation 2 without any balancing techniques is shown in Table 31. The results that I had was an accuracy of 57% while the majority class is ~86.756%. I applied 49 smote balancing in Table 32, random Under sampling in Table 33, and forced the dataset to be perfectly balanced in Table 34. Table 31 Naïve Bayes Not Balanced Generation 2 precision recall f1-score FALSE 0.18 0.61 0.27 TRU E 0. 9 0.56 0.69 accuracy 0.57 Table 32 Naïve Bayes SMOTE Generation 2 precision recall f1-score FALSE 0.16 0.62 0.25 TRUE 0.89 0.5 0.64 accuracy 0.51 Table 33 Naïve Bayes Random Under Sampling Generation 2 precision recall f1-score FALSE 0.17 0.71 0.27 TRUE 0.91 0.46 0.61 accuracy 0.5 Table 34 Naïve Bayes Perfectly Balanced Generation 2 precision recall f1-score FALSE 0.17 0.72 0.27 TRUE 0.91 0.45 0.6 accuracy 0.48 For generation 3 without any balancing techniques is shown in Table 35. The results that I had was an accuracy of 45% while the majority class is ~ 84.771%. I applied SMOTE balancing in Table 36, random under sampling in Table 37, and forced the dataset to be perfectly balanced in Table 38. Table 35 Naïve Bayes Generation 3 precision recall f1-score 50 FALSE 0.18 0.75 0.29 TRU E 0. 9 0.4 0.55 accuracy 0.45 Table 36 Naïve Bayes SMOTE Generation 3: precision recall f1-score FALSE 0.18 0.65 0.28 TRUE 0.88 0.45 0.59 accuracy 0.48 Table 37 Naïve Bayes Random Under sampling Generation 3 precision recall f1-score FALSE 0.18 0.79 0.29 TRUE 0.9 0.34 0.49 accuracy 0.41 Table 38 Naïve Bayes Perfectly Balanced Generation 3 precision recall f1-score FALSE 0.18 0.78 0.29 TRUE 0.9 0.36 0.51 accuracy 0.42 These results were currently worse compared to the majority class over decision tree or k-nearest neighbor. A summary of the results will be in Table 39. Table 39 Naïve Bayes Summary Results Gen 1 Gen 2 Gen 3 Naïve Bayes No Balancing 0.84 0.57 0.45 Naïve Bayes SMOTE 0.4 0.51 0.48 Naïve Bayes Under sampling 0.42 0.5 0.41 Naïve Bayes 50/50 0.43 0.48 0.42 Majority Class 0.92069 0.86756 0.84771 Records used 561175 264,964 113,629 51 Records Total 2,809,463 1,657,634 142,036 Record Percentage Used 19.974% 15.984% 80% 52 5.4 Child Mortality Neural Network For neural networks, I used the same fields and same number of records being used as I did with decision trees. For generation 1 without any balancing techniques is shown in Table 40. The results that I had was an accuracy of 92% while the majority class is ~92.069%. I applied SMOTE balancing in Table 41, random under sampling in Table 42, and forced the dataset to be perfectly balanced in Table 43. Table 40 Neural Network Not Balanced Generation 1 precision recall f1-score FALSE 0.48 0.02 0.03 TRU E 0.9 2 1 0.96 accuracy 0.92 Table 41 Neural Network SMOTE Generation 1 precision recall f1-score FALSE 0.15 0.57 0.24 TRU E 0.9 5 0.73 0.83 accuracy 0.72 Table 42 Neural Network Random Under Sampling Generation 1 precision recall f1-score FALSE 0.15 0.75 0.25 TRU E 0.9 7 0.62 0.76 accuracy 0.63 Table 43 Neural Network Perfectly Balanced Generation 1 precision recall f1-score FALSE 0.16 0.77 0.26 TRU E 0.9 7 0.64 0.77 accuracy 0.65 For generation 2 without any balancing techniques is shown in Table 44. The results that I had was an accuracy of 86% while the majority class is ~86.756%. I applied SMOTE 53 balancing in Table 45, random under sampling in Table 46, and forced the dataset to be perfectly balanced in Table 47. Table 44 Neural Network Not Balanced Generation 2 precision recall f1-score FALSE 0.4 0.07 0.13 TRU E 0.8 7 0.98 0.93 accuracy 0.86 Table 45 Neural Network SMOTE Generation 2 precision recall f1-score FALSE 0.24 0.35 0.28 TRUE 0.89 0.83 0.86 accuracy 0.77 Table 46 Neural Network Random Under Sampling Generation 2 precision recall f1-score FALSE 0.21 0.65 0.32 TRUE 0.92 0.63 0.75 accuracy 0.63 Table 47 Neural Network Perfectly Balanced Generation 2 precision recall f1-score FALSE 0.24 0.74 0.36 TRUE 0.94 0.63 0.76 accuracy 0.65 For generation 3 without any balancing techniques is shown in Table 48. The results that I had was an accuracy of 82% while the majority class is ~ 84.771%. I applied SMOTE balancing in Table 49, random Under sampling in Table 50, and forced the dataset to be perfectly balanced in Table 51. Table 48 Neural Network Not Balanced Generation 3 precision recall f1-score FALSE 0.38 0.26 0.31 TRUE 0.87 0.92 0.9 54 accuracy 0.82 Table 49 Neural Network SMOTE Generation 3 precision recall f1-score FALSE 0.3 0.39 0.34 TRUE 0.88 0.83 0.86 accuracy 0.77 Table 50 Neural Network Random Under Sampling Generation 3 precision recall f1-score FALSE 0.23 0.67 0.34 TRUE 0.91 0.6 0.72 accuracy 0.61 Table 51 Neural Network Perfectly Balanced Generation 3 precision recall f1-score FALSE 0.24 0.63 0.35 TRUE 0.91 0.65 0.75 accuracy 0.64 The results of the neural network are interesting in that it matches the majority class. Generation 1’s model was off by 0.069% and generation 2 was off by ~0.75%. these models are extremely close to the majority class. A summary of all results for child mortality will be in Table 52. Table 52 Child Mortality Summary Results Gen 1 Gen 2 Gen 3 Decision Tree No Balancing 0.89 0.79 0.8 Decision Tree SMOTE 0.76 0.77 0.76 Decision Tree Under sampling 0.6 0.58 0.61 Decision Tree 50/50 0.63 0.61 0.6 55 K-Nearest Neighbor No Balancing 0.91 0.85 0.83 K-Nearest Neighbor SMOTE 0.75 0.56 0.53 K-Nearest Neighbor Under sampling 0.55 0.53 0.52 K-Nearest Neighbor 50/50 0.65 0.55 0.53 Naïve Bayes No Balancing 0.84 0.57 0.45 Naïve Bayes SMOTE 0.4 0.51 0.48 Naïve Bayes Under sampling 0.42 0.5 0.41 Naïve Bayes 50/50 0.43 0.48 0.42 Neural Net No Balancing 0.92 0.86 0.82 Neural Net SMOTE 0.72 0.77 0.77 Neural Net Under sampling 0.63 0.63 0.61 Neural Net 50/50 0.65 0.65 0.64 Majority Class 0.92069 0.86756 0.84771 Records used 561175 264,964 113,629 Records Total 2,809,463 1,657,634 142,036 Record Percentage Used 19.974% 15.984% 80% 56 6.0 Age at Time of Death With classification algorithms created, I moved to the regression algorithms. Specifically, predicting what the age of the person will be when they die using three different machine learning algorithms: linear regression, regression tree, and neural network. Age of time of death is a field of how many days the person was alive, so ~365 days would be about a year. With the change from classification to regression, scoring for recursive field elimination changes as well. Unlike f1 score, I was able to use a scoring for root mean square error which the overall result is being compared against. Root means square error is a risk metric corresponding to the expected value of the error. To calculate root means square error, is to take the difference in the expected outcome and the actual outcome (i.e. residuals) for each record being tested and square the result. Take the mean of all the errors and then square root it. This scoring is useful in emphasizing larger errors in the algorithm. The goal for these algorithms is to have a sensible root mean squared error. By sensible, I want to have less than around 10% to 20% of the average age at the time of death to be the error. The average age for generation 1 is 61.356 years with the target root mean square error should be 12.271 or less. The average distance traveled for generation 2 is 57.263 years with the target being 11.453 or less. Generation 3’s average distance is 56.71 years with a target root mean square error should be 11.342 or less. I was able to apply the recursive feature elimination on the three generations in the same way as the child mortality to end up with these fields: Generation 1: 57 Birth_year Birth_month Birth_day GendersId Baptism_location_citiesId Baptism_location_statesId Death_location_country_codesId Death_location_countriesId Death_location_statesId Death_location_citiesId Death_location_resolved_extern_typesId Death_location_place_namesId Cause_of_deathsId Baptism_location_resolved_extern_typesId Baptism_location_place_namesId Baptism_location_countrysId Baptism_location_country_codesId Birth_location_citysId Birth_location_statesId Birth_location_countrysId Birth_location_country_codesId Birth_location_place_namesId Birth_location_resolved_extern_typesId Burial_location_place_namesId Burial_location_resolved_extern_typesId Burial_location_citysId Burial_location_statesId Burial_location_countrysId Burial_location_country_codesId AgeAtTimeOfDeath Generation 2: Generation_Two_First_Parent_Birth_year Generation_Two_First_Parent_Birth_month Generation_Two_First_Parent_Birth_day Generation_Two_First_Parent_Death_year Generation_Two_First_Parent_Death_day Generation_Two_First_Parent_GendersId Generation_Two_First_Parent_Death_location_country_codesId Generation_Two_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Death_location_statesId Generation_Two_First_Parent_Death_location_citiesId Generation_Two_First_Parent_Death_location_resolved_extern_typesId Generation_Two_First_Parent_Cause_of_deathsId 58 Generation_Two_First_Parent_Baptism_location_resolved_extern_typesId Generation_Two_First_Parent_Baptism_location_place_namesId Generation_Two_First_Parent_Baptism_location_country_codesId Generation_Two_First_Parent_Birth_location_citysId Generation_Two_First_Parent_Birth_location_countrysId Generation_Two_First_Parent_Birth_location_country_codesId Generation_Two_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_place_namesId Generation_Two_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_statesId Generation_Two_First_Parent_AgeAtTimeOfDeath Generation 3: Generation_Two_First_Parent_Birth_year Generation_Two_First_Parent_Birth_month Generation_Two_First_Parent_Birth_day Generation_Two_First_Parent_Death_year Generation_Two_First_Parent_Death_month Generation_Two_First_Parent_Death_day Generation_Two_First_Parent_GendersId Generation_Two_First_Parent_Death_location_country_codesId Generation_Two_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Cause_of_deathsId Generation_Two_First_Parent_Baptism_location_countrysId Generation_Two_First_Parent_Birth_location_countrysId Generation_Two_First_Parent_Birth_location_country_codesId Generation_Two_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_citysId Generation_Two_First_Parent_Burial_location_statesId Generation_Two_First_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_year Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_month Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_day Generation_Two_First_Parent_Generation_Three_First_Parent_Death_year Generation_Two_First_Parent_Generation_Three_First_Parent_Death_month Generation_Two_First_Parent_Generation_Three_First_Parent_Death_day Generation_Two_First_Parent_Generation_Three_First_Parent_GendersId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_citiesId Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Generation_Three_First_Parent_AgeAtTimeOfDeath 59 6.1 Age at Time Linear Regression As was the same with child mortality, I could not hold the entirety of the database in RAM and had to recursively reduce the amount I used to train with. For generation one, I was able to hold 500,000 records. This is ~ 22.761% of the total 2,196,711 records. For generation two, I was able to hold 225000 records. This is ~ 13.736% of the total 1638023 records. For generation three, I was able to hold 112520 records. This is ~80% of the total 140711 records. Unlike classification that is heavily impacted by the data being balanced, regression does not have these issues. Because of this, I did not have to focus on testing with the regression’s equivalent tools to SMOTE and random under sampling. The results that I got was for generation 1 was a root means square error of ~23.996 in years. Generation 2 and generation 3 data did no better. Generation 2 had a root means square error of 26.706 years, and generation 3 had an error of 27.886 years. These values are at best twice the targeted value error of ~11-12(depending on the generation) years and so not useful enough to be confident on its usability. I created a bar graph showing how the linear regression algorithm for generation 1 is predicting in Figure 1 vs how the actual data spread looks in Figure 2. The findings are interesting in that each of the predictions so heavily lean to the ages of 55-70. Instead of such a massive spike in the prediction tables, the actual results are instead a gradual increase till the peak of the 75 to 80. One part of the table from actual that was an oddity was the amount of records 0 to 5. 60 Figure 1 Age at Time of Death Linear Regression Prediction Figure 2 Age at Time of Death Linear Regression Actual Seeing the graphs with what the algorithm predicted vs what the actual results were, I created graphs to show the differences. Figure 3 shows the predicted results subtracted by the actual results for each record, and then grouped by frequency. If predicted results subtracted by actual is positive, then the algorithm predicted the person would live longer than they did. If the number is negative, then the algorithm undershot how long the person 61 would live. Figure 4 is to see I created an accompanying table for just the absolute difference to just show how far the algorithm is regardless of over/under shooting. There are many records that were overshot by 60 to 70 years, but with the large amount of actual records of 0 to 5 years of age and the predicted ages were almost predominantly 60 to 65 years of age, this spike of overshot is understandable. Figure 3 Age at Time of Death Linear Regression Prediction - Actual 62 Figure 4 Age at Time of Death Linear Regression Difference 63 6.2 Age at Time of Death Regression Tree With regression tree, the only change was in which algorithm was being used, the same records with the same fields were used. The results using the different algorithm with root means square error were: Generation 1 was 26.683 years, Generation 2 was 32.094 years, and Generation 3 was 31.273 years. Compared to the linear regression, this algorithm had a worst root mean square error for all three generations of data. The goal of this algorithm was to have a lower root mean square error of ~11-12 (depending on the generation) years, and even the best generation was twice the target value error of 12 years and so isn’t accurate enough to be useful. I created bar graphs showing how the regression tree algorithm for generation 1 predicted in Figure 5 to the actual results in Figure 6. The predictions graph is different in that it has far more variance then guessing strictly between the ages of 55 to 70. The actual results graph is identical as the one in Linear regression due to the same data is being compared against. 64 Figure 5 Age at Time of Death Regression Tree Prediction Figure 6 Age at Time of Death Regression Tree Actual As with linear regression, I created Figure 7, which shows the predicted results subtracted by the actual results and then grouped by frequency. If predicted results subtracted by actual is positive, then the algorithm predicted the person would live longer than they did. If the number is negative, then the algorithm undershot how long the person 65 would live. Figure 8 is to see I created an accompanying table for just the absolute difference to just show how far the algorithm is regardless of over/under shooting. Unlike linear regression, this algorithm did not have as many predictions between 0-5 as the actual results but did not have another spike at 60-70 years old and instead ended up with a graph looking much closer to a standard distribution chart. Figure 7 Age at Time of Death Regression Tree Prediction - Actual 66 Figure 8 Age at Time of Death Regression Tree Difference 67 6.3 Age at Time of Death Neural Network With the neural network algorithm, the only change was in which algorithm was being used, the same records with the same fields were used. The results using the different algorithm with root means square error were: Generation 1 was 23.157 years, Generation 2 was 25.73 years, and Generation 3 was 27.382 years. The results were only better than the linear regression by reducing the root mean square error by half to a full year. The target goal is to have a root mean square error of 11-12 years (depending on the generation) and as the same for linear regression and regression tree with being ~twice the error than the target goal. Similar to linear regression and regression tree, the results of this model are insufficient. This model is the best of the three with the lowest errors, but only lowered them by one to four years depending on generation and the other model being compared to. A summary of the results will be shown in Table 53. Table 53 Age at time Of Death Summary Results As with linear regression and regression tree, I created the bar graphs showing how the neural network algorithm is predicting in Figure 9 Age at Time of Death Neural Gen 1 Gen 2 Gen 3 Linear Regression 23.99609255 26.70617789 27.88610487 Regression Tree 26.68342486 32.0940887 31.2732002 Neural Net 23.15698592 25.7296628 27.38210708 Overall Average 61.35624405 57.26308988 56.71003494 Target Value 12.27124881 11.45261798 11.34200699 Records used 500000 225000 112520 Records Total 2,196,711 1638023 140711 Record Percentage Used 22.76130087 13.73607086 79.96531899 68 Network Prediction for generation 1 vs how the actual data spread looks in Figure 10 Age at Time of Death Neural Network Actual. The findings are similar to linear regression and regression tree in that the predictions so heavily lean to the ages of 55-70. The actual result is identical to the prior algorithm as they use the same data to compare against. Figure 9 Age at Time of Death Neural Network Prediction Figure 10 Age at Time of Death Neural Network Actual 69 As with linear regression and regression tree, I created Figure 11, which shows the predicted results subtracted by the actual results and then grouped by frequency. If predicted results subtracted by actual is positive, then the algorithm predicted the person would live longer than they did. If the number is negative, then the algorithm undershot how long the person would live. Figure 12 is to show the absolute difference to just show the difference of the prediction vs the actual results regardless of over/under shooting. Because neural networks predictions are nearly identical to linear regression, Figure 11 looks similar to Figure 3 and Figure 12 look similar to Figure 4 as well. Figure 11 Age at Time of Death Neural Network Prediction - Actual 70 Figure 12 Age at Time of Death Neural Network Difference 71 6.4 Age at Time of Death by Year Another way to visualize the age at time of death is the average age of death over the years. I used the neural network algorithm as it had the lowest root mean square error, put the average prediction, average actual and average absolute difference in a line graph over the 1,010 years. I split the table into before 1900 in Figure 13 and after 1900 in Figure 14. I chose to split the tables at the year 1900 because from years 1000 to 1900, the lines are relatively static but after 1900 is when the graph changes. Effectively after 1900 the average person is no longer capable of living to old age and so the average prediction and actual is capable of following the trend till about 1980. At 1980, the algorithm no longer understood the data and the absolute difference spiked as the average prediction was a negative number even though no actual record was negative (no person dying before they were born). Figure 13 Age at Time of Death Neural Network Over Years Before 1900 72 Figure 14 Age at Time of Death Neural Network Over Years After 1900 73 6.5 Age at Time of Death Subset Looking at the linear regression and neural network graphs, there was a large amount of records with an age at time of death between 0 to 5 year. I wanted to know if the algorithms would be significantly improved if the algorithm did not have to deal with that large number of outliers. I split the data and created two neural network algorithms for under 5 and the other for over 5. The results were disappointing, with the algorithm under 5 having a root mean squared error of 478.092 days (1.31 years). The predictive bar graph is in Figure 15 while the actual results is in Figure 16. The reason for the disappointing was that the prediction was that the prediction was predominantly from years 1 to 2 with a small amount of predictions from 0 to 1. In contrast, the actual results were mostly from 0 to 1 but with a small amount of records for the other years. Figure 15 Age at Time of Death Neural Network Age Under 5 Prediction 74 Figure 16 Age at Time of Death Neural Network Age Under 5 Actual The graph for over 5 has reduced its root mean squared error to 18.858 years from 23.157, but the graph still shows the algorithm is making the same type of predictions without splitting the database in Figure 17. The actual results, as shown in Figure 18, is understandably identical to various previous graphs without the ages of 0 to 5. Overall, splitting the data to try and fix the oddities of the amount of 0 to 5 records did not help change how the algorithm is making predictions and instead improved the root mean square only due to the algorithm no longer had to deal with the records from 0 to 5 years of age. 75 Figure 17 Age at Time of Death Neural Network Age Over 5 Prediction Figure 18 Age at Time of Death Neural Network Age Over 5 Actual Another subset that I was interested in was how large of an error was created after the year 1900. After 1980, the algorithm was completely off on the predictions vs actual results, but the graph began to change after 1900 compared to its static lines from before 1900. What I did was rerun the neural network but limited it so all training and test data 76 was restricted to before 1900. The neural network algorithm would then produce the root mean squared error and recreate the year graph to see if there are any noticeable changes. The results were lackluster. The root means square error went to 23.059 from 23.586, or a difference of about half of a year. The graph is fairly unchanged, shown in Figure 19. Figure 19 Age at Time of Death Neural Network Over Years Limited to Before 1900 77 7.0 Distance Traveled With finishing the Age at Time of Death regression, I moved to Distance Traveled. Specifically, am I capable of predicting the distance from where the person was born and where the person died. I used three different machine learning for algorithms: linear regression, regression tree, and neural network. The field I tested against was DistanceTraveledAtDeath, which is the using Pythagorean Theorem for the latitude and longitude of the location and then multiplied by the rough average of 69 miles per degree. The goal for these algorithms is to have a sensible root mean squared error. By sensible, I want to have less than around 10 to 20% of the average distance traveled be the error. The average distance traveled for generation 1 is 14.262 with the target root mean square error should be 2.852 or less. The average distance traveled for generation 2 is 10.811 with the target being 2.162 or less. Generation 3’s average distance is 6.352 with a target root mean square error should be 1.270467874 or less. The scoring used with the Age at Time of Death algorithm models, root mean square error, is going to be used with distance traveled. With this scoring being used, I was able to apply the Recursive Field Elimination on the three generations in the same way as the Age at Time of Death to end up with these fields: Generation 1: Birth_year Birth_day Birth_location_latitude Birth_location_longitude Death_year GendersId Baptism_location_statesId Death_location_country_codesId Death_location_countriesId 78 Death_location_statesId Death_location_resolved_extern_typesId Death_location_place_namesId Cause_of_deathsId Baptism_location_resolved_extern_typesId Birth_location_citysId Birth_location_statesId Birth_location_countrysId Birth_location_country_codesId Birth_location_place_namesId Burial_location_resolved_extern_typesId Burial_location_statesId Burial_location_countrysId Burial_location_country_codesId AgeAtTimeOfDeath DistanceTraveledAtDeath Generation 2: Generation_Two_First_Parent_Birth_year Generation_Two_First_Parent_Birth_month Generation_Two_First_Parent_Death_year Generation_Two_First_Parent_Death_day Generation_Two_First_Parent_Death_location_country_codesId Generation_Two_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Death_location_statesId Generation_Two_First_Parent_Death_location_citiesId Generation_Two_First_Parent_Cause_of_deathsId Generation_Two_First_Parent_Baptism_location_resolved_extern_typesId Generation_Two_First_Parent_Baptism_location_place_namesId Generation_Two_First_Parent_Birth_location_statesId Generation_Two_First_Parent_Birth_location_countrysId Generation_Two_First_Parent_Birth_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_resolved_extern_typesId Generation_Two_First_Parent_Burial_location_statesId Generation_Two_First_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_Birth_location_latitude Generation_Two_First_Parent_Birth_location_longitude Generation_Two_First_Parent_Death_location_latitude Generation_Two_First_Parent_Death_location_longitude Generation_Two_First_Parent_DistanceTraveledAtDeath Generation_Two_Second_Parent_Birth_year Generation_Two_Second_Parent_Birth_month 79 Generation_Two_Second_Parent_Death_year Generation_Two_Second_Parent_Death_month Generation_Two_Second_Parent_Death_day Generation_Two_Second_Parent_Death_location_country_codesId Generation_Two_Second_Parent_Death_location_countriesId Generation_Two_Second_Parent_Death_location_statesId Generation_Two_Second_Parent_Death_location_resolved_extern_typesId Generation_Two_Second_Parent_Death_location_place_namesId Generation_Two_Second_Parent_Baptism_location_country_codesId Generation_Two_Second_Parent_Birth_location_countrysId Generation_Two_Second_Parent_Burial_location_countrysId Generation_Two_Second_Parent_Burial_location_country_codesId Generation_Two_Second_Parent_Birth_location_latitude Generation_Two_Second_Parent_Birth_location_longitude Generation_Two_Second_Parent_Death_location_latitude Generation_Two_Second_Parent_Death_location_longitude Generation_Two_Second_Parent_DistanceTraveledAtDeath Generation 3: Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_year Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_month Generation_Two_First_Parent_Generation_Three_First_Parent_Birth_day Generation_Two_First_Parent_Generation_Three_First_Parent_Death_year Generation_Two_First_Parent_Generation_Three_First_Parent_Death_month Generation_Two_First_Parent_Generation_Three_First_Parent_Death_day Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_countriesId Generation_Two_First_Parent_Generation_Three_First_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_Generation_Three_First_Parent_Death_location_longitude Generation_Two_First_Parent_Generation_Three_Second_Parent_Birth_year Generation_Two_First_Parent_Generation_Three_Second_Parent_Death_year Generation_Two_First_Parent_Generation_Three_Second_Parent_AgeAtTimeOfDeath Generation_Two_First_Parent_Generation_Three_Second_Parent_Death_location_longitude Generation_Two_Second_Parent_Generation_Three_First_Parent_Birth_year Generation_Two_Second_Parent_Generation_Three_First_Parent_Death_year Generation_Two_Second_Parent_Generation_Three_First_Parent_AgeAtTimeOfDeath Generation_Two_Second_Parent_Generation_Three_First_Parent_Birth_location_latitude Generation_Two_Second_Parent_Generation_Three_First_Parent_Birth_location_longitude Generation_Two_Second_Parent_Generation_Three_First_Parent_Death_location_longitude Generation_Two_Second_Parent_Generation_Three_First_Parent_DistanceTraveledAtDeath Generation_Two_Second_Parent_Generation_Three_Second_Parent_Birth_year Generation_Two_Second_Parent_Generation_Three_Second_Parent_Birth_month Generation_Two_Second_Parent_Generation_Three_Second_Parent_Death_year Generation_Two_Second_Parent_Generation_Three_Second_Parent_Death_month 80 Generation_Two_Second_Parent_Generation_Three_Second_Parent_AgeAtTimeOfDeath Generation_Two_Second_Parent_Generation_Three_Second_Parent_Death_location_latitude 81 7.1 Distance Traveled Linear Regression As was the same with all previous algorithms, I could not hold the entirety of the database in RAM and had to recursively reduce the amount I used to train with. For generation one, I was able to hold 500,000 records. This is ~ 72.696% of the total 687,796 records. For generation two, I was able to hold 200000 records. This is ~ 56.357% of the total 354,881 records. For generation three, I was able to hold 13618 records. This is ~ 79.791% of the total 17,067 records. As similar to age at time of death, distanced from birth location to death location being a regression algorithm does not have as significant problems with imbalanced data compared to classification algorithms. Because of this, I did not focus on testing with the regression’s equivalent tools to SMOTE and random under sampling. The results of using linear tree for distance from location at birth to death is the error in miles. For generation 1, I got was a root means square error of 33.148. Generation 2 had a root means square error of 25.538 miles, and generation 3 had an error of 20.637. While the error seems low, it’s still twice or more than the average result which is far too high of an error to be usable. Similar to age at time of death, I created bar graphs to visualize the algorithms predicted in Figure 20, and what the actual results were in Figure 21. The algorithm predicted primarily with a distance of 10 to 20 miles, but with a decent amount of 0 to 10 miles and 20 to 30 miles. The algorithm did not seem to understand the data it was working with very well, with it predicting people moving a negative number of miles. In reality, the actual result was almost exclusively between 0 and 10 but a small amount of records even up to distance of 340 miles. 82 Figure 20 Distance Traveled Linear Regression Prediction Figure 21 Distance Traveled Linear Regression Actual I also created two bar graphs showing the differences between the algorithm’s prediction and actual results. Figure 22 is predicted results subtracted by the actual results. This shows if the algorithm is overshooting (the number is positive) or undershooting (the 83 number is negative) and by how much and how often it is doing so. Figure 23 is absolute difference to show how far the algorithm is regardless of over/under shooting. The graphs show that it was predominantly undershooting and not was far off in those cases, but in no case did it ever overshoot by more than 40 miles but had undershot by up to 380 miles. Figure 22 Distance Traveled Linear Regression Prediction – Actual Figure 23 Distance Traveled Linear Regression Difference 84 7.2 Distance Traveled Regression Tree With regression tree, the only change was in which algorithm was being used, the same records with the same fields were used. Regression tree’s results were a dramatic increase over linear regression, but still do not get below 20% of the average distance. The regression tree’s root mean square error for Generation 1 was 11.9 miles, generation 2 was 16.39, and generation 3 was 17.158. Compared to the average distance, this is 83.4% for generation 1, 151.6% for generation 2, and 270.1%. Because of this, while generation 1 does it best, all of them are still short for getting to a relatively useful result. I created the bar graphs to visualize the results of the regression tree algorithm. Figure 24 displays the algorithm’s predictions vs the actual results being in Figure 25. These tables are cleaner versions of the linear regression tables. For predictions, the algorithm didn’t predict negative miles due to its impossibility, it also predicted far less 10 to 30 miles which matches closer to the actual results. Unlike linear regression, regression tree algorithm also predicts some of the records to have a large distance traveled. Regression tree predicts some records at up to 310 miles while linear regression’s highest was 60 miles. 85 Figure 24 Distance Traveled Regression Tree Prediction Figure 25 Distance Traveled Regression Tree Actual The two difference bar graphs are much cleaner as well. Figure 26 shows the predicted results subtracted by the actual results, and Figure 27 is the absolute value of the differences. These tables match the prediction vs actual table with them almost exclusively 86 being off by 0 to 10 miles. This algorithm does overshoot in a few instances by up to 310 miles and undershoot by 360 miles. What this means is that for some of the records with a high distance, the algorithm will predict a low number, which is not much of a surprise as linear regression did the same. But the change is that for some records with a low distance the algorithm will predict a high distance. That last part is a change from linear tree as it means that the algorithm has attempted try and understand what causes a person to have a large distance. Figure 26 Distance Traveled Regression Tree Prediction – Actual 87 Figure 27 Distance Traveled Regression Tree Difference 88 7.3 Distance Traveled Neural Network With the neural network algorithm, the only change was in which algorithm was being used, the same records with the same fields were used. The results using the neural network as the algorithm had root means square errors of: Generation 1 was 18.526, Generation 2 was 16.737, and Generation 3 was 17.861. Similar to linear regression, and regression tree, the errors of this algorithm model are too far above the average to be reliable, to the point of almost guessing. An interesting thing that arose was that regression tree generation 1 was the algorithm with the least error. All three algorithms were inconsistent on which generation had the least errors as seen in Table 54. Linear regression had the least errors amongst the three generations with generation 3 at 20.637 miles. Neural networks had the least errors with generation 2 at 16.737 miles, and regression tree at generation 1 was 11.9. Table 54 Distance Traveled Distance Traveled Summary Results As with the other algorithms, I created bar graphs for this algorithm to help visualize the results. Figure 28 shows how the algorithm predicted compared to Figure 29, which is the actual results. These graphs show that the algorithm is predicting similarly to Gen 1 Gen 2 Gen 3 Linear Regression 33.14769279 25.53787699 20.63700616 Regression Tree 11.90007388 16.38979802 17.15797492 Neural Net 18.52598332 16.73708869 17.8613377 Overall Average 14.26166139 10.8110739 6.352339372 Target Value 2.852332277 2.162214781 1.270467874 Records used 500000 200000 13618 Records Total 687796 354881 17067 Record Percentage Used 72.69597381 56.35691964 79.79141032 89 the linear regression algorithm with some small interesting differences. The algorithm predicted more often from 0 to 10 miles distance, which would explain its smaller root mean squared error. The algorithm also made a similar issue to linear regression that regression tree was able to recognize, which is that no actual record goes below 0, which I would expect to be one reason that regression tree got a lower root mean squared error. A key difference between the neural network algorithm’s predictions and the linear regression algorithm’s predictions is that the linear regression algorithm only predicted up to 60 miles, while neural network predicted records to have a distance as high as 300 miles. Figure 28 Distance Traveled Neural Network Prediction 90 Figure 29 Distance Traveled Neural Network Actual I then created the two bar graphs that show the differences between the algorithm’s prediction and actual results. Figure 30 is the predicted results subtracted by the actual results, and Figure 31 is the absolute value of the difference. These two graphs are to be expected with the similarities to the linear regression graphs. Both graphs show that the root mean square error of neural network is lower than linear regression but the prediction and actual results are similar. 91 Figure 30 Distance Traveled Neural Network Prediction – Actual Figure 31 Distance Traveled Neural Network Difference 92 7.4 Distance Traveled by Year As with age at time of death, I created a line graph to visual the average distance traveled over the years. I used regression tree because it had the lowest root means square error. The line graph shows the average prediction, average actual results, and average absolute difference from the year 1000 to 1990. I split the tables at the year of 1500 purely as there is too much in one graph to display as shown in Figure 32 and 1500 is the middle of the timeframe as shown in Figure 33. In the prior graph, showing age at time of graph, the graph was pretty linear while this graph seems to have no consistency pattern based on the year. Interestingly between 1500 and 1930, the average predictions and average actuals are extremely close but before 1500 is extremely erratic and also off significantly. Figure 32 Distance Traveled Over the Years Via Regression Tree Before 1500 93 Figure 33 Distance Traveled Over the Years Via Regression Tree After 1500 94 8.0 Result Findings In this section I condense the results I have in the prior sections to the most successful for each section. For predicting if an individual is able to reach 18, the results are condensed in Table 55. The scoring between generation and algorithm are in accuracy. Table 55 Child Mortality Summary Gen 1 Gen 2 Gen 3 Decision Tree No Sampling 0.89 0.79 0.8 Decision Tree SMOTE 0.76 0.77 0.76 Decision Tree Under sampling 0.6 0.58 0.61 Decision Tree 50/50 0.63 0.61 0.6 K-Nearest Neighbor No Sampling 0.91 0.85 0.83 K-Nearest Neighbor SMOTE 0.75 0.56 0.53 K-Nearest Neighbor Under sampling 0.55 0.53 0.52 K-Nearest Neighbor 50/50 0.65 0.55 0.53 Naïve Bayes No Sampling 0.84 0.57 0.45 Naïve Bayes SMOTE 0.4 0.51 0.48 Naïve Bayes Under sampling 0.42 0.5 0.41 Naïve Bayes 50/50 0.43 0.48 0.42 Neural Net No Sampling 0.92 0.86 0.82 Neural Net SMOTE 0.72 0.77 0.77 Neural Net Under sampling 0.63 0.63 0.61 95 Neural Net 50/50 0.65 0.65 0.64 Majority Class 0.92069 0.86756 0.84771 Records used 561175 264,964 113,629 Records Total 2,809,463 1,657,634 142,036 Record Percentage Used 19.974% 15.984% 80% For predicting the age of an individual at time of death, the results are condensed in Table 56. The scoring is in root mean square error in years. Table 56 Age at Time of Death Summary For predicting the distance between the location of the individual’s birth and the location of that individual’s death, the results are condensed in Table 57. The scoring is root mean square error in miles. Gen 1 Gen 2 Gen 3 Linear Regression 23.99609255 26.70617789 27.88610487 Regression Tree 26.68342486 32.0940887 31.2732002 Neural Net 23.15698592 25.7296628 27.38210708 Overall Average 61.35624405 57.26308988 56.71003494 Target Value 12.27124881 11.45261798 11.34200699 Records used 500000 225000 112520 Records Total 2,196,711 1638023 140711 Record Percentage Used 22.76130087 13.73607086 79.96531899 96 Table 57 Distance Traveled Summary Gen 1 Gen 2 Gen 3 Linear Regression 33.14769279 25.53787699 20.63700616 Regression Tree 11.90007388 16.38979802 17.15797492 Neural Net 18.52598332 16.73708869 17.8613377 Overall Average 14.26166139 10.8110739 6.352339372 Target Value 2.852332277 2.162214781 1.270467874 Records used 500000 200000 13618 Records Total 687796 354881 17067 Record Percentage Used 72.69597381 56.35691964 79.79141032 97 9.0 Demonstrable Console Application The last milestone was to attempt to see if the data can be used on a GEDCOM file to see how well the models created from the algorithms perform on a different dataset. A GEDCOM file is a is a data file that stores family history and genealogical event data in the standard GEDCOM genealogy format. The GEDCOM file used contain the British Royal Family history. After creating a program that read the GEDCOM file, the program than parses it into the separate fields. It ended with 3,010 individuals with 1,422 family relationships. The fields in a GEDCOM file are shown in Table 58. Table 58 GED Fields Individual: Family relationships: string ID bool Divorce string Name Individual [] Children string Title string MarriedDate char Sex string MarriedLocation string BirthDay Individual Husband string BirthLocation Individual Wife string DeathDay string DeathLocation string BurialDate string BurialLocation string BaptismDay string BaptismLocation The issue is when these fields are compared to the fields required for the algorithms. Several of the fields translate easily, such as the birth date, and death date. Several of the fields would be far more time intensive to translate over, specifically the various different location details. The locations are more problematic as the field are only the name of the 98 location and not various formats required, such as the latitude or longitude of the location. This is still manageable by looking up those details and giving them a margin of error. A field that is not available in the GEDCOM file is cause of death, which each of the algorithms have this field as a parameter. Without cause of death, none of the algorithms can be used. While it is possible to do additional research to find the 3010 individual’s cause of death, that is outside the scope of this thesis as this demonstratable console application was only if the data was available for use. 99 10.0 Conclusion There are three different but related research questions that I had. My hypothesis was that I would not be able to predict if an individual will reach eighteen years of age better than the majority class, would not be able to predict the age of an individual at the time of death within reasonable bounds, and would not be able to predict the distance between the location an individual was born at and the location that the individual will die at. The original hypothesis of child mortality was the assumption that I would not be able to predict better than the majority class due to the dataset is so imbalanced at ~92% that having a model with a higher precision using such indirect data would be extremely difficult. When I reduced the data to a purely balanced dataset, I was able to beat the majority class with 67% (majority class of a balanced dataset being 50%) as found in Table 4 as shown earlier. This shows that I was able to do more than completely guess. The best I was able to do was with the neural network on generation 1 as shown in Table 40 earlier. I lost to the majority class by less than 1%. I did this by effectively turning the algorithm into the majority class. I would speculate that the issue was that the data at much larger scale with the imbalanced data made it so that even though there may have been a small relationship between the data fields I used as inputs and the outcome I was predicting against; it was too insignificant to make a noticeable impact over recreating the majority class. The generational significance was also skewed by the data as well. Generation 1 had a majority class of ~92%, generation 2 had a majority class of ~86.756%, and 100 Generation 3 was ~84.771%. Because the relationship between the inputs and outcome is not as significant and the data may rely more towards the majority class, it is difficult to see if the generations have a positive or negative impact on the prediction. The data size between the generations was not even as well. The data seemed to be somehow biased as it makes no sense to have a lower chance of surviving to adulthood with data of the parents and grandparents then without the data for them. The data size is also reduced dramatically and so it is difficult to compare them equally. My hypothesis for being able to predict the age at time of death and being able to predict the distance between the locations of birth and death was that I would be unsuccessful within reasonable bounds. My original intuition was that there would not be a comprehensive enough relationship of family history data to create a useful model. For age at time of death, the closet that I came was using neural network generation 1 with an error of 8,614.657 days, or ~23.157 years. Comparing this number to the average human lifespan of 61.356 years, this does not reach the 20% or less threshold of the mean actual result at 37.742% or about two times the goal. For distance traveled, the best result was of regression tree generation 1 with an error of 11.900 miles. Comparing this number to the average distance of 14.262 miles, this does not reach the 20% or less threshold of the mean actual result at 83.438% or 4 times the goal. 101 11.0 References [1] S. Hall and P. Du Gay, Questions of Cultural Identity: SAGE Publications, Sage, 1996. [2] T. Chai and R. R. Draxler, "Root mean square error (RMSE) or mean absolute error (MAE)?–Arguments against avoiding RMSE in the literature," Geoscientific model development, vol. 7, p. 1247–1250, 2014. [3] C. J. Willmott and K. Matsuura, "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance," Climate research, vol. 30, p. 79–82, 2005. [4] G. Brassington, "Mean absolute error and root mean square error: which is the better metric for assessing model performance?," EGUGA, p. 3574, 2017. [5] W. Wang and Y. Lu, "Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model," in IOP Conference Series: Materials Science and Engineering, 2018. [6] A. M. Barhoom, A. J. Khalil, B. S. Abu-Nasser, M. M. Musleh and S. S. A. Naser, "Predicting Titanic Survivors using Artificial Neural Network," 2019. [7] R. Ball and P. Beck, Automatically Recreating Probabilistic History through Genealogy, 2017. [8] Y. Erlich, FamiLinx, 2020. 102 [9] J. Kaplanis, A. Gordon, T. Shor, O. Weissbrod, D. Geiger, M. Wahl, M. Gershovits, B. Markus, M. Sheikh, M. Gymrek and others, "Quantitative analysis of population-scale family trees with millions of relatives," Science, vol. 360, p. 171– 175, 2018. [10] Y. Attiga, S.-Y. Chen, J. LaGue, A. Ovalle, N. Stott, T. Brander, A. Khaled, G. Tyagi and P. Francis-Lyon, "Applying deep learning to public health: Using unbalanced demographic data to predict thyroid disorder," in 2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2018. [11] R. A. Kerber, E. O'Brien, K. R. Smith and R. M. Cawthon, "Familial excess longevity in Utah genealogies," The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, vol. 56, p. B130–B139, 2001. [12] M. Gögele, C. Pattaro, C. Fuchsberger, C. Minelli, P. P. Pramstaller and M. Wjst, "Heritability analysis of life span in a semi-isolated population followed across four centuries reveals the presence of pleiotropy between life span and reproduction," Journals of Gerontology Series A: Biomedical Sciences and Medical Sciences, vol. 66, p. 26–37, 2011. [13] S. Klüsener and R. D. Scholz, "Regional hot spots of exceptional longevity in Germany," Vienna yearbook of population research, p. 137–163, 2013. [14] A. R. Brooks-Wilson, "Genetics of healthy aging and longevity," Human genetics, vol. 132, p. 1323–1338, 2013. 103 [15] P. J. Mayer, "Inheritance of longevity evinces no secular trend among members of six New England families born 1650–1874," American journal of human biology, vol. 3, p. 49–58, 1991. [16] P. Sebastiani, N. Solovieff, A. T. DeWan, K. M. Walsh, A. Puca, S. W. Hartley, E. Melista, S. Andersen, D. A. Dworkis, J. B. Wilk and others, "Genetic signatures of exceptional longevity in humans," PloS one, vol. 7, p. e29848, 2012. [17] P. Sebastiani and T. T. Perls, "The genetics of extreme longevity: lessons from the new England centenarian study," Frontiers in genetics, vol. 3, p. 277, 2012. [18] I. Iachine, A. Skytthe, J. W. Vaupel, M. McGue, M. Koskenvuo, J. Kaprio, N. L. Pedersen, K. Christensen and others, "Genetic influence on human lifespan and longevity," Human genetics, vol. 119, p. 312, 2006. [19] J. G. Ruby, K. M. Wright, K. A. Rand, A. Kermany, K. Noto, D. Curtis, N. Varner, D. Garrigan, D. Slinkov, I. Dorfman and others, "Estimates of the heritability of human longevity are substantially inflated due to assortative mating," Genetics, vol. 210, p. 1109–1124, 2018. [20] R. Chetty, M. Stepner, S. Abraham, S. Lin, B. Scuderi, N. Turner, A. Bergeron and D. Cutler, "The association between income and life expectancy in the United States, 2001-2014," Jama, vol. 315, p. 1750–1766, 2016. [21] J. W. Osborne, Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data, Sage, 2013. 104 [22] B. M. Berry and R. S. Schofield, "Age at baptism in pre-industrial England," Population Studies, vol. 25, p. 453–463, 1971. [23] D. Goldenberg, "Why the Oldest Person in the World Keeps Dying," FiveThirtyEight, May, vol. 26, 2015. [24] X.-w. Chen and J. C. Jeong, "Enhanced recursive feature elimination," in Sixth International Conference on Machine Learning and Applications (ICMLA 2007), 2007. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, p. 2825– 2830, 2011. [26] P. A. Flach, "The geometry of ROC space: understanding machine learning metrics through ROC isometrics," in Proceedings of the 20th international conference on machine learning (ICML-03), 2003. [27] D. M. Powers, "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation," 2011. [28] D. M. Powers, "Recall & Precision versus The Bookmaker," 2003. [29] M. Buckland and F. Gey, "The relationship between recall and precision," Journal of the American society for information science, vol. 45, p. 12–19, 1994. 105 [30] H.-F. Yu, C.-J. Hsieh, K.-W. Chang and C.-J. Lin, "Large linear classification when data cannot fit in memory," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 5, p. 1–23, 2012. [31] P. J. G. Lisboa, A. Vellido and H. Wong, "Bias reduction in skewed binary classification with Bayesian neural networks," Neural Networks, vol. 13, p. 407– 410, 2000. [32] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker and G. D. Tourassi, "Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance," Neural networks, vol. 21, p. 427–436, 2008. [33] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, "RUSBoost: A hybrid approach to alleviating class imbalance," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, p. 185–197, 2009. [34] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, p. 463–484, 2011. [35] B. Das, N. C. Krishnan and D. J. Cook, "RACOG and wRACOG: Two probabilistic oversampling techniques," IEEE transactions on knowledge and data engineering, vol. 27, p. 222–234, 2014. 106 [36] K. D. Feuz and D. J. Cook, "Modeling Skewed Class Distributions by Reshaping the Concept Space.," in AAAI, 2017. [37] J. Prusa, T. M. Khoshgoftaar, D. J. Dittman and A. Napolitano, "Using random undersampling to alleviate class imbalance on tweet sentiment data," in 2015 IEEE international conference on information reuse and integration, 2015. [38] M. A. Tahir, J. Kittler and F. Yan, "Inverse random under sampling for class imbalance problem and its application to multi-label classification," Pattern Recognition, vol. 45, p. 3738–3750, 2012. [39] N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, p. 321–357, 2002. [40] H. Han, W.-Y. Wang and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International conference on intelligent computing, 2005. Anthony Carter Thesis Final Audit Report 2020-11-30 Created: 2020-11-30 By: Christel Grange-Hicks (cgrangehicks@weber.edu) Status: Signed Transaction ID: CBJCHBCAABAA1MFzOP7k8T0gPRVufyZ-uIbB8-x8Ehn_ "Anthony Carter Thesis" History Document created by Christel Grange-Hicks (cgrangehicks@weber.edu) 2020-11-30 - 7:10:24 PM GMT- IP address: 137.190.250.0 Document emailed to Kyle Feuz (kylefeuz@weber.edu) for signature 2020-11-30 - 7:12:36 PM GMT Document emailed to Hugo Valle (hugovalle1@weber.edu) for signature 2020-11-30 - 7:12:36 PM GMT Document emailed to Robert Ball (robertball@weber.edu) for signature 2020-11-30 - 7:12:36 PM GMT Email viewed by Hugo Valle (hugovalle1@weber.edu) 2020-11-30 - 7:27:42 PM GMT- IP address: 74.125.209.8 Document e-signed by Hugo Valle (hugovalle1@weber.edu) Signature Date: 2020-11-30 - 7:27:59 PM GMT - Time Source: server- IP address: 137.190.229.111 Email viewed by Kyle Feuz (kylefeuz@weber.edu) 2020-11-30 - 8:36:27 PM GMT- IP address: 64.233.172.42 Document e-signed by Kyle Feuz (kylefeuz@weber.edu) Signature Date: 2020-11-30 - 8:36:39 PM GMT - Time Source: server- IP address: 38.94.240.75 Email viewed by Robert Ball (robertball@weber.edu) 2020-11-30 - 9:32:20 PM GMT- IP address: 174.23.163.157 Document e-signed by Robert Ball (robertball@weber.edu) Signature Date: 2020-11-30 - 9:32:33 PM GMT - Time Source: server- IP address: 174.23.163.157 Agreement completed. 2020-11-30 - 9:32:33 PM GMT |
Format | application/pdf |
ARK | ark:/87278/s6qe37wc |
Setname | wsu_smt |
ID | 96826 |
Reference URL | https://digital.weber.edu/ark:/87278/s6qe37wc |