Title | Wilson, Jacob_MCS_2020 |
Alternative Title | Applying Machine Learning to Improve Curriculum Design Across a Variety of Disciplines |
Creator | Wilson, Jacob |
Collection Name | Master of Computer Science |
Description | A study of the graduation and retention rates among college student data using features gathered from academic records and demographic information. Machine Learning techniques are employed on university student data to predict graduation and retention rates from several different learning disciplines. Analyses made from result sets showing how the features relate to graduation and retention outcomes. |
Subject | Computer science |
Keywords | Machine learning techniques; Graduation and retention rates |
Digital Publisher | Stewart Library, Weber State University |
Date | 2020 |
Language | eng |
Rights | The author has granted Weber State University Archives a limited, non-exclusive, royalty-free license to reproduce their theses, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. The author retains all other rights. |
Source | University Archives Electronic Records; Master of Computer Science. Stewart Library, Weber State University |
OCR Text | Show Applying Machine Learning to Improve Curriculum Design Across a Variety of Disciplines By Jacob Wilson A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE OF COMPUTER SCIENCE WEBER STATE UNIVERSITY Ogden, Utah November 23, 2020 ________________________________ Faculty Advisor, Kyle Feuz Committee Chair ________________________________ Second Committee member, Robert Ball Committee Member ________________________________ Second Committee member, Yong Zhang Committee Member ________________________________ Student, Jacob Wilson Signature: Email: Signature: Email: Signature: Email: Signature: Email: Nov 23, 2020 jacobwilson@weber.edu Nov 23, 2020 kylefeuz@weber.edu yongzhang@weber.edu Nov 23, 2020 Robert Ball (Nov 23, 2020 17:43 MST) Robert Ball robertball@weber.edu Nov 23, 2020 Abstract A study of the graduation and retention rates among college student data using features gathered from academic records and demographic information. Machine Learning techniques are employed on university student data to predict graduation and retention rates from several different learning disciplines. Analyses made from result sets showing how the features relate to graduation and retention outcomes. Table of Contents Introduction ..........................................................................................................................4 Related Work .......................................................................................................................6 Data Extraction ....................................................................................................................9 Determining Critical Courses ............................................................................................10 Data Preprocessing.............................................................................................................13 Machine Learning Techniques ...........................................................................................17 Machine Learning Algorithm Performance .......................................................................19 General Analysis ................................................................................................................23 Computer Science Analysis ...............................................................................................30 Business Administration Analysis .....................................................................................34 Elementary Education Analysis .........................................................................................38 Nursing Analysis ................................................................................................................43 Microbiology Analysis.......................................................................................................47 Interpretations of Results ...................................................................................................51 Conclusion .........................................................................................................................55 Appendix A Features and Descriptions .............................................................................58 Appendix B Sample of Random Forest decision tree ........................................................60 Appendix C Numeric results from algorithm comparison .................................................61 Appendix D Comparison of results from additional preprocessing...................................62 References ..........................................................................................................................64 4 Introduction Universities around the country are coming under increased scrutiny to improve their graduation rates and the time a student needs to graduate. Objective data is needed in college academics to assist in curriculum design and advising to improve these outcomes. Without objective data, deans and faculty must rely on only personal intuition and anecdotal evidence, which can vary greatly and lead to inconsistency in learning patterns and outcomes. Providing objective data can help these groups reliably design curriculum and learning paths to maximize the learning potential and graduation rates of the students in their respective programs. Much work has already been done on this topic by many other institutional researchers and university personnel. This thesis in particular involves a small spectrum of different fields of study and will identify any information that applies to each of them individually, as well as what is generally common among them. Machine learning is one of the most useful techniques in analyzing data to produce accurate predictive outcomes and meaning. [1] In preparation for employing machine learning, data will be collected from several learning disciplines, including Computer Science, Business Administration, Elementary Education, Nursing, and Microbiology. Using the data collected from each of the different learning disciplines, we can identify and categorize important features and measure how impactful that point is on the retention and graduation outcome. The results can then help determine what kind of impact or relationship each feature has with the outcome being measured. 5 The steps of this method include: Extraction of data sets from WSU’s Banner database, consisting of individual data sets for students who are declared in each respective learning discipline, and a combined data set that includes all students enrolled in any of the five learning disciplines. Determining critical courses and milestones for the disciplines of Computer Science, Business Administration, Education, Nursing and Microbiology. Processing of the student data through machine learning techniques and software. Analysis of results from machine learning, determining feature importance, comparative algorithm performance, and accuracy of classification. Summary and visualizations made to explain findings and identifying features that were shown to have the highest impact on graduation rates. 6 Related Work Particular credit for the base of this study is the work of Weber State Universities’ own Computer Science Department faculty. Their publication, “Applying Machine Learning to Improve Curriculum Design,” [2] provided the basis for conducting this study and I have been able to use the same Machine Learning techniques already prepared by them and build from their work as the starting point. As noted in their abstract and conclusion, there was the belief that while they only employed the use of machine learning analysis on computer science students, the same methods should be applicable to any curriculum. This study is the exploration into broadened curriculum application beyond just computer science, as well as a look into what to expect from a general analysis involving several curriculums. There are several other studies that have been conducted by other researchers that explores similar ideas using machine learning or data mining techniques to try and discover any cause and effect relationships in student data and graduation outcomes. The work of Dursun Delen in “A comparative analysis of machine learning techniques for student retention management” [3] shows that employing data mining and machine learning techniques on student data can yield promising results. Analysis conducted on predicting a graduation or retention outcome based on data gathered about a student’s university data showed that the outcome can be predicted with a high rate of accuracy. Also, that the results of the analysis and relationships between the student data and the outcome allows for more informed decision making in addressing problems with student attrition. [4] 7 The work of Dmitri Rogulkin at California State University Fresno [5] shows how the decision tree technique and model on student data can show the correlation of specific student features with a graduation outcome and how impactful each feature is to that outcome. This can give particular attention to those data that carry the most influence and weight when making decisions in how to help students or devise the most worthwhile and effective programs and strategies. The methods used in this study are also machine learning techniques that use the decision tree model as a basis for the analysis, from which the impact or weight of particular feature has on the outcome is measured. There are many other studies conducted by institutional researchers regarding the topic of measuring student data as it relates to the outcome of retention and graduation rates. Many are preliminary studies exploring the usefulness and potential prediction accuracy. [6 - 12] The feasibility of the use of machine learning techniques can be demonstrated with the prediction models on different student datasets, and using accuracy metrics obtained with the output of various machine learning algorithms. Other studies include the use of data mining to determine some of the best predictors to use with a prediction model. [13 - 17] Especially as new study and teaching paradigms are introduced, there is a drive for determining how these new methods affect student retention. [18] It is of particular interest in the education industry to be able to determine through these models which factors are a determinate for a student graduating or dropping out. Additional research and studies have applied the techniques to try to identify more specific and targeted outcomes or information, such as predicting success in STEM fields 8 by using academic data from a student’s first terms. [9, 19 - 21] Or the focus may be on a particular subset of the student population such as non-traditional students. [22] Each of these studies uses a combination of data extraction and analysis, and then applies machine learning and artificial intelligence mechanisms to generate a prediction model which can be used with academic or behavioral data to predict graduation and retention outcomes. These help identify student’s attributes as being high-risk or high achieving, to assist educators in meeting student’s needs, especially in cases where intervention may be needed. This study will differ in technique in that the data will be gathered for a full 6 year undergraduate period, and from a subset of student data from a variety of learning disciplines including some that are non-STEM. The prediction model will also be applied both individually for each discipline and generally in a dataset combining the disciplines to give a comparative analysis. 9 Data Extraction Data has been gathered from a subset of students who attended Weber State University (hereafter referred to as WSU) between 2008 and 2014. A total student body of 20,178 students from the disciplines of Computer Science, Business Administration, Elementary Education, Microbiology, and Nursing. The disciplines chosen were to represent a variety of learning styles, ranging from those requiring high math to low math requirements, engineering and applied sciences vs theory and lab sciences, and humanities and soft skills vs technical and analytical skills. A learning discipline denotes the major for which a student is declared at the university. Students must declare a major before they graduate in order to receive their degree, and is therefore a prerequisite for the graduation outcome. A student is considered belonging to a major if they ever completed a semester at WSU while declared in that particular major, and they are considered retained if they graduated with the same degree for the major they were first declared. Features gathered are taken from student demographics, transcripts, tests, and financial aid information. A complete list with a description for each of the features gathered is found in Appendix A. 10 Determining Critical Courses Determining critical courses involved an SQL query on the student information system database of course information. In some learning disciplines it is found that there are courses described as the “hump” or “gateway” courses, or courses that are perceived as relatively difficult or involved that might make or break a student’s academic career in that discipline, and may be a course that opens up registration access to the upper division set of courses in that discipline. A student’s grade in a critical path course may greatly impact the graduation outcome because not passing the course might mean retaking it and thus another term which they must spend at the university. Such courses can be identified from the course attribute data in the student information system by the prerequisite relationships they have with other courses. If the courses from a learning discipline were visualized in tree form with lines indicating prerequisite relationships between courses, we would see that critical courses have a large number of lines going to and from that course. In other words, a critical course has a relatively large number of prerequisite courses required in order to take it, and there is also a large number of courses that require that course as a prerequisite. The critical courses are indicated by having a high number of prerequisites required for the class and a high number of times it is listed as a prerequisite for other courses. These two values were summed together to create a value index of how indicative it is that the course is part of the critical path of the curriculum for that learning discipline. A sample from the Computer Science discipline is shown in Figure 1. 11 Figure 1. Critical Course Identification Method Example The results of the SQL query for the computer science discipline. The number to the right of the course number is the value index taken from the total of both number of prerequisites the course has and how many times it is listed as a prerequisite for another course, as per the student information system’s course attribute tables. From tables like this, courses with a higher index value such as CS 2420 are identified as a critical course. As shown, the values for courses are likely to stand out by having significantly higher numbers relative to other courses in that discipline. It can also be inferred that if performance in a course that is listed as a prerequisite a large number of times can indicate a higher impact on graduation outcome because the course contains some 12 foundational concepts or practice that will be used extensively thereafter, so better grades in such a course could correlate to continued success in further coursework. 13 Data Preprocessing Once the data was collected from the student information system, there were 3 different techniques used to preprocess the data to prepare it for machine learning analysis. 1. One-hot encoding. Fields that included non-binary data or multiple field values were separated into their own binary fields. One new field was generated per option in the original field with the possible values of 1 or 0, indicating either yes or no. Gender was divided into Male and Female fields, marital status was separated into Married, Divorced, Single and Other fields. Ethnicity was already one-hot encoded in the database tables into the different categories because a student may identify as any number of them. 2. Numeric conversion Non-numeric fields do not process well with particular machine learning techniques such as linear regression. To account for this, non-numeric fields were converted to numeric equivalents. Some of this was already done by way of one-hot encoding with the different possible values becoming their own field with a binary value, such as with gender and marital status. Course grades contained 14 letters or withdrawal codes which were changed to a numeric equivalent based on the impact they have. The values were as follows: Table 1. Index of possible course grade values. A 15 A- 14 B+ 13 B 12 B- 11 C+ 10 C 9 C- 8 D+ 7 D 6 D- 5 I 4 E 3 W 2 UW 1 Unknown 0 (7.5) Possible course grade values are indexed into a numeric value ordered by their relative performance equivalent. 3. Fill-in Missing Values Some fields contained blank or unknown values. Each field was handled a bit differently to provide meaningful substitutes for missing values in that field. Missing age values were substituted with the average age of other rows in the dataset. Unknown gender values were substituted with a 0 value in both Male and Female columns. Missing transfer credits were also given a 0 value, as the absence of transferred credits is treated the same as no credit in the student information system. Missing course grades actually had a large number of missing values. This is due to the way that a student’s discipline is defined. Because a 15 student must only spend one term declared in a particular major to be included in that dataset, there is a high likelihood that a student may not have taken any of the courses in that major before leaving that major. Values were initially set at 0 as with other fields, however this was found to bias the results of the machine learning toward the negative. To correct this bias, the middle value of 7.5 was chosen as seen in the parentheses from Table 1. 4. Normalization Because the field values do not have equal magnitude and the non-binary fields all have different ranges, normalization was necessary to even out the weights of all the fields. Age for example could be any number from 0 to over 100, while high school GPA has values between 0.0 and 4.0, and furthermore ACT test scores had a range of 0 to 36. Normalization removes the range discrepancy so that larger numerical values do not represent larger magnitudes of impact, which would skew the results. 5. Outlier removal The dataset contained some values that were nonsensical or beyond the range a value should be, such as age or GPA being a negative number. These were due to some clerical errors or faults with the data in the source database. These rows were simply removed from the dataset before processing to prevent them from interfering with machine learning processes. It is possible some values exist in the dataset which were not so easily detected. Validation of fields was limited to checking that the values fell within proper range values, and the internal database constraints that exists on the tables from which the data was extracted. 16 Further techniques were used, but the results of the processing were found to have less accuracy than the results without them. So the results produced from them were not used for the final analysis. Those attempts included removing fields for which the value can be inferred from other values related or associated with it, and giving weighted averages to missing values rather than the values described previously. The majority of students in the dataset are female so the related male field was removed because it could be inferred that if they were not female, then they must be male. It is indeed possible for instances where both the female and male fields both contained 0 for students whose gender was unknown or not declared, but only 2 existed in the dataset. For marital status, the Other field was removed as it could be inferred from the other related fields. The weighted average processing found that results were skewed and biased in favor of the positive, which had the machine learning predicting a false positive more frequently. This was true especially in the case of the missing values in course grades where a large number of missing values existed. The weighted average value for those fields was relatively high. The handling of the different versions of preprocessing is explained in the next section. 17 Machine Learning Techniques The aforementioned features have been extracted from WSU’s student information system database using SQL queries tailored to get the best representation possible from what was available. The data has also been preprocessed to allow for functionality and ease of use with the various machine learning algorithms that were employed against them. The chosen machine learning algorithms include the Decision Tree classifier (DT), Logistic Regression classifier (LR), Adaboost (ADA), Random Forest (RF) and Majority Class (MC). Each was implemented using the python programming language with the use of the SK-Learn library of machine learning functionality. [23] The following is a brief description of the functionality of the algorithms. DT – A decision tree is built using the supplied features and traversed sequentially several times and in different layouts to determine the optimal path for the given outcome. Branches are weighted based on the likelihood of the outcome given the value of the node (feature.) Predictions are based on the statistical likelihood given the optimal path of the tree given the features. [24] LR – Logical regression of a set of non-linear features. The algorithm attempts to determine a dividing line (or function curve) between instances with the two different outcomes. Predictions are made based on the function output value of the instance features, whether it lies inside or outside the dividing line. [25] 18 ADA – Combination classifier. The results of different instances of several machine learning classifiers are compared and outcomes are predicted. Each classifier “votes” what the outcome is predicted and the majority vote from the classifiers is output. [26] RF – Combination classifier using only decision trees. Several different decision trees are constructed and traversed for optimal path. Each tree is constructed from a subset of the features. Results of each predicts their outcome vote, and the majority vote of all trees is the output prediction. A sample tree formed by the subset of features used is found in the appendix. [27] MC – Simple algorithm that takes whichever outcome from the dataset is the most common and predicts that outcome every time. Used as a baseline control to measure the comparative and hopefully improved accuracy and information gained from the other classifiers. Each of the algorithms was used on each of the datasets and measured for their performance. The details of which can be found in the next section. 19 Machine Learning Algorithm Performance Figures 2 and 3 show the comparative accuracy performance of each of the machine learning algorithms for each of the datasets. Figure 2. Prediction Accuracy of Machine Learning Algorithms (Graduation). Prediction accuracy as a percentage of correct predictions for each classifier. 1 (100%) indicates that the prediction was correct every time, and 0 meaning an incorrect prediction every time. Results are for predictions of whether a student graduated from the university or not. 0.0000 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000 1.0000 COMBINED COMBINED COMBINEDCOMBINED BUSIADMIN BUSIADMIN BUSIADMIN BUSIADMIN COMPSCI COMPSCI COMPSCI ELEMEDUCELEMEDUCELEMEDUCELEMEDUC ELEMEDUC ELEMEDUC NURSINGNURSINGNURSINGNURSINGNURSING NURSING MICROBIO MICROBIOMICROBIO MICROBIO Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning by Discipline for Graduationby Discipline for Graduation by Discipline for Graduation by Discipline for Graduationby Discipline for Graduation by Discipline for Graduation by Discipline for Graduationby Discipline for Graduation by Discipline for Graduationby Discipline for Graduation by Discipline for Graduation by Discipline for Graduationby Discipline for Graduationby Discipline for Graduation DT LR ADAADAADA RF MC20 Figure 3. Prediction Accuracy of Machine Learning Algorithms (Program Retention). Prediction accuracy as a percentage of correct predictions for each classifier. 1 (100%) indicates that the prediction was correct every time, and 0 meaning an incorrect prediction every time. Results are for if a student graduated in their original declared discipline. As shown from the information in the figures, the use of machine learning algorithms proved to show some promise in predicting the graduation outcome with a fairly high degree of accuracy. There were three versions of each of the datasets that were processed with the machine learning algorithms. The first dataset was the version created immediately after the preprocessing mentioned in the previous section. Each version was also processed with two different outcome fields, “graduated” and “grad_in_pos,” which correspond respectively to the outcomes of a student graduating with any 4 year degree or whether they graduated in the discipline (i.e. program of study) they originally declared. The two 0.0000 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000 1.0000 COMBINED COMBINED COMBINEDCOMBINED BUSIADMIN BUSIADMIN BUSIADMIN BUSIADMIN COMPSCI COMPSCI COMPSCI ELEMEDUCELEMEDUCELEMEDUCELEMEDUC ELEMEDUC ELEMEDUC NURSINGNURSINGNURSINGNURSINGNURSING NURSING MICROBIO MICROBIOMICROBIO MICROBIO Prediction Accuracy of Machine LearningPrediction Accuracy of Machine LearningPrediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine LearningPrediction Accuracy of Machine LearningPrediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine LearningPrediction Accuracy of Machine Learning Prediction Accuracy of Machine LearningPrediction Accuracy of Machine Learning Prediction Accuracy of Machine LearningPrediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning Prediction Accuracy of Machine Learning by Discipline for Program Retentionby Discipline for Program Retention by Discipline for Program Retention by Discipline for Program Retentionby Discipline for Program Retention by Discipline for Program Retention by Discipline for Program Retentionby Discipline for Program Retention by Discipline for Program Retentionby Discipline for Program Retentionby Discipline for Program Retentionby Discipline for Program Retention by Discipline for Program Retention by Discipline for Program Retentionby Discipline for Program Retentionby Discipline for Program Retentionby Discipline for Program Retentionby Discipline for Program Retention by Discipline for Program Retention DT LR ADAADAADA RF MC21 proceeding versions were created in an attempt to optimize the performance of the algorithms if possible. As mentioned in the previous section, one attempt was using the weighted average values for missing values, and another attempt was to remove the inferred fields from one-hot encoding, such as removing the Male field as it can be inferred if they are not male they must be Female (in the datasets there were no instances of other gender definitions.) They are hereafter labeled as the “Weighted” and “OH -1” datasets for both the corresponding outcomes. See Appendix D for an accuracy comparison between the different versions of preprocessing on each of the datasets. Because the results from the original graduated and grad_in_pos versions were the most accurate, they were chosen to be used for the final analysis in this study. As can be seen from the table, the original graduated and grad_in_pos versions have higher accuracies than the weighted and OH -1 versions. The attempts to further optimize the results of the algorithms proved to reduce the overall accuracy for each dataset. The accuracy reduction is most likely due to the weighted average being too close to one side of the spectrum, which biases the predictions. It is encouraging to see that the accuracy is relatively high for each of the algorithms in each of the datasets. Predictions show accuracies around 70% to as high as 95%. Also, in each case nearly every algorithm performed better than the Majority Class baseline algorithm. This shows that there is at least some information value gained from running the machine learning techniques. The only exception was the LR algorithm for the Elementary Education discipline. Appendix C contains the exact numeric values for the charts to demonstrate this discrepancy. 22 In the analysis section, the accuracy results here can be kept in mind when seeing which features rise to the top for the given algorithm in terms of the importance factor percentages. It is also shown from the figures that in most cases the Random Forest algorithm had the most accurate performance for most of the datasets. So the information from the output of this algorithm should be particularly useful for analysis as compared with the others. 23 General Analysis The combined dataset was taken by combining the 5 discipline’s datasets into one larger dataset, totaling 20,178 students. The following table provides a statistical breakdown of some of the key features from the combined dataset: Table 2a. Statistics of features from Combined Dataset. Combined Total Graduated Retained Male 8232 5243 1779 Female 11945 8164 3522 Married 10343 7920 3308 Single 7100 3948 1395 Divorced 791 508 197 Dev Math 7026 2985 905 Dev Engl 3448 1378 476 Veteran 431 221 95 Black 288 102 27 Pacific Islander 106 46 13 Native American 255 147 50 Hispanic 1516 851 289 International 26 13 1 First Gen 7286 4713 1723 Course Grade Counts Course1 4969 4109 1975 Course2 4918 4686 2875 Course3 5908 5257 2903 Course4 4117 3846 2461 Total Students 20178 13407 5301 This table shows how many instances of each of the features therein occurred in the dataset. The labels for features for critical courses were changed because the particular courses would not be found across the disciplines that don’t include it. They were 24 relabeled as Course1, Course2, Course3, and Course 4, respectively to how far in the curriculum for the discipline they are. For example, Course1 would be the course from that discipline’s determined critical courses that was earliest in the curriculum, while Course4 was later in the curriculum. For discipline datasets that had more than 4 determined critical courses, any additional courses after the first 4 were excluded from this dataset. That is to say, the courses that were latest in the curriculum were not part of the combined dataset. Each of the result tables shows the features in ascending order of importance. The features that were most impactful on graduation or retention rates are at the bottom. They are also expressed as decimals or percentages, meaning that each of the impactful features takes a share of a whole 100%. A larger percent indicates the feature was more impactful on the outcome. Reading the Logistic Regression table is somewhat different. From the LR table we can discern both how impactful, as well as if that impact is positive or negative. The impact is not measured by a percentage, but rather a numeric magnitude. A larger number indicates a larger impact, and whether it is a positive or negative impact is indicated by the number being positive or negative. [28] Due to the nature of the RF classifier, with the decision trees being taken from subsets of the features, we see that there was less difference in the percentage between different features when compared to the results from other classifiers. [29] Also, considering that the RF classifier often scored the most accurate predictions in most cases, it seems that emphasizing the importance of any feature too much or too few features will not lead to better outcomes. This is likely due to overfitting, meaning the 25 algorithm results are somewhat skewed to try to make a correct prediction for one particular case, but in doing so causes other similar correct cases to be predicted incorrectly. [30] Table 2b. Decision Tree Importance Factors for Combined Dataset. Combined Graduation Program Retention INC_DEV_MATH 67.77% COURSE4 57.66% MARRIED 9.53% INC_DEV_MATH 21.86% COURSE2 8.83% MALE 6.77% COURSE1 5.01% AGE_MATRICULATED 4.00% MALE 4.18% FEMALE 3.85% AGE_MATRICULATED 2.04% SINGLE 2.37% COURSE4 1.36% COURSE2 1.64% COURSE3 0.66% COURSE1 1.36% FEMALE 0.63% STU_HS_GPA 0.40% STU_HOURS_EARNED_TRANSFER 0.05% MARRIED 0.05% Result table for the combined dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 2c. Random Forest Importance Factors for Combined Dataset. Combined Graduated Program Retention INC_DEV_MATH 10.67% COURSE4 10.71% AGE_MATRICULATED 8.67% COURSE2 9.53% COURSE2 7.49% AGE_MATRICULATED 8.02% COURSE3 7.33% COURSE3 7.42% STU_HS_GPA 6.95% STU_HS_GPA 6.72% COURSE4 5.65% COURSE1 5.76% COURSE1 5.00% ACT_COMPOSITE 5.26% ACT_COMPOSITE 4.92% STU_HOURS_EARNED_TRANSFER 4.58% MARRIED 4.42% FEMALE 4.19% INC_DEV_ENGL 4.39% INC_DEV_MATH 4.18% SINGLE 3.85% MARRIED 4.09% STU_HOURS_EARNED_TRANSFER 3.76% NRSG 4.07% OFFERED_PELL 3.60% OFFERED_PELL 3.74% 26 FEMALE 3.50% MALE 3.37% NRSG 3.30% EDUC 3.16% MALE 2.97% FIRST_GEN_IND 2.87% FIRST_GEN_IND 2.77% SINGLE 2.76% BSAD 1.98% BSAD 2.62% EDUC 1.80% CS 1.68% CS 1.78% REPORTED_LOW_INCOME 1.42% REPORTED_LOW_INCOME 1.71% MICR 0.92% TERM_RACE_HISPANIC_IND 0.90% INC_DEV_ENGL 0.78% MICR 0.67% TERM_RACE_HISPANIC_IND 0.58% DIVORCED 0.48% DIVORCED 0.49% AP_CREDIT 0.37% AP_CREDIT 0.47% VETERAN 0.30% VETERAN 0.25% TERM_RACE_BLACK_IND 0.27% TERM_RACE_NATIVE_AMERICAN_IND 0.10% TERM_RACE_NATIVE_AMERICAN_IND 0.19% TERM_RACE_BLACK_IND 0.07% OTHER_MARRIAGE 0.16% CLEP_CREDIT 0.06% TERM_RACE_PACIFIC_ISLANDER_IND 0.08% OTHER_MARRIAGE 0.05% CLEP_CREDIT 0.04% TERM_RACE_PACIFIC_ISLANDER_IND 0.03% ROLE_INTERNATIONAL_STUDENT_IND 0.02% ROLE_INTERNATIONAL_STUDENT_IND 0.01% Result table for the combined dataset using the Random Forest classifier. Only features with non-zero values are included. Table 2d. Adaboost Importance Factors for Combined Dataset. Combined Graduated Program Retention INC_DEV_MATH 12% COURSE4 16% COURSE2 10% COURSE2 10% AGE_MATRICULATED 8% BSAD 8% FEMALE 8% FEMALE 8% STU_HS_GPA 8% SINGLE 8% NRSG 6% CS 6% COURSE3 6% STU_HS_GPA 6% EDUC 4% INC_DEV_MATH 6% MALE 4% COURSE3 6% ACT_COMPOSITE 4% AGE_MATRICULATED 4% STU_HOURS_EARNED_TRANSFER 4% EDUC 4% MARRIED 4% NRSG 4% INC_DEV_ENGL 4% ACT_COMPOSITE 4% 27 COURSE1 4% TERM_RACE_HISPANIC_IND 2% COURSE4 4% FIRST_GEN_IND 2% BSAD 2% MALE 2% CS 2% REPORTED_LOW_INCOME 2% DIVORCED 2% INC_DEV_ENGL 2% OFFERED_PELL 2% REPORTED_LOW_INCOME 2% Result table for the combined dataset using the Adaboost classifier. Only features with non-zero values are included. Table 2e. Logistic Regression Importance Factors for Combined Dataset. Combined Graduated Program Retention INC_DEV_MATH -1.53E+09 INC_DEV_MATH -2.41E+05 INC_DEV_ENGL -1.31E+05 COURSE4 52286.99 SINGLE -1.34E+04 MALE -1753.03 MARRIED 10623.06 SINGLE -1653.83 MALE -532.383 STU_HS_GPA -218.821 COURSE2 159.2031 INC_DEV_ENGL -215.944 OFFERED_PELL 102.65 MARRIED 79.5166 COURSE3 31.89799 AGE_MATRICULATED -61.3966 COURSE4 25.91541 FIRST_GEN_IND -46.5928 AGE_MATRICULATED -11.1563 REPORTED_LOW_INCOME -11.4427 Result table for the combined dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. Note that for this dataset and all other datasets, the importance factor results only show features that were not 0. In other words, if the feature did not impact the outcome according to the algorithm, they do not show in the result tables. For the combined dataset, the features that stand out are the developmental math and English requirement, the marital status, gender and course grades. We can see from the LR results that the impact from developmental courses, age and being single is a 28 negative impact. Being required to take extra courses at the university stage intuitively means less likelihood to graduate. Also shown is that as age increases, graduation rates decrease. Being single having a negative impact and being married having a positive impact shows an interesting relationship with graduation rates. More investigation into why and how they impact the rates might be worth looking into. Gender is also an interesting impact. We see from LR results that in can be one of the more significant features, and that the importance of gender in the outcome is shown in each of the other result sets. Though we do not see any details that lead to this result from this data, it is clear that there is a possible discrepancy in graduation and retention rates between the genders. The High School GPA showing a significant impact, along with the LR results showing that being negative could indicate the “double-edged sword” that if a student is over prepared, they have less likelihood of graduating. Whether that is because they drop out or perhaps transfer to a more prestigious university or otherwise unfortunately cannot be discerned by this data. Ethnicity, as well as veteran and international status did not appear to have much impact on results and it may be due to the population of students with those indicators is fairly low for this dataset. However, we can also see that these features have little impact as the graduated count is very near half of the total for each of them, which means the feature itself is not very deterministic of that outcome. We can expect the trend of low impact of these features to persist in the other datasets as well. Course grades shows the opposite situation to ethnicity data. While there is a lot of missing data from the course grades features, a majority of those with grade data 29 appear in the graduated and retained columns. This means that course data can be expected to trend as highly impactful in the other datasets as well. 30 Computer Science Analysis The computer science dataset includes 2,462 students who finished at least one term while declared as a Computer Science major. The statistical breakdown for this dataset is as follows: Table 3a. Statistics of features from Computer Science Dataset. CS Total Graduated Retained Male 2188 1260 265 Female 274 166 24 Married 1089 710 135 Single 1097 587 130 Divorced 53 27 2 Dev Math 807 318 68 Dev Engl 400 156 43 Veteran 97 38 9 Black 36 11 0 Pacific Islander 11 7 0 Native American 39 12 5 Hispanic 175 104 31 International 1 1 0 First Gen 842 478 104 Course Grade Counts CS_1400_GRADE 1773 1143 288 CS_2420_GRADE 1122 949 286 CS_2550_GRADE 1320 1055 287 CS_2810_GRADE 330 255 145 CS_3100_GRADE 811 774 289 CS_3750_GRADE 717 708 289 Total Students 2462 1426 289 This table shows how many instances of each of the features therein occurred in the dataset. 31 Table 3b. Decision Tree Importance Factors for Computer Science Discipline. Computer Science Graduation Program Retention CS_3750_GRADE 44.76% CS_3750_GRADE 59.97% INC_DEV_MATH 15.18% CS_2810_GRADE 21.29% MALE 10.17% CS_1400_GRADE 8.13% CS_2550_GRADE 9.61% AGE_MATRICULATED 7.73% CS_2420_GRADE 4.95% STU_HOURS_EARNED_TRANSFER 1.99% CS_1400_GRADE 4.71% STU_HS_GPA 0.89% SINGLE 2.25% AGE_MATRICULATED 2.00% STU_HS_GPA 1.92% MARRIED 1.92% CS_2810_GRADE 1.37% CS_3100_GRADE 0.76% REPORTED_LOW_INCOME 0.38% Result table for the computer science dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 3c. Random Forest Importance Factors for Computer Science Discipline. Computer Science Graduated Program Retention CS_3750_GRADE 13.37% CS_3750_GRADE 16.43% CS_2550_GRADE 10.60% CS_2810_GRADE 11.30% CS_2420_GRADE 8.54% CS_3100_GRADE 11.16% CS_3100_GRADE 7.59% MALE 7.46% AGE_MATRICULATED 7.51% AGE_MATRICULATED 6.90% STU_HS_GPA 6.60% CS_2420_GRADE 6.63% MALE 5.97% STU_HS_GPA 6.43% CS_1400_GRADE 5.50% CS_1400_GRADE 5.32% CS_2810_GRADE 5.09% CS_2550_GRADE 5.17% INC_DEV_MATH 4.27% ACT_COMPOSITE 4.62% ACT_COMPOSITE 4.10% OFFERED_PELL 3.55% SINGLE 3.79% STU_HOURS_EARNED_TRANSFER 3.06% OFFERED_PELL 3.06% SINGLE 2.46% MARRIED 2.75% MARRIED 2.27% STU_HOURS_EARNED_TRANSFER 2.72% FIRST_GEN_IND 1.63% FIRST_GEN_IND 2.27% REPORTED_LOW_INCOME 1.33% INC_DEV_ENGL 1.54% INC_DEV_MATH 1.00% REPORTED_LOW_INCOME 1.20% FEMALE 0.73% 32 FEMALE 1.15% INC_DEV_ENGL 0.55% TERM_RACE_HISPANIC_IND 0.66% CLEP_CREDIT 0.55% AP_CREDIT 0.58% TERM_RACE_HISPANIC_IND 0.54% VETERAN 0.54% VETERAN 0.43% TERM_RACE_BLACK_IND 0.20% AP_CREDIT 0.27% CLEP_CREDIT 0.12% TERM_RACE_NATIVE_AMERICAN_IND 0.16% DIVORCED 0.11% DIVORCED 0.05% TERM_RACE_NATIVE_AMERICAN_IND 0.10% TERM_RACE_PACIFIC_ISLANDER_IND 0.01% TERM_RACE_PACIFIC_ISLANDER_IND 0.05% OTHER_MARRIAGE 0.03% ROLE_INTERNATIONAL_STUDENT_IND 0.01% Result table for the computer science dataset using the Random Forest classifier. Only features with non-zero values are included. Table 3d. Adaboost Importance Factors for Computer Science Discipline. Computer Science Graduated Program Retention CS_3750_GRADE 18% CS_2810_GRADE 14% AGE_MATRICULATED 10% CS_3100_GRADE 14% STU_HS_GPA 8% STU_HS_GPA 12% CS_2550_GRADE 8% AGE_MATRICULATED 10% CS_2810_GRADE 8% CS_1400_GRADE 10% MALE 6% CS_3750_GRADE 10% ACT_COMPOSITE 6% MALE 8% CS_2420_GRADE 6% ACT_COMPOSITE 6% CS_3100_GRADE 6% STU_HOURS_EARNED_TRANSFER 4% MARRIED 4% CS_2420_GRADE 4% CS_1400_GRADE 4% CS_2550_GRADE 4% TERM_RACE_HISPANIC_IND 2% FEMALE 2% TERM_RACE_NATIVE_AMERICAN_IND 2% REPORTED_LOW_INCOME 2% SINGLE 2% AP_CREDIT 2% CLEP_CREDIT 2% OFFERED_PELL 2% INC_DEV_ENGL 2% INC_DEV_MATH 2% Result table for the computer science dataset using the Adaboost classifier. Only features with non-zero values are included. 33 Table 3e. Logistic Regression Importance Factors for Computer Science Discipline. Computer Science Graduated Program Retention CS_3750_GRADE 385.26 CS_3750_GRADE 2404.114 AGE_MATRICULATED -172.993 AGE_MATRICULATED -163.805 CS_2550_GRADE 77.59914 CS_3100_GRADE 31.54723 CS_2420_GRADE 70.68672 CS_2810_GRADE 19.48314 CS_3100_GRADE 39.84163 CS_2420_GRADE 9.378467 INC_DEV_MATH -14.2907 STU_HS_GPA -7.50229 CS_2810_GRADE -13.877 STU_HOURS_EARNED_TRANSFER 4.681611 CS_1400_GRADE 5.214355 CS_1400_GRADE 4.515934 INC_DEV_ENGL -4.99 ACT_COMPOSITE -2.57486 SINGLE -4.89128 CS_2550_GRADE 2.122836 Result table for the computer science dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. For the computer science discipline it is apparent that course grades are what make the most difference in both the graduation and retention rates. Age follows closely behind, showing the same relationship as in the combined dataset. As age increases, graduation and retention rates decrease. There does exist a gender gap in this dataset as about 9 out of 10 students are male. This may have contributed to the Male feature having a slightly higher impact as opposed to other datasets. 34 Business Administration Analysis The business administration dataset includes 4,561 students who finished at least one term while declared as a Business Administration major. The statistical breakdown for this dataset is as follows: Table 4a. Statistics of features from Business Administration Dataset. BSAD Total Graduated Retained Male 3392 2116 605 Female 1169 746 161 Married 2241 1710 489 Single 1692 827 192 Divorced 125 69 22 Dev Math 1762 726 141 Dev Engl 938 358 93 Veteran 92 41 16 Black 108 41 12 Pacific Islander 26 10 4 Native American 52 33 3 Hispanic 361 172 45 International 17 9 0 First Gen 1613 1013 232 Course Grade Counts BSAD_2899 1339 1257 547 BSAD_4780 1444 1431 766 ACTG_2020 2052 1770 753 ECON_2020 1805 1649 723 FIN_3200 1505 1458 759 Total Students 4561 2862 766 This table shows how many instances of each of the features therein occurred in the dataset. Table 4b. Decision Tree Importance Factors for Business Administration Discipline. Business Admin Graduation Program Retention 35 BSAD_4780_GRADE 46.79% BSAD_4780_GRADE 88.86% INC_DEV_MATH 25.99% BSAD_2899_GRADE 6.99% MARRIED 11.57% STU_HOURS_EARNED_TRANSFER 3.58% BSAD_2899_GRADE 5.89% ECON_2020_GRADE 0.56% ECON_2020_GRADE 3.83% MALE 1.92% ACTG_2020_GRADE 1.43% ACT_COMPOSITE 1.31% INC_DEV_ENGL 1.26% Result table for the business administration dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 4c. Random Forest Importance Factors for Business Administration Discipline. Business Administration Graduated Program Retention BSAD_4780_GRADE 11.81% BSAD_4780_GRADE 20.30% ACTG_2020_GRADE 9.05% FIN_3200_GRADE 10.71% INC_DEV_MATH 7.85% ACTG_2020_GRADE 9.97% AGE_MATRICULATED 7.27% ECON_2020_GRADE 7.56% STU_HS_GPA 6.62% AGE_MATRICULATED 7.34% ECON_2020_GRADE 6.54% BSAD_2899_GRADE 6.09% FIN_3200_GRADE 6.01% STU_HS_GPA 5.48% MARRIED 5.62% MALE 5.30% MALE 5.17% ACT_COMPOSITE 4.68% BSAD_2899_GRADE 5.17% MARRIED 4.28% INC_DEV_ENGL 5.04% OFFERED_PELL 3.76% ACT_COMPOSITE 4.90% STU_HOURS_EARNED_TRANSFER 3.22% SINGLE 4.15% FIRST_GEN_IND 2.31% OFFERED_PELL 3.65% SINGLE 2.22% STU_HOURS_EARNED_TRANSFER 2.70% FEMALE 1.89% FIRST_GEN_IND 2.53% INC_DEV_MATH 1.87% FEMALE 2.03% REPORTED_LOW_INCOME 1.03% REPORTED_LOW_INCOME 1.33% INC_DEV_ENGL 0.54% TERM_RACE_HISPANIC_IND 0.79% TERM_RACE_HISPANIC_IND 0.44% DIVORCED 0.43% AP_CREDIT 0.27% AP_CREDIT 0.42% DIVORCED 0.26% TERM_RACE_BLACK_IND 0.39% VETERAN 0.25% VETERAN 0.20% TERM_RACE_BLACK_IND 0.10% TERM_RACE_NATIVE_AMERICAN_IND 0.17% TERM_RACE_PACIFIC_ISLANDER_IND 0.04% OTHER_MARRIAGE 0.06% TERM_RACE_NATIVE_AMERICAN_IND 0.04% ROLE_INTERNATIONAL_STUDENT_IND 0.05% ROLE_INTERNATIONAL_STUDENT_IND 0.03% 36 CLEP_CREDIT 0.03% OTHER_MARRIAGE 0.01% TERM_RACE_PACIFIC_ISLANDER_IND 0.02% CLEP_CREDIT 0.01% Result table for the business administration dataset using the Random Forest classifier. Only features with non-zero values are included. Table 4d. Adaboost Importance Factors for Business Administration Discipline. Business Administration Graduated Program Retention BSAD_4780_GRADE 18% BSAD_4780_GRADE 30% BSAD_2899_GRADE 8% STU_HOURS_EARNED_TRANSFER 10% AGE_MATRICULATED 6% BSAD_2899_GRADE 10% FIRST_GEN_IND 6% MALE 6% MALE 6% ACTG_2020_GRADE 6% STU_HS_GPA 6% FIN_3200_GRADE 6% ACTG_2020_GRADE 6% FIRST_GEN_IND 4% TERM_RACE_BLACK_IND 4% FEMALE 4% FEMALE 4% STU_HS_GPA 4% DIVORCED 4% ACT_COMPOSITE 4% MARRIED 4% OFFERED_PELL 4% INC_DEV_ENGL 4% INC_DEV_MATH 4% INC_DEV_MATH 4% ECON_2020_GRADE 4% ECON_2020_GRADE 4% AGE_MATRICULATED 2% FIN_3200_GRADE 4% MARRIED 2% TERM_RACE_NATIVE_AMERICAN_IND 2% ACT_COMPOSITE 2% STU_HOURS_EARNED_TRANSFER 2% SINGLE 2% AP_CREDIT 2% OFFERED_PELL 2% Result table for the business administration dataset using the Adaboost classifier. Only features with non-zero values are included. Table 4e. Logistic Regression Importance Factors for Business Administration Discipline. Business Administration Graduated Program Retention BSAD_4780_GRADE 9496.426 BSAD_4780_GRADE 183021.5 BSAD_2899_GRADE -623.135 AGE_MATRICULATED -174.427 37 INC_DEV_MATH -531.04 BSAD_2899_GRADE -35.8275 ACTG_2020_GRADE 91.4362 ACTG_2020_GRADE 33.33284 SINGLE -91.163 STU_HS_GPA -18.0797 INC_DEV_ENGL -84.6354 FIN_3200_GRADE 10.75852 MARRIED 83.16365 INC_DEV_MATH -7.09579 AGE_MATRICULATED -33.9698 ECON_2020_GRADE 5.836036 ECON_2020_GRADE 19.85393 SINGLE -3.73919 OFFERED_PELL 12.49723 INC_DEV_ENGL -2.68817 Result table for the business administration dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. For business administration, the results were similar to those of computer science. Both coursework and age stood out as significant. The developmental math requirement showed more of an impact here than in computer science. The statistical likelihood of course data being highly impactful shows very much here. Especially for the BSAD 4780 course, from the counts we see that of the 1444 records with that course data, only 13 of them did not graduate. It is worth noting that the BSAD 4780 course is among the final courses that a business administration will take, so it stands to reason that if a student has made it that far in the discipline, that they are likely to graduate in that program. The consistent very high impact of the course may have skewed the percentage results a bit. 38 Elementary Education Analysis The education dataset includes 3,211 students who finished at least one term while declared as an Elementary Education major. The statistical breakdown for this dataset is as follows: Table 5a. Statistics of features from Elementary Education Dataset. EDUC Total Graduated Retained Male 279 166 81 Female 2932 2204 1120 Married 1752 1418 777 Single 1078 697 296 Divorced 117 98 49 Dev Math 1108 576 219 Dev Engl 491 243 103 Veteran 37 24 12 Black 18 8 0 Pacific Islander 4 3 0 Native American 24 17 2 Hispanic 186 132 48 International 3 2 0 First Gen 1166 838 377 Course Grade Counts EDUC_2010_GRADE 388 354 187 EDUC_3210_GRADE 306 304 267 EDUC_4210_GRADE 308 306 271 EDUC_3230_GRADE 4 4 4 EDUC_3270_GRADE 339 334 267 EDUC_4350_GRADE 16 16 10 MATH_2020_GRADE 1367 1345 1082 Total Students 3211 2370 1201 This table shows how many instances of each of the features therein occurred in the dataset. 39 Table 5b. Decision Tree Importance Factors for Elementary Education Discipline. Education Graduation Program Retention INC_DEV_MATH 46.33% MATH_2020_GRADE 67.11% MATH_2020_GRADE 34.68% EDUC_4350_GRADE 14.90% EDUC_2010_GRADE 4.53% EDUC_4210_GRADE 4.93% MALE 4.26% EDUC_3230_GRADE 4.50% EDUC_4210_GRADE 3.03% EDUC_2010_GRADE 2.28% INC_DEV_ENGL 2.69% FEMALE 1.76% EDUC_3230_GRADE 2.43% MALE 1.09% MARRIED 1.83% SINGLE 0.95% FIRST_GEN_IND 0.10% FIRST_GEN_IND 0.78% STU_HOURS_EARNED_TRANSFER 0.08% STU_HS_GPA 0.61% EDUC_3210_GRADE 0.03% ACT_COMPOSITE 0.59% FEMALE 0.01% INC_DEV_MATH 0.49% Result table for the elementary education dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 5c. Random Forest Importance Factors for Elementary Education Discipline. Education Graduated Program Retention MATH_2020_GRADE 13.58% MATH_2020_GRADE 31.97% INC_DEV_MATH 11.26% EDUC_3210_GRADE 6.08% STU_HS_GPA 6.68% AGE_MATRICULATED 6.06% AGE_MATRICULATED 6.40% EDUC_4210_GRADE 5.57% EDUC_3230_GRADE 5.07% EDUC_2010_GRADE 5.18% ACT_COMPOSITE 5.01% EDUC_4350_GRADE 5.16% EDUC_4350_GRADE 4.79% EDUC_3230_GRADE 4.67% FEMALE 4.50% STU_HS_GPA 4.37% EDUC_2010_GRADE 4.39% ACT_COMPOSITE 4.17% EDUC_3270_GRADE 4.31% EDUC_3270_GRADE 3.96% EDUC_3210_GRADE 4.25% INC_DEV_MATH 3.81% EDUC_4210_GRADE 4.16% FEMALE 3.69% OFFERED_PELL 4.10% STU_HOURS_EARNED_TRANSFER 2.63% MARRIED 3.69% MARRIED 2.57% INC_DEV_ENGL 3.45% OFFERED_PELL 2.25% SINGLE 3.33% FIRST_GEN_IND 2.23% STU_HOURS_EARNED_TRANSFER 3.23% SINGLE 2.02% FIRST_GEN_IND 2.76% TERM_RACE_HISPANIC_IND 0.71% REPORTED_LOW_INCOME 1.62% REPORTED_LOW_INCOME 0.65% 40 MALE 1.43% INC_DEV_ENGL 0.57% TERM_RACE_HISPANIC_IND 0.70% AP_CREDIT 0.52% AP_CREDIT 0.43% MALE 0.50% DIVORCED 0.27% DIVORCED 0.38% VETERAN 0.20% VETERAN 0.18% TERM_RACE_NATIVE_AMERICAN_IND 0.13% OTHER_MARRIAGE 0.06% OTHER_MARRIAGE 0.11% TERM_RACE_NATIVE_AMERICAN_IND 0.01% TERM_RACE_BLACK_IND 0.09% TERM_RACE_BLACK_IND 0.01% ROLE_INTERNATIONAL_STUDENT_IND 0.03% TERM_RACE_PACIFIC_ISLANDER_IND 0.02% CLEP_CREDIT 0.01% Result table for the elementary education dataset using the Random Forest classifier. Only features with non-zero values are included. Table 5d. Adaboost Importance Factors for Elementary Education Discipline. Education Graduated Program Retention MATH_2020_GRADE 18% MATH_2020_GRADE 22% EDUC_4210_GRADE 10% EDUC_4210_GRADE 10% AGE_MATRICULATED 8% ACT_COMPOSITE 8% FIRST_GEN_IND 8% EDUC_4350_GRADE 8% STU_HS_GPA 8% AGE_MATRICULATED 6% ACT_COMPOSITE 6% EDUC_3270_GRADE 6% EDUC_2010_GRADE 6% FIRST_GEN_IND 4% STU_HOURS_EARNED_TRANSFER 4% STU_HS_GPA 4% INC_DEV_ENGL 4% STU_HOURS_EARNED_TRANSFER 4% INC_DEV_MATH 4% MARRIED 4% EDUC_3210_GRADE 4% SINGLE 4% EDUC_3230_GRADE 4% EDUC_2010_GRADE 4% TERM_RACE_HISPANIC_IND 2% VETERAN 2% MALE 2% FEMALE 2% DIVORCED 2% MALE 2% MARRIED 2% DIVORCED 2% OFFERED_PELL 2% AP_CREDIT 2% REPORTED_LOW_INCOME 2% REPORTED_LOW_INCOME 2% EDUC_3270_GRADE 2% INC_DEV_MATH 2% EDUC_4350_GRADE 2% EDUC_3230_GRADE 2% Result table for the elementary education dataset using the Adaboost classifier. Only features with non-zero values are included. 41 Table 5e. Logistic Regression Importance Factors for Elementary Education Discipline. Education Graduated Program Retention MATH_2020_GRADE 8367.433 MATH_2020_GRADE 3445664 INC_DEV_MATH -216.189 EDUC_3230_GRADE -1122.31 EDUC_3230_GRADE -150.925 EDUC_4350_GRADE -469.729 EDUC_4350_GRADE -53.8601 EDUC_2010_GRADE -64.8478 INC_DEV_ENGL -14.9765 EDUC_4210_GRADE 20.18127 SINGLE -8.40308 INC_DEV_MATH -14.8852 MARRIED 8.035705 EDUC_3210_GRADE 14.56117 EDUC_2010_GRADE 5.701304 AGE_MATRICULATED -9.2618 OFFERED_PELL 5.340651 STU_HS_GPA -7.22455 EDUC_4210_GRADE 2.722026 SINGLE -6.21037 Result table for the elementary education dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. It is very clear for the elementary education discipline that math requirements are the most significant factor in determining graduation from this study. The education discipline has its own unique math courses, with Math 2020 being the last. Success in this course seemed to be the make or break indicator of graduation and retention in this discipline. Other coursework was also highly impactful, though interestingly there were some with an inverse relationship. So success in coursework in this discipline actually means less likelihood to finish the program. In this instance, we see that the number of missing grades from these courses was very high. As seen from the statistics for this dataset, the EDUC 3230 course had only 4 grades total for that feature, and EDUC 4350 only had 16. This shows that unfortunately the information from these two courses is very unreliable as the sample size is far too small. 42 This dataset also shows another gender gap. So as with computer science, we see that the gender had a slightly higher impact in this dataset. Though in this case it is female students that show higher impact. 43 Nursing Analysis The nursing dataset includes 8,543 students who finished at least one term while declared as a Nursing major. The statistical breakdown for this dataset is as follows: Table 6a. Statistics of features from Nursing Dataset. NRSG Total Graduated Retained Male 1479 1005 441 Female 7063 4697 2071 Married 4563 3495 1614 Single 2718 1483 599 Divorced 437 277 101 Dev Math 2967 1166 397 Dev Engl 1472 547 199 Veteran 167 101 46 Black 110 35 13 Pacific Islander 51 22 5 Native American 115 71 35 Hispanic 699 380 136 International 1 0 0 First Gen 3182 2016 861 Course Grade Counts NRSG_2200_GRADE 545 539 440 NRSG_2500_GRADE 1319 1316 1031 NRSG_3100_GRADE 1361 1358 1063 NRSG_3300_GRADE 1339 1339 1056 Total Students 8543 5702 2512 This table shows how many instances of each of the features therein occurred in the dataset. Table 6b. Decision Tree Importance Factors for Nursing Discipline. Nursing Graduation Program Retention INC_DEV_MATH 71.08% NRSG_3300_GRADE 48.82% STU_HOURS_EARNED_TRANSFER 8.12% INC_DEV_MATH 28.26% AGE_MATRICULATED 7.90% AGE_MATRICULATED 14.28% 44 MARRIED 5.57% SINGLE 2.59% NRSG_3300_GRADE 5.57% OFFERED_PELL 1.86% INC_DEV_ENGL 0.90% MARRIED 1.18% ACT_COMPOSITE 0.87% FIRST_GEN_IND 1.18% STU_HOURS_EARNED_TRANSFER 0.93% FEMALE 0.83% ACT_COMPOSITE 0.08% Result table for the nursing dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 6c. Random Forest Importance Factors for Nursing Discipline. Nursing Graduated Program Retention INC_DEV_MATH 13.87% AGE_MATRICULATED 11.19% AGE_MATRICULATED 10.02% NRSG_3300_GRADE 10.53% STU_HS_GPA 7.46% NRSG_3100_GRADE 9.02% STU_HOURS_EARNED_TRANSFER 7.02% NRSG_2500_GRADE 8.25% NRSG_3300_GRADE 6.31% STU_HS_GPA 7.62% FEMALE 6.07% STU_HOURS_EARNED_TRANSFER 7.12% ACT_COMPOSITE 5.89% FEMALE 6.79% NRSG_3100_GRADE 5.63% NRSG_2200_GRADE 6.19% NRSG_2200_GRADE 5.40% ACT_COMPOSITE 5.98% NRSG_2500_GRADE 5.25% INC_DEV_MATH 5.35% INC_DEV_ENGL 4.83% MARRIED 4.63% MARRIED 4.62% OFFERED_PELL 4.36% OFFERED_PELL 3.68% FIRST_GEN_IND 3.17% SINGLE 3.65% SINGLE 2.88% FIRST_GEN_IND 3.16% MALE 1.97% REPORTED_LOW_INCOME 2.15% REPORTED_LOW_INCOME 1.66% MALE 1.52% INC_DEV_ENGL 0.82% TERM_RACE_HISPANIC_IND 1.20% TERM_RACE_HISPANIC_IND 0.65% DIVORCED 0.73% DIVORCED 0.64% AP_CREDIT 0.33% AP_CREDIT 0.52% VETERAN 0.29% VETERAN 0.25% TERM_RACE_BLACK_IND 0.27% TERM_RACE_NATIVE_AMERICAN_IND 0.15% TERM_RACE_NATIVE_AMERICAN_IND 0.26% OTHER_MARRIAGE 0.11% OTHER_MARRIAGE 0.22% TERM_RACE_BLACK_IND 0.10% TERM_RACE_PACIFIC_ISLANDER_IND 0.12% CLEP_CREDIT 0.04% CLEP_CREDIT 0.06% TERM_RACE_PACIFIC_ISLANDER_IND 0.02% ROLE_INTERNATIONAL_STUDENT_IND 0.01% 45 Result table for the nursing dataset using the Random Forest classifier. Only features with non-zero values are included. Table 6d. Adaboost Importance Factors for Nursing Discipline. Nursing Graduated Program Retention NRSG_3300_GRADE 16% INC_DEV_MATH 12% AGE_MATRICULATED 14% SINGLE 10% STU_HOURS_EARNED_TRANSFER 12% NRSG_3300_GRADE 10% INC_DEV_MATH 12% FEMALE 8% FEMALE 8% STU_HS_GPA 8% STU_HS_GPA 8% MARRIED 6% INC_DEV_ENGL 6% NRSG_2200_GRADE 6% MALE 4% NRSG_3100_GRADE 6% ACT_COMPOSITE 4% AGE_MATRICULATED 4% MARRIED 4% MALE 4% OTHER_MARRIAGE 4% ACT_COMPOSITE 4% OFFERED_PELL 4% STU_HOURS_EARNED_TRANSFER 4% DIVORCED 2% NRSG_2500_GRADE 4% REPORTED_LOW_INCOME 2% VETERAN 2% TERM_RACE_HISPANIC_IND 2% TERM_RACE_NATIVE_AMERICAN_IND 2% DIVORCED 2% OFFERED_PELL 2% REPORTED_LOW_INCOME 2% INC_DEV_ENGL 2% Result table for the nursing dataset using the Adaboost classifier. Only features with non-zero values are included. Table 6e. Logistic Regression Importance Factors for Nursing Discipline. Nursing Graduated Program Retention INC_DEV_MATH -2.88E+05 INC_DEV_MATH -1585.44 INC_DEV_ENGL -815.104 NRSG_3300_GRADE 1070.681 NRSG_3300_GRADE 619.3245 SINGLE -71.8435 MARRIED 359.829 AGE_MATRICULATED -45.1624 SINGLE -331.862 NRSG_3100_GRADE 40.3325 NRSG_3100_GRADE 27.22321 INC_DEV_ENGL -30.3714 NRSG_2500_GRADE 27.10035 STU_HS_GPA -22.4405 46 NRSG_2200_GRADE -26.7144 FEMALE -17.8527 STU_HOURS_EARNED_TRANSFER 17.65846 NRSG_2200_GRADE -16.8154 FEMALE -16.3778 NRSG_2500_GRADE 16.39836 Result table for the nursing dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. As with the education discipline, the dev math requirement was the highest factor here. Among coursework the NRSG 3300 course seems to hold the most significant. Success in that course in particular seems to go hand in hand with success overall, once any developmental requirements have been met. The trend of single females having a lower graduation and retention rate was also shown in the results. It should be acknowledged that the gender gap in this dataset is quite wide as it was with computer science and education. There are 7 females for every male in this program. 47 Microbiology Analysis The microbiology dataset includes 1401 students who finished at least one term while declared as a Microbiology major. The statistical breakdown for this dataset is as follows: Table 7a. Statistics of features from Microbiology Dataset. MICR Total Graduated Retained Male 894 696 387 Female 507 351 146 Married 698 587 293 Single 515 354 178 Divorced 59 37 23 Dev Math 382 199 80 Dev Engl 147 74 38 Veteran 38 17 12 Black 16 7 2 Pacific Islander 14 4 4 Native American 25 14 5 Hispanic 95 63 29 International 4 1 1 First Gen 483 368 149 Course Grade Counts CHEM_1210_GRADE 924 816 513 CHEM_2310_GRADE 727 686 525 MICR_2054_GRADE 867 768 529 MICR_3053_GRADE 639 599 533 MICR_3154_GRADE 596 568 512 Total Students 1401 1047 533 This table shows how many instances of each of the features therein occurred in the dataset. Table 7b. Decision Tree Importance Factors for Microbiology Discipline. 48 Microbiology Graduation Program Retention INC_DEV_MATH 36.41% MICR_3053_GRADE 69.98% CHEM_1210_GRADE 25.55% AGE_MATRICULATED 15.85% AGE_MATRICULATED 10.31% CHEM_2310_GRADE 6.41% MARRIED 5.58% MICR_3154_GRADE 5.71% SINGLE 5.29% FIRST_GEN_IND 1.11% STU_HS_GPA 4.67% SINGLE 0.47% MICR_3154_GRADE 3.77% OTHER_MARRIAGE 0.46% ACT_COMPOSITE 2.35% MALE 2.10% MICR_3053_GRADE 2.04% TERM_RACE_PACIFIC_ISLANDER_IND 0.90% FEMALE 0.90% INC_DEV_ENGL 0.14% Result table for the microbiology dataset using the Decision Tree classifier. Only features with non-zero values are included. Table 7c. Random Forest Importance Factors for Microbiology Discipline. Microbiology Graduated Program Retention CHEM_1210_GRADE 10.13% MICR_3053_GRADE 29.44% INC_DEV_MATH 8.72% MICR_3154_GRADE 17.77% MICR_2054_GRADE 8.34% MICR_2054_GRADE 8.13% AGE_MATRICULATED 8.23% AGE_MATRICULATED 6.07% MICR_3053_GRADE 8.09% CHEM_2310_GRADE 5.30% MICR_3154_GRADE 5.87% CHEM_1210_GRADE 5.18% STU_HS_GPA 5.81% STU_HS_GPA 4.18% CHEM_2310_GRADE 5.66% ACT_COMPOSITE 3.79% ACT_COMPOSITE 5.27% OFFERED_PELL 3.45% MARRIED 4.47% MALE 2.71% OFFERED_PELL 4.10% STU_HOURS_EARNED_TRANSFER 2.67% MALE 4.05% FEMALE 2.18% SINGLE 3.91% FIRST_GEN_IND 2.11% STU_HOURS_EARNED_TRANSFER 3.55% INC_DEV_MATH 1.74% FEMALE 3.27% SINGLE 1.31% FIRST_GEN_IND 2.34% MARRIED 1.21% INC_DEV_ENGL 2.23% REPORTED_LOW_INCOME 0.68% REPORTED_LOW_INCOME 1.52% DIVORCED 0.59% TERM_RACE_HISPANIC_IND 0.84% INC_DEV_ENGL 0.43% VETERAN 0.82% AP_CREDIT 0.43% 49 AP_CREDIT 0.67% TERM_RACE_HISPANIC_IND 0.26% DIVORCED 0.57% OTHER_MARRIAGE 0.10% OTHER_MARRIAGE 0.56% VETERAN 0.07% TERM_RACE_NATIVE_AMERICAN_IND 0.39% CLEP_CREDIT 0.06% TERM_RACE_BLACK_IND 0.31% TERM_RACE_PACIFIC_ISLANDER_IND 0.05% TERM_RACE_PACIFIC_ISLANDER_IND 0.18% TERM_RACE_NATIVE_AMERICAN_IND 0.04% ROLE_INTERNATIONAL_STUDENT_IND 0.05% ROLE_INTERNATIONAL_STUDENT_IND 0.03% IB_CREDIT 0.02% TERM_RACE_BLACK_IND 0.02% CLEP_CREDIT 0.01% Result table for the microbiology dataset using the Random Forest classifier. Only features with non-zero values are included. Table 7d. Adaboost Importance Factors for Microbiology Discipline. Microbiology Graduated Program Retention MICR_3154_GRADE 14% MICR_3154_GRADE 20% MICR_3053_GRADE 12% MICR_3053_GRADE 14% CHEM_2310_GRADE 10% AGE_MATRICULATED 12% AGE_MATRICULATED 8% MALE 10% STU_HS_GPA 8% MICR_2054_GRADE 8% ACT_COMPOSITE 8% FEMALE 6% INC_DEV_MATH 6% STU_HS_GPA 6% CHEM_1210_GRADE 6% ACT_COMPOSITE 6% FIRST_GEN_IND 4% CHEM_1210_GRADE 4% OFFERED_PELL 4% CHEM_2310_GRADE 4% MICR_2054_GRADE 4% TERM_RACE_HISPANIC_IND 2% VETERAN 2% STU_HOURS_EARNED_TRANSFER 2% TERM_RACE_PACIFIC_ISLANDER_IND 2% DIVORCED 2% TERM_RACE_HISPANIC_IND 2% MARRIED 2% FEMALE 2% OFFERED_PELL 2% MALE 2% MARRIED 2% OTHER_MARRIAGE 2% INC_DEV_ENGL 2% Result table for the microbiology dataset using the Adaboost classifier. Only features with non-zero values are included. Table 7e. Logistic Regression Importance Factors for Microbiology Discipline. 50 Microbiology Graduated Program Retention AGE_MATRICULATED -57.1476 MICR_3053_GRADE 10697.47 MICR_2054_GRADE 55.21896 MICR_3154_GRADE 2172.838 CHEM_1210_GRADE 34.16446 AGE_MATRICULATED -1457.22 MICR_3053_GRADE 19.76172 MICR_2054_GRADE 21.46865 MICR_3154_GRADE 10.47177 CHEM_2310_GRADE 5.388148 INC_DEV_MATH -5.83623 CHEM_1210_GRADE 5.304366 CHEM_2310_GRADE 3.401559 STU_HS_GPA -5.14329 MARRIED 3.131125 ACT_COMPOSITE -2.42237 SINGLE -2.17146 INC_DEV_MATH -2.3384 INC_DEV_ENGL -2.16964 STU_HOURS_EARNED_TRANSFER 2.097548 Result table for the microbiology dataset using the Logistic Regression classifier. Only features with the top ten most impactful values are included. Values less than 1 indicate an inverse relationship, while values greater than 1 indicate a direct relationship. A value of 1 indicates no relationship. In the microbiology discipline, coursework stands out as the top factor. Preliminary chemistry courses seem to take precedence over the dev math requirement for this discipline. Age also seems to have more of an impact as compared to the other disciplines. 51 Interpretations of Results Caution should be exercised in drawing specific conclusions from the data and information in these analyses. While the information from the objective data presented can show impact on graduation and retention rates from specific features, it does not include specific details as to why the impact occurred. There may be some preconceived notions from social contexts that must be bridled. For example, the fact that race and ethnicity showed little to no impact in overall graduation rates does not mean that disparities in the personal academic experiences and struggles of individuals who belong to a racial or ethnic minority do not exist. We can only infer from the statistical information that perhaps there were too few instances such that they did not impact graduation rates directly in this particular subset of student data. The data represented here should be examined like it were the blip detected from a radar scan. The particular feature that shows significant impact indicates a general area where a pain point or important factor exists, but does not describe the nature of the impact. Further and deeper specific investigations and qualitative studies may be merited and required to uncover specific conclusions to the reasons for the impact on graduation rates. In another example, showing that students who are single are negatively impacted on their graduation rate as opposed to married students does not necessarily mean that single students struggle more in schooling. It could be just as easily assumed that single students are more likely to move on to another university as they do not have familial 52 factors that keep them where they are. So again, a deeper investigation would reveal those specific factors. They are not concluded by the evidence in this study. Here are some concrete conclusions that may be drawn from the data. a) Having to take remedial math or English courses has a large negative impact on graduation rates. The significance of the remedial math requirement in particular cannot be overstated. It was consistently one of the highest factors in the graduation outcome in most cases. Even in the disciplines with less math course requirements, math proves a stumbling block. It is worth noting the Educator specific math requirement (Math 2020) was the greatest factor impacting the retention of the Elementary Education discipline by a huge margin. It can be said that in general a greater focus on improving math preparedness and skills will significantly improve overall graduation rates at the college level. b) Course grades from particular courses are a common and high impact indicator of both graduation and retention rates. In most cases, it was the course grades from the datasets that were among the most significant factor in determining the graduation outcome. Generally, the better the grade the more likely the student is to graduate and stay with a particular program. This is generally a common knowledge and intuitive conclusion, but here we see objective evidence to back up the statement. It could be recommended that specific curriculum studies using similar machine learning techniques on the entire set of courses in the discipline be conducted to assist in designing curriculum that consistently leads to higher graduation and retention rates from that program. 53 c) Marital status is a significant indicator of eventual graduation. Particularly, being single has a negative impact while being married has a positive impact. Again, specific reasons or causes behind this cannot be concluded here. It could be supposed from cultural contexts that perhaps single students struggle due to a lack of a close social support system or are more volatile and likely to move on or change course. Or perhaps married students are more committed and motivated to complete their degree to support their families. More specific studies would have to be conducted to conclude any specific reasons. d) Age is a significant factor. It was shown that as the student’s age goes up, the likelihood of graduating goes down. This holds true especially for fields of scientific study and the more math oriented disciplines. The impact of age was greater in the science disciplines than in the humanities. e) A higher high school GPA indicates less likeliness to stay with a program. In every case of retention to a particular discipline, the high school GPA had a negative impact on graduating in that discipline. Curiously, it was not a factor in determining the graduation outcome, but was significant in determining if they stuck with a particular program. Again, specifics cannot be determined from these results, but it can be suggested that a higher GPA means more volatility. Perhaps students with higher GPAs are more likely to explore multiple fields of study before settling on one in particular. Higher high school 54 academic success could indicate the student feels they have more options and freedom at the university level. f) Gender can impact outcomes. In many of the datasets, being male showed a significant impact on both graduation and staying with a particular discipline. And for the combined dataset we see that the impact of being male was negative. It is worth noting that in the dataset there were more students identifying as female than as male (12k female to 8k male.) So the female student population held the majority. Conversely, being female had a negative impact in the Nursing discipline where females are a much greater majority. The gender gap in Nursing was very high (7k females to 1k males). However this did not hold true in the education dataset where the gender gap was even greater. Despite the wider gender gap in the education discipline, the impact of gender was not as high as it was in nursing. It is reasonable to suggest overall that gender does play a significant role, especially when noticeable gender gaps are present.55 Conclusion As shown from the results and tables in this study, the use of machine learning on data sets of university students shows promise in finding indicators that can help predict graduation and retention with a reasonably high degree of accuracy. Some early intervention programs or other student outreach may find data such as this useful in making decisions regarding supporting students in the most meaningful ways. The results of this study show promise, but they are not perfect. Machine learning algorithms themselves do have their flaws and weaknesses. Also, as with all computing practices, the output can only be as high quality as the input. While the assistance of machine learning can provide us with a metaphorical microscope in examining student data and the relationships they have with graduation and retention outcomes, they are not an oracle that produce exact meaning and guidance. Results must be taken with some healthy skepticism and measured against common intuition, but they can lead to more informed decision making and encourage further examination into meaningful areas of improvement. I would add my voice to that of Durson Delen that the richness and quality of the data on which these methods are employed has a direct correlation with the quality of the results, as well as that the results of the study themselves can provide the guidance as to where and how we can improve how we gather and organize student data and information such that they can yield more meaningful results from this method. [2, sec 5] 56 It was useful to find that there is a disparity in the results across different disciplines of university study. Not all programs are created equal, and different disciplines do show some significant differences of importance between the factors measured in this study. The results however only represent the student body from a limited set of disciplines, and may not generalize in application for the entire university. Certain factors such as the Developmental Math requirement did stand out consistently across the entire result set. The results for the individual disciplines did reveal some helpful indicators that can be useful in making decisions to impact student success for the particular areas. Elementary Education is a good example showing that the unique math courses required for that discipline showed to be a highly significant factor in determining both retention and graduation in that discipline. This study only covered a few of the many different learning disciplines that exist and only for one particular university. The study can certainly be expanded to include any other learning disciplines and need not be limited to the university level of academics, or even to education systems. Future work could include such an expansion. There is also value to be found in applying these similar techniques to target specific disciplines for more in-depth analysis where there are areas that may need or desire improvement and objective information is needed to help guide those decisions. All things considered, machine learning techniques applied across a variety of disciplines can be an effective method for finding useful information to guide decisions to help increase graduation and retention rates, as well as provide insight into how 57 particular factors affect those rates. The goal of finding objective information to give an objective analysis can be met in this way. 58 Appendix A Features and Descriptions VETERAN Indicates whether a student has any veteran status on record at WSU. TERM_RACE_BLACK_IND Indicates a student has identified as Black. TERM_RACE_PACIFIC_ISLANDER_IND Indicates a student has identified as belonging to a Pacific Islander race. TERM_RACE_HISPANIC_IND Indicates a student has identified as Hispanic. TERM_RACE_NATIVE_AMERICAN_IND Indicates a student has identified as a Native American. ROLE_INTERNATIONAL_STUDENT_IND Indicates international student status. Foreign exchange or otherwise. FIRST_GEN_IND Indicates a student is a first generation college student. Meaning neither of the student’s parents have completed any form of post-secondary education program. FEMALE Indicates the student identifies as female. MALE Indicates the student identifies as male. STU_HS_GPA The student’s final GPA recorded on their High School or primary educational institution’s official transcript. Range of 0.0 to 4.0. ACT_COMPOSITE The student’s highest ACT test score index. Range of 0 to 36. STU_HOURS_EARNED_TRANSFER Higher education credits a student transferred from another institution or earned prior to attending WSU. DIVORCED Indicates a student’s marital status is Divorced. MARRIED Indicates a student’s marital status is Married. SINGLE Indicates a student’s marital status is Single. OTHER_MARRIAGE Indicates a student’s marital status is unknown or belongs to a less common category such as separated, life-partner, widowed, etc… AP_CREDIT Total credits a student earned taking Advanced Placement courses. 59 CLEP_CREDIT Total credits a student earned taking College Level Examination Program tests. IB_CREDIT Total credits a student earned taking International Baccalaureate courses. OFFERED_PELL Indicates a student qualified for financial aid and was offered a Pell grant as part of financial assistance from WSU. This does not indicate whether the assistance was accepted or rejected. REPORTED_LOW_INCOME The student was reported as having low income status from their financial aid application. INC_DEV_MATH Indicates the student was required to take remedial math courses (courses below the 1000 level) before being able to register for college level math courses. INC_DEV_ENGL Indicates the student was required to take remedial English courses (courses below the 1000 level) <course>_XXXX_GRADE The latest grade a student received in the given course at WSU. A numeric value between 0 and 15. 15 indicating the highest grade of an ‘A’, 14 as ‘A-‘, etc… Unknown or missing grades used the middle value of 7.5. For the combined dataset, a generic name of COURSE# is used for the corresponding course from their specific discipline. Appendix B Sample of Random Forest decision tree 61 Appendix C Numeric results from algorithm comparison Table 8. Prediction Accuracy of Machine Learning Algorithms (Graduation). Graduated COMBINED BUSIADMIN COMPSCI ELEMEDUC NURSING MICROBIO DT 0.7392 0.6575 0.7364 0.7696 0.7367 0.7723 LR 0.7377 0.7060 0.7372 0.7294 0.7501 0.7730 ADA 0.7607 0.7027 0.7356 0.7820 0.7191 0.7816 RF 0.7588 0.7244 0.8595 0.8362 0.7203 0.8344 MC 0.6644 0.6275 0.5792 0.7381 0.6674 0.7473 Prediction accuracies by algorithm. Numbers in bold are the most accurate from each discipline. The MC control classifier was the least accurate in nearly every case, which indicates that there is knowledge gained by the use of each classifier. The only exception was the Logistic Regression classifier for the Elementary Education discipline (highlighted.) Table 9. Prediction Accuracy of Machine Learning Algorithms (Program Retention). Prog Ret COMBINED BUSIADMIN COMPSCI ELEMEDUC NURSING MICROBIO DT 0.7806 0.8323 0.8911 0.8315 0.7428 0.8680 LR 0.7882 0.8435 0.8891 0.8228 0.7791 0.8930 ADA 0.7824 0.8351 0.9208 0.8564 0.7497 0.9073 RF 0.7735 0.9066 0.9525 0.9206 0.7810 0.9408 MC 0.7373 0.8321 0.8826 0.6260 0.7060 0.6196 Prediction accuracies by algorithm. Numbers in bold are the most accurate from each discipline. In each case the different classifiers showed a higher accuracy than the control class, meaning some knowledge was gained in applying the algorithm. 62 Appendix D Comparison of results from additional preprocessing Table 10: Comparison of Accuracy from Additional Preprocessing. Business Administration Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.657530841 0.550956492 0.604243371 0.83227119 0.829203433 0.826351402 LR 0.705987106 0.635824048 0.700284035 0.843458043 0.831616146 0.838414103 ADA 0.702697392 0.553367908 0.562358847 0.835122067 0.810573463 0.825694228 RF 0.724387022 0.581218064 0.593715325 0.906602487 0.875461665 0.885107472 MC 0.627493947 0.627494031 0.627494031 0.832054534 0.832054454 0.832054454 COMPUTER SCIENCE Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.73640314 0.599894537 0.720153887 0.891143489 0.880985909 0.891142073 LR 0.737187289 0.639305131 0.731917253 0.889110144 0.882615886 0.886269471 ADA 0.735580237 0.648240298 0.71325025 0.920805092 0.886677461 0.916734005 RF 0.859450189 0.694126734 0.78147609 0.952477778 0.921611063 0.941103453 MC 0.579203978 0.579203925 0.579203925 0.882615973 0.882615886 0.882615886 EDUCATION Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.76955083 0.622202443 0.610990398 0.831512139 0.732791725 0.862039146 LR 0.729372151 0.738399202 0.729056025 0.822792304 0.632822558 0.819997033 ADA 0.781998324 0.604443397 0.64181843 0.856419238 0.667084653 0.810630587 RF 0.836188427 0.634970956 0.649610665 0.920581096 0.801915699 0.844888028 MC 0.738088109 0.738087966 0.738087966 0.62597346 0.625973338 0.625973338 NURSING Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.736735125 0.589809531 0.59753425 0.742840699 0.67330414 0.75056017 LR 0.750092704 0.731832838 0.743538438 0.779123472 0.701042493 0.776429522 ADA 0.719059861 0.58548246 0.596719566 0.749741416 0.656801372 0.714853683 RF 0.720344423 0.585949679 0.628331214 0.781001736 0.709938364 0.735922209 MC 0.667447033 0.667447024 0.667447024 0.705958125 0.7059581 0.7059581 MIRCOBIOLOGY Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.772310625 0.578158458 0.640256959 0.867994408 0.655246253 0.805139186 LR 0.773029995 0.751605996 0.769450393 0.892963904 0.648108494 0.883654532 ADA 0.781601423 0.66095646 0.663811563 0.907267412 0.680228408 0.826552463 RF 0.834425521 0.620985011 0.755174875 0.940780376 0.774446824 0.908636688 MC 0.747323335 0.74732334 0.74732334 0.619557702 0.619557459 0.619557459 63 COMBINED Graduated Weighted OH -1 Grad_in_pos Weighted pos OH -1 pos DT 0.739219834 0.71280698 0.703736743 0.780554391 0.502094385 0.693081574 LR 0.737733226 0.707554967 0.735404896 0.788233654 0.730003697 0.775696303 ADA 0.760679105 0.686941111 0.714094558 0.782387975 0.518104802 0.688026564 RF 0.758846049 0.698378055 0.720438101 0.773467909 0.543526637 0.696699376 MC 0.664436515 0.664436515 0.664436515 0.737288137 0.737288137 0.737288136 Comparison shows for each data set that the original preprocessing performed best when compared with results from the same function with additional preprocessing. Original results are highlighted (Graduated and Grad_in_pos). Additional results are to the right of the original. Weighted and Weighted pos are the results when missing course grade values were replaced with weighted averages for those fields. OH -1 and OH -1 pos are the results when inferred fields after one-hot encoding were removed. 64 References 1. Angra, S., Ahuja, S., "Machine learning and its applications: A review," in International Conference on Big Data Analytics and Computational Intelligence (ICBDAC), Pages 57-60, IEEE.org, https://ieeexplore.ieee.org/document/8070809, 2017 2. Ball, R., Duhadway, L., Feuz, K., Jensen, J., Rague, B., & Weidman, D., “Applying Machine Learning to Improve Curriculum Design,” in proceedings of the 50th ACM Technical Symposium on Computer Science Education (pp. 787-793), February 2019. 3. Delen, D., “A comparative analysis of machine learning techniques for student retention management,” in Decision Support Systems, Volume 49, Issue 4, pages 498-506, https://doi.org/10.1016/j.dss.2010.06.003 (Sept, 30, 2020) November 2010. 4. Delen, D., “Predicting Student Attrition with Data Mining Methods”, in Journal of College Student Retention: Research, Theory & Practice, by Sage Journals, https://doi.org/10.2190/CS.13.1.b, August 2011 5. Rogulkin, D., “Predicting 6-Year Graduation and High-Achieving and At-Risk Students”, in California State University Fresno document archives, http://www.csufresno.edu/academics/oie/documents/documents-research/2011/data%20mining%20report1.pdf, (Sept 30, 2020) May 2011. 6. Zohair, L., “Prediction of Student’s performance by modelling small dataset size”, in the International Journal of Educational Technology in Higher Education, Volume 16, Article 27, https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-019-0160-3, August 2019 7. Jia, JW, Mareboyana, M., “Machine learning algorithms and predictive models for undergraduate student retention”, in Proceedings of the World Congress on Engineering and Computer Science 2013 Vol I, https://d1wqtxts1xzle7.cloudfront.net/34349420/WCECS2013_pp222-227.pdf?1407058670=&response-content-disposition=inline%3B+filename%3DMachine_Learning_Algorithms_and_Predicti.pdf&Expires=1605466532&Signature=OflYsZtlmEpe95CSHnjnJ913vuG2KEq78ekeBSeynho2z8UlKg5pfvH4y9Z~Fhy~a2KUCVlUf1Wg-Vf2ioZRzPmrawUaIcMR2EmNE1E~szdrKDAnzi-O6dNYxznwVa5CbFkZFAeIWoSvoHpFQzfZu4~UL~5jnBi7RfvXjPyTRlJ6jtqMzcSFcxzfBdBwdiUFJrjUd7GgtaqQu26l9WIJYM3D-YfwQjRPmMzcki6jlX3VaR8f4hM9zPA8CWlf0-U7GdPyjPomXAjStQilmGOAOTnd1IUT6BVUZv~5c6m4bnpHTOrpjCjC0G1ghHUkFDmeCKPNGrLZergJpoDZmRVzmA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA, Oct 2013. 8. Lykourentzou, I., Giannoukos, I., Nikolopoulos, V., Mpardis, M., Loumos, V., “Dropout prediction in e-learning courses through the combination of machine learning techniques”, in Computers & Education Vol 53 Issue 3 Pages 950-965, 65 on sciencedirect.com, https://doi.org/10.1016/j.compedu.2009.05.010, November 2009 9. Aulck, L., Velagapudi, N., Blumenstock, J., West, J., “Predicting Student Dropout in Higher Education”, presented in ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, by Cornell University, https://arxiv.org/abs/1606.06364, June 2016 10. Yadav, S., Bharadwaj, B., Pal, S., “Mining Education Data to Predict Student's Retention: A comparative Study”, in International Journal of Computer Science and Information Security Vol. 10 pp113-117, Cornell University, https://arxiv.org/abs/1203.2987, March 2012 11. Chai, K., Gibson, D., “Predicting the Risk of Attrition for Undergraduate Students with Time Based Modelling”, presented at International Association for Development of the Information Society, https://eric.ed.gov/?id=ED562154, October 2015 12. Herzog, S., “Estimating Student Retention and Degree-Completion Time: Decision Trees and Neural Networks Vis-à-Vis Regression”, in New Directions for Institutional Research, http://www.fisme.science.uu.nl/staff/christianb/downloads/work_by_claudia/Articles%20about%20data%20mining%20and%20log%20files/Herzog.pdf, 2006 13. Lin, SH., “Data Mining For Student Retention Management”, in Journal of Computing Sciences in Colleges, on researchgate.net, https://www.researchgate.net/profile/Linda_Werner/publication/262368922_Know_your_students_to_increase_diversity_results_of_a_study_of_community_college_women_and_men_in_computer_science_courses/links/564deaed08aeafc2aab0b94c/Know-your-students-to-increase-diversity-results-of-a-study-of-community-college-women-and-men-in-computer-science-courses.pdf#page=103, 2012. 14. Zhang, Y., Oussena, S., Clark, T. and Hyensook, K., “Using data mining to improve student retention in HE: a case study”, in ICEIS - 12th International Conerence on Enterprise Information Systems, https://eprints.mdx.ac.uk/5808/, June 2010. 15. Nandeshwar, A., Menzies, T., Nelson, A., “Learning Patterns of university student retention”, in Expert Systems with Applications Vol 38 Issue 12 Pages 14984-14996, sciencedirect.com, https://doi.org/10.1016/j.eswa.2011.05.048, Nov-Dec 2011 16. Lauría, E., Baron, J., Devireddy, M., Sundararaju, V., Jayaprakash, S., “Mining academic data to improve college student retention: An open source perspective”, in Proceedings of the 2nd International Conference on Learning Analytics and Knowledge Pages 139-142, ACM digital library on acm.org, https://doi.org/10.1145/2330601.2330637, April 2012 17. de Freitas, S., Gibson, D., Du Plessis, C., Halloran, P., Williams, E., Ambrose, M., Dunwell, I., Arnab, S., “Foundations of dynamic learning analytics: Using university student data to increase retention”, in British Journal of Educational Technology Vol 46 Issue 6, by British Education Research Association, https://doi.org/10.1111/bjet.12212, October 2014 18. Alkhasawneh, R., Hobson Graves, R., “Developing a Hybrid Model to Predict Student First Year Retention in STEM Disciplines Using Machine Learning 66 Techniques”, in Journal of Stem Education: Innovations and Research, jstem.org, https://www.jstem.org/jstem/index.php/JSTEM/article/download/1805/1627, 2014 19. Alkhasawneh, R., Hobson, R., "Modeling student retention in science and engineering disciplines using neural networks," IEEE Global Engineering Education Conference (EDUCON) Pages 660-663, https://ieeexplore.ieee.org/document/5773209, 2011 20. Yu, CH., DiGangi, S., Jannasch-Pennell, A., Kaprolet, C., “A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year”, in Journal of Data Science 8, Pages 307-325, by researchgate.net, https://www.researchgate.net/profile/Chong_Ho_Yu/publication/228684382_A_Data_Mining_Approach_for_Identifying_Predictors_of_Student_Retention_from_Sophomore_to_Junior_Year/links/55810ecf08aed40dd8cd39d5/A-Data-Mining-Approach-for-Identifying-Predictors-of-Student-Retention-from-Sophomore-to-Junior-Year.pdf, 2010 21. Chen, Y., Johri, A., Rangwala H., “Running out of STEM: a comparative study across STEM majors of college students at-risk of dropping out early”, in Proceedings of the 8th International Conference on Learning Analytics and Knowledge Pages 270–279, ACM digital library on acm.org, https://doi.org/10.1145/3170358.3170410, March 2018 22. Huo, H., Cui, J., Hein, S., Padgett, Z., Ossolinski, M., Raim, R., Zhang, J., “Predicting Dropout for Nontraditional Undergraduate Students: A Machine Learning Approach”, in Journal of College Student Retention: Research, Theory & Practice, by Sage Journals, https://doi.org/10.1177/1521025120963821, October 2020 23. SKLearn Python library, version 0.23.2, Accessed May 2020, scitkit-learn.org, [online], Available: https://scikit-learn.org/stable/user_guide.html 24. Chowdary, D., “Decision Trees Explained With A Practical Example”, Online, https://towardsai.net/p/programming/decision-trees-explained-with-a-practical-example-fe47872d3b53, May 2020 25. Kambria.io, “Logistic Regression For Machine Learning and Classification”, Online, https://kambria.io/blog/logistic-regression-for-machine-learning/, July 2019 26. Kurama, V., “A Guide to AdaBoost: Boosting to Save the Day”, Online, https://blog.paperspace.com/adaboost-optimizer/, February 2020 27. Donges, N., “A Complete Guide to the Random Forest Algorithm”, Online, https://builtin.com/data-science/random-forest-algorithm, June 2019 (update September 2020) 28. Bhattacharyya, S., “’Logit’ of Logistic Regression: Understanding the Fundamentals”, Online, https://towardsdatascience.com/logit-of-logistic-regression-understanding-the-fundamentals-f384152a33d1, October 2018 29. Dietterich, T., “Ensemble Methods in Machine Learning”, in Multiple Classifier Sytems, Lecture Notes in Computer Science vol 1857, https://doi.org/10.1007/3-540-45014-9_1, 2000 30. Kong, E. and Dietterich, T., “Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms” in Oregon State University 67 course notes, http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf, 1995 Thesis_Jacob_Wilson (1) Final Audit Report 2020-11-24 Created: 2020-11-23 By: Christel Grange-Hicks (cgrangehicks@weber.edu) Status: Signed Transaction ID: CBJCHBCAABAAwJ9_kZ1hz7Fdnfi2N--UuOzE5KFseOMC "Thesis_Jacob_Wilson (1)" History Document created by Christel Grange-Hicks (cgrangehicks@weber.edu) 2020-11-23 - 9:52:49 PM GMT- IP address: 137.190.140.116 Document emailed to Jacob Wilson (jacobwilson@weber.edu) for signature 2020-11-23 - 10:02:10 PM GMT Email viewed by Jacob Wilson (jacobwilson@weber.edu) 2020-11-23 - 10:07:13 PM GMT- IP address: 66.249.84.120 Document e-signed by Jacob Wilson (jacobwilson@weber.edu) Signature Date: 2020-11-23 - 10:07:30 PM GMT - Time Source: server- IP address: 137.190.250.27 Document emailed to Kyle Feuz (kylefeuz@weber.edu) for signature 2020-11-23 - 10:07:32 PM GMT Email viewed by Kyle Feuz (kylefeuz@weber.edu) 2020-11-23 - 10:25:17 PM GMT- IP address: 64.233.172.42 Document e-signed by Kyle Feuz (kylefeuz@weber.edu) Signature Date: 2020-11-23 - 10:25:27 PM GMT - Time Source: server- IP address: 38.94.240.75 Document emailed to Yong Zhang (yongzhang@weber.edu) for signature 2020-11-23 - 10:25:29 PM GMT Email viewed by Yong Zhang (yongzhang@weber.edu) 2020-11-23 - 10:38:54 PM GMT- IP address: 73.98.135.141 Document e-signed by Yong Zhang (yongzhang@weber.edu) Signature Date: 2020-11-23 - 10:39:17 PM GMT - Time Source: server- IP address: 73.98.135.141 Document emailed to Robert Ball (robertball@weber.edu) for signature 2020-11-23 - 10:39:19 PM GMT Email viewed by Robert Ball (robertball@weber.edu) 2020-11-24 - 0:37:50 AM GMT- IP address: 64.233.172.38 Document e-signed by Robert Ball (robertball@weber.edu) Signature Date: 2020-11-24 - 0:43:53 AM GMT - Time Source: server- IP address: 174.23.163.157 Agreement completed. 2020-11-24 - 0:43:53 AM GMT |
Format | application/pdf |
ARK | ark:/87278/s6m5ptvm |
Setname | wsu_smt |
ID | 96829 |
Reference URL | https://digital.weber.edu/ark:/87278/s6m5ptvm |