«USING DATA MINING TO PREDICT FRESHMEN OUTCOMES Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness ...»
A general data mining diagram for running a modeling method with k-fold cross validation can be seen in Figure 5. Filters can be applied to select the proper groups for the validation and training sets for each fold, then the training and validation sets are sent to the modeling nodes where the same modeling method is run for each of the five training sets. The model is then run on each validation set for calculating the error. A model comparison node provides the relevant model evaluation statistics for each of the five folds.
The five different methods used to develop predictive models were: CHAID2 (chi-square automatic interaction detection), BFOS-CART (the classification and regression tree method;
Breiman, Friedman, Olshen, and Stone, 1984), a general decision tree, gradient boosting, and linear regression. Each model was developed to predict the first semester GPA of the first-time The CHAID and CART methods have been closely approximated by using Enterprise Miner settings. SAS Institute Inc. 2014.
SAS® Enterprise Miner™ 13.2: Reference Help. Cary, NC: SAS Institute Inc. p. 755-758.
full-time fall 2014 freshmen cohort. The average squared errors (ASE) of the five validation samples for each method were averaged and compared with the average errors of the training samples to check for overfitting and to find the method with the smallest error.
Figure 5. A general data-mining diagram for running 5-fold cross-validation to evaluate the accuracy of a model.
With the exception of linear regression, the methods tested were decision tree-based methods. The CART method begins by doing an exhaustive search for the best binary split. It then splits categorical predictors into a smaller number of groups or finds the optimal split in numerical measures. Each successive split is again split in two until no further splits are possible. The result is a tree of maximum possible size, which is then pruned back algorithmically. For interval targets the variance is used to assess the splits; for nominal targets the Gini impurity measure is used. Pruning starts with the split that has the smallest contribution to the model and missing data is assigned to the largest node of a split. This method creates a set of nested binary decision rules to predict an outcome.
Unlike CART with binary splits evaluated by the variance or misclassification measures, the CHAID algorithm uses the chi-square test (or the F test for interval targets) to determine significant splits and finds independent variables with the strongest association with the outcome. A Bonferroni correction to the p-value is applied prior to the split. CHAID may find multiple splits in continuous variables, and allows splitting of categorical data into more than two categories. This may result in very wide trees with numerous nodes at the first level. As with CART, CHAID allows different predictors for different sides of a split. The CHAID algorithm will halt when statistically significant splits are no longer found in the data.
The software was also configured to run a general decision tree that does not conform or approximate mainstream methods found in the literature. To control for the large number of nodes at each level, the model was restricted to up to four-way splits (4 branches), as opposed to CHAID which is finds and utilizes all significant splits and CART which splits each node in two.
The F test was used to evaluate the variance of the nodes and the depth of the overall tree was restricted to 6 levels. Missing values were assigned to produce an optimal split with the ASE used to evaluate the subtrees. The software’s cross validation option was selected in order to perform the cross validation procedure for each subtree. That results in a sequence of estimates using the cross validation method explained earlier to select the optimal subtree.
The final tree method was gradient boosting which uses a partitioning algorithm developed by Jerome Friedman. At each level of the tree the data is resampled a number of times without replacement. A random sample is drawn at each iteration from the training data set and the sample is used to update the model. The successive resampling results in a weighted average of the re-sampled data. The weights assigned at each iteration improve the accuracy of the predictions. The result is a series of decision trees, each one adjusted with new weights to improve the accuracy of the estimates or to correct the misclassifications present in the previous tree. Because the results at each stage are weighted and combined into a final model, there is no resulting tree diagram. However, the scoring code that is generated allows the model to be used to score new data for predicting outcomes.
The final method tested was linear regression. The discussion that follows highlights some of the difficulties in implementing linear regression in a data mining environment.
Decision tree methods are able to handle missing values by combining them with another category or using surrogate rules to replace them. Linear regression, on the other hand, will listwise delete the missing values. Data in this study was obtained from multiple campus sources, and as such, many students did not have any records for some predictors. For example, students who did not apply for financial aid will have missing data on financial aid measures, a small percentage of the entering freshmen do not have SAT scores, and some students may not have courses utilizing the LMS. These measures result in an excessive amount of data being listwise deleted. The software has an imputation node that can be configured to impute missing data. For this study the distribution method was used whereby replacement values are calculated from random percentiles of the distributions of the predictors. There are many imputation methods and a thorough study of missingness for such a large number of variables is very time consuming. If the linear regression method appeared promising, other imputation methods would be explored and studied in greater detail. Another issue of concern in the linear regression analysis was multicollinearity. That is another issue that can take time to address thoroughly.
For this analysis clustering was employed to reduce multicollinearity. With a large volume of predictors, it would be difficult and time consuming to evaluate all of the potential multicollinearity issues, so the software clustering node was used to group highly correlated variables. In each cluster, the variable with the highest correlation coefficient was retained and entered into the modeling process, and the others were eliminated.
Results Gradient boosting had the smallest average ASE followed by that of CART (Table 1).
Additionally, gradient boosting and BFOS-CART, on average, had the smallest differences between the validation and training errors. Those absolute errors were both approximately 0.02, while for the other methods it was greater than 0.1. Gradient boosting had the lowest average Table 1. Average Squared Error (ASE) Results for the Five Data Mining Methods validation error, 0.375, while CHAID and linear regression had the highest at 0.49. Though gradient boosting had the lowest average validation ASE, the CART method was chosen for the modeling process. Close inspection of the CART results did not show evidence of any problems with the fit of the model, and it had a relatively low average ASE. The main reason for choosing the CART model is that gradient boosting, without an actual tree diagram, would make the results much more difficult to explain, use, and visualize. Having a set of student characteristics assigned to each node, as well as the ability to graphically display the decision tree adds to the utility of the CART model. Once the CART method was selected, the model was run again using all of the data, and scoring output was created.
The score distribution table, Figure 2, which is part of the decision tree output allows us to view the frequencies of the model predictions. Twenty bins, the prediction ranges, are created by evenly dividing the interval between the lowest and highest predictions, 1.30 and 3.76. (Intervals without students are not listed.) The model score is calculated by taking the mid-point of the prediction range. The average GPA column contains the average GPA of the N students in the data that fall within the given range. The table can aid us in choosing GPA cut points for different interventions since it shows the number of students at the various prediction levels.
Table 3. Variable Importance Table.
Table 3 lists the relative importance measure for variables that were entered into the modeling process. The relative importance measure is evaluated by using the reduction in the sum of squares that results when a node is split, summing over all of the nodes.3 In the variable importance calculation when variables are highly correlated they will both receive credit for the sum of squares reduction, hence the relative importance of highly correlated variables will be about the same. For that reason some measures may rank high on the variable importance list, but do not appear as a predictors in the decision tree.
On Table 3 high school GPA is highest on the variable importance list for predicting freshmen GPA when modeled mid-semester, followed by whether or not a student received a scholarship. Next are AP STEM and non-STEM courses accepted for credit, and then LMS system logins. A student’s combined SAT Math and Critical Reading Exam Score is 15th on the list just behind the high school average score for the combined SAT Math, Critical Reading, and Writing exam. Some other measures that exceeded SAT scores in relative importance are whether a student has a declared major, and the geographic area of residence when admitted.
The decision tree generated by the model is presented in two parts in Figures 6 and 7. The CART method, employing only binary splits as previously discussed, selected high school GPA for the first branch of the tree modeling first semester freshmen GPA. High school GPA was split into two nodes, less than or equal to 92.0, and greater than 92.0 or missing. Figure 6 displays the portion of the decision tree with high school GPA less than or equal to 92.0 and Figure 7 has the portion of the tree with high school GPA greater than 92.0 or missing.
. SAS Institute Inc. 2014. SAS® Enterprise Miner™ 13.2: Reference Help. Cary, NC: SAS Institute Inc. p. 794.
Figure 6. Part 1 of the CART Decision Tree Model Predicting Freshmen GPA for Students Having a High School GPA = 92.
Figure 7. Part 2 of the CART Decision Tree Model Predicting Freshmen GPA for Students Having a HS GPA 92.
0 or missing.
The next branch for the lower high school GPA group is the non-STEM course LMS logins during weeks 2 through 6. Average high school SAT scores appear at the next level.
Figure 7 displays the section of the tree having the students with a high school GPA greater than
92.0 or missing. A small number of students, some of them international students, do not have a high school GPA in their records. The CART algorithm has combined those observations with the node having high school GPA 92.0. In that way, those observations remain in the model and are not listwise deleted as they would be in a standard linear regression analysis. The next two levels are different than those for the lower high school GPA students. The next split after high school GPA is whether the students received a scholarship or not. For those who received a scholarship another high school GPA node follows that splits the students into groups above and below 96.5, while for those without a scholarship LMS non-STEM logins during weeks 2 through 6 is most important Examining both sections of the tree in Figures 6 and 7, we see that LMS logins factored in numerous splits confirming that students’ interactions with the college environment plays a role in their academic success. We also observe the differences in the decision rules for students in the higher high school GPA group as compared to the students in the lower high school GPA group.
The actual GPA predictions can be found in the nodes in the right-most column of the tree and are the average GPA’s of the students represented by the characteristics of each particular node. The characteristics associated with the GPA predictions can be ascertained by tracing the paths from the high school GPA node on the left to the desired average GPA node on the right. For example, to determine the characteristics for the students represented in the top right average GPA = 3.63 node in figure 6, we have students with high school GPA =92, LMS logins per non-STEM course in weeks 2 to 6 = 11.3 or missing, high school average SAT critical reading 570, SAT Math – Critical Reading combined score 1360, and finally, receiving credit for 1 or more AP STEM courses. The prediction, 3.63, is the actual average GPA of students in the fall 2014 cohort having the characteristics just listed. Hence, we can say that students with characteristics represented in the final nodes have, on average, the GPA that is listed in the node.
The average GPA nodes have been color-coded to assign estimated risk to the GPA levels. The red nodes have average GPA’s of 2.20 or less and are at the highest risk of receiving a low GPA The orange nodes represent high risk students and on average have GPA’s of above
2.20 to 2.75. Yellow nodes with average GPA’s of above 2.75 to 3.0 represent moderate risk, white nodes represent neutral risk with average GPA’s ranging from above 3.0 to below 3.5, and the green nodes are low risk students who, on average, have GPA’s of 3.5 and above. The given risk levels can be adjusted based on university outcomes and the number of students assigned to various planned interventions.