FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:     | 1 || 3 |

«USING DATA MINING TO PREDICT FRESHMEN OUTCOMES Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness ...»

-- [ Page 2 ] --

A general data mining diagram for running a modeling method with k-fold cross validation can be seen in Figure 5. Filters can be applied to select the proper groups for the validation and training sets for each fold, then the training and validation sets are sent to the modeling nodes where the same modeling method is run for each of the five training sets. The model is then run on each validation set for calculating the error. A model comparison node provides the relevant model evaluation statistics for each of the five folds.

The five different methods used to develop predictive models were: CHAID2 (chi-square automatic interaction detection), BFOS-CART (the classification and regression tree method;

Breiman, Friedman, Olshen, and Stone, 1984), a general decision tree, gradient boosting, and linear regression. Each model was developed to predict the first semester GPA of the first-time The CHAID and CART methods have been closely approximated by using Enterprise Miner settings. SAS Institute Inc. 2014.

SAS® Enterprise Miner™ 13.2: Reference Help. Cary, NC: SAS Institute Inc. p. 755-758.

full-time fall 2014 freshmen cohort. The average squared errors (ASE) of the five validation samples for each method were averaged and compared with the average errors of the training samples to check for overfitting and to find the method with the smallest error.

Figure 5. A general data-mining diagram for running 5-fold cross-validation to evaluate the accuracy of a model.

With the exception of linear regression, the methods tested were decision tree-based methods. The CART method begins by doing an exhaustive search for the best binary split. It then splits categorical predictors into a smaller number of groups or finds the optimal split in numerical measures. Each successive split is again split in two until no further splits are possible. The result is a tree of maximum possible size, which is then pruned back algorithmically. For interval targets the variance is used to assess the splits; for nominal targets the Gini impurity measure is used. Pruning starts with the split that has the smallest contribution to the model and missing data is assigned to the largest node of a split. This method creates a set of nested binary decision rules to predict an outcome.

Unlike CART with binary splits evaluated by the variance or misclassification measures, the CHAID algorithm uses the chi-square test (or the F test for interval targets) to determine significant splits and finds independent variables with the strongest association with the outcome. A Bonferroni correction to the p-value is applied prior to the split. CHAID may find multiple splits in continuous variables, and allows splitting of categorical data into more than two categories. This may result in very wide trees with numerous nodes at the first level. As with CART, CHAID allows different predictors for different sides of a split. The CHAID algorithm will halt when statistically significant splits are no longer found in the data.

The software was also configured to run a general decision tree that does not conform or approximate mainstream methods found in the literature. To control for the large number of nodes at each level, the model was restricted to up to four-way splits (4 branches), as opposed to CHAID which is finds and utilizes all significant splits and CART which splits each node in two.

The F test was used to evaluate the variance of the nodes and the depth of the overall tree was restricted to 6 levels. Missing values were assigned to produce an optimal split with the ASE used to evaluate the subtrees. The software’s cross validation option was selected in order to perform the cross validation procedure for each subtree. That results in a sequence of estimates using the cross validation method explained earlier to select the optimal subtree.

The final tree method was gradient boosting which uses a partitioning algorithm developed by Jerome Friedman. At each level of the tree the data is resampled a number of times without replacement. A random sample is drawn at each iteration from the training data set and the sample is used to update the model. The successive resampling results in a weighted average of the re-sampled data. The weights assigned at each iteration improve the accuracy of the predictions. The result is a series of decision trees, each one adjusted with new weights to improve the accuracy of the estimates or to correct the misclassifications present in the previous tree. Because the results at each stage are weighted and combined into a final model, there is no resulting tree diagram. However, the scoring code that is generated allows the model to be used to score new data for predicting outcomes.

The final method tested was linear regression. The discussion that follows highlights some of the difficulties in implementing linear regression in a data mining environment.

Decision tree methods are able to handle missing values by combining them with another category or using surrogate rules to replace them. Linear regression, on the other hand, will listwise delete the missing values. Data in this study was obtained from multiple campus sources, and as such, many students did not have any records for some predictors. For example, students who did not apply for financial aid will have missing data on financial aid measures, a small percentage of the entering freshmen do not have SAT scores, and some students may not have courses utilizing the LMS. These measures result in an excessive amount of data being listwise deleted. The software has an imputation node that can be configured to impute missing data. For this study the distribution method was used whereby replacement values are calculated from random percentiles of the distributions of the predictors. There are many imputation methods and a thorough study of missingness for such a large number of variables is very time consuming. If the linear regression method appeared promising, other imputation methods would be explored and studied in greater detail. Another issue of concern in the linear regression analysis was multicollinearity. That is another issue that can take time to address thoroughly.

For this analysis clustering was employed to reduce multicollinearity. With a large volume of predictors, it would be difficult and time consuming to evaluate all of the potential multicollinearity issues, so the software clustering node was used to group highly correlated variables. In each cluster, the variable with the highest correlation coefficient was retained and entered into the modeling process, and the others were eliminated.

Results Gradient boosting had the smallest average ASE followed by that of CART (Table 1).

Additionally, gradient boosting and BFOS-CART, on average, had the smallest differences between the validation and training errors. Those absolute errors were both approximately 0.02, while for the other methods it was greater than 0.1. Gradient boosting had the lowest average Table 1. Average Squared Error (ASE) Results for the Five Data Mining Methods validation error, 0.375, while CHAID and linear regression had the highest at 0.49. Though gradient boosting had the lowest average validation ASE, the CART method was chosen for the modeling process. Close inspection of the CART results did not show evidence of any problems with the fit of the model, and it had a relatively low average ASE. The main reason for choosing the CART model is that gradient boosting, without an actual tree diagram, would make the results much more difficult to explain, use, and visualize. Having a set of student characteristics assigned to each node, as well as the ability to graphically display the decision tree adds to the utility of the CART model. Once the CART method was selected, the model was run again using all of the data, and scoring output was created.

The score distribution table, Figure 2, which is part of the decision tree output allows us to view the frequencies of the model predictions. Twenty bins, the prediction ranges, are created by evenly dividing the interval between the lowest and highest predictions, 1.30 and 3.76. (Intervals without students are not listed.) The model score is calculated by taking the mid-point of the prediction range. The average GPA column contains the average GPA of the N students in the data that fall within the given range. The table can aid us in choosing GPA cut points for different interventions since it shows the number of students at the various prediction levels.

–  –  –

Table 3. Variable Importance Table.

Table 3 lists the relative importance measure for variables that were entered into the modeling process. The relative importance measure is evaluated by using the reduction in the sum of squares that results when a node is split, summing over all of the nodes.3 In the variable importance calculation when variables are highly correlated they will both receive credit for the sum of squares reduction, hence the relative importance of highly correlated variables will be about the same. For that reason some measures may rank high on the variable importance list, but do not appear as a predictors in the decision tree.

On Table 3 high school GPA is highest on the variable importance list for predicting freshmen GPA when modeled mid-semester, followed by whether or not a student received a scholarship. Next are AP STEM and non-STEM courses accepted for credit, and then LMS system logins. A student’s combined SAT Math and Critical Reading Exam Score is 15th on the list just behind the high school average score for the combined SAT Math, Critical Reading, and Writing exam. Some other measures that exceeded SAT scores in relative importance are whether a student has a declared major, and the geographic area of residence when admitted.

The decision tree generated by the model is presented in two parts in Figures 6 and 7. The CART method, employing only binary splits as previously discussed, selected high school GPA for the first branch of the tree modeling first semester freshmen GPA. High school GPA was split into two nodes, less than or equal to 92.0, and greater than 92.0 or missing. Figure 6 displays the portion of the decision tree with high school GPA less than or equal to 92.0 and Figure 7 has the portion of the tree with high school GPA greater than 92.0 or missing.

. SAS Institute Inc. 2014. SAS® Enterprise Miner™ 13.2: Reference Help. Cary, NC: SAS Institute Inc. p. 794.

Figure 6. Part 1 of the CART Decision Tree Model Predicting Freshmen GPA for Students Having a High School GPA = 92.


Figure 7. Part 2 of the CART Decision Tree Model Predicting Freshmen GPA for Students Having a HS GPA 92.

0 or missing.

The next branch for the lower high school GPA group is the non-STEM course LMS logins during weeks 2 through 6. Average high school SAT scores appear at the next level.

Figure 7 displays the section of the tree having the students with a high school GPA greater than

92.0 or missing. A small number of students, some of them international students, do not have a high school GPA in their records. The CART algorithm has combined those observations with the node having high school GPA 92.0. In that way, those observations remain in the model and are not listwise deleted as they would be in a standard linear regression analysis. The next two levels are different than those for the lower high school GPA students. The next split after high school GPA is whether the students received a scholarship or not. For those who received a scholarship another high school GPA node follows that splits the students into groups above and below 96.5, while for those without a scholarship LMS non-STEM logins during weeks 2 through 6 is most important Examining both sections of the tree in Figures 6 and 7, we see that LMS logins factored in numerous splits confirming that students’ interactions with the college environment plays a role in their academic success. We also observe the differences in the decision rules for students in the higher high school GPA group as compared to the students in the lower high school GPA group.

The actual GPA predictions can be found in the nodes in the right-most column of the tree and are the average GPA’s of the students represented by the characteristics of each particular node. The characteristics associated with the GPA predictions can be ascertained by tracing the paths from the high school GPA node on the left to the desired average GPA node on the right. For example, to determine the characteristics for the students represented in the top right average GPA = 3.63 node in figure 6, we have students with high school GPA =92, LMS logins per non-STEM course in weeks 2 to 6 = 11.3 or missing, high school average SAT critical reading 570, SAT Math – Critical Reading combined score 1360, and finally, receiving credit for 1 or more AP STEM courses. The prediction, 3.63, is the actual average GPA of students in the fall 2014 cohort having the characteristics just listed. Hence, we can say that students with characteristics represented in the final nodes have, on average, the GPA that is listed in the node.

The average GPA nodes have been color-coded to assign estimated risk to the GPA levels. The red nodes have average GPA’s of 2.20 or less and are at the highest risk of receiving a low GPA The orange nodes represent high risk students and on average have GPA’s of above

2.20 to 2.75. Yellow nodes with average GPA’s of above 2.75 to 3.0 represent moderate risk, white nodes represent neutral risk with average GPA’s ranging from above 3.0 to below 3.5, and the green nodes are low risk students who, on average, have GPA’s of 3.5 and above. The given risk levels can be adjusted based on university outcomes and the number of students assigned to various planned interventions.

Pages:     | 1 || 3 |

Similar works:

«Predicting Worst Case Execution Times on a Pipelined RISC Processor Shaun J. Bharrat Kevin Jeffay University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC 27599-3175 USA {bharrat,jeffay}@cs.unc.edu Abstract: A key step in analyzing and reasoning about the performance of realtime systems is the derivation of the worst case execution time of a program or program fragment. Modern computer systems with pipelined processors, caches, DMA, etc., can complicate this...»

«ORANGUTAN CONSERVANCY ORANGUTAN VETERINARY ADVISORY GROUP WORKSHOP 2013 REPORT Design by: drh. Winny Pramesyswari Photos provided by Darmawan and Intan Citraningputri (veterinary students from IPB), Steve Unwin and Raffaella Commitante Orangutan Conservancy Veterinary Workshop logo courtesy Amy Burgess © Copyright 2013 by Orangutan Conservancy Prepared with participants of the Orangutan Conservancy 2013 Orangutan Veterinary Advisory Group (OVAG) Workshop, Bogor, Jawa, Indonesia June 24-28,...»

«CHAPTER 3 Join the Allyn & Bacon First Editions Club! To start your membership in our First Editions Club, simply read the following chapter and complete a brief questionnaire at: www.ablongman.com/swsurvey. As a Thank You, we will send you a Penguin-Putnam title of your choice! We look forward to hearing from you! Human Behavior and the Social Environment: Theory and Practice Joan Granucci Lesser Ph.D. Donna Saia Pope M.S.W. ©2007 ISBN-10: 0-205-42019-2 ISBN-13: 978-0-205-42019-3 CHAPTER 3:...»

«ILA LP Model Solutions Fall 2014 1. Learning Objectives:3. The candidate will understand the relationship between the product features, their inherent risks, and the selection of appropriate pricing assumptions, profit measures and modeling approaches.Learning Outcomes: (3c) Analyze results and recommend appropriate action from an array of risk and profit measures such as: Statutory, GAAP, Return on Equity, Market Consistent Pricing, Embedded Value Sources: Risk Based Pricing – Risk...»

«THE BEST PRANK EVER The Tornado Prank -This is circa 1988 I can tell you about the only tornado to have directly hit a CIU building. PART I The Prank It was only a few months after Hugo hit SC, as I recall. I was living in West II. The tornado season was on, and like any good RA, Jaye Morgan (RA for West) was planning to advise us of what to do in case of a tornado. Now, we had been having an ongoing feud with the guys from West I, for several weeks. Bubba Danielson (1990's?) would run thru our...»

«DIVISION 500 STRUCTURES SECTION 501 FOUNDATION PILES 501.01 Description This work shall consist of furnishing and driving piles and casings, of the types and dimensions specified on the contract plans, to the required ultimate capacity.Contract Plans. Piles shall conform to and be installed, as detailed in these specifications, in reasonably close conformity to the lines, grades, and locations, and required resistance(s) shown on the Plans or as authorized by the Resident. Work under this item...»

«  FINANCING HIGH-SPEED INTERCITY PASSENGER RAIL WITH TAX CREDIT BONDS: POLICY ISSUES AND FISCAL IMPACTS Prepared for the American Public Transportation Association June 25, 2008 Prepared by: Mercator Advisors, LLC VantagePoint Associates, Inc. 1629 Locust Street – Suite 100 1500 Walnut Street – Suite 1060 Philadelphia, PA 19103 Philadelphia, PA 19102 www.mercatoradvisors.com www.vpointassociates.com   TABLE OF CONTENTS INTRODUCTION RESULTS IN BRIEF BACKGROUND: TRANSPORTATION INVESTMENT...»

«North Carolina Pre-Kindergarten (NC Pre-K) Program Requirements and Guidance Effective SFY 2016-2017 Issue Date: October 2016 NC Pre-K Program Requirements & Guidance Table of Contents Issue Date: October 2016 NC Pre-Kindergarten (NC Pre-K) Program Requirements and Guidance Table of Contents Section 1: Introduction Section 2: The County/Region NC Pre-K Committee A. Purpose of the NC Pre-K Committee B. Committee Structure and Meetings C. Committee Authority and Responsibilities D. Additional...»

«Analyzing the Effect of Financial Aid on Student Academic Success using Multivariate Statistical Models Michelle Duda April 29, 2015 Contents 1 Introduction 1 2 Methodology 4 2.1 Data and Variables........................ 4 2.2 Models............................... 7 3 Results 9 3.1 Descriptive Statistics....................... 9 3.2 Logistic Regression........................ 15 3.3...»

«Linda Cappellato and Nicola Ferro and Gareth Jones and Eric San Juan (eds.): CLEF 2015 Labs and Workshops, Notebook Papers, 8-11 September, Toulouse, France. CEUR Workshop Proceedings. ISSN 1613-0073, http://ceur-ws.org/Vol-1391/, 2015. Overview of the 3rd Author Profiling Task at PAN 2015 Francisco Rangel1,2 Fabio Celli3 Paolo Rosso1 Walter Daelemans5 Martin Potthast Benno Stein Natural Language Engineering Lab, Universitat Politècnica de València, Spain Autoritas Consulting, S.A., Spain...»

«EUROPEAN UNIVERSITY ASSOCIATION INSTITUTIONAL EVALUATION SELF-EVALUATION REPORT February 14th, 2007 TABLE OF CONTENTS Table of contents List of Acronyms 1 Introduction 1.1 The self-evaluation steering committee 1.2 The self-evaluation process 2 Institutional context 2.1 Brief historical overview 2.2 Geographic location 2.3 Departments and research centres 2.4 Students 2.5 Finance 3 Norms and values 3.1 Mission and goals 3.2 Constraints and opportunities 4 Organisation and activities 4.1...»

«COMPUTATIONAL FLUID DYNAMICS MODELING APPROACH TO EVALUATE VSC–17 DRY STORAGE CASK THERMAL DESIGNS Kaushik Das,* Debashis Basu,* Jorge Solis,** Ghani Zigh** *Center for Nuclear Waste Regulatory **U.S. Nuclear Regulatory Commission Analyses Washington, DC 20555-0001 Southwest Research Institute® San Antonio, Texas 78238 Abstract A numerical simulation of flow and heat transfer in a ventilated concrete dry storage cask system–17 (VSC–17) is performed and results compared with experimental...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.