FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:   || 2 | 3 |

«USING DATA MINING TO PREDICT FRESHMEN OUTCOMES Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness ...»

-- [ Page 1 ] --



Nora Galambos, PhD

Senior Data Scientist

Office of Institutional Research, Planning & Effectiveness

Stony Brook University


Data mining is used to develop models for the early prediction of freshmen GPA. Since

student engagement has long been associated with student success, the use of service utilization

and transactional data is examined along with more traditional student factors. Factors entered into the data mining models include advising visits, freshmen course-taking activity, interactions with the college learning management system, and college activity participation, along with SAT scores, high school GPA, demographics, and financial aid. In models predicting first semester freshmen GPA, factors associated with students' interactions with the campus environment were stronger predictors than SAT scores.

Introduction The goal is to develop a model to predict at risk first-time full-time freshmen as early as possible in their college careers in order to assist them with interventions. Traditional methods of logistic and linear regression are often good at identifying factors significantly associated with an outcome, but are not always able to make accurate predictions. Linear and logistic regression have one set of predictors to model the outcomes of all of the students in the data and do not assign separate sets of predictors to students having very different characteristics. For example, first-time freshmen entering college with high SAT scores may have very different retention and college GPA predictors than those entering with a low high school GPA and low SAT scores.

Inevitably, when using any model, some students will be incorrectly assigned, with some students miss-identified as being at risk or students at risk being not being identified as such by the model. There is an allocation trade-off when resources are expended on students not really in need of interventions or when students who would potentially benefit from interventions do not receive them. Methods capable of more accurate predictions will result in more effective utilization of resources, and higher retention and graduation rates. For that reason the decision was made to explore data mining, because it offers a variety of methods for utilizing different types of data, there are few assumptions to satisfy relative to traditional hypothesis driven methods, and it is able to handle a great volume of data with hundreds of predictors.

At our institution poor academic performance by first-time full-time freshmen in the first semester has a negative impacton graduation and retention outcomes. Figure 1 illustrates that only 11% of students in the lowest GPA decile graduate in four years, and less than 29% of students in that group graduate in six years. For the second decile the four year rate increases to 26% and the six year rate improves to 53%. Those rates, though higher, are still very low relative to the top half of the freshmen class.

Approximately 30% of first-time full-time freshmen received a GPA below 2.5 in their first semester (Figure 2). Almost 84% of those students returned in year two, however by the next year the retention rate had dropped substantially with only 64% returning for year three and only 48% graduating in six years. In contrast over 77% of students receiving a GPA of 2.5 or greater in their first semester graduated in six years.

Figure 1. Four and Six Year Graduation Rates of First-Time Full-Time Freshmen by GPA Deciles*

–  –  –

*The fall freshmen cohorts of 2006 through 2008 were combined.

Figure 2. Comparison of Graduation and Retention Rates of First-time Full-time Freshmen by First Semester GPA Above and Below 2.


–  –  –

*The fall freshmen cohorts of 2006 through 2008 were combined.

Even when evaluating results for students above and below 3.0 the differences are dramatic (Figure 3). Only 34% of students with a first semester GPA below 3.0 (approximately the median) graduated in four years, which is almost 27 percentage points lower than students above the median.

Figure 3. Comparison of Graduation and Retention Rates of First-time Full-time Freshmen by First Semester GPA Above and Below 3.


–  –  –

*The fall freshmen cohorts of 2006 through 2008 were combined.

Given these results we see that it would greatly benefit at risk students if they could be identified as early as possible. In order for the programs to be cost effective and, more importantly, a good match for the needs of the students, the model must be able to make very accurate predictions. The difficulty of this task lies in the fact that there are not many universitylevel academic measures available on or before the middle of the first semester of the freshmen year. For that reason we have explored the development of a data mining model that combines transactional data such as learning management system (LMS) logins and service utilization such as advising and tutoring center visits with other more traditional measures in an attempt to identify at risk students before any grades appear on their transcripts.

Literature Review The study has cast a wide net in terms of assembling a variety of data for use in studying academic, social, and economic factors to determine elevated risk of a low GPA, which can translate to increased risk of early attrition or longer time to degree. Consistent with the retention study of Tinto (1987), we evaluate many types of data representing students’ interactions with their campus environment to determine if higher levels of campus engagement are predictive of improved freshmen outcomes. These measures of engagement include interactions with the learning management system, intramural sports and fitness class participation, and academic advising and tutoring center visits. It appears that students who are identified to be at risk in their first term and remain at the institution, continue to be at risk, with greater numbers leaving in the subsequent term (Singell and Waddell 2010). This is consistent with the results at our institution which are presented in Figures 1, 2, and 3. Methods capable of more accurate predictions will result in more effective utilization of campus resources, and higher retention and graduation rates. Course-taking behavior is also important, particularly math readiness. Herzog (2005) found math readiness to be “more important than aid in explaining freshmen dropout and transfer-out during both first and second semesters.” Herzog also focused on both merit and need-based aid and the role that interaction of aid and academic preparedness plays in student retention. Living within a 60 mile radius of the institution, the percent of students at a high school who take the SAT, along with the percentage at the high school receiving free lunches was explored by Johnson (2008) underlining the need to examine the role of the secondary school and socio-economic factors in developing a model. Persistence increases among students closer to the institution and not surprisingly, decreases among those who were from schools having a high percentage of students receiving free school lunches. The role of differing stop-out patterns exhibited by grant, work-study, and loan recipients (Johnson

2010) demonstrated that grants have the highest positive effect on persistence, but its effect decreases more than that of loans after controlling for other factors. Resource utilization was studied (Robbins et al. 2009) using a tracking system. Services and resources were grouped into academic services, recreational resources, social measures and advising sessions, with all but social measures demonstrating positive associations with GPA even after controlling for other demographic and risk factors. These papers have demonstrated that researchers are examining a range of factors in studying and modeling risk. This research underlines that fact that student success is the result of complex interactions between student engagement, academic service utilization, financial metrics, and demographics, which are combined with student academic characteristics that include high school GPA and SAT scores. Data mining is ideal for developing a model with a large diverse number of predictors.

Data Sources An attempt was made to include as many types of data as possible, so learning management system logins, not previously explored by our institution were included. Building the dataset began with the traditional measures such as demographics (gender, ethnicity, and geographic area of residence when admitted), to which were added high school GPA and SAT scores. In order to control for high school GPA, the average SAT scores of the high schools were incorporated. Because we are modeling the freshmen GPA at the mid-semester point, in terms of college academic characteristics we only have available the fall semester courses in which the students are enrolled, the area the major, whether a major has been declared, and how many college credits were accepted by the institution upon admission. The number of AP credits received was also captured, with those credits separated into STEM and non-STEM totals.

To explore the effect of high failure rate courses on student outcomes, courses with enrollments of 70 or more students having 10% or more D, F, or W grades were identified and categorized as STEM or non-STEM courses. The total number of high DFW-rate courses, and the highest DFW rate for each student (by STEM indicator) was included in the model. The percentage of freshmen in each DFW course was also tabulated and that percentage for the corresponding course was additionally added. The rationale for examining the percentage of freshmen in these difficult courses is that if the courses are populated by large numbers of upper level students, it may make the course even more difficult for freshmen who are less experienced.

Since student engagement has long been associated with student success, the use of service and academic utilization data was included to determine if it resulted in improved models.

Student interactions with the university’s learning management system, academic advising, tutoring center visits, intramural sports, and fitness classes, have been incorporated in the analysis to evaluate the association of GPA with students’ engagement in the university environment.

Much of the data pertaining to interactions with student services and learning management system logins has not been stored long term. In fact the LMS login data was not available for any fall semester prior to fall 2014. As a result, part of the data mining process has included the initial collecting, saving, and storing of the data. Programs are being developed to automate the formatting and aggregation of the transactional data so it can easily be merged with student records and utilized in the data mining process. For modeling use of the LMS logins, only one login per course per hour was counted, so an individual course can have at most 24 logins per day. This eliminated multiple logins that occurred just few minutes and sometimes a few seconds apart. Further, the courses were categorized as STEM or non-STEM. Next the STEM and non-STEM logins were totaled for week 1 and separately for weeks 2 through 6. Finally the STEM and non-STEM logins were divided by their respective STEM and non-STEM course totals to obtain per-course login rates.

Financial aid data was also assembled. The measures that were captured are the expected family contribution, adjusted gross income (AGI), types and amounts of disbursed aid (athletics aid, loans, grants, scholarships, and work-study). Pell Grants and the Tuition Assistance Program (TAP) recipients were also added to the model.

Because the data mining initiative is new and many data sources are being collected and explored for the first time, research and evaluation of the methods for summarizing and using the data in the model is ongoing. The expectation is that additional data sources will be added. A detailed list of the data elements can be found in the appendix.

Methodology Different models were compared to find the ones that provide the most accurate prediction of the first semester GPA with the lowest average squared errors (ASE)1. In developing data mining models it is advisable to partition the data into training and validation ASE = SSE/N or ASE = (Sum of Squared Errors)/N sets. The training set is used for model development, then the model is run on the validation set to check its accuracy and calculate the prediction error. It is also important to avoid developing an overly complex model, overfitting. If the model is too complex it can be influenced by random noise, and if there are outliers an overly complex model may be fit to them.

Unfortunately, when using such a model on new data its ability to accurately predict the outcomes will be diminished. One way of detecting overfitting is to compare the ASE of the training and validation data. A large increase in the ASE when running the model on the validation data may be a sign of overfitting. However, with less than 3,000 subjects and over 50 variables to predict the GPA’s of the bottom 20% of the class, setting aside 40% of the data as is typical for a validation set, is not practical because it would not leave enough of the lower GPA students for building the model. As an alternative, k-fold cross validation was used. It works with limited amounts of data, and its initial steps are similar to traditional analysis. The entire dataset is used to choose the predictors and the error is estimated by averaging the error of the k test samples. In subsequent years, when more than one semester of LMS data has been collected, the easier to implement training-validation-partitioning method can be used.

To implement k-fold cross validation, the dataset is divided into k equal groups or folds.

In this case five folds were used. Four groups are taken together and are used to train the data and one is used for validation. The procedure is repeated five times, each time leaving out a different set for validation as in Figure 4. The model error is estimated by averaging the errors of the five validation samples.

Figure 4: K-fold cross-validation sampling design.

Five different modeling methods were tested and compared using k-fold cross validation.

Pages:   || 2 | 3 |

Similar works:

«Review of International Studies http://journals.cambridge.org/RIS Additional services for Review of International Studies: Email alerts: Click here Subscriptions: Click here Commercial reprints: Click here Terms of use : Click here When norms clash: international norms, domestic  practices, and Japan's internalisation of the GATT/WTO Andrew P. Cortell and James W. Davis...»

«THE BOOK OF RUTH 1 HEARTS UNBOUND Engaging Biblical texts of God’s Radical Love through Reader’s Theater by David R. Weiss THE BOOK OF RUTH 1 HEARTS UNBOUND Engaging Biblical Texts of God’s Radical Love through Reader’s Theater by David R. Weiss HIDDEN IN THE KING’S BLOOD: A Faithful Outsider Brought In THE BOOK OF RUTH © 2013 National Gay and Lesbian Task Force  Permission is given to download and photocopy this script for use by church or community groups. Hearts Unbound by David...»

«Scrutiny and Overview Committee 14 MARCH 2016 Present: Councillors: Leonard Crosbie (Chairman), David Coldwell (ViceChairman), Alan Britten, John Chidlow, Jonathan Dancer, Matthew French, Nigel Jupp, David Skipp and Michael Willett Apologies: Councillors: Paul Clarke, Roger Clarke, Tony Hogben, Tim Lloyd, Brian O'Connell and Ben Staines Also Present: Toni Bradnum, Christian Mitchell, Godfrey Newman, Stuart Ritchie and Tricia Youtan SO/1 MINUTES The minutes of the meeting of the Committee held...»

«Reprinted from Communications in Pure and Applied Mathematics, Vol. 13, No. I (February 1960). New York: John Wiley & Sons, Inc. Copyright © 1960 by John Wiley & Sons, Inc. THE UNREASONABLE EFFECTIVENSS OF MATHEMATICS IN THE NATURAL SCIENCES Eugene P. Wigner Princeton Mathematics, rightly viewed, possesses not only truth, but supreme beauty cold and austere, like that of sculpture, without appeal to any part of our weaker nature, without the gorgeous trappings of painting or music, yet...»

«UNITED STATES INTERNATIONAL TRADE COMMISSION Washington, DC 20436 MEMORANDUM ON PROPOSED TARIFF LEGISLATION of the 108th Congress 1 [Date approved: April 7, 2004]2 Bill No. and sponsor: S. 1724 (Sen. Santorum). Proponent name, location: Sony Electronics Inc., Mount Pleasant, PA (Contact: Christina Tellalian, Wash. DC, Tel. (202) 429-3653). Other bills on product (108th Congress only): S. 1794 and H.R. 3399. Nature of bill: Temporary duty suspension through December 31, 2006. Retroactive effect:...»

«CITY OF ELY COUNCIL COUNCIL CHAMBERS 72 MARKET STREET ELY CB7 4LS Fax: 01353 668933 Tel: 01353 661016 E-mail: info@cityofelycouncil.org.uk _ MINUTES OF A MEETING OF THE CITY OF ELY COUNCIL HELD IN EAST CAMBS DISTRICT COUNCIL CHAMBER, THE GRANGE, NUTHOLT LANE, ELY AT 7.00 PM ON MONDAY 24TH NOVEMBER 2014 PRESENT: The Mayor, Cllr E Every Deputy Mayor, Cllr J Yates Cllr C Phillips Cllr A Arnold Cllr M Rouse Cllr N Clarke Cllr B Ashton Cllr I Lindsay Cllr A Whelan Cllr S Pittock Cllr E Griffin-Singh...»

«Craig Calhoun The radicalism of tradition: community strength or venerable disguise and borrowed language? Article (Published version) (Refereed) Original citation: Calhoun, Craig (1983) The radicalism of tradition: community strength or venerable disguise and borrowed language? American journal of sociology, 88 (5). pp. 886-914. ISSN 0002-9602 © 1983 University of Chicago Press This version available at: http://eprints.lse.ac.uk/42435/ Available in LSE Research Online: November 2012 LSE has...»

«` MINUTES OF THE BOARD OF TRUSTEES THE VILLAGE OF CHESTNUT RIDGE AUGUST 20, 2015 The Board of Trustees of the Village of Chestnut Ridge convened in regular session on August 20, 2015 at the Village Hall, located at 277 Old Nyack Turnpike, Chestnut Ridge, NY 10977. PRESENT: ROSARIO PRESTI, JR. MAYOR JOAN BROCK DEPUTY MAYOR GRANT VALENTINE TRUSTEE ABSENT: HOWARD COHEN TRUSTEE PRESENT: WALTER R. SEVASTIAN VILLAGE ATTORNEY FLORENCE A. MANDEL RECORDING SECRETARY   1 1. Pledge of Allegiance. 2....»

«Venice: An Aging City An Interdisciplinary Qualifying Project Submitted to the faculty of Worcester Polytechnic Institute in partial fulfillment of the requirements for the Degree of Bachelor of Science Submitted By: Angelica DeMartino Julie Kent Daniel Mallette On Site Location: Venice, Italy Submitted To: Project Advisors: Dr. Fabio Carrera Dr. Paul Davis Date: December 12, 2008 Ve08-old@wpi.edu http://wikivenice.org/index.php/Getting_Old_in_Venice Authorship As a group we all contributed...»

«Newick Parish Council th Minutes of the Meeting of Newick Parish Council held at 7.00 p.m. on Tuesday, 30 August 2016 in the Sports Pavilion, King George V Playing Field, Allington Road, Newick. Present: Councillors C. Armitage, N. Berryman, M. Halsey, B. Horsfall, C. Jago, J. Sheppard, M. Thew and C. Wickens. In Attendance: Mrs S. Berry (Clerk) Lewes District and East Sussex County Councillor, Jim Sheppard Five members of the public, including candidates for co-option, Mr. R. Allum and Mr. A....»

«Sta s ki ljetopis Republike Hrvatske Sta s cal Yearbook of the Republic of Croa a CODEN SLRHED ISSN 1334-0638 Godina 45. Year Zagreb Prosinac 2013. December Lektorica za hrvatski jezik/ Croatian Language Editor: Izdaje i tiska Državni zavod za statistiku Republike Hrvatske, Zagreb, Ilica 3, p. p. 80. Anđa Matić Published and printed by the Croatian Bureau of Statistics, Korektorica za hrvatski jezik Zagreb, Ilica 3, P.O.B. 80 Croatian Language Proofreader: Anđa Matić Odgovara ravnatelj...»

«2+2= REALITY By William Samuel First printed 1958, Birmingham, Alabama There is a simple and gentle Light within these pages that will change your life and make all things new. It can! It will! But it is a message only the Heart can find. Search with the Heart and you will find it. With honest and straightforward simplicity, I am here to tell you that you can never be free of personal trials and tribulations until you make your own discovery of Reality, God. You will never find more than the...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.