Predicting with the Caret Package in R

7/3/2017

The Caret package ("Caret" stands for Classification And REgression Training) includes functions to support the model training process for complex regression and classification problems. It is widely used in the context of Machine Learning which we have dealt with earlier this year. See this blog post from spring if you are interested in an overview. Apart from other, even more advanced algorithms, regression approaches represent and established element of machine learning. There are tons of resources on regression, if you are interested and we will not focus on the basics in this post. There is, however, an earlier post if you look for a brief introduction.

In the following, the output of a knitr report into html was pasted on the blog site. We used made-up data on donor lifetime value as dependent variable and the initial donation, the last donation and donor age as predictors.

In short: The explanatory and predictive power of the model are low (for which the dummy data has to be blamed to a certain extent). However, the code you will find below aims to illustrate important concepts of machine learning in R:

Data management to enable algorithms to work properly
Splitting the data into a training and test dataset
Applying predictions after fitting the model
Correlation plots as a nice add-on.

library(knitr)
lifetimedata <- read.table("C:/Users/johan/Documents/lifetime.txt", header = TRUE) ## here we read in the data
dim(lifetimedata) ## its 4000 cases in 6 columns

## [1] 4000    5

str(lifetimedata) ## str gives us a summary of the dataframe structure

## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: Factor w/ 2729 levels "10,01","10,05",..: 2209 2557 2110 1352 2044 575 2091 177 620 1730 ...
##  $ LastDonation   : Factor w/ 3264 levels "10","10,05","10,08",..: 3260 3038 2854 2794 2758 2679 2570 2515 2484 2467 ...
##  $ Age            : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : Factor w/ 3975 levels "100","100,24",..: 2450 2404 1881 1663 1 956 3875 1619 1301 2100 ...

Preparing Training and Test Data

We see that InitialDonation, LastDonation and Lifetimesum are factos .. so let´s prepare the data.

lifetimedata$InitialDonation <- as.numeric(lifetimedata$InitialDonation) ## InitialDonation to numeric
lifetimedata$LastDonation <- as.numeric(lifetimedata$LastDonation)  ## LastDonation to numeric
lifetimedata$Age <- as.numeric(lifetimedata$Age) ## Age to numeric
lifetimedata$Lifetimesum <- as.numeric(lifetimedata$Lifetimesum) ## Lifetimesum to numeric
str(lifetimedata) ## another look at the file

## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

lifetimedata <- lifetimedata[,2:5] ## DonorId is not needed ... we subset columns 2 to 6
str(lifetimedata) ## check whether subset worked

## 'data.frame':    4000 obs. of  4 variables:
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

As we have a decent dataset now we go ahead and load the promiment machine learning package Caret (Classification And Regression Training)

Data cleaning

So called near-zero-variance variables (i.e. variables where the observations are all the same) are cleaned.

library(caret) ## we load the Caret Package"

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

## Loading required package: ggplot2

inTrain <- createDataPartition(y=lifetimedata$Lifetimesum, p=0.75, list=FALSE) ## createDataPartition splits in train and test set
training <- lifetimedata[inTrain,]
testing <- lifetimedata[-inTrain,]
dim(training) ## the training set contains 3.000 cases

## [1] 3000    4

Intercorrelation

Before we fit the model, let´s have a look at intercorrelations

library(corrplot) # we load the library corrplot

## Warning: package 'corrplot' was built under R version 3.3.3

corrplotbase <- cor(training)
corrplot(corrplotbase, method="pie") ## apply the corrpllot with method "pie"

Fitting the model

Now it is the moment we fit the (regression) model:

## now we fit the linear model from Caret
lmfit <- train(Lifetimesum ~ .,
               data = training, 
               method = "lm")
summary(lmfit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2100.18 -1003.85     3.03   972.65  2093.47 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1856.10651   75.92703  24.446  < 2e-16 ***
## InitialDonation   -0.01945    0.02652  -0.733    0.463    
## LastDonation       0.09741    0.02213   4.403 1.11e-05 ***
## Age                0.01976    0.99975   0.020    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1145 on 2996 degrees of freedom
## Multiple R-squared:  0.006653,   Adjusted R-squared:  0.005658 
## F-statistic: 6.688 on 3 and 2996 DF,  p-value: 0.0001696

finalmodel <- lmfit$finalModel
testing$pred<- predict(finalmodel, testing) ## we add the prediction column to the test dataset
require("lattice")
xyplot(testing$Lifetimesum ~ testing$pred, data = testing, auto.key = TRUE, pch = 20, col = 'red')

The scatterplot illustrates the relative poorness of the prediction. So: Still some work (data collection, modelling) to be done :-)

0 Comments

Predicting with the Caret Package in R

Preparing Training and Test Data

Data cleaning

Intercorrelation

Fitting the model

Categories

Archive

About