Ask your data
  • Blog
  • About & Contact

Predicting with the Caret Package in R

7/3/2017

0 Comments

 
Bild
The Caret package ("Caret" stands for Classification And REgression Training) includes functions to support the model training process for complex regression and classification problems. It is widely used in the context of Machine Learning which we have dealt with earlier this year. See this blog post from spring if you are interested in an overview. Apart from other, even more advanced algorithms, regression approaches represent and established element of machine learning. There are tons of resources on regression, if you are interested and we will not focus on the basics in this post. There is, however, an earlier post if you look for a brief introduction.

In the following, the output of a knitr report into html was pasted on the blog site. We used made-up data on donor lifetime value as dependent variable and the initial donation, the last donation and donor age as predictors.

In short: The explanatory and predictive power of the model are low (for which the dummy data has to be blamed to a certain extent). However, the code you will find below aims to illustrate important concepts of machine learning in R:
  • Data management to enable algorithms to work properly
  • Splitting the data into a training and test dataset
  • Applying predictions after fitting the model
  • Correlation plots as a nice add-on.

library(knitr)
lifetimedata <- read.table("C:/Users/johan/Documents/lifetime.txt", header = TRUE) ## here we read in the data
dim(lifetimedata) ## its 4000 cases in 6 columns
## [1] 4000    5
str(lifetimedata) ## str gives us a summary of the dataframe structure
## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: Factor w/ 2729 levels "10,01","10,05",..: 2209 2557 2110 1352 2044 575 2091 177 620 1730 ...
##  $ LastDonation   : Factor w/ 3264 levels "10","10,05","10,08",..: 3260 3038 2854 2794 2758 2679 2570 2515 2484 2467 ...
##  $ Age            : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : Factor w/ 3975 levels "100","100,24",..: 2450 2404 1881 1663 1 956 3875 1619 1301 2100 ...

Preparing Training and Test Data

We see that InitialDonation, LastDonation and Lifetimesum are factos .. so let´s prepare the data.

lifetimedata$InitialDonation <- as.numeric(lifetimedata$InitialDonation) ## InitialDonation to numeric
lifetimedata$LastDonation <- as.numeric(lifetimedata$LastDonation)  ## LastDonation to numeric
lifetimedata$Age <- as.numeric(lifetimedata$Age) ## Age to numeric
lifetimedata$Lifetimesum <- as.numeric(lifetimedata$Lifetimesum) ## Lifetimesum to numeric
str(lifetimedata) ## another look at the file
## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...
lifetimedata <- lifetimedata[,2:5] ## DonorId is not needed ... we subset columns 2 to 6
str(lifetimedata) ## check whether subset worked
## 'data.frame':    4000 obs. of  4 variables:
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

As we have a decent dataset now we go ahead and load the promiment machine learning package Caret (Classification And Regression Training)

Data cleaning

So called near-zero-variance variables (i.e. variables where the observations are all the same) are cleaned.

library(caret) ## we load the Caret Package"
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=lifetimedata$Lifetimesum, p=0.75, list=FALSE) ## createDataPartition splits in train and test set
training <- lifetimedata[inTrain,]
testing <- lifetimedata[-inTrain,]
dim(training) ## the training set contains 3.000 cases
## [1] 3000    4

Intercorrelation

Before we fit the model, let´s have a look at intercorrelations

library(corrplot) # we load the library corrplot
## Warning: package 'corrplot' was built under R version 3.3.3
corrplotbase <- cor(training)
corrplot(corrplotbase, method="pie") ## apply the corrpllot with method "pie"

Fitting the model

Now it is the moment we fit the (regression) model:

## now we fit the linear model from Caret
lmfit <- train(Lifetimesum ~ .,
               data = training, 
               method = "lm")
summary(lmfit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2100.18 -1003.85     3.03   972.65  2093.47 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1856.10651   75.92703  24.446  < 2e-16 ***
## InitialDonation   -0.01945    0.02652  -0.733    0.463    
## LastDonation       0.09741    0.02213   4.403 1.11e-05 ***
## Age                0.01976    0.99975   0.020    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1145 on 2996 degrees of freedom
## Multiple R-squared:  0.006653,   Adjusted R-squared:  0.005658 
## F-statistic: 6.688 on 3 and 2996 DF,  p-value: 0.0001696
finalmodel <- lmfit$finalModel
testing$pred<- predict(finalmodel, testing) ## we add the prediction column to the test dataset
require("lattice")
xyplot(testing$Lifetimesum ~ testing$pred, data = testing, auto.key = TRUE, pch = 20, col = 'red')

The scatterplot illustrates the relative poorness of the prediction. So: Still some work (data collection, modelling) to be done :-)

0 Comments

    Categories

    Alle
    Artificial Intelligence
    Because It´s Fun!
    Churn
    Clustering
    Data Sciene @NPO
    Data Visualization
    Facebook
    Machine Learning
    Maps
    Natural Language Processing
    Neural Nets
    Power BI
    Predictive Analytics
    Social Media
    Time Series
    Twitter

    Archive

    Januar 2021
    November 2020
    September 2020
    August 2020
    Mai 2020
    April 2020
    Februar 2020
    Dezember 2019
    November 2019
    September 2019
    Juni 2019
    April 2019
    März 2019
    Januar 2019
    Dezember 2018
    Oktober 2018
    August 2018
    Juni 2018
    Mai 2018
    März 2018
    Februar 2018
    Dezember 2017
    November 2017
    Oktober 2017
    September 2017
    August 2017
    Juli 2017
    Mai 2017
    April 2017
    März 2017
    Februar 2017
    Januar 2017

About

Copyright © 2018
  • Blog
  • About & Contact