Ask your data
  • Blog
  • About & Contact

Predicting with the Caret Package in R

7/3/2017

0 Comments

 
Bild
The Caret package ("Caret" stands for Classification And REgression Training) includes functions to support the model training process for complex regression and classification problems. It is widely used in the context of Machine Learning which we have dealt with earlier this year. See this blog post from spring if you are interested in an overview. Apart from other, even more advanced algorithms, regression approaches represent and established element of machine learning. There are tons of resources on regression, if you are interested and we will not focus on the basics in this post. There is, however, an earlier post if you look for a brief introduction.

In the following, the output of a knitr report into html was pasted on the blog site. We used made-up data on donor lifetime value as dependent variable and the initial donation, the last donation and donor age as predictors.

In short: The explanatory and predictive power of the model are low (for which the dummy data has to be blamed to a certain extent). However, the code you will find below aims to illustrate important concepts of machine learning in R:
  • Data management to enable algorithms to work properly
  • Splitting the data into a training and test dataset
  • Applying predictions after fitting the model
  • Correlation plots as a nice add-on.

library(knitr)
lifetimedata <- read.table("C:/Users/johan/Documents/lifetime.txt", header = TRUE) ## here we read in the data
dim(lifetimedata) ## its 4000 cases in 6 columns
## [1] 4000    5
str(lifetimedata) ## str gives us a summary of the dataframe structure
## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: Factor w/ 2729 levels "10,01","10,05",..: 2209 2557 2110 1352 2044 575 2091 177 620 1730 ...
##  $ LastDonation   : Factor w/ 3264 levels "10","10,05","10,08",..: 3260 3038 2854 2794 2758 2679 2570 2515 2484 2467 ...
##  $ Age            : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : Factor w/ 3975 levels "100","100,24",..: 2450 2404 1881 1663 1 956 3875 1619 1301 2100 ...

Preparing Training and Test Data

We see that InitialDonation, LastDonation and Lifetimesum are factos .. so let´s prepare the data.

lifetimedata$InitialDonation <- as.numeric(lifetimedata$InitialDonation) ## InitialDonation to numeric
lifetimedata$LastDonation <- as.numeric(lifetimedata$LastDonation)  ## LastDonation to numeric
lifetimedata$Age <- as.numeric(lifetimedata$Age) ## Age to numeric
lifetimedata$Lifetimesum <- as.numeric(lifetimedata$Lifetimesum) ## Lifetimesum to numeric
str(lifetimedata) ## another look at the file
## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...
lifetimedata <- lifetimedata[,2:5] ## DonorId is not needed ... we subset columns 2 to 6
str(lifetimedata) ## check whether subset worked
## 'data.frame':    4000 obs. of  4 variables:
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

As we have a decent dataset now we go ahead and load the promiment machine learning package Caret (Classification And Regression Training)

Data cleaning

So called near-zero-variance variables (i.e. variables where the observations are all the same) are cleaned.

library(caret) ## we load the Caret Package"
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(y=lifetimedata$Lifetimesum, p=0.75, list=FALSE) ## createDataPartition splits in train and test set
training <- lifetimedata[inTrain,]
testing <- lifetimedata[-inTrain,]
dim(training) ## the training set contains 3.000 cases
## [1] 3000    4

Intercorrelation

Before we fit the model, let´s have a look at intercorrelations

library(corrplot) # we load the library corrplot
## Warning: package 'corrplot' was built under R version 3.3.3
corrplotbase <- cor(training)
corrplot(corrplotbase, method="pie") ## apply the corrpllot with method "pie"

Fitting the model

Now it is the moment we fit the (regression) model:

## now we fit the linear model from Caret
lmfit <- train(Lifetimesum ~ .,
               data = training, 
               method = "lm")
summary(lmfit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2100.18 -1003.85     3.03   972.65  2093.47 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1856.10651   75.92703  24.446  < 2e-16 ***
## InitialDonation   -0.01945    0.02652  -0.733    0.463    
## LastDonation       0.09741    0.02213   4.403 1.11e-05 ***
## Age                0.01976    0.99975   0.020    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1145 on 2996 degrees of freedom
## Multiple R-squared:  0.006653,   Adjusted R-squared:  0.005658 
## F-statistic: 6.688 on 3 and 2996 DF,  p-value: 0.0001696
finalmodel <- lmfit$finalModel
testing$pred<- predict(finalmodel, testing) ## we add the prediction column to the test dataset
require("lattice")
xyplot(testing$Lifetimesum ~ testing$pred, data = testing, auto.key = TRUE, pch = 20, col = 'red')

The scatterplot illustrates the relative poorness of the prediction. So: Still some work (data collection, modelling) to be done :-)

0 Comments



Leave a Reply.

    This website uses marketing and tracking technologies. Opting out of this will opt you out of all cookies, except for those needed to run the website. Note that some products may not work as well without tracking cookies.

    Opt Out of Cookies

    Categories

    All
    Artificial Intelligence
    Attribution Modelling
    Because It´s Fun!
    Churn
    Clustering
    Data Sciene @NPO
    Data Strategy
    Data Visualization
    Ethical AI
    Facebook
    Machine Learning
    Maps
    Marketing Mix Modelling
    Natural Language Processing
    Neural Nets
    Next Best Action
    Power BI
    Predictive Analytics
    Recommender Systems
    Segmentation
    Social Media
    Time Series
    Trends
    Twitter

    Archive

    December 2024
    September 2024
    August 2024
    June 2024
    December 2023
    August 2023
    March 2023
    January 2023
    October 2022
    August 2022
    December 2021
    September 2021
    June 2021
    January 2021
    November 2020
    September 2020
    August 2020
    May 2020
    April 2020
    February 2020
    December 2019
    November 2019
    September 2019
    June 2019
    April 2019
    March 2019
    January 2019
    December 2018
    October 2018
    August 2018
    June 2018
    May 2018
    March 2018
    February 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017

About

Copyright © 2018
  • Blog
  • About & Contact