Data Science 2018

12/21/2017

The year 2017 is coming to its close soon. We guess that for many people the days between years are a time to reflect on the opportunities and challenges the year that is about to start will bring. One can expect that this is also true when it comes to trends in Data Science. We did some research for Data Science outlooks 2018.

Machine Learning

Machine Learning has been one of the buzzwords in the recent past. We also published an introductory blog post on the topic earlier this year. One might say the term is over-hyped, however, machine learning is applied in academia and across industries as a very recent survey by KDnuggets shows.

Dataconomy, a Berlin-based Data Science company, elaborate on the promising field of machine learning but also mention from where organizations are starting from. Regardless of Data Science concepts and tools, 77 percent of German companies still rely on “small" data tools like Excel and Access. Many of them might still have plenty of homework to do to transform into what Dataconomy call data-driven organizations. At the same time, a recent KPMG study (available upon request in German) shows that 60 percent of companies have been able to benefit in different forms (reduction of costs or risks and/or increase of revenues) from Data Science – which of course includes Machine Learning and Artifical Intelligence.

Artificial Intelligence (AI)

Speaking of artificial intelligence - for many, 2018 will be the year when AI breaks though. The prominent research company Gartner defines AI as one of the most important strategic technology trends for 2018. They refer to a recent survey showing that 6 out of 10 organizations are in the process of evaluating AI strategies whereas the remaining 4 have started adopting respective technologies

The analytics company absolutedata goes as far as to speak of AI powered marketing and formulates certain predictions regarding what 2018 will bring in this context:

Real Time Behavioral Indicators
Dynamic Segmentation
Customer Directed Marketing
Fewer Campaigns, Higher ROI
More Creative Thinking, Less Routine Work
Marketing Tech Stack Consolidation
The Emergence of New Buying Journeys
Better Understanding of When to STOP Marketing

Bill Vorhies, Editorial Director for Data Science Central, is a bit more hesitant in the context of AI. He predicts that – regardless of the hype – the diffusion of techniques and tools in from the field of AI and deep learning will be slower and more gradual than expected by many. One already visible manifestation of the spread of AI are chat bots which are increasingly used in a web and mobile context. Chat bots essentially process natural language and thereby involve customers and prospects in an interaction. The implementation of facial and gesture recognition currently look like the next big thing as possible applications on the point of sale seem vast.

How should nonprofits deal with AI? Steve MacLaughlin, author and Vice President of Data & Analytics at Blackbaud, underlines the vast opportunities for nonprofits but also relativizes the buzz around AI. MacLaughlin explains that AI for nonrpfits requires the availability of the right data, contextual expertise, and continuous learning. Given these factors, AI can support nonprofits to be impactful particularly through fundraising.

Big Data

We also dealt with Big Data – probably the buzziest of all buzzwords from the field – in February this year. We quite liked a recent blog post by KDnuggets that starts as follows:

There's no denying that the therm Big Data is no longer what it used to be [..] note that the t. We now all just assume and understand that our everyday data is huge. There is, however, still value in treating Big Data as an entity or a concept which needs to be properly managed, an entity which is distinct from much smaller repositories of data in all sorts of ways.

What follows next is the gist of interviews with various experts from the field talking about their expectations for 2018 and beyond – a really recommendable holiday read.

Data Protection

The EU General Data Protection Regulation (GDPR) will be enforced from May 25th, 2018. It is yet to be seen to what extent customers and donors will, for instance, actively insist on the Right to be Forgotten – which might have implications for the availability of donor data for advanced modelling as well.

People Needed!

More and more organizations seem to develop an interest in Big Data experts, Data Architects, Data Scientists etc.. Without any doubt, advanced analytics and Data Science efforts need motivated and skilled women and men to succeed. Florian Teschner conducted an analysis on Data Science job offers recently. Although there is scope for absolute growth in available positions, there is some five fold increase since 2015.

Stay tuned to Data Science in 2018

We think it will be worthwhile to keep one’s eyes open in 2018.

If you are interested in Data Science as well as the conceptual and technological developments in the field and you are not on Twitter yet, following some influential thinkers from the area might be an interesting starting point. Maptive.com provides a list of potential influencers.

If you want to dive a little deeper, some experts’ Github accounts might be a place to go if you look for papers, code etc. - just follow the overview provided by analyticsvidya.com.

Last, not least ...

... we wish you and your dear one a happy and relaxed (data-free) Christmas and good start in a successful and dynamic 2018. See you next year on this blog!

1 Comment

Predicting with the Caret Package in R

7/3/2017

0 Comments

The Caret package ("Caret" stands for Classification And REgression Training) includes functions to support the model training process for complex regression and classification problems. It is widely used in the context of Machine Learning which we have dealt with earlier this year. See this blog post from spring if you are interested in an overview. Apart from other, even more advanced algorithms, regression approaches represent and established element of machine learning. There are tons of resources on regression, if you are interested and we will not focus on the basics in this post. There is, however, an earlier post if you look for a brief introduction.

In the following, the output of a knitr report into html was pasted on the blog site. We used made-up data on donor lifetime value as dependent variable and the initial donation, the last donation and donor age as predictors.

In short: The explanatory and predictive power of the model are low (for which the dummy data has to be blamed to a certain extent). However, the code you will find below aims to illustrate important concepts of machine learning in R:

Data management to enable algorithms to work properly
Splitting the data into a training and test dataset
Applying predictions after fitting the model
Correlation plots as a nice add-on.

library(knitr)
lifetimedata <- read.table("C:/Users/johan/Documents/lifetime.txt", header = TRUE) ## here we read in the data
dim(lifetimedata) ## its 4000 cases in 6 columns

## [1] 4000    5

str(lifetimedata) ## str gives us a summary of the dataframe structure

## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: Factor w/ 2729 levels "10,01","10,05",..: 2209 2557 2110 1352 2044 575 2091 177 620 1730 ...
##  $ LastDonation   : Factor w/ 3264 levels "10","10,05","10,08",..: 3260 3038 2854 2794 2758 2679 2570 2515 2484 2467 ...
##  $ Age            : int  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : Factor w/ 3975 levels "100","100,24",..: 2450 2404 1881 1663 1 956 3875 1619 1301 2100 ...

Preparing Training and Test Data

We see that InitialDonation, LastDonation and Lifetimesum are factos .. so let´s prepare the data.

lifetimedata$InitialDonation <- as.numeric(lifetimedata$InitialDonation) ## InitialDonation to numeric
lifetimedata$LastDonation <- as.numeric(lifetimedata$LastDonation)  ## LastDonation to numeric
lifetimedata$Age <- as.numeric(lifetimedata$Age) ## Age to numeric
lifetimedata$Lifetimesum <- as.numeric(lifetimedata$Lifetimesum) ## Lifetimesum to numeric
str(lifetimedata) ## another look at the file

## 'data.frame':    4000 obs. of  5 variables:
##  $ DonorId        : int  166 1749 2500 1024 858 2130 446 2571 2119 139 ...
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

lifetimedata <- lifetimedata[,2:5] ## DonorId is not needed ... we subset columns 2 to 6
str(lifetimedata) ## check whether subset worked

## 'data.frame':    4000 obs. of  4 variables:
##  $ InitialDonation: num  2209 2557 2110 1352 2044 ...
##  $ LastDonation   : num  3260 3038 2854 2794 2758 ...
##  $ Age            : num  18 18 18 18 18 18 18 18 18 18 ...
##  $ Lifetimesum    : num  2450 2404 1881 1663 1 ...

As we have a decent dataset now we go ahead and load the promiment machine learning package Caret (Classification And Regression Training)

Data cleaning

So called near-zero-variance variables (i.e. variables where the observations are all the same) are cleaned.

library(caret) ## we load the Caret Package"

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

## Loading required package: ggplot2

inTrain <- createDataPartition(y=lifetimedata$Lifetimesum, p=0.75, list=FALSE) ## createDataPartition splits in train and test set
training <- lifetimedata[inTrain,]
testing <- lifetimedata[-inTrain,]
dim(training) ## the training set contains 3.000 cases

## [1] 3000    4

Intercorrelation

Before we fit the model, let´s have a look at intercorrelations

library(corrplot) # we load the library corrplot

## Warning: package 'corrplot' was built under R version 3.3.3

corrplotbase <- cor(training)
corrplot(corrplotbase, method="pie") ## apply the corrpllot with method "pie"

Fitting the model

Now it is the moment we fit the (regression) model:

## now we fit the linear model from Caret
lmfit <- train(Lifetimesum ~ .,
               data = training, 
               method = "lm")
summary(lmfit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2100.18 -1003.85     3.03   972.65  2093.47 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1856.10651   75.92703  24.446  < 2e-16 ***
## InitialDonation   -0.01945    0.02652  -0.733    0.463    
## LastDonation       0.09741    0.02213   4.403 1.11e-05 ***
## Age                0.01976    0.99975   0.020    0.984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1145 on 2996 degrees of freedom
## Multiple R-squared:  0.006653,   Adjusted R-squared:  0.005658 
## F-statistic: 6.688 on 3 and 2996 DF,  p-value: 0.0001696

finalmodel <- lmfit$finalModel
testing$pred<- predict(finalmodel, testing) ## we add the prediction column to the test dataset
require("lattice")
xyplot(testing$Lifetimesum ~ testing$pred, data = testing, auto.key = TRUE, pch = 20, col = 'red')

The scatterplot illustrates the relative poorness of the prediction. So: Still some work (data collection, modelling) to be done :-)

0 Comments

What fundraisers can learn from machines

2/19/2017

0 Comments

Daintree, Australia

Artificial intelligence (AI) has been an influential concept not only in the scientific community but also in popular culture. Depending on the respective attitude towards AI, one might associate characters such as inhibited but likeable android Data from Star Trek or neurotic and vicious HAL 9000 from the Space Odysee movies with it. The form of AI those two represent is called General AI, i.e. the intelligence of a machine that enables it to perform intellectual tasks as good as or better than a human. General AI is – at least for the moment – still a topic for science fiction. What is interesting for various industries is the notion of Narrow AI (also termed Weak AI). These are technologies that enable humans to fulfil specific tasks in an automated manner just like or even better than they could.

In the context of data, machine learning can be seen as an approach to achieve artifical intelligence. Machine learning is about analyzing data, learning from it and using the insights gained for decision making or predicitions about something.

The learning in machine learning can be attempted in two generic forms:

The first is so called supervised learning. Supervised learning algorithms aim to make predictions based upon a given set of data examples for which properties are known. A machine learning algorithm can for instance look for patterns in historic stock prices and take any possibly relevant information into account that is available with it, be it days of week, indicators that describe company performance, the state of the economy or even the weather. As soon as an algorithm has found the relationship between the stock price (dependent variable) and independet variables (the ones that allegedly influence stock prices), it can be used to predict future outcomes by plugging in values for the variables. This might ring a bell for many who have come across regression models in their education. In fact linear regression is said to be the most popular machine learning model.
In unsupervised learning data comes without pre-defined labels, i.e. any kind of classifications or categorizations are not included. The goal of unsupervised learning algorithms is to discover hidden structures. A practical example of this is the functionality of Apple’s iOS that tries to sort photos using the actual faces on them. Unsupervised learning seems much harder as we ask the algorithm to do something for us without telling it (or knowing) how to do it.

Going through all possibly relevant algorithms would go beyond the scope of this post that is intended to be an introductory one (more specific ones are planned, though). As I have started diving deeper into machine learning recently, I can recommend this cheat sheet by Mithun Sridharan before you start your desk research and do the googeling. You might also find Laura Hamilton's overview of the pros and cons of the most popular algorithms helpful for a start. If you can´t wait to start trying things, you might find this blog post on KD Nuggest worth taking a look at as it offers heaps of links to machine learning cheat sheets for different platforms. As a fan of R, I can particularly recommend this overview by Yanchang Zhao.

What can fundraisers do? I have to say that I did not come across many practice sharing posts or articles when I conducted research for this post. I doubt that either the availability of data or the competence portfolio of analysts and data scientists limit the possibilities of fundraising organizations in the context of machine learning techniques. However, there might be a certain level of insecurity regarding where and how to start. I found an inspiring blog post by Stephen W. Lambert in which he explains that all you basically need is a computer, a database with relevant data and your brain to start diving into machine learning techniques. I think Lambert’s text invites fundraising organizations to do their theoretical and conceptual homework, process and prepare their data accordingly and start experimenting with maching learning techniques. So - go ahead and try.

0 Comments

Data Science 2018

Predicting with the Caret Package in R

Preparing Training and Test Data

Data cleaning

Intercorrelation

Fitting the model

What fundraisers can learn from machines

Categories

Archive

About