Playing around the Net | A short introductory example to Neural Networks in R

8/7/2018

So-called Artificial Neural Networks (ANN) are a family of popular Machine Learning algorithms that has contributed to advances in data science, e.g. in processing speech, vision and text. In essence, a Neural Network can be seen as a computational system that provides predictions based on existing data. Neural Networks are comparable to non-linear regression models (such as logit regression), their potential strength lies in the ability to process a large number of model parameters.

Neural Networks are good at learning non-linear functions. Moreover multiple outputs can be modelled.

Artifical Neural Networks are generically inspired by the biological neural networks within animal and human brains. They consist of the following key components:

Input layer: The descriptive variables that are fed into the prediction
Hidden layer: A user-defined number of layers with a specified number of so-called neurons in each layer.
Output layer: This is the variable(s) we are trying to predict. The output could be the label of an image or 0 or 1 / TRUE or FALSE output.
Weights: Each neuron in any given layer is potentially linked to every neuron in the neighbouring layer. The weights reflect the importance of this layer.

In a simplified manner, an Neural Network can be visualized as follows:

For the simplified application example below, we produced an example dataset with some 140.000 records. Imagine that we start with a relatively large dataset of sporadic donors and have come up with a straightforward definition of the dependent churn variable, e.g. a definition based on the recency of the last respective donation.

The features (variables) we included were:

AgeAtEntry: Age of respective supporter at the time of database entry (typically first donation) [Integer]
Title: Is there any title like PhD, Dr., Professor etc. [Binary]
IncomeEst: Externally enriched income estimation on person level [Integer]
FirstDon: Initial donation of supporter [Integer]
Churn: This is our binary (0/1) dependent variable.

We start with loading the relevant R packages, reading in our base dataset and some data pre-processing.

An essential step in setting up Neural Networks is data normalization. This implies the scaling of the data. See for instance this link for some brief conceptual considerations and information on the scale function in R.

We then split the dataset into a training and test dataset using a 70% split.

Now we are ready to fit the model. We use the package nnet with one hidden layer containing 4 neurons. We run a maximum of 5.000 iterations using the code shown in code snippet number 4:

After fitting the model, we plot our neural net object. The neuron B1 in the illustration below is a so called bias unit. This is an additional neuron added to each pre-output layer (in our case one). Bias units are not connected to any previous layer and therefore do not represent an "activity". Bias units can still have outgoing connections and might contribute to the outputs in doing so. There is a compact post on Quora with a more detailed discussion.

When it comes to modelling in a data science context, it is quite common to look at the variable importance within the respective model. For neural nets, there is a comfortable way to do this using the function olden from the package NeuralNetTools. For our readers interested in the conceptual foundations of this functions, we can recommend this paper.

This is the chart that we get:

It stand out that the variable Age at entry has a high negative importance on the output whereas Estimated Income shows some degree of positive variable importance.

We finally turn to running the neural net model for predictive purposes on our test data set and plot our results in a confusion matrix-like manner:

The result of the code above looks as follows:

The table above cross-tabls the actual and predicted outcomes of churned and non-churned donors. Let's now evaluate the predictive power of our example neural net. In doing so, we can recommend this nice guide to interpreting confusion matrices which can be found here.

Overall accuracy [How often is the classifier correct?, i.e. (True Positives + True Negatives) / Total i.e. (330 + 31.564) / 43.967 is relatively high with 0.725.
Specificity (True Negatives divided by all Negatives, i.e. the correctly predicted No-Churns) is at 0.73 [31.564 / (31.564 + 11.717)].
However, Sensitivity (True Positive Rate) is really low as True Positive (correctly predicted churn) divided by total of actual churned is 330 / 12.047 = 0.0274.
The same goes for Precision [True Positives over total predicted Positives, i.e. 330 / (330 + 356)] which is at 0.48.

In the light of our data and the example model described above, we can conclude that definitely further model tuning would be needed. Tuning will focus on the used Hyperparameters. At the same time, we would recommend running a "benchmark model" such as a logit regression to compare the neural net's model performance with.

As further reading we can recommend:

This introduction into Neural Nets from Towards Data Science.
This tutorial by David Selby that also elaborates on the underlying math.
A post on RPubs by Capt Spencer Butt which includes an extensive list of references.

As always, we look forward to your shares, like, comments and thoughts.
Have a nice, hopefully long (rest of) summer!

1 Comment

Playing around the Net | A short introductory example to Neural Networks in R

Leave a Reply.

Categories

Archive

About