Ask your data
  • Blog
  • About & Contact

An Introduction to Machine Learning Interpretability with R

6/13/2021

0 Comments

 
Bild
For the last decades, lots of efforts have been put in developing machine learning algorithms and methods. Those methods are currently being widely used among companies and let us extract meaningful insights from our raw data to solve complex problems that could hardly be solved otherwise. They make our life (and our job) easier, but at what cost? 

There is a good reason why Machine learning methods are known as being “black-box”: They have turned so complex that is hard to know what is exactly going on inside them. However, understanding how models work and making sure our predictions make any sense is an important issue in any business environment. We need to trust our model and our predictions in order to apply them for business decisions. Understanding the model also help us debug it, potentially detect bias, data leakage and wrong behaviour. 

Towards interpretability: The importance of knowing our data

We should take into account that, whenever we talk about modelling, there needs to be a lot of work behind related to data preparation and understanding. Starting with the clients’ needs or interest, those need to be translated into a proper business question, upon which we will then design an experiment. That design should specify, not just the desired output and the proper model to use for it, but also – and more important – the data needed for it. That data needs to exist, be queried and have enough quality to be used. Of course, data also needs to be explored and useful variables (i.e. variables related to the output of interest) be selected.
In other words: Modelling is not an isolated process and its results cannot be understood without first understanding the data that has been used to get those results, as well as its relationship with the predicted outcome. 

Bild
Interpretable vs. non interpretable models

Until now, we have just talked about black-box models. But are actually all models hard to interpret? The answer is no. Some models are simpler and intrinsically interpretable, including linear models and decision trees. But since that decrease in complexity comes with a cost on the performance, we usually tend to use more complex models, which are hardly interpretable. Or are they?

Actually, intensive research has been put into developing model interpretability methods and two main type of methods exist:
  • Model-specific methods, which are specific for a single model. Feature importance plots are a good example of it: many R packages contain their own specific functions for calculating the feature importance within the model, which is not directly comparable with the way that feature importance is calculated for a different package/model.
  • Model-agnostic methods, which can be applied to any machine learning method after training. Therefore, its results are comparable between different models. The model-agnostic equivalent to a feature importance is the permutation feature importance method.

Those model methods can also be grouped, depending on their predictions scope, into:
  • Global interpretability methods, which try to explain how predictions in general are affected by some parts of the model (features).
  • Local interpretability methods, which try to explain how models build specific or reduced groups of individual predictions.
  • Also, and even when that knowledge is not enough to interpret our predictions, it is important to have a general overview on how algorithms learn from our input data to create models. That is what is known as algorithm interpretability or transparency. 

Some Global interpretability examples 

As previously mentioned, probably the most widely method used is the calculation of the feature importance, and many packages have their own functions to calculate it. For instance, package caret has the function varImp(), which we have used to plot the following example. There, we can see how feature “gender-male” and “age” seem to be the most important features to predict the survival probability in the titanic (yes! we have used the famous Kaggle titanic-dataset to build our models). ​
Bild
Bild
Partial dependence plots are also widely used. These plots show how predicted output changes when we change the values on a given predictor variable. In other words, it shows the effect of single features on the predicted outcome, controlling for the values of all other features.
In order to build them, function partial() from package pdp can be used. For instance, in the following partial depende plot we can see how paying a low fare seems to have a positive effect on the survival – which makes sense, knowing for instance that children had preference on the boats! 

Some local interpretability examples

Local interpretability techniques can be studied with the packages DALEX and modelStudio, which let us use a very nice and interactive dashboard – where we can choose which methods and which observations are we most interested at. 
​
Bild
One of the best methods contained are the so-called break-down plots, which show how the contributions attributed to individual explanatory variables change the mean model prediction to yield the actual prediction for a particular single observation. In the following example of a 30 year old male travelling on 2nd class, which payed 24 pounds and boarded in Cherbourg, we can see how the boarding port and the age had a positive contribution on the survival prediction, whereas his gender and the class had a negative one. In this way, we can study each of the observations which we want or have to focus on – for instance, if we think that the model is not working properly on them. ​

Shap values is a similar method, which consists on taking each feature and testing the accuracy of every combination of the rest of features, checking then how adding that feature on each combination improves the accuracy of the prediction.
On the following example, and for the same observation as we just analysed, we can see that result are very similar: gender shows the biggest and most negative contribution, while the boarding port has the biggest and most positive effect on the survival prediction, for that specific passenger. 

Bild
Last, if we are interested on how observations’ predictions change when changing feature values, we can study the individual conditional expectation plots. Even though they can just display one feature at a time, it let us have a feeling on how predictions change when feature values change. For instance, on the following example we can see how increasing the age have a negative effect on the survival of the titanic passengers. 
Bild
Some last words
​

In this post, we have made a brief introduction on the interpretability of machine learning models, we have explained why is important to actually be able to interpret our results and we have shown some of the most used methods. But just as a reminder: for a similar performance, we should actually always prefer simpler models which are interpretable per se, over super complex machine learning ones! 
0 Comments

Your Mind Plays Tricks on You: The relevance of Congitive Biases in Data Science

1/27/2021

0 Comments

 
Bild
The diffusion of digital technologies has brought data into countless areas of our professional and private lifes. As this big shift also impacts all types of businesses, a lot of of them take measures and, for instance, try to facilitate a data-driven culture, formulate data strategies or invest in technology. Data Analysts and Data Scientists play key roles in many modern organizations. It can be assumed that a large number of people – this includes of course managers and all kinds of „data people“ – perceive themselves as rational and logical decision makers. The unconvenient truth is: They / we are not! A lot of human thinking is influenced by so called congitive biases. We will take a closer look at them in this blog post.

​​Cognitive biases are systematic patterns of deviation from rationality in judgment. These biases are subject to research interests in fields like psychology and behavioral economics. What we call cognitive biases are mechanisms that have developed within an evolutionary process. They already helped our ancestors in making fast decisions when needed and with limited information processing capabilities. These biases are not only an essential building block of our "gut feeling" but also our intuition to a ceratin degree. This is what Daniel Kahnemann, nobel prize winner for economics in 2002, has called System 1, the area of unconscious and fast decision making in our minds. The speed and ease of this sytem comes with a price as biases can lead to irrational and counter-factual decisons. Biases can affect human power of judgment in a professional context and in personal life.

Presumably rational and fact-oriented people like analysts and data scientists are not save from cognitive biases either. Some authors even argue that they are even more prone to be to biased due to the experimental and research-oriented nature of their work. As biases are essentially part of human nature and they are everywhere, it is important to be aware of them. This might enable us to give better advice to others and take more informed decisions ourselves. We will try to provide a light introdcution, some hints for prevention and some interesting sources for further reading. Let us look at the most relevant cognitive biases one by one.
​
Bild

Confirmation Bias

The challenge: One could say that the Confirmation Bias is the „flagship“ of cognitive biases. The underlying idea is that we favor data that confirm our existing beliefs and hypotheses. As everybody wants to be right, confirmation bias is literally everwhere. The mechanism is highly relevant for data scientists. One will tend to interpret results of analyses or model predictions as support for prior assumptions. It will not be possible to avoid confirmation bias completely – but being aware of it will help a lot.

What you can try: Write down any of your relevant hypotheses, ideas etc. before you run an actual analysis. As soon as you have results, go back to your notes and cross-check them with prior beliefs.
​

Bild

Anchoring Effect

The challenge: As the metaphorical name suggests, the anchoring effect implies that the first quantitative judgment a person makes has impact on subsequent judgements. When a salesperson offers you a more expensive product or tells you a higher price at the beginning of your talk, he or she tries to anchor you the expensive price point. You are expected to use this anchor as benchmark for the subsequent prices to make them look cheaper. 

Anchoring was researched intensively by Kahnemann and Tversky in the 1960ies and 1970ies. However, there must have been awareness of this bias for centuries - just visualize the bargaining on ancient markets.

Data scientists might be influenced by anchoring when groups or the impact of several new features are compared. Also here the very first interpretations will set the frame for the following ones.
​
What you can try: Define an anchor level a priori. If you can for instance define a 5% lift in accuracy as a significant improvement, you can go back to this unbiased benchmark as soon as you have results. 


Bild

Availability Bias

The challenge: We tend to take cognitive shortcuts by relying on information and prior experience that is available and easily accessible. This over-reliance in what is there may result in neglecting additional sources of data that would potentially improve predictions. If you are a data scientist developing a churn model for regular gifts, it is straightforward to take sociodemographic and behavioural data into consideration. The existence of the availabilty bias makes it worthwhile to think outside the box particulary in terms of feature selection, knowing that there will be relevant constraints such as data protection rules, availabilty of data in the company of on the market and last not least data quality.

What you can try: Invite others who potentially add different perspectives and ideas. Organize brainstormings and jointly formulate hypotheses.

​

Bild

Curse of Knowledge

The challenge: Knowing to much can be a challenge when you try to communicate your knowledge to others. The curse of knowledge is when someting is completely clear to you and you assume that it is obvious to everyone else. For data scientists, this bias can be a big obstacle when they are supposed to present results to stakeholders. Data scientists often invest lots of time and energy to build analyses, models and lines of argumentation step by step. In many cases, everything finally makes sense and you have developed a big picture. Others have not gone through this process and did not have the chance to develop the level of understanding that you have. The curse of knowledge comes in many forms such as using to many unexplained technical terms or jargon as well as too little elaboration and „story telling“.

What you can try: „If you can't explain it simply, you don't understand it well enough.“ is a quote attributed to Albert Einstein. For the context of data science projects, this could mean investing time and effort in developing comprehensible, concise and well-structured presentations, reports etc. Focus on actionable information and key results and provide additional background information if necessary or asked for.
​

Bild

Narrative Fallacy

The challenge: We all need stories. Humankind has used them to convey information wrapped in plots for thousands of year. As story telling is so deeply human, we look for them literally everywhere. Narrative fallacy refers to our limited ability to look at sequences of potentially unrelated and random facts, events etc. without finding logical links between them. Analysts and also decision makers might tend to connect dots, i.e. analytical insights, in a way that seems plausible. Nassim Taleb, author of „Black Swan“ says that Explanations bind facts together. These facts make more sense to us then and get more easily rememberd. The key question is whether these presumed relationships are really fact-based.

What you can try: Discuss whether a signal in data is strong enough not be noise. If this is the case, you might quickly develop a story. Try to formulate an „alternative narrative“ that leads to your results and is consistent with your data. Be aware that your story is one of many interpreations and that there might be unobserved or even unobservable relationships.

So what?

You can get it if you really want. But you must try, try and try.
Jimmy Cliff, You can get if you really want.

Overcoming cognitive biases completely might be almost impossible. However, raised awareness of how our minds try to trick us will already lead to noticeable improvements in judgment. If you are interested in the topic, we can recommend the following readings.

Books
  • Thinking, Fast and Slow by Daniel Kahneman
  • The Art of Thinking Clearly by Rolf Dobelli
  • Predictably Irrational. The Hidden Forces that Shape Our Decisions by Dan Ariely

Blog Posts
  • Practical Psychology for Data Scientists via Towards Data Science
  • Dealing with Congitive Biases ​via Towards Data Science
  • YourBias


All the best and read you soon!

0 Comments

An introduction to Natural Language Processing (NLP)

11/20/2020

1 Comment

 
Bild

​by Fabian Pfurtscheller, BSc
Data Scientist at joint sytems

In 2017, the Economist published a headline that has become famous among data aficionados – Data is the new oil. It referred to the many new opportunities businesses and organisation can reach when beneficially utilising the data from their customers or partners that is newly generated every day. In many cases, we have tried and tested statistical and mathematical methods that have seen use for decades. If data is neatly stored in a relational database and takes the form of numbers and the occasional categorical variable, we can use a multitude of statistical methods to analyse and gain insight from the data.

However, the real world is always messier than we would like it to be, and this applies to data as well: Forbes magazine estimates 90% of data generated every day is unstructured data – that is text and images, videos and sounds, in short, all data that we cannot utilise using standard methods. Natural Language Processing (NLP) is a way of gaining insight into one category of unstructured data. The term describes all methods and algorithms that deal with processing, analysing and interpreting data in the form of (natural) language.
One key field of NLP is Sentiment Analysis – that is to analyse a text for the sentiment contained in or the emotions evoked by the text. Sentiment Analysis can be used to classify a text according to some metric such as polarity – is the text negative or positive – or subjectivity – is the text written subjectively or objectively. While traditionally, its applications have been focussed on social media – classifying for instance a brand’s output on social media or analysing the public’s reaction to the web presence – they are certainly not limited to the digital sphere. An NGO communicating with their supporters in ways such as direct mailing could for example use Sentiment Analysis to examine their mailings and investigate the response they result in.
A frequently employed yet quite simple way of undertaking Sentiment Analysis is to use a so-called Sentiment Dictionary. This is simply a dictionary of words receiving a score according to some metric related to sentiment. For instance, for scoring a text for its polarity one would use a dictionary allocating to each word a continuous score in some symmetric interval around zero, where the ends of the interval correspond to strictly negative or strictly positive words, and more neutral words group around the midpoint. There are very different versions of Sentiment Dictionaries, depending on context and most importantly language – while there is some choice for English, this diminishes for other languages. A collection of dictionaries in the author’s native German for example can be found here.
As a brief example for Sentiment Analysis in action, consider the following diagram, showing a combined polarity score for some direct mailing activities of an NGO in a German-speaking country over one year, whereby each maling received a score between -1 (very negative) and +1 (very positive). This scoring was undertaken using the SentiWS dictionary of the University of Leipzig.
Bild
Sentiment scores for certain mailings from an NGO throughout one year. Higher scores indicate a positive sentiment, scores below zero a negative sentiment in the mailing.

A very different approach relies on the use of word embedding. Word embedding is a technique whereby the individual words in a text are represented as high-dimensional vectors, ideally in a way that certain operations or metrics in the vector space correspond to some relation on the semantic level. This means for example that the embedding group words occurring in similar circumstances or carrying similar meaning or sentiment closely together in the vector space, that is to say, their distance measured by some metric is small. Another illustrative example is the following: Consider an embedding ϕ and the words “king”, “woman” and “queen”, then ideally, ϕ(“king”) +  ϕ(“woman”) ≈ ϕ(“queen”). To put this in words, the sum of the vectors for king and woman should be roughly the same as the vector for queen, who is of course in some ways a “female king”. The semantic relation thus corresponds to an arithmetic relation in the vector space. To better visualise these relationships on the vector level, consider the following illustrations, taken from Google's Machine Learning crash course that show the vector representations of some words in different semantic contexts, represented for better readability in a three-dimensional space (actual embeddings usually map into much higher dimensional vector spaces, upwards of a few hundred dimensions).
BildExample visualisation of word embeddings in the three-dimensional space
There are multiple ways to construct such an embedding, i.e. to receive an optimal vector representation for every word in the text corpus, most of which rely on neural nets. Due to the huge computational cost involved, many users opt for transfer learning. This means using a pre-trained model architecture and finish training on the data set – in an NLP setting, the text corpus – at hand, which is both more efficient and produces more accurate results than training a model architecture “from scratch”. Python users might find this resource providing a huge number of pre-trained models and frameworks to continue training very interesting.

Word embedding is principally a way to take unstructured text data and transform it into a structured form – vectors on which computers can perform operations. One can then go on and use this vectorised representation of the text to use different methods to gain insight from the written text. For instance, when performing Sentiment Analysis, one can use a Machine Learning model of choice – typically a neural net – to classify the text according to some metric of sentiment – say, polarity – using the vector representation of the text corpus.

The field of NLP has in recent years become one of the fastest moving and most researched fields of Data Science – the methods we can employ today have evolved and improved tremendously compared to just a short time ago, and are bound to do so again in the recent future. For a data nerd, this is interesting in and by itself; but different actors, especially in the not-for-profit sector, are well advised to keep an open eye on these developments and consider in which ways they can help them in their work.
1 Comment

Four Weeks of Data Storytelling

9/28/2020

0 Comments

 

If you are interested in the art of Data Storytelling and wish to improve your skills and knowledge in a month-long-sprint, you will probably like the following list:

Week
Day
Type
​Content & Link
1
Monday
📝 Blog
​Daydreaming Numbers: bit.ly/2S928qS
1
Tuesday
🎬 Video
TED Talk by David McCandless: ​bit.ly/2HFgBZy
1
Wednesday
📝 Blog
Visual Capitalist: ​bit.ly/3kU0gON
1
Thursday
📰 Article
HBR: A Data Scientist´s real job: bit.ly/3jcEbL6
1
Friday
🎬 Video
TED Talk by Hans Rosling: bit.ly/30fz19v
2
Monday
📰 Article
Narrative Visualization: stanford.io/2HFEK27
2
Tuesday
📝 Blog
Make it conversationsl: bit.ly/3cIsuJx
2
Wednesday
🎬 Video
TED Talk by Tommy McCall: bit.ly/3cC0tn5
2
Thursday
🎨 Gallery
Collection of infographics: bit.ly/3kZt4p5
2
Friday
💻 PPT
Berkeley: Data Journalism: bit.ly/30j7Z1g
3
Monday
🎬 Video
Data Storytelling: bit.ly/33arivv
3
Tuesday
📝 Blog
Impact on Audience: bit.ly/338dIbQ
3
Wednesday
🎨 Gallery
Juice Analytics Gallery: bit.ly/2G6nL8I
3
Thursday
📕 Book
Data Journalism Handbook: bit.ly/2S94Hcd
3
Friday
🎬 Video
TED Talk by Aaron Koblin: bit.ly/2EFZWDY
4
Monday
🎨 Gallery
DataViz Catalogue: bit.ly/34mdy0b
4
Tuesday
📝 Blog
Data Visualization Checklist: bit.ly/3cQ0d45
4
Wednesday
🎬 Video
TED Talk by Chris Jordan: bit.ly/3kTaaQT
4
Thursday
📝 Blog
Toolbox for Data Storytelling: bit.ly/3mZrd5H
4
Friday
🎬 Video
Storytelling with Data: bit.ly/3jd2W9Q
0 Comments

Must-watch movies for analytics lovers and data aficionados

9/26/2020

0 Comments

 
Bild

​Autumn is coming closer in many parts of the world. As days are getting darker and shorter, lots of  people like getting comfortable on their sofas to watch an interesting movie. If you are into data analytics, statistics and artificial intelligence, we have some recommendable picks for you.

Moneyball
​
​Released: 2011

Big names: Brad Pitt, Philip Seymour Hoffman

IMDB Rating: 76%
​

Plot in a nutshell: The movie is based on the book Moneyball: The Art of Winning an Unfair Game by Michael Lewis. Its main protagonist is Billy Beane who started as General Manager of the baseball club Oakland Athletics in 1997. Beane was confronted with the challenge of building a team with very limited financial resources and introduced predictive modelling and data-driven decision making to assess the performance and potential of players. Beane and his peers were successful and managed to reach the playoffs of the Major Leage Baseball several times in a row.
​
Trailer: 
​​Why you should watch this movie: Moneyball highlights the importance of communication skills and persistence for people aiming to drive change using data science.

​

The Imitation Game
​
​Released: 2014

Big names: Benedict Cumberbatch, Keira Knightley

IMDB Rating: 80%
​

Plot in a nutshell: The Imitation Game is based upon the real-life story of British mathematician Alan Turing who is known as the father of modern computer science and for the test named after him. The film is centered around Turing and his team of code-breakers working hard to decipher the Nazi German military encryption Enigma. To crack the code, Turing creates a primitive computer system that would consider permutations at a much faster speed than any human could. The code breakers at Bletchley Park succeeded and thereby not only helped Allied forces ensure victory over the Wehrmacht but contributed to shorten the horros of the Second World War.
​
Trailer: 
​​Why you should watch this movie: It is a (too) late tribute to Alan Turing. Turing was prosecuted for his homosexuality after WWII and eventually committed suicide. The film is also about the power of machines and ethical perspectives in analytics​

​
Margin Call 
​
​Released: 2011

Big names: Paul Bettany, Stanley Tucci, Demi Moore

IMDB Rating: 71%
​

Plot in a nutshell: Margin Call plays during the first days of the last global financial crisis in 2008. A junior analyst at a large Wall Street investment bank discovers a major flaw in the risk evaluation model of the bank. The story develops during the night as the young employee informs senior managers that the bank is close to a financial disaster, knowing that the bancruptcy of the firm would lead to a dramatic chain reaction in the market – and millions of lives would be affected.

​
Trailer: 
​​Why you should watch this movie: The film depicts to what extent algorithms dominate decision making in the financial industry. It also portrays the interplay between supposedly objective models and human beings driven by emotions and interests.​

​
21
​
​Released: 2008

Big names: Kate Bosworth, Laurence Fishburne

IMDB Rating: 68%
​

Plot in a nutshell: Six students of the renowned Massachusetts Institute of Technology (MIT) get trained in card counting and rip off Las Vergas casinos at various blackjack tables. The film is based upon a true story.

​
Trailer: 
​​Why you should watch this movie: It is an entertaining and fun movie. In addition to that, it contains some interesting mathematical concepts such as the Fibonacci Series and the Monty Hall Problem.​
​

We hope our tipps are valuable for you and you enjoy any of the flicks. 📺🎬 🍿☕🍷
Take care and all the beston behalf of joint systems
Johannes

0 Comments

Can we predict future sporadic donations using data on the general economic climate?

8/13/2020

1 Comment

 
Bild


by Fabian Pfurtscheller, BSc
Data Science & Analytics at joint systems

Particularly in uncertain times like these, organizations strive to predict the future in the best possible way. Previously we have already explored multiple times how to forecast future income using the past income trajectory, for instance in these blog posts. We now want to go a step further and investigate the relationship between fundraising income and the general economic climate, exploring whether or not it is possible to infer extra information from and improve income forecasting tools by using economic indicators.

Introduction

More specifically, we will look at correlations between the amount of sporadic donations to a charitable non-profit organization from the period of 2000 to 2019 and three economic data sets from the country the NPO is based in: the national unemployment rate, the national stock market index and an index of economic activity in retail, authored by the national bureau of statistics. We chose these data sets for their ability to paint a picture of the general economic climate, their relatively easy accessibility down to the monthly level and the fact that to the extent that they exhibit seasonality, they do so in the same yearly rhythm as the amount of sporadic donations, simplifying the statistical analyses. The statistical analyses and models employed in this blog post were all implemented in Python, taking heavy use of the statsmodels library.

The main mathematical tool used for the analyses is a time series, which is a set of data points collected over a discrete, ordered time. We have already looked at time series in detail in a past entry on this blog. An important property of time series is that of stationarity, which we discussed here. As in that post, we use the Dickey-Fuller test to investigate whether or not our time series are stationary, which can be done in python like this:
Dickey-Fuller test

    
Bild
We have defined a function that returns a dataframe with the test statistic, the critical value at a confidence level of 5% and the p-value for the Dickey-Fuller-Test for each time series. The p-values of the test results are above a significance level of 0.05, leading us to keep the null hypothesis of the time series being non-stationary. In cases like these, differencing the time series – this means subtracting the previous value from the current – can help in making the season stationary. However, due to our time series exhibiting strong yearly trends, in our case it makes sense also to take the time series’ seasonal differences, subtracting the values from the year before.

Granger causality

Having prepared our time series, we can look at a first test of interdependencies. Using a version of Pearson’s Chi-Squared test, we examine the (made stationary) series’ Granger causality, which tells us whether data from one series can help in forecasting the future trajectory of another series. The test is applied pairwise on two different time series, with the null hypothesis being that the second time series does not Granger cause the first. The following function returns a data frame that shows the p-values of the tests investigating whether the column variable granger causes the row variable. Especially the first row – indicating (no) Granger causality between the economic data sets on the one hand and the amount of sporadic donations on the other – is interesting to us. At a significance level of 0.05, we keep the null hypothesis of no Granger causality between the economic indicators and the sporadic donations for all three time series.
Granger causality

    
Bild

Cointegration

A second useful quality to look at is cointegration.  We consider a set of time series cointegrated if there exists a linear combination of the time series that is stationary. Cointegrated time series share a common, long-term equilibrium and we can use them to predict each other’s future trajectory using a process called Vector autoregression (VAR). A common test for cointegration is the Johansen cointegration test. In the following, we define a function that returns a dataframe with the test statistic and the critical values of the Johansen test, leading to these results:
Johansen cointegration test

    
Bild
The test statistic is below the critical value for all four of our time series, meaning that we have to keep the null hypothesis of no cointegration and cannot assume the time series to be cointegrated. We can thus not assume that there is an underlying equilibrium between the four time series, and the test results do not support the hypothesis that we can use the time series to forecast each other’s future trajectory.

Vector autoregression (VAR)

Had the test results been different and had we been able to reject our null hypothesis, we could have attempted to construct a VAR-model. If we ignore our test results for a moment and do so anyway, we can immediately see that the model falls catastrophically short. The black line in the below graphs show the actual time series of sporadic donations. Having used data until 2016 as our training set, we constructed with the Python library statsmodels a VAR-model that we can use to forecast the sporadic donations from 2017 to 2019 – the red line in the graph – using the actual data for these years for evaluation. As we can see, the model is not able to forecast the amount of sporadic donations very well, capturing the seasonality, but failing to accurately predict the trend and overall trajectory.
Vector autoregression

    
Bild

ARIMA and ARIMAX

Using a different model does not yield any much better results. We attempted to construct an ARIMAX-model, a generalisation of the ARIMA-model, which we have used previously in this blog, that also takes into account data from external variables – in our case, the economic time series. ARIMA models are composed of three parts: AR for autoregression, indicating a regression on the time series’ past values, I for integration, signifying differencing terms in the case of a non-stationary time series, and MA for the moving-average-model, a regression on past values white noise terms. ARIMAX takes all three of those terms and adds data from external variables – a different time series – to better forecast the time series at hand. Both ARIMA and ARIMAX are implemented in python as part of the statsmodels library, while the pmdarima library comes with an autoarima function modelled on R’s autoarima function, allowing for a quick search through the possible parameters of the ARIMA(X) model.

We have used all four time series to construct an ARIMAX-model, using the economic data to help forecast the amount of sporadic donations. Again we used the data until 2016 as our training set, with the data from 2017 to 2020 as a test set to evaluate results. We have also used a standard ARIMA-model to construct a forecast for sporadic donations only on the time series’ historical data. Interestingly, the models’ projected forecasts did not differ much from each other:
ARIMAX

    
Bild
It seems thus that the economic data we have provided – the unemployment rate, the stock index and a retail index – did not add much extra information to better forecast the amount of sporadic donations. Upon closer inspection the ARIMA-model, relying solely on historical data from the time series itself, even performed slightly better. In light of the previous test results – the lack of Granger causality and cointegration – this points to the fact that those economic indicators have little measurable effect on the development of sporadic donations, and thus cannot be used to improve forecasting models for the amount of sporadic donations in the future. However, we believe that the approach of using time series analysis for fundraising income predictions hand in hand with open (economic) data deserves further focus as other base data, time frames, data sources etc. might lead to different results!

1 Comment
<<Previous
Forward>>

    This website uses marketing and tracking technologies. Opting out of this will opt you out of all cookies, except for those needed to run the website. Note that some products may not work as well without tracking cookies.

    Opt Out of Cookies

    Categories

    All
    Artificial Intelligence
    Attribution Modelling
    Because It´s Fun!
    Churn
    Clustering
    Data Sciene @NPO
    Data Strategy
    Data Visualization
    Ethical AI
    Facebook
    Machine Learning
    Maps
    Marketing Mix Modelling
    Natural Language Processing
    Neural Nets
    Next Best Action
    Power BI
    Predictive Analytics
    Recommender Systems
    Segmentation
    Social Media
    Time Series
    Trends
    Twitter

    Archive

    December 2024
    September 2024
    August 2024
    June 2024
    December 2023
    August 2023
    March 2023
    January 2023
    October 2022
    August 2022
    December 2021
    September 2021
    June 2021
    January 2021
    November 2020
    September 2020
    August 2020
    May 2020
    April 2020
    February 2020
    December 2019
    November 2019
    September 2019
    June 2019
    April 2019
    March 2019
    January 2019
    December 2018
    October 2018
    August 2018
    June 2018
    May 2018
    March 2018
    February 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017

About

Copyright © 2018
  • Blog
  • About & Contact