David Weber, BA
Data Science & Analytics
“If your only tool is a hammer, then every problem looks like a nail.” – Unknown.
Today’s data science landscape is a great example where we need hammers, but we also need screwdrivers, wrenches and pliers. Even though R and Python are the most used programming languages for data science, it is important to expand the toolset with other great utensils.
Today’s blog post will introduce a tool, which lets you leverage the benefits of data science without being native in coding: KNIME Analytics Platform.
KNIME (Konstanz Information Miner) provides a workflow-based analytics platform that enables you to fully focus on your domain knowledge such as fundraising processes. The intuitive and automatable environment enables guided analytics without knowing how to code. This blog provides you a hands-on demonstration of the key-concepts.
Important KNIME Terms
Before we start with a walkthrough of a relevant example, we need to declare some of the most important KNIME terms.
Nodes: A node represents a processing point for a certain action. There are many nodes for different tasks, for example reading a CSV-file. You can find further explanations about different nodes on Node Pit.
Workflow: A workflow within KNIME is a set of nodes or a sequence of actions you take to accomplish your particular task. Workflows can easily be saved and used by other colleagues. Even collapsing workflows into a single meta-node is possible. That makes them reusable in other workflows.
Extensions: KNIME Extensions are an easy and flexible way to extend the platform’s functionalities by providing nodes for certain tasks (connecting to databases, processing data, API requests, etc.).
Sample Dataset and Data Import
For demonstration, we are going to use a telecom dataset for churn prediction. This dataset could be easily replaced with a fundraising dataset containing information about churned donors. Customer or donor churn, also known as customer attrition is a critical metric for every business, especially in the non-profit sector (i.e. quitting regular donations). For more information on donor churn, visit our previous blog posts.
Consisting of two tables, the dataset includes call-data from customers and a table about customer contract data. While using some of the information available, we will try to predict whether a customer will quit his subscription or not. Churn is represented with a binary variable (0 = no churn, 1 = churn). For visualization purposes, we are going to use a decision tree classifier, although there are probably even better classification algorithms available.
First, we are using an Excel Reader and a File Reader to import both files. To make things easier, we use a Joiner node where we join both tables based on a common key. The result is a single table now ready for exploration and further analysis.
Feature Engineering is the process of analyzing your data and deciding which parameters have an impact on the label we want to predict – in this case whether a customer will quit or not. In other words: Making your data insightful.
But before we look at correlations between the label and the features, a general exploration of our data is recommendable. The Data Explorer node is perfect for some basic information. One thing we notice is that we need to convert our churn label to a string, in order to make it interpretable for our classifier later. This can be done with the Number to String node. Now it’s time for some correlation matrices. We are able to see some correlation between various features and our churn label, whereas others do not correlate really. We decide to get rid of those.
Now let’s start with the training of our model. But before we can do that, we need to partition our data into a training- and test-dataset. The training-set (mostly around 60-80% of our data) is used to train our model. The other part of our data will be used to test our model and to make sure it has prediction power. We can verify this with certain metrics. In this case, we will set the partition-percentage to 80%, which seems to be a good amount. This data will be fed into our decision tree learner.
After some computing time, our finished decision tree looks like this:
In order to make the model reusable and available for predictions with new datasets, we can save it with the PMML Writer node for later use. PMML is a format for sharing and reusing built models. If we want to, we can read the model later on with a PMML Reader node to make predictions with a new, unknown dataset. But before we use our model on a regular basis, we need to evaluate it with our test-dataset, which we split earlier.
Model Prediction and Evaluation
Now, testing our new model and evaluating its performance is one of the most important steps. If we can’t be sure that our model will predict right to a certain extent, it would be fatal to deploy it. So we feed the Decision Tree Predictor node with our test-dataset. This lets us see how the model performed.
We have certain metrics within our KNIME workflow to fully evaluate it. First, we are using a Scorer node to get the confusion matrix and some other important statistics. Our confusion matrix gives us a little hint about threshold tuning, but the accuracy with 84% looks already pretty good. Our model predicted 29 cases as ‘no churn’, although they were actually ‘churning’. This number is rather high, so we should consider tuning our model parameter.
Next up is the ROC (Receiver Operating Characteristics) Curve. It maps True Positive Rates and False Positive Rates against each other. One of the results is the AUC (Area under the Curve) which adds up to a very good score of 0.914. A score of 0.5 (the diagonal in the chart below) represents prediction without any meaningfulness, because it means predicting randomly.
Additional metrics would be the Lift Chart and Lift Table, but an explanation would be beyond the scope of today’s blog. We think it’s time to summarize and draw a conclusion.
Too good to be true?
KNIME is a powerful platform, which provides various possibilities to extract, transform, load and analyze your data. However, simplicity has its limitations. In direct comparison with various R packages, visualizations are not as neat and configurable. And the ‘simplistic’ approach to data science limits possibilities in some way or another, requiring the user to have a thorough understanding of the data science pipeline and process. Further, most real life cases are more complicated and need more feature engineering and analysis beforehand – creating the model itself is mostly one of the smallest challenges.
Nevertheless, we think that KNIME is an awesome tool for data engineering / exploration and workflow automation (and building fun stuff with social media and web scraping). But if you are looking for complex models supporting your business decisions – KNIME won’t probably be the platform you are searching for.
We hope you liked this month’s blog post and we would love to get in touch if you are interested in achieving advanced insights with your data or just want to dive deeper into the topic. If you want to know what Joint Systems can offer you concerning data analytics, this page will provide you with more information.
In last month's blogpost, we referred to the impactful Economist article stating that data is the new oil. One interpretation of this metaphor is that data can be seen as "fuel" for today's economy. This also applies to the nonprofit sector. The reference to oil does, however, not necessarily mean "digging for data" implies high costs in terms of acquiring and collecting it. There is quite some open data around which might be useful in a fundraising context. We see a broad range of possible use cases when it comes to open data in a fundraising context:
We will rush trough two hands-on examples to illistrate how to obtain, process and visualize open data with possible value for fundraising decision makers and analysts.
Example 1: Visualizing income data on regional level
Geographical disparities can be of relevance in the context of certain fundraising practices such as events or contact to High Net Worth Individuals (HNWI). Some regions are "wealthier" than others in terms of the respective average income levels. This information allows some conclusions about overall fundraising potential in a respective area.
An aggregation level that we find useful and a good "common denominator" are the so-called NUTS regions the European Union uses. NUTS sounds like an English acronym, however, it is a French abbreviation and for Nomenclature des unités territoriales statistiques, in other words regional stat units.
The European Union´s Statistics Office is called Eurostat. They offer a huge database that can be accessed online and is free of charge in most cases. We download not only data on the regional distribution on the level of NUTS2 areas but also use the specific R package eurostat. NUTS2 areas are quite "intuitive" in our eyes. In Austria, for instance, they reflect the federal provinces whereas Germany with its 16 provinces is split into 38 regions. For reasons of completeness, we show you how we searched for respective income data in the code snippet below. Our code is by the way inspired by this recommendable tutorial on eurostat.
Code Snippet 1.1: Load libraries and search for data
The query above shows us 3 tables that contain income data. We decided to use the table with the index tgs00026, it contains data on disposable household income on regional level.
Code Snippet 1.2: Obtain both income and geospatial data
We now have two dataframes, one for the income data with a regional variable and one for the actual geospatial data. We merge the two and dive into the visualization immediately:
Code Snippet 1.3: Merge datasets into one and visualize data
As we already used a little French today, we are now able already to say voilà as the overview visualization we were striving for is finised and presentable:
Example 2: What about large companies and their CEOs?
The name Forbes might ring a bell if you think of listing the wealthiest people on the planet. Forbes also publishes data on the largest companies on an annual basis. We signed up at the platform data world - which we can also recommend - and obtained the 2018 dataset for the 2.000 largest corporations (here - signup is necessary). Luckily this data contains country information and also names the respective CEO - but step by step. As we are working with a flatfile, we did some data prep in it before loading it into R; we end up with a dataset that contains the following variables:
Code Snippet 2.1.: Read the file and select one country
So far, so good. We now have a condensed dataframe in R that contains the Dutch corporations that are listed in the Forbes 2000, i.e. 22 firms. We prepare a bubble chart with the company revenues and profits on the axes. The market value (mostly in shares) shall be reflected by the acutal size of the bubble.
Code Snippet 2.2.: Draw the bubble chart
This is our result:
We can also spot a cluster of companies with high profit volumes in the lower right of the bubble chart. This cluster contains "big names" like Unilver but also contains firms that are not as widely known:
Even though it might not be so easy to meet some of the CEOs immediately, it might be worthwhile researching whether those big and powerful corporations have CSR departments, foundations etc. one can get in touch with.
This was it for this month´s post. We hope you stay "open" over the summer break not only towards data but also our blog :-).
It was in 2017 when the renowned magazine The Economist wrote: "The world’s most valuable resource is no longer oil, but data."
We set up our blog askyourdata.co in the same year. The topics of our articles vary from illustrative applications of models to fundraising data, discussions of topics like Artificial Intelligence for Fundraising to data visualization. The common denominator for our content is the context of charitable non-profit organizations. We think that the tools and methods of advanced analytics and data science can contribute to the effectiveness and efficiency of fundraising organizations. This is why we started wondering whether there are empirical findings on the state of data science and advanced analytics in data science. Good news: We found results that are both recent and interesting.
Data Science starts with data. Back in 2016, the American software company EveryAction surveyed some 460 professionals (presumably mainly in the US; the complete study can be downloaded here) from non-profits about their habits, culture, and outlook on the state of data at their organizations. Some key findings:
A bit more than two-thirds of the respondents conduct ad hoc or (ex post) hindsight analysis (descriptive, diagnostic) whereas the remaining third has stepped into advanced analytics (predictive, prescriptive, AI / cognitive). Having mentioned the term “advanced”, one should not forget about the speed of developments. What might have been considered “advanced” a few years ago might turn into the "standard level" sooner or later.
The probably most interesting insight from the survey is that respondents with more advanced analytics capabilities reported higher effectiveness than others on three important metrics (Improve performance against mission, Achieve operating efficiencies, Improve staff productivity). In other words: Non-profits with deeper data capabilities see stronger impact, transparency and decision making.
Like the aforementioned study by EveryAction, the IBM survey asked for primary barriers to advancing data and analytics. These are:
What about data science in the non-profit sector? We did not come across specific studies on the state of data science in the non-profit sector in general or fundraising more particularly. However, we had a closer look at the industry segmentations in two major data science surveys.
JetBrains polled more than 1.600 people involved in Data Science in the US, Europe, Japan and China (download full study here). One of the questions asked for the industry for which the respondents analyse data. The non-profit sector ranked quite low (5% compared to Accounting / Finance / Insurance and Science with both 16%). This finding can be seen as consistent with the results of the aforementioned studies that diagnose scope for advances in the field of non-profit analytics.
The presumably most comprehensive study on the state of data science and machine learning is the annual Machine Learning and Data Science Survey conducted by the platform kaggle.com. Kaggle surveyed almost 24.000 people in October 2018, the results are therefore quite recent. In line with open data thinking, the raw data containing the survey results can be downloaded from the site, there is also a data analysis competition attached to it.
The non-profit related results in a nutshell: Of some 24.000 respondents, only a tiny fraction of 0.79% (189 respondents) answered that they work or recently worked for an employer in the non-profit industry. Regardless of the fact that the sample might not be fully representative across industries, this figure shows that data science is still in its beginnings in so-called third sector.
When it comes to the regional distribution of non-profit data scientists, it is striking that a third of respondents that said that they work for nonprofits reside in the United States. India is the runner-up with some 10 percent. The numbers of the remaining countries with respondents is quite evenly distributed.
Our conclusion: Non-profit representatives and decision makers are largely aware of the potential benefits that use of advanced analytics and data science might imply for their organizations. Building competencies, structures and systems is often challenged by scarce resources (most prominently budget but also expertise). The answers regarding the presence of data scientists surveyed by JetBrains and Kaggle reflect this state. The good news is: Nonprofits need not feel alone in their advanced analytics endeavours, as outlined by IBM who advocate establishing an insight ecosystem.
If you are interested in building your own insight ecosystem, you might wish to learn more on what joint systems can offer in this regard.
Let us stay in touch! We wish you a nice rest of spring.
Visualizing data is an integral part of analysts´ and data scientist’s day-to-day life. Visualizations are not produced for the sake of beauty and design – at least not exclusively. One could say the common denominator for data visualization is to make it easier to process information for the human brain and therefore for the recipients. This might lead to better decision making (this is what is often called actionable insights), meaningful storytelling (e.g. in the area of data journalism in general but maybe also particularly in the context of NPOs) and the increase of so-called data literacy. One may like it or not but we live in a highly quantified society which means that often also "non-quantitative“ professionals across industries are required to consider data.
The final product that analysts are asked for by recipients are often charts and infographics. Visualizations also play an important role in the course of data science projects. In line with CRISP DM thinking, it is often data visualizations that help develop the so-called data understanding. Modern tools such as good old Excel or more integrated and holisitc solutions like Power BI make it possible to process large amounts of data from different sources with relative ease and in short time. We can therefore draw a preliminary conclusion: The need for visualization of data will persist and steadily grow, modern tools make life significantly easier in this regard. But what does it actually mean do come up with good data visualizations?
The good news is: There are various sources and thinkers one can turn to get inspirations and recommendations in the context of data visualization and information design. In this month’s blog post, we will take a close look at the work of Edward Tufte. Tufte is an American statistician and professor emeritus of political science, statistics, and computer science at Yale University. He is one of the most influential contemporary thinkers in the field of information design and data visualization. The New York Times went as far as to call him the “Da Vinci of Data” in 1998. More than 35 year ago, Tufte published the first edition of The Visual Display of Quantitative Information which has become a classic on statistical graphics, charts, tables. Tufte is also known for some easy to remember quotes such as:
"If the statistics are boring,
then you've got the wrong numbers."
Tufte has coined the idea of Graphical Excellence. Graphical Excellence means the efficient communication of complex quantitative ideas towards recipients. This requires clarity, precision and efficiency. What does efficiency mean in this context? The viewer should be given the greatest number of ideas in the shortest time with the least ink in the smallest space. You could say this is the application of a minimalist and “less is more” philosophy in the context of data visualization.
Graphical excellence is the well-designed presentation of interesting data - a matter of underlying data, of statistics and of design. One could say in data visualization, data is not everything but without the appropriate data, everything is nothing. Data and the messages derived from it have to be correct. Tufte uses the term integrity in this regard. There is a plethora of sources on how to lie with statistics. Data has to be relevant for the respective viewer. As mentioned above, Tufte went as far as to say that if the statistics are perceived as boring, then you've got the wrong numbers.
When it comes to design, Tufte suggests the following things that graphics should do:
"Design cannot rescue failed content!"
... is another striking quote by Tufte. We tried to put togehter an interesting (hopefully!) slideshow with some "evergreen" data visualizations and inspiring works from the recent past.
Have a look, enjoy and read you next time!
Recently, the average media consumer could get the impression that Artificial Intelligence (AI) is the next big thing and that it will coin our daily lives in the near future. Numerous developments are currently discussed in newspapers and magazines, TV debates, blogs etc. Some of these developments are said to bring about significant, sometimes even radical changes to different industries, be it health, finance, production etc. The question is:
Will charitable fundraising remain unaffected by these big shifts and everything will stay as it is? We have our doubts but definitely see the need for a differentiated view at the same time. This month’s blog post will provide a quick introduction to artificial intelligence and reflect on it current and possible future role in the context of fundraising.
What is Artificial Intelligence
So, what is Artificial Intelligence? Even a quick web research delivers numerous definitions of which many show certain overlap. Let us proceed with one definition by digital evangelist Ray Kurzweil who said that AI ist the art of creating machines that fulfil tasks that – if they were carried out by humans – would require intelligence. The pictures that come to people´s minds should not be underestimated when it comes to a broader understanding of AI. It can be assumed that many think of so called strong artifical intelligence when they hear the term AI. Strong artifical intelligence implies by definition machines that are actually intelligent just like Data from Star Trek, C3PO from Star Wars or Bender from Futurama. The real-life form of artifical intelligence is so called weak artifical intelligence, machines that show intelligent behaviour to some degree. Weak AI essentially means rule-based systems that have capacity for machine learning.
AI history in a nutshell
The beginnigs of AI go back to the early days of modern information technology. You might know Alan Turing from your studies and/or the movie The Imitation Game (showing how Turing and his fellow experts significantly contributed to the Allied victory in WWII). He suggested a test now named after him to find out whether the respective opposite is a machine or a human.
In the same year, Isaac Asimov wrote the novel I, Robot and suggested the Three Laws of Robotics. Asmiov´s work is a good example of the overlap between technoclogical advances in AI and their dramatization in popular culture.
From the beginning oft he 1950ies onwards, AI technology evolved gradually, It was, for instance, as early as 1974, i.e. 45 years ago, when the Stanford AI lab introduced the first prototype of a self-driving car. A growing audience saw Deep Blue beat chess world champion Gary Kasparov in 1997. A bit more than a decade later, in 2011, Siri, Google Now und Cortana were introduced almost at the same time.
What about the future? There are developments than can be forseen. Sooner or later self driving cars will be introduced, translation algorithms as well as image and text recognition will gradually become better etc. When it comes to the long-term perspective, there are different positions. Elon Musk, from Tesla for example goes as far as to term AI as a potential threat to the existence of the human race. Andrew Ng, a machine learning evangelist, says that fearing the rise of killer robots is like worrying about overpopulation on Mars.
What are the functions of AI?
To put it in an anthropomorphic, i.e. human-like sense, AI can nowadays do the following:
From a more functional perspective, AI capabilites can be summarized as follows:
It has to be moreover noted that AI is not a monolithic and distinct technology but a bundle of different technologies, methods and algorithms. This is reflected by the different sub-areas:
Although AI is an ambiguous field, there are certain research areas and big topics that can be identified:
What AI is able to do - and how it might change the game
Many think tanks, consultancies and companies deal with the future potential and development of AI across industries. A quite recent study by Mc Kinsey deals with the impact of AI on different sector and functions.
AI can do an impressive lot of things nowadays, be it agriculture, medicine, finance, law etc. Business Insider have put together a compact list with some 50 examples, all with a link for further reading.
Expectation levels are definitley high when it comes to the potential of AI– which brings about the risk of exaggerated hopes and fears at the same time. Inspiried by a blogpost on datasciencecentral, a blog we can really recommend, we find it worthwhile to reflect on 6 common myhts regarding AI:
So what? The case of charitable fundraising organizations
To put it very generically, FR organizations can be seen as entities that link needs (be it children at risk, endagered animals or the environment) with supporters. As a consequence, there are three major pillars – the need, the oranization and the supporters (or, to be exact, the communication and interaction with them) in which AI might contribute.
We found three inspiring examples from different organizations how AI – also at small scale and in a hands-on manner – can contribute close to the need of charitable organizations.
The organization and its processes
This is the „internal“ view and it is therefore hard to find promiment good practice examples on the web. Generally speaking, AI has the potential to help improce payment processes, fulfilment processes, etc. These processes are often rather „generic“ to a certain extent and therefore comparable to the profit sector – which is an opportunity when it comes to available tools, expertise etc.
Donor Communication and Fundraising
Many topics we have covered in this blog so far, be it churn analyses, donor clustering, dealing with unstrcutured data, data visualization etc. can be attributed to artificial intelligence in a broader sense and can be potentially applied in afundraising context. AI in fundraising might therefore mean using advanced algorithms and data science methods for donor data.
There are also other applications of AI in todays fundraising context.
The American version of Amazon’s Alexa is now able to trigger donations through voice commands. Numerous organizations have started using chatbots to enhance their touchpoints with donors.
So, what to do now? Public Enemy already rapped “Don’t believe the hype” in 1988. So, should the fundraising sector lean back and watch other industries chasing after supposed AI innovation? We do not think so - but instead of thinking about big investments in the first place, we find it recommendable gradually start building know-how, run tests and create prototypes at a smaller scale. If you need support or resources, get in touch with a reliable partner that knows your organizations and whom you have a sustainable relationship with. As far as conferences, blogs, books etc. about AI are concerned, there is nothing that holds one back - so why not start dealing with AI for fundraising in 2019?
Speaking of 2019: As the year is still quite young, we wish you and your colleagues a happy, healthy and successful new year!
P.S. for joint systems clients: This blog post is an extract of a keynote I have delivered recently. Please do not hesitate to get in touch if you are interested in diving deeper into the topic or learning about our service and product portfolio.
Have you heard about Giving Tuesday? Giving Tuesday is the equivalent to Black Friday and Cyber Monday – but for charitable donations. It was started in 2012 during the American Christmas and holiday season. Thanksgiving is traditionally the fourth Thursday in November, so Giving Tuesday is four days after, i.e. in late November or early December. This year, Giving Tuesday was November 27th, 2018. Having started some years ago, not only the amount of funds collected in the US has continuously risen but also the popularity and perceived relevance of Giving Tuesday for fundraising organizations all over the world. So, is Giving Tuesday a global movement already? We decided to investigate this by analysing Twitter data using the statistical software R.
Obtaining data from Twitter
To scrape data from Twitter, one has to apply for a Twitter developer account since July 2018 and create a Twitter app which has to be approved by the platform. You of course also need a Twitter account for that. We went through this quite simple process in which certain information on one’s plans have to be provided.
We used the package rtweet to set up the connection from R to Twitter. After having successfully registered a Twitter app, it is recommendable to embed the necessary credentials (Consumer Key, Consumer Secret, Access Token, Access Secret) in the code.
After having established the connection, we used the command search_tweets to obtain the tweet data about Giving Tuesday. We started with a search number of 100.000 tweets without retweets. Due to restrictions of the Twitter programming interface (API), setting the parameter retryonratelimit to TRUE allowed for re-connects and stepwise data download.
Code Snippet 1: Get rtweet package, connect to API and obtain data
We ended up with 53.653 records in the data frame rt. As the R package in combination with the Twitter API allows the data extraction of the last 6 to 9 days, our data pretty much reflects the Twitter activities regarding Giving Tuesday 2018.
Drawing upon our initial question, we now turned to researching how “international” Giving Tuesday has become. The package rtweet allows the extraction of longitude and latitude data for the respective tweets. This then enabled us to draw a world map of tweets. A warning message told us that no such data could be obtained for 51.000 of the 53.000 tweet records. Regardless of that and assuming that the 2.000 tweets for which we got the data are somewhat representative, we used the package leaflet for a plot. It is visible, that tweeting about Giving Tuesday predominantly happens in the US and the UK.
Code Snippet 2: Map
In contrast to scarce geographical information, language information was mostly complete for the tweet data and therefore allows us to the analyse the international relevance of Giving Tuesday even better. The vast majority of tweets is English. Apart from some 500 tweets with undefined language, French and Spanish are the runners-up.
Code Snippet 3: Barplot languages
What about the timing of the tweets? We extracted the creation dates of the tweets and visualize them in a barplot. Non surprisingly, the 27th of November, i.e. Giving Tuesday itself shows by far the highest number of tweets. There is also notable activity the day before and after with some 6.000 tweets.
Code Snippet 4: Barplot Dates
In order to analyse the virality of the Giving Tuesday tweets, we measured and plotted the number of retweets. Our plot below shows that some 38.000 tweets did not get retweeted at all, some 8.000 received 1 retweet and so on (up to 6 retweets).
Code Snippet 5: Barplot Retweets
We then extracted and ordered the respective data to find out which one was the most viral tweet in the dataset. It is by Fred Guttenberg, an American activist against gun violence. His 14-year-old daughter Jaime Guttenberg was killed in the Stoneman Douglas High School shooting on February 14, 2018. This is his tweet:
Our example shows on the one hand that social media analyses can be conducted relatively easily, on the other hand we saw that Giving Tuesday is about to become a truly global phenomenon. So prepare for Giving Tuesday 2019 which might fit well into your Christmas fundraising. Speaking of Christmas:
On behalf of joint systems I wish our dear readers a very merry Christmas 2018 and a good start into a successful, healthy and happy 2019 - read you there! :-)
Clustering means grouping similar things together. In this month’s we take a closer look at how the so called k-means algorithm as a clustering method might help to develop an even deeper understanding of established donor segments.
Many fundraising organizations segment their donor bases to enable target-group oriented communication and appeals. Working with segments also allows the development of more differentiated analyses on group migrations and income development. The example organization whose data we are playing around with has an up-and-running segmentation model that evaluates the payment behaviour of donors in terms of Recency (when was the last payment), Frequency (how often payments are made) and Monetary Value (sum of donations). This RFM method is commonly used in database-related direct marketing, both in the profit and non-profit sector.
We take a closer look at the best donor group in the following. The segmentation model already tells us that the group under scrutiny consists of “good donors” in terms of payment behaviour. In this regard, the segment as such can be seen as a homogenous group. Can a clustering algorithm like k-means generate additional insights regarding the “inner structure” of this group?
K-means is an unsupervised machine learning algorithm. Unsupervised learning data comes without pre-defined labels, i.e. any kind of classifications or categorizations are not included. The goal of unsupervised learning algorithms is to discover hidden structures in data. The basic idea of k-means is to assign a number of records (n) into a certain number of similar clusters (k). The number of clusters k is either pre-defined or can be jointly defined by the respective analyst / fundraising manager.
The following walktrough was inspired by Kimberly Coffey’s blogpost on k-means Clustering for Customer Segmentation. It is a highly recommended read and can be found here.
Data extract and preprocessing
Our example organization has an established RFM-based segmentation model that yields 4 core groups. We defined the “best” of those groups to be subject for a k-means clustering attempt. The dataset we extract is straightforward in the first place as it contains the unique identifier, the Recency measured as number of days between December 31st, 2017 and the day of the last donation per person. Frequency reflects the number of payments for the respective record in 2017 whereas Monetary Value shows the donation sum.
K-means clustering requires continuous variables and it works best with (relatively) normally-distributed, standardized input variables. We therefore apply a logarithm (log) to the variables and standardize them to avoid positive skew. Standardizing variables means re-scaling them to have a mean of zero and a standard deviation of one, i.e. aligning them to the standard normal distribution. The following code snippet illustrates the loading and transformation of the data.
Code Snippet 1: Load packages and data, then transfrom data
This is how our dataset looks like the aforementioned transformation. Columns 5 to 7 contain the logs of Recency, Frequency and Monetary Value, whereas the standardized (z-)values are in columns 8 to 10.
A quick exploratory view
To intially dive into the data, we plot the log-transformed Monetary Value as well as log-transformed Frequency of donations and use the log-transformed Recency for colouring by using the code in Code Snippet 2.
Code Snippet 2: Exploratory plot of RFM variables
What is striking is the high general density of observations on the one hand (which is due to the large amount of data) and the different shades of blue that reflect certain heterogeneity in terms of Recency.
Running the K-means Algorithm
We now turn to running the k-means algorithm. The following code contains a loop that runs for a number of j clusters (in this example 10). It writes the cluster membership on donor level back into the dataset, creates two-dimensional plots (see example for 3 and 7 clusters below) and collects the model information.
Code Snippet 3: K-means clustering
These are two output examples of the code using 3 and 7 clusters.
So how do we choose the “optimal” number of clusters now? The graphs below both aim for the detection of the number of clusters beyond which adding a cluster adds only little additional explanatory power. In other words we look for bends in the so-called elbow chart. In our example it looks as if this would be at 4. The same decision could have also been made or at least influenced by a business decision regarding the "feasible" number of clusters. Adding additional clusters always adds explanatory power, however, in practice 4 groups are easier to handle (e.g. in the context of a direct marketing test) than 10 or more clusters.
Results and interpretation
Let us now take a closer look at the results. Clik on the picture on the left to get to an interactive 3d-graph of the 4-cluster solution for which the R-code can be found below. The 4-cluster solution yields 4 ellipsoids aiming to reflect the areas with high observation densities for the clusters. These ellispoids should contribute to the ease of reading the graph, the actual observations are still represented by differently coloured dots just like in the 2-dimensional plot we used for exploration. The three "upper clusters" in the picture share a comparable level of Monetary Value and Recency. The dark blue ellispoid stand out of the three as it reflects higher Frequeny. The lower ellipsoid reflects observations that rank relatively low on all of the three RFM variables (remember, the higher the recency, the "worse" - knowing that we are working with a dataset of good donors). The video below contains a fixed-axis rotation.
Code Snippet 4: 3d graph
So, what can we conclude from our clustering attempt? K-means is a widely-used and straightforward algorithm that can be applied relatively easily in practice. It is, however, worthwhile to dive into the underlying concepts of the algorithm and consider the related diagnostics (variance explianed, "withins"). Due to the derived possibilites in terms of data visualization, results can be directly communicated to fundraising decision makers. These decision makers should be involved in the process at an early stage. Although there are "objective" measures for the numbers of clusters, application-oriented considerations (e.g. for further analyses of test designs) should not be left out.
We hope you liked this month's post and wish you a nice beginning of autumn. Do not hesitate to share, comment, recommend etc. Read / hear / see you soon!
So-called Artificial Neural Networks (ANN) are a family of popular Machine Learning algorithms that has contributed to advances in data science, e.g. in processing speech, vision and text. In essence, a Neural Network can be seen as a computational system that provides predictions based on existing data. Neural Networks are comparable to non-linear regression models (such as logit regression), their potential strength lies in the ability to process a large number of model parameters.
Neural Networks are good at learning non-linear functions. Moreover multiple outputs can be modelled.
Artifical Neural Networks are generically inspired by the biological neural networks within animal and human brains. They consist of the following key components:
For the simplified application example below, we produced an example dataset with some 140.000 records. Imagine that we start with a relatively large dataset of sporadic donors and have come up with a straightforward definition of the dependent churn variable, e.g. a definition based on the recency of the last respective donation.
The features (variables) we included were:
We start with loading the relevant R packages, reading in our base dataset and some data pre-processing.
Code Snippet #1: Loading packages and data
An essential step in setting up Neural Networks is data normalization. This implies the scaling of the data. See for instance this link for some brief conceptual considerations and information on the scale function in R.
Code Snippet #2: Scaling
We then split the dataset into a training and test dataset using a 70% split.
Code Snippet #3: Training and test set
Now we are ready to fit the model. We use the package nnet with one hidden layer containing 4 neurons. We run a maximum of 5.000 iterations using the code shown in code snippet number 4:
Code Snippet #4: Fitting Neural Net Model
After fitting the model, we plot our neural net object. The neuron B1 in the illustration below is a so called bias unit. This is an additional neuron added to each pre-output layer (in our case one). Bias units are not connected to any previous layer and therefore do not represent an "activity". Bias units can still have outgoing connections and might contribute to the outputs in doing so. There is a compact post on Quora with a more detailed discussion.
When it comes to modelling in a data science context, it is quite common to look at the variable importance within the respective model. For neural nets, there is a comfortable way to do this using the function olden from the package NeuralNetTools. For our readers interested in the conceptual foundations of this functions, we can recommend this paper.
Code Snippet #5: Function olden for variable importance
This is the chart that we get:
It stand out that the variable Age at entry has a high negative importance on the output whereas Estimated Income shows some degree of positive variable importance.
We finally turn to running the neural net model for predictive purposes on our test data set and plot our results in a confusion matrix-like manner:
Code Snippet #6: Run prediction and show results
The result of the code above looks as follows:
The table above cross-tabls the actual and predicted outcomes of churned and non-churned donors. Let's now evaluate the predictive power of our example neural net. In doing so, we can recommend this nice guide to interpreting confusion matrices which can be found here.
In the light of our data and the example model described above, we can conclude that definitely further model tuning would be needed. Tuning will focus on the used Hyperparameters. At the same time, we would recommend running a "benchmark model" such as a logit regression to compare the neural net's model performance with.
As further reading we can recommend:
As always, we look forward to your shares, like, comments and thoughts.
Have a nice, hopefully long (rest of) summer!
Every four years, football is omnipresent as national teams compete in the world championship. Like it or not, football is a global obsession on the one hand and big business on the other hand. About a year ago, we first talked about the datafication of football. Data nowadays is used not only used to optimize the performance of players and teams but also to predict the results of football matches and whole tournaments. In our last football related post, we already mentioned Martin Eastwoods presentation from 2014 in which he discussed the application of a Poisson regression model with an adjustment inspired by the 1997 paper by Dixon and Coles published in the Journal of the Royal Statistical Society. Another interesting blog post applying a linear model on Premier league data is from David Sheehan.
It is not surprising that many data scientists made an effort to set up prediction models for the coming football world champion that is currently searched for in Russia. Ritchie King, Allison McCann and Matthew Conlen already tried to predict the 2014 world champion (ok, Brazil did not finally make it ...) on fivethirtyeight.com, the site founded by Nate Silver (well known for predicting baseball and election results).
Two interesting approaches we found for the current World Cup in Russia use the same data set from kaggle.com containing international football results from 1872 to 2017. While Gerald Muriuki applies a logistic regression model and comes up with Brazil as next world champion, Estefany Torres applies decision trees and predicts Spain to win the tournament. Torres additionally included data on individual player performance and market value in her analysis.
James Le took an even closer look at individual player data and investigated the optimal line up for some of the high profile teams in the tournament. We particularly liked the descriptive analyses of the Fifa18 player dataset. Le’s personal prediction is France by the way.
We also came across a post on KDnuggets that mentions additional data sources (FIFA world rankings, Elo ratings, TransferMarkt team value and betting odds). The prediction model there – our Northern neighbours might be happy to hear that – sees Germany beating Brazil in the final.
We will see on July 15th ... and wish you nice summer days / holidays in the meantime.
Read you soon!
We have discussed various topics from the area of data science and analytics on our blog in the past months. Data visualization was focused on quite frequently (as we find it inspiring and fun ?). We for instance talked about the power of data visualization in our last blog post. We also mentioned Power BI, a state-of-the-art tool for data integration, analytics and visualization there. Power BI was released by Microsoft some years ago and is comparable to other well-know tools from the field like Tableau and Qlik.
On askyourdata.co, we constantly try to discuss possible applications of data science and analytics in the context of fundraising. Today’s communication mix and fundraising cannot be imagined without the use of digital channels such as display advertising, Search Engine Marketing (SEM) and Optimization (SEO), social media as well as good old websites and blogs, to name just a few. Those “newer” channels (compared to more traditional ones such as direct mailing) bring new analytical and visualization possibilities about. We therefore decided to bring it all together and present a Power BI visualization showcase in this month’s blog post. Which data would have been more suitable for that than the one from our blog askyourdata.co?
So, voilà - our little showcase can be found below - we invite you to have a closer look! You can either use the navigation page or use arrows in the footer to browse the pages. The second page contains some hints on how to use the visualizations interactively. We embedded the showcase as iframe below, you can also access the original version using this link.
We have Google Analytics implemented on this site – which just requires some lines of additional HTML code. Google Analytics is one of the various services Google offers, it is for web analytics and allows tracking and reporting website traffic. Power BI allows accessing Google Analytics accounts trough its API. Google Analytics and add-on products such as Google Data Studio include data visualization features themselves. However, Power BI brings even broader possibilities and allows the integration of various data sources such as data from SQL servers (which many CRM systems and BI solutions are based upon) or tools like Adobe Analytics. It starts getting really interesting for analysts when data from the actual digital fundraising channels and CRM data that reflects supporter care and behaviour (typically from CRM or BI systems) are looked at in an integrated manner. This is a topic we plan to take discuss in-depth in one of our upcoming blog posts.
We used sessions to measure the traffic on this site. Amongst a general overview for sessions over time, we looked at where traffic has been coming from in terms of channels and locations. We were also interested in the devices (desktop, mobiles, tables) our readers use to visit askyourdata.co. Although our showcase is a simplified example, we think it gives an idea of the possibilities that modern visualization tools imply for both data from digital and "classic" channels.
As always, we appreciate your comments and feedback - and look forward to welcoming you again on our blog. We wish you a pleasant start into hopefully nice summer.