Every four years, football is omnipresent as national teams compete in the world championship. Like it or not, football is a global obsession on the one hand and big business on the other hand. About a year ago, we first talked about the datafication of football. Data nowadays is used not only used to optimize the performance of players and teams but also to predict the results of football matches and whole tournaments. In our last football related post, we already mentioned Martin Eastwoods presentation from 2014 in which he discussed the application of a Poisson regression model with an adjustment inspired by the 1997 paper by Dixon and Coles published in the Journal of the Royal Statistical Society. Another interesting blog post applying a linear model on Premier league data is from David Sheehan. It is not surprising that many data scientists made an effort to set up prediction models for the coming football world champion that is currently searched for in Russia. Ritchie King, Allison McCann and Matthew Conlen already tried to predict the 2014 world champion (ok, Brazil did not finally make it ...) on fivethirtyeight.com, the site founded by Nate Silver (well known for predicting baseball and election results). Two interesting approaches we found for the current World Cup in Russia use the same data set from kaggle.com containing international football results from 1872 to 2017. While Gerald Muriuki applies a logistic regression model and comes up with Brazil as next world champion, Estefany Torres applies decision trees and predicts Spain to win the tournament. Torres additionally included data on individual player performance and market value in her analysis. James Le took an even closer look at individual player data and investigated the optimal line up for some of the high profile teams in the tournament. We particularly liked the descriptive analyses of the Fifa18 player dataset. Le’s personal prediction is France by the way. We also came across a post on KDnuggets that mentions additional data sources (FIFA world rankings, Elo ratings, TransferMarkt team value and betting odds). The prediction model there – our Northern neighbours might be happy to hear that – sees Germany beating Brazil in the final. We will see on July 15th ... and wish you nice summer days / holidays in the meantime. Read you soon!
0 Kommentare
We have discussed various topics from the area of data science and analytics on our blog in the past months. Data visualization was focused on quite frequently (as we find it inspiring and fun ?). We for instance talked about the power of data visualization in our last blog post. We also mentioned Power BI, a stateoftheart tool for data integration, analytics and visualization there. Power BI was released by Microsoft some years ago and is comparable to other wellknow tools from the field like Tableau and Qlik.
On askyourdata.co, we constantly try to discuss possible applications of data science and analytics in the context of fundraising. Today’s communication mix and fundraising cannot be imagined without the use of digital channels such as display advertising, Search Engine Marketing (SEM) and Optimization (SEO), social media as well as good old websites and blogs, to name just a few. Those “newer” channels (compared to more traditional ones such as direct mailing) bring new analytical and visualization possibilities about. We therefore decided to bring it all together and present a Power BI visualization showcase in this month’s blog post. Which data would have been more suitable for that than the one from our blog askyourdata.co? So, voilà  our little showcase can be found below  we invite you to have a closer look! You can either use the navigation page or use arrows in the footer to browse the pages. The second page contains some hints on how to use the visualizations interactively. We embedded the showcase as iframe below, you can also access the original version using this link. We have Google Analytics implemented on this site – which just requires some lines of additional HTML code. Google Analytics is one of the various services Google offers, it is for web analytics and allows tracking and reporting website traffic. Power BI allows accessing Google Analytics accounts trough its API. Google Analytics and addon products such as Google Data Studio include data visualization features themselves. However, Power BI brings even broader possibilities and allows the integration of various data sources such as data from SQL servers (which many CRM systems and BI solutions are based upon) or tools like Adobe Analytics. It starts getting really interesting for analysts when data from the actual digital fundraising channels and CRM data that reflects supporter care and behaviour (typically from CRM or BI systems) are looked at in an integrated manner. This is a topic we plan to take discuss indepth in one of our upcoming blog posts. We used sessions to measure the traffic on this site. Amongst a general overview for sessions over time, we looked at where traffic has been coming from in terms of channels and locations. We were also interested in the devices (desktop, mobiles, tables) our readers use to visit askyourdata.co. Although our showcase is a simplified example, we think it gives an idea of the possibilities that modern visualization tools imply for both data from digital and "classic" channels. As always, we appreciate your comments and feedback  and look forward to welcoming you again on our blog. We wish you a pleasant start into hopefully nice summer. About a year ago, we promised to dedicate a significant share of the content on this blog to data visualization. Let´s look back together how we did – and add up :)! In January 2017, we claimed that data visualization has the potential to help us see and convey concepts and facts in a graspable and interesting way. Data that are appropriately visualized might be easier to understand and process for recipients. This in turn might contribute to better decision making in different types of organizations. At the same time, data visualization can essentially be seen as a mean of storytelling which might be driven by an underlying agenda such as catalyzing social change or raising awareness for certain topics. History of data visualization started long before the computer age – some might even see the paintings on the walls of Lascaux as the first data visualizations. In April 2017, we dived a little deeper and took a closer look at what good data visualization would actually be about and recommended some potential sources of inspiration. A highly influential thinker we can recommend in this context is Edward Tufte, an American statistician and professor emeritus of political science, statistics, and computer science at Yale University. Tufte coined the concept of graphical excellence: "Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space." This implies an attractive display of statistical information; these visualizations:
A recommendable read is Tufte's book The Visual Display of Quantitative Information. One form of data visualization we are fond of are maps – so we dedicated a blog post to it in August 2017. For visualization aficionados, the fun begins as soon as less common and to a certain extent special charts come into play. In September 2017 we took a closer look at Sankey and Chord diagrams. In our view, it is still the sound concept and useroriented design of data visualizations that essentially make a difference. However, there are powerful tools around nowadays. What they have in common is the fact that they are on the one hand capable of integrating various different data sources and on the other hand offer various functionalities in terms of visualization, navigation and interactivity. Our latest video which you can find on the joint systems Youtube channel takes an exemplary look at Power BI from Microsoft. We will keep on visualizing and reflecting on it. Anyway, we wish you relaxing Easter holidays and a nice spring. See / hear / read you soon!
Do you sometimes find yourself wishing to predict the future? Well, let's stay downtoearth, nobody can (not even fundraisers or analysts :). However, there are established statistical methods in the area of time series that we find potentially interesting in the context of fundraising analytics. Our first blog post of 2018 will take a closer look ... Forecasting with ARIMA It seems that forecasting future sales, website traffic etc. has become quite an imperative in a business context. In methodical terms, time series analyses represent a quite popular approach to generate forecasts. They essentially use historical data to derive predictions for possible future outcomes. We thought it worthwhile to apply a socalled ARIMA (AutoRegressive Integrated Moving Average) model to fundraising income data from an exemplary fundraising charity. The data used The data for the analysis was taken from a mediumsized example fundraising charity. It comprises income data between January 1st, 2015 and February 7, 2018. We therefore work with some 3 years of income data, coming along as accumulated income sums on the level of booking days. The source of income in our case is from regular givers within a specific fundraising product. We know that the organization has grown both in terms of supporters and derived income in the mentioned segment Preparing the data After having loaded the required R packages, we import our example data it into R, take a look at the first couple of records, format the date accordingly and plot it. It has to be noted that the data was directly extracted from the transactional fundraising system and essentially comes along as "income per booking day". Code Snippet 1: Loading R packages and data + plot
To overcome potential difficulties in modelling at a later stage of the time series analyses, we decided to shorten the date variable and to aggregate on the level of a new dateyearvariable. The code and the new plot we generated looks as follows: Code Snippet 2: Date transformation + new plot
The issue with Stationarity Fitting an ARIMA model requires a time series to be stationary. A stationary time series is one whose properties are independent from the point on the timeline for which its data is observed. From a statistical standpoint, a time series is stationary if its mean, variance, and autocovariance are timeinvariant. Time series with underlying trends or seasonality are not stationary. This requirement of stability to apply ARIMA makes intuitive sense: As ARIMA uses previous lags of time series to model its behavior, modeling stable series with consistent properties implies lower uncertainty. The form of the plot from above indicates that the mean is actually not timeinvariant  which would violate the stationarity requirement. What to do? We will use the the log of the timeseries for the later ARIMA. Decomposing the Data Seasonality, trend, cycle and noise are generic components of a time series. Not every time series will necessarily have all of these components (or even any of them). If they are present, a deconstruction of data can set the baseline for buliding a forecast. The package tseries includes comfortable methods to decompose a time series using stl. It splits the data (which is plotted first using a line chart) into a seasonal component, a trend component and the remainder (i.e. noise). After having transformed that data into a timeseries object with ts, we apply stl and plot. Code Snippet 3: Decompose time series + plot
Dealing with Stationarity The augmented DickeyFuller (ADF) test is a statistical test for stationarity. The null hypothesis assumes that the series is nonstationary. We now conduct  as mentioned earlier  a logtransformation to the deseasonalized income and test the data for stationarity using the ADFtest. Code Snippet 4: Apply ADFTest to logged time series
The computed p.value is at 0.0186, i.e the data is stationary and ARIMA can be applied. Fitting the ARIMA Model We now fit the ARIMA model using auto.arima from the package forecast and plot the residuals. Code Snippet 5: Fitting the ARIMA
The ACFplot (Autocorrelation and CrossCorrelation Function Estimation) of the residuals (lower left in picture above) shows that all lags are within the bluishdotted confidence bands. This implies that the model fit is already quite good and that there are no apparently significant autocorrelations left. The ARIMA model that auto.arima fit was ARIMA(2,1,0) with drift. Forecasting We finally apply the command forecast from the respective package upon the vector fit that contains our ARIMA model. The parameter h represents the number of time series steps to be forcast. In our context this implies predicting the income development for the next 12 months. Code Snippet 6: Forecasting
Outlook and Further reading
We relied on auto.arima which does a lot of tweaking under the hood. There are also ways to modify the ARIMA paramters witin the code. We went through our example with data from regular giver income for which we a priori knew that growth and a certain level of seasonality due to debiting procedures was present. Things might look a little different if we, for instance, worked with campaignrelated income or bulk income from a certain channel such as digital. In case you want to take a deeper dive into time series, we recommend the book Time Series Analysis: With Applications in R by Jonathan D. Cryer and KungSik Chan. A free digital textbook called Forecasting: principles and practice by Rob J. Hyndman (author of forecast package) and George Athanasopoulos can also be found on the web. We also found Ruslana Dalinia's blog post on the foundations of time series worth reading. The same goes for the Suresh Kumar Gorakal's introduction of the forecasting package in R. Now it is TIME to say "See you soon" in this SERIES :)! The year 2017 is coming to its close soon. We guess that for many people the days between years are a time to reflect on the opportunities and challenges the year that is about to start will bring. One can expect that this is also true when it comes to trends in Data Science. We did some research for Data Science outlooks 2018. Machine Learning
Machine Learning has been one of the buzzwords in the recent past. We also published an introductory blog post on the topic earlier this year. One might say the term is overhyped, however, machine learning is applied in academia and across industries as a very recent survey by KDnuggets shows. Dataconomy, a Berlinbased Data Science company, elaborate on the promising field of machine learning but also mention from where organizations are starting from. Regardless of Data Science concepts and tools, 77 percent of German companies still rely on “small" data tools like Excel and Access. Many of them might still have plenty of homework to do to transform into what Dataconomy call datadriven organizations. At the same time, a recent KPMG study (available upon request in German) shows that 60 percent of companies have been able to benefit in different forms (reduction of costs or risks and/or increase of revenues) from Data Science – which of course includes Machine Learning and Artifical Intelligence. Artificial Intelligence (AI) Speaking of artificial intelligence  for many, 2018 will be the year when AI breaks though. The prominent research company Gartner defines AI as one of the most important strategic technology trends for 2018. They refer to a recent survey showing that 6 out of 10 organizations are in the process of evaluating AI strategies whereas the remaining 4 have started adopting respective technologies The analytics company absolutedata goes as far as to speak of AI powered marketing and formulates certain predictions regarding what 2018 will bring in this context:
Bill Vorhies, Editorial Director for Data Science Central, is a bit more hesitant in the context of AI. He predicts that – regardless of the hype – the diffusion of techniques and tools in from the field of AI and deep learning will be slower and more gradual than expected by many. One already visible manifestation of the spread of AI are chat bots which are increasingly used in a web and mobile context. Chat bots essentially process natural language and thereby involve customers and prospects in an interaction. The implementation of facial and gesture recognition currently look like the next big thing as possible applications on the point of sale seem vast. How should nonprofits deal with AI? Steve MacLaughlin, author and Vice President of Data & Analytics at Blackbaud, underlines the vast opportunities for nonprofits but also relativizes the buzz around AI. MacLaughlin explains that AI for nonrpfits requires the availability of the right data, contextual expertise, and continuous learning. Given these factors, AI can support nonprofits to be impactful particularly through fundraising. Big Data We also dealt with Big Data – probably the buzziest of all buzzwords from the field – in February this year. We quite liked a recent blog post by KDnuggets that starts as follows: There's no denying that the therm Big Data is no longer what it used to be [..] note that the t. We now all just assume and understand that our everyday data is huge. There is, however, still value in treating Big Data as an entity or a concept which needs to be properly managed, an entity which is distinct from much smaller repositories of data in all sorts of ways. What follows next is the gist of interviews with various experts from the field talking about their expectations for 2018 and beyond – a really recommendable holiday read. Data Protection The EU General Data Protection Regulation (GDPR) will be enforced from May 25th, 2018. It is yet to be seen to what extent customers and donors will, for instance, actively insist on the Right to be Forgotten – which might have implications for the availability of donor data for advanced modelling as well. People Needed! More and more organizations seem to develop an interest in Big Data experts, Data Architects, Data Scientists etc.. Without any doubt, advanced analytics and Data Science efforts need motivated and skilled women and men to succeed. Florian Teschner conducted an analysis on Data Science job offers recently. Although there is scope for absolute growth in available positions, there is some five fold increase since 2015. Stay tuned to Data Science in 2018 We think it will be worthwhile to keep one’s eyes open in 2018. If you are interested in Data Science as well as the conceptual and technological developments in the field and you are not on Twitter yet, following some influential thinkers from the area might be an interesting starting point. Maptive.com provides a list of potential influencers. If you want to dive a little deeper, some experts’ Github accounts might be a place to go if you look for papers, code etc.  just follow the overview provided by analyticsvidya.com. Last, not least ... ... we wish you and your dear one a happy and relaxed (datafree) Christmas and good start in a successful and dynamic 2018. See you next year on this blog! This month’s blog post illustrates how to extract, analyze and visualize data from Facebook using the software R. We use Rfacebook, a specific package for social media mining. To get started, you have to set up a developer account on www.developers.facebook.com which is really simple given that you have an existing Facebook account yourself. After that there are some steps to follow to connect R to the programming interface (API) of Facebook and make the authentication process reproducible in your R code. There are lots of walk through descriptions available on the web, the one on listendata.com and thinktostart.com helped us a lot to prepare this post. Getting connected and exploring data In order to provide a possibly interesting example for analytical practice, we take a closer look at the Facebook pages of the renowned international charities Unicef and Save the Children. After having connected to the Facebook API via R, we scrape the 100 last posts published on the Unicef page: Code Snippet 1: Scrape data from Facebook page
The command getPage() creates a data frame which we can use for further analysis. It contains the following variables:
We take a look at the mentioned data frame using summary(), reformat the column created_time into a date and run the descriptive statistics again: Code Snippet 2: Look at data frame and reformat created_time
Using the summary() command we see that the last 100 posts were published in a period of almost 2 weeks between November 6th and November 21st 2017 (minimum and maximum of created_time). Socalled likes and shares on Facebook are indicators that reflect the reception and virality of posted content. Content might come in the form of photos, videos or posted links. Let us start with a simple histogram of like counts for recent posts on the page under scrutiny. We therefore embed the data frame unicef in a command within the popular visualization package ggplot2: Code Snippet 3: Histogram of Likes for last 100 posts
This is the plot we end up with: As we are curious what type of content posted reached a significant number of likes and shares on our example page, we use at data frame unicef and create a scatterplot of the last 100 likes and shares. This is how the respective code looks: Code Snippet 4: Plot Likes and Shares for last 100 Posts
We see the two recent posts with the highest numbers of likes and share were both using photos (see two orange points in the upper right corner in the plot below). Text Mining In order to dive deeper in to contentrelated analyses of social media, we apply some basics of text mining using several packages combined. Text mining is wide a field with a large number of applications in different areas. One of our future blog posts will try to dig deeper and provide background information in this regard. For the sake of brevity and because of the focus on the Facebook API, we focus on the basics this time. Word clouds which are also called tag clouds are commonly used to visualize text data. Typically a conglomerate of words from the text under scrutiny is visualized using word sizes in relation to the frequency of appearance in the respective text. Word clouds can be created in R with relative ease using the package wordcloud but require some preparatory steps for data extraction and preparation. For this purpose we use the packages tm for text mining and SnowballC for text stemming. Our code is inspired by Alboukadel Kassambara’s blog post on text mining basics. To illustrate the use of text mining in social media analyses, we take a closer look at the text data within the last 100 posts from the international charities Unicef and Save the Children. The subsequent steps needed can be summarized as follows:
This is how those steps look in R language: Code Snippet 5: Text Mining and Word Cloud Creation
The generated word cloud for the scraped most recent Facebook posts of Unicef looks as follows: It is not too surprising the term world children's day is so prominently positioned in the word cloud of recent posts by Unicef. November 20th is United Nations Universal Children’s Day, it is the date when the UN adopted the Decleration of the Rights of the Child (1959) as well as the Convention of the Rights of the Child. For the sake of comparison, we rerun the code for data extraction, data processing and visualization again and generate a word cloud for the last 100 posts of Save the Children: !Whereas Unicef obviously related to the World Children Day within their Facebook communication which might be related to their strong ties to the United Nations, Save the Children literally kept on putting the terms children and child in the center of their content.
Conclusion There are numerous social media analytics tool around (both free and paid, see for example this blog post for a quick overview) around. However, using the Facebook API with R raises the possibility of using the power and flexiblity of R to gain additional insights from accessible Facebook data. Have a nice fall and stay connected (via Facebook or any other mean :)! This month's blog post was contributed by my colleague Susanne Berger. Thank you, Susanne, for the highly interesting read on attrition (churn) for committed givers and how to investigate it with methods and tools from data science. Johannes What this post is about The aim of this post is to gain some insights into the factors driving customer attrition with the aid of a stepwise modeling procedure:
The problem Servicebased businesses in the profit as well as in the nonprofit sector are equally confronted with customer attrition (or customer churn). Customer attrition is when customers decide to terminate their commitment (i.e., regular donation), and finds expression in absent interactions and donations. However, any business whose revenue depends on a continued relationship with its customers should focus on this issue. We will try to gain some insights into customer attrition of a selected committed giving product from a fundraising NPO. Acquisition activities for the product took place for six years in total and ended some years ago. Essentially, we are concerned about how long a commitment lasts and what actions can be undertaken to prolong the survival time of commitments. We worked with a pool of some 50.000 observations. To be more specific, we are interested in whether commitments from individuals with a certain sociodemographic profileor other characteristics tend to survive longer. Survival analysis is a tool that helps us to answer these questions. However, it might be the case that the underlying decision process of an individual to terminate a commitment has different influence factors than the decision to start supporting. The approach chosen is to first ask whether an individual with one of the respective commitments has started paying, and if she has, we will have a look at the duration of the commitment. Both of these questions are depicted in a separate model:
The data Let's catch a first glimpse of the data. The variables ComValidFrom and ComValidUntil depict the start and enddate of a commitment, ComValidDiff gives the length of a commitment in days, and censored shows whether a commitment has a valid enddate and is equal to one if the commitment has been terminated, and equal to zero if the commitment has not been terminated (until 20171014). R Code  Snippet 1
Age and AmountYearly (the yearly amount payable) are, among others variables such as PaymentRhythm, Region and Gender included in the analysis, as they potentially affect the survival time of a commitment or the probability of an individual starting to donate. At the moment, no further explanatory variables are included in the analysis. The analysis After having briefly introduced the most important variables in our data, let's take a look at the results: Has an individual even begun to donate within the committed gift? First, we need to construct a binary variable, that allocates a FALSE if the lifetime sum of a commitment (variable ComLifetimeSum) is zero. Individuals who started to donate if the lifetime sum is larger than zero get assigned a TRUE. The variable DonorStart distinguishes between these two cases. R Code  Snippet 2
A popular way of modeling such binary responses is logistic regression, which is a mathematical model that can be employed to estimate the probability of an individual starting to donate, controlling for certain explanatory variables, as in our case Gender, Age, AmountYearly, PaymentRhythm, and Region. Thus, we fit an exemplary logistic regression model, assuming that Age and AmountYearly have a nonlinear effect on the propability to start paying. In addition, interactions of Age and Gender, as well as of Age and PaymentRhythm are included. For example, the presence of the first interaction effect indicates that the effect of Age on the probability to start to donate is different for male and female individuals. All estimated coeffcients are significant at least at the 5% significance level (except two of the Region coefficients and the main gender effect). R Code  Snippet 3
Usually, the results are presented in tables with a lot of information (as for example pvalues, standard errors, and test statistics). Furthermore, the estimated coefficients are difficult to translate into an intuitive interpretation, as they are on a logodds scale. For this reasons, the summary table of the model will not be presented here. An alternative way to show the results is to use some typical values of the explanatory variables to get predicted probabilities of whether an individual started to donate. An effectsplot (from the effects package) plots the results such that we can gain insights into how predicted probabilities change if we vary one explanatory variable while keeping the others constant. R Code  Snippet 4
In the subsequent panel, the effects plot for the interaction between Age and Gender is depicted. For all ages, females have a higher probability to start to donate than males. In addition, regardless of gender the older the individual (up to age 7075), the higher the probability to start to donate. For individuals in their twenties, the steep slope indicates a larger change in probability for each additional year of age than for individuals older than approximately 30 years. We use the following code to create the effects plot: R Code  Snippet 5
We were also interested in the the interaction between Age and PaymentRhythm. As most of the individuals in the sampe decided for either yearly or halfyearly payment terms, one should not put too much confidence in the other two categories quarterly and monthly. Nevertheless, for individuals with a yearly payment rhythm, the probability to start to donate increases with age up to 7075 years. For individuals with halfyearly or quarterly payment rhythm, rising age (up to 7075 years) does not seem to affect the probability of starting to donor much. And for a monthly payment rhythm, the probability even seems to decrease slightly with rising age. The figure below illustrates these findings: R Code  Snippet 6
Now, let's go one step further and try to investigate which factors have an influence on the duration of a commitment, once an individual has started to donate. The question we ask: What affects the duration of a commitment, given the individual acutally started paying for it? Survival analysis analyzes the time to an event, in our case the time until a commitment is terminated. First, we will have a look at a nonparametric KaplanMeier estimator, then continue with a semiparametric Cox proportional hazards (PH) model. For the sake of brevity, we will neither go into the computational details, nor will we discuss model diagnostics and selection. To start the analysis, we first have to load the survival package and create a survival object. R Code  Snippet 7
The variable Survdays reflects the duration of commitments in days. The "+" sign after some observations indicates commitments that are still active (censored), and thus do not have a valid end date (ComValidUntil is set to 20171014 for these cases, as the last valid end date is 20171013): R Code  Snippet 8
The illustration below shows KaplanMeier survival curves for different groups, the tickmarks on the curves represent censored observations. It can be observed that 50% of individuals terminate their commitment approximately within the first 1.5 years. We formed groups using using maximally selected rank statistics: The groups exhibit different survival patterns:
The limit of KaplanMeier curves is when several potential variables are available that we believe to contribute to survival. Thus, in a next step we employ a Cox regression model to be able to control statistically for several explanatory variables. Again, we are interested in whether the variables Age, Yearly amount payable, Gender, Payment Rhythm, and Region are related to survival, and if, in what manner. The relationship of Age as well as of Yearly amount payable on the loghazard (the hazard here is the instantaneous risk  not probability  of attrition) is modelled with a penalized spline in order to account for potential nonlinearities. A Cox proportional hazard model First, we fit one Cox model for all data, including individuals who never started paying their commitments. The upper right panel in the figure below shows that the log hazard decreases (which means a lower relative risk of attrition) with rising age, with a slightly upward turn after age 60. But again, the data are rather sparse at older ages, which is reflected by the wide confidence interval and should thus be interpreted with caution. In the upper left panel, we see that the relative risk of attrition is higher for male than for female individuals. In addition, the lower left Panel depicts a nonlinear relationship for the yearly amount payable, the log hazard decreasing the higher the yearly amount, (with a small upward peak at 100 Euro), and increasing again for yearly amounts higher than about 220 Euro (though data are again sparse for amounts higher than 200 Euro). Include only individuals who started to donate... Next, we exclude individuals who never started to donate (i.e. those with a lifetime sum of zero) from the data, and then fit one Cox model for early data (with a commitment duration < 6 months) and one Cox model for later data (with a commitment duration > 6 months). Interestingly, some of the results undergo fundamental changes. ...and a commitment duration of > 6 months In the following, we present the termplots of a Cox model, including only individuals in the analysis who have a commitment duration of more than 6 months, and given the lifetime sum of the commitment is larger than zero. The gender effect turned the other way around compared to the model from the figure above. In addition, the shape of the nonlinear effect of the yearly amount payable changed, if only the "later" data are considered. ...and a commitment duration of <= 6 months Last not least the termplots of a further Cox model including only individuals with a commitment duration of less or equal than 6 months, and given the lifetime sum of the commitment is larger than zero are shown. Here, the age effect as well as the yearly amount payable do not seem to exhibit strong nonlinearities (actually, the effect of the yearly amount payable is linear). Furthermore, age does not seem to have an effect on the risk of attrition at all. Conclusion and outlook
Today I want to introduce 3 different kinds of charts and how they can be applied to data on donor segments and income. You might have come across these kinds of visualizations before  but maybe in different contexts such as the illustration of voter streams after elections or the visualization of genome data. All charts were generated with the software R which is an open source statistical software. I would go as far as to term it ecosystem as there is a vast amount of additional packages and a large community is constantly developing the features of R. The following charts are all interactive html visualization, so please click on the respective pictures below to open them and take a closer look. The first visualization is a so called Sankey Diagram. I have seen them called River Plot as well. Sankey Diagrams are special forms of flow charts. They are named after an Irish captain who first used them in the late 19th century to show the energy efficiency of steam engines. Comparable visualizations are typically used to illustrate detailed election results when it comes to the (calculated) streams of voters from one party to another. If you are a regular reader of this blog, you might have seen a very famous Sankey Diagram by Charles Joseph Minard that illustrates Napoleon Bonaparte's Russia campaign from 1812. The R code used is relatively simple and incorporates the GoogleVis package. An example can be found here (scroll towards the end of the page). The example to the left attempts to visualize the migration of donors from defined segements over time, The thickness of the line represents the number of donors having moved from one group to another. The interactive version of the chart (click the picture) gives you a tooltip showing the exact numbers. The visualization to the right is a so called Chord Diagram. Wikipedia defines Chord Diagrams as a graphical method of displaying the interrelationships between data in a matrix. Chord Diagrams are quite frequently used to visualize genome data, the New York Times published a beautiful example back in 2007. There are several ways to generate Chord Diagrams in R, I used the tool by Matt Flor that can be found on GitHub. The idea behind the sample data I used is again leant against the migration of donors from segments from year 1 to segments in year 2 within a defined segmentation model. The third chart makes your data move  it is a so called Motion Chart. I used sample data in a flat file containing income per donor segment (or "business area") per year from 2013 to 2016. Again, with some simple lines of code in R using the GoogleVis package, data can be visualized and animated in a bubble chart (with several different options) as well as in more "classical" bar and line charts. It is worth mentioning that no actual data is transmitted to Google when using the API, the actual rendering happens in the browser. The examples I created worked fine in Internet Explorer 11 and Edge under Windows 10. If you are interested in the R code / sample data I used or a chat, please feel encouraged to get back to me via this site, email of social media. I hope you have / had a nice end of summer. See you next month :)!
Visualizing data in maps can be a bit of a challenge although widespread tools like MS Excel recently got improved functionalities in this context (see for example our blog post called Where the donors are from earlier this year). There are two R packages that are definitely worth taking a look at when it comes to producing maps with data. The first one is called Leaflet. Leaflet is an opensource JavaScript library for creating and customizing interactive maps. These maps can be used directly from the R console, within R Studio as well as in in so called Shiny Apps and R Markdown documents. Leaflet is becoming more and more popular, probably also because of its use in online media (see these examples from the NY Times and the Washington Post). Using Leaflet is fairly easy. A very simple example shows some of the major cities I have visited on a world map and attaches a little comment popup to them. The data underneath is straightforward (longitudes, latitudes, comment) and the code is quite selfexplaining. Click on the Image below to see the interactive result: Another recommendable package is Plotly (also spelt Plot.ly like its URL). Plotly is a virtual data analytics and visualization tool. The platform as such invites to play around. Plotly contains libraries for languages like Python, Perl and MATLAB and last not least a package for R. I have created a simple example with madeup data on the number of potential beneficiaries per country on a world map. This is the result (click image below): There are sources with way more impressing showcases than mine, e.g in the R Graph Gallery. Three examples I found inspiring in a broader fundraising and charity context are maps with illustrations of Nepal earthquakes as well as a London income map.
The possibilities with Leaflet and Plotly seem limitless. I have made the experience that working with the two – after some initial tweaking with the data and getting used – is really fun!
The Caret package ("Caret" stands for Classification And REgression Training) includes functions to support the model training process for complex regression and classification problems. It is widely used in the context of Machine Learning which we have dealt with earlier this year. See this blog post from spring if you are interested in an overview. Apart from other, even more advanced algorithms, regression approaches represent and established element of machine learning. There are tons of resources on regression, if you are interested and we will not focus on the basics in this post. There is, however, an earlier post if you look for a brief introduction.
In the following, the output of a knitr report into html was pasted on the blog site. We used madeup data on donor lifetime value as dependent variable and the initial donation, the last donation and donor age as predictors. In short: The explanatory and predictive power of the model are low (for which the dummy data has to be blamed to a certain extent). However, the code you will find below aims to illustrate important concepts of machine learning in R:
Preparing Training and Test DataWe see that InitialDonation, LastDonation and Lifetimesum are factos .. so let´s prepare the data.
As we have a decent dataset now we go ahead and load the promiment machine learning package Caret (Classification And Regression Training) Data cleaningSo called nearzerovariance variables (i.e. variables where the observations are all the same) are cleaned.
IntercorrelationBefore we fit the model, let´s have a look at intercorrelations
