In 2017, the Economist published a headline that has become famous among data aficionados – Data is the new oil. It referred to the many new opportunities businesses and organisation can reach when beneficially utilising the data from their customers or partners that is newly generated every day. In many cases, we have tried and tested statistical and mathematical methods that have seen use for decades. If data is neatly stored in a relational database and takes the form of numbers and the occasional categorical variable, we can use a multitude of statistical methods to analyse and gain insight from the data.However, the real world is always messier than we would like it to be, and this applies to data as well: Forbes magazine estimates 90% of data generated every day is unstructured data – that is text and images, videos and sounds, in short, all data that we cannot utilise using standard methods. Natural Language Processing (NLP) is a way of gaining insight into one category of unstructured data. The term describes all methods and algorithms that deal with processing, analysing and interpreting data in the form of (natural) language.One key field of NLP is Sentiment Analysis – that is to analyse a text for the sentiment contained in or the emotions evoked by the text. Sentiment Analysis can be used to classify a text according to some metric such as polarity – is the text negative or positive – or subjectivity – is the text written subjectively or objectively. While traditionally, its applications have been focussed on social media – classifying for instance a brand’s output on social media or analysing the public’s reaction to the web presence – they are certainly not limited to the digital sphere. An NGO communicating with their supporters in ways such as direct mailing could for example use Sentiment Analysis to examine their mailings and investigate the response they result in.A frequently employed yet quite simple way of undertaking Sentiment Analysis is to use a so-called Sentiment Dictionary. This is simply a dictionary of words receiving a score according to some metric related to sentiment. For instance, for scoring a text for its polarity one would use a dictionary allocating to each word a continuous score in some symmetric interval around zero, where the ends of the interval correspond to strictly negative or strictly positive words, and more neutral words group around the midpoint. There are very different versions of Sentiment Dictionaries, depending on context and most importantly language – while there is some choice for English, this diminishes for other languages. A collection of dictionaries in the author’s native German for example can be found here.As a brief example for Sentiment Analysis in action, consider the following diagram, showing a combined polarity score for some direct mailing activities of an NGO in a German-speaking country over one year, whereby each maling received a score between -1 (very negative) and +1 (very positive). This scoring was undertaken using the SentiWS dictionary of the University of Leipzig. A very different approach relies on the use of word embedding. Word embedding is a technique whereby the individual words in a text are represented as high-dimensional vectors, ideally in a way that certain operations or metrics in the vector space correspond to some relation on the semantic level. This means for example that the embedding group words occurring in similar circumstances or carrying similar meaning or sentiment closely together in the vector space, that is to say, their distance measured by some metric is small. Another illustrative example is the following: Consider an embedding ϕ and the words “king”, “woman” and “queen”, then ideally, ϕ(“king”) + ϕ(“woman”) ≈ ϕ(“queen”). To put this in words, the sum of the vectors for king and woman should be roughly the same as the vector for queen, who is of course in some ways a “female king”. The semantic relation thus corresponds to an arithmetic relation in the vector space. To better visualise these relationships on the vector level, consider the following illustrations, taken from Google's Machine Learning crash course that show the vector representations of some words in different semantic contexts, represented for better readability in a three-dimensional space (actual embeddings usually map into much higher dimensional vector spaces, upwards of a few hundred dimensions).There are multiple ways to construct such an embedding, i.e. to receive an optimal vector representation for every word in the text corpus, most of which rely on neural nets. Due to the huge computational cost involved, many users opt for transfer learning. This means using a pre-trained model architecture and finish training on the data set – in an NLP setting, the text corpus – at hand, which is both more efficient and produces more accurate results than training a model architecture “from scratch”. Python users might find this resource providing a huge number of pre-trained models and frameworks to continue training very interesting.Word embedding is principally a way to take unstructured text data and transform it into a structured form – vectors on which computers can perform operations. One can then go on and use this vectorised representation of the text to use different methods to gain insight from the written text. For instance, when performing Sentiment Analysis, one can use a Machine Learning model of choice – typically a neural net – to classify the text according to some metric of sentiment – say, polarity – using the vector representation of the text corpus.The field of NLP has in recent years become one of the fastest moving and most researched fields of Data Science – the methods we can employ today have evolved and improved tremendously compared to just a short time ago, and are bound to do so again in the recent future. For a data nerd, this is interesting in and by itself; but different actors, especially in the not-for-profit sector, are well advised to keep an open eye on these developments and consider in which ways they can help them in their work.
0 Kommentare
## If you are interested in the art of |

Week |
Day |
Type |
Content & Link |

1 |
Monday |
📝 Blog |
Daydreaming Numbers: bit.ly/2S928qS |

1 |
Tuesday |
🎬 Video |
TED Talk by David McCandless: bit.ly/2HFgBZy |

1 |
Wednesday |
📝 Blog |
Visual Capitalist: bit.ly/3kU0gON |

1 |
Thursday |
📰 Article |
HBR: A Data Scientist´s real job: bit.ly/3jcEbL6 |

1 |
Friday |
🎬 Video |
TED Talk by Hans Rosling: bit.ly/30fz19v |

2 |
Monday |
📰 Article |
Narrative Visualization: stanford.io/2HFEK27 |

2 |
Tuesday |
📝 Blog |
Make it conversationsl: bit.ly/3cIsuJx |

2 |
Wednesday |
🎬 Video |
TED Talk by Tommy McCall: bit.ly/3cC0tn5 |

2 |
Thursday |
🎨 Gallery |
Collection of infographics: bit.ly/3kZt4p5 |

2 |
Friday |
💻 PPT |
Berkeley: Data Journalism: bit.ly/30j7Z1g |

3 |
Monday |
🎬 Video |
Data Storytelling: bit.ly/33arivv |

3 |
Tuesday |
📝 Blog |
Impact on Audience: bit.ly/338dIbQ |

3 |
Wednesday |
🎨 Gallery |
Juice Analytics Gallery: bit.ly/2G6nL8I |

3 |
Thursday |
📕 Book |
Data Journalism Handbook: bit.ly/2S94Hcd |

3 |
Friday |
🎬 Video |
TED Talk by Aaron Koblin: bit.ly/2EFZWDY |

4 |
Monday |
🎨 Gallery |
DataViz Catalogue: bit.ly/34mdy0b |

4 |
Tuesday |
📝 Blog |
Data Visualization Checklist: bit.ly/3cQ0d45 |

4 |
Wednesday |
🎬 Video |
TED Talk by Chris Jordan: bit.ly/3kTaaQT |

4 |
Thursday |
📝 Blog |
Toolbox for Data Storytelling: bit.ly/3mZrd5H |

4 |
Friday |
🎬 Video |
Storytelling with Data: bit.ly/3jd2W9Q |

**Moneyball**

**Released:**2011

**Big names:**Brad Pitt, Philip Seymour Hoffman

**IMDB Rating:**76%

**Plot in a nutshell:**The movie is based on the book

*Moneyball: The Art of Winning an Unfair Game*by Michael Lewis. Its main protagonist is Billy Beane who started as General Manager of the baseball club Oakland Athletics in 1997. Beane was confronted with the challenge of building a team with very limited financial resources and introduced predictive modelling and data-driven decision making to assess the performance and potential of players. Beane and his peers were successful and managed to reach the playoffs of the Major Leage Baseball several times in a row.

**Trailer:**

**Why you should watch this movie:**

*Moneyball*highlights the importance of communication skills and persistence for people aiming to drive change using data science.

__The Imitation Game__

**Released:**2014

**Big names:**Benedict Cumberbatch, Keira Knightley

**IMDB Rating: 80%**

**Plot in a nutshell:**

*The Imitation Game is based*upon the real-life story of British mathematician Alan Turing who is known as the father of modern computer science and for the test named after him. The film is centered around Turing and his team of code-breakers working hard to decipher the Nazi German military encryption Enigma. To crack the code, Turing creates a primitive computer system that would consider permutations at a much faster speed than any human could. The code breakers at Bletchley Park succeeded and thereby not only helped Allied forces ensure victory over the

*Wehrmacht*but contributed to shorten the horros of the Second World War.

**Trailer:**

**Why you should watch this movie:**It is a (too) late tribute to Alan Turing. Turing was prosecuted for his homosexuality after WWII and eventually committed suicide. The film is also about the power of machines and ethical perspectives in analytics

__Margin Call__

**Released:**2011

**Big names:**Paul Bettany, Stanley Tucci, Demi Moore

**IMDB Rating: 71%**

**Plot in a nutshell:**

*Margin Call*plays during the first days of the last global financial crisis in 2008. A junior analyst at a large Wall Street investment bank discovers a major flaw in the risk evaluation model of the bank. The story develops during the night as the young employee informs senior managers that the bank is close to a financial disaster, knowing that the bancruptcy of the firm would lead to a dramatic chain reaction in the market – and millions of lives would be affected.

**Trailer:**

**Why you should watch this movie:**The film depicts to what extent algorithms dominate decision making in the financial industry. It also portrays the interplay between supposedly objective models and human beings driven by emotions and interests.

__21__

**Released:**2008

**Big names:**Kate Bosworth, Laurence Fishburne

**IMDB Rating: 68%**

**Plot in a nutshell:**Six students of the renowned Massachusetts Institute of Technology (MIT) get trained in card counting and rip off Las Vergas casinos at various blackjack tables. The film is based upon a true story.

**Trailer:**

**Why you should watch this movie:**It is an entertaining and fun movie. In addition to that, it contains some interesting mathematical concepts such as the Fibonacci Series and the Monty Hall Problem.

## We hope our tipps are valuable for you and you enjoy any of the flicks. 📺🎬 🍿☕🍷

**Take care and all the beston behalf of joint systems**

Johannes

## Particularly in uncertain times like these, organizations strive to predict the future in the best possible way. Previously we have already explored multiple times **h****ow** to forecast future income using the past income trajectory, for instance in these blog posts. We now want to go a step further and investigate the **relationship between fundraising income and the general economic climate**, exploring whether or not it is possible to infer extra information from and improve income forecasting tools by using economic indicators.

**ow**to forecast future income using the past income trajectory

**Introduction**

**amount of sporadic donations**to a charitable non-profit organization from the

**period of 2000 to 2019**and t

**hree economic data sets from the country the NPO is based in**: the

**national unemployment rate**, the

**national stock market index**and an

**index of economic activity in retail**, authored by the national bureau of statistics. We chose these data sets for their ability to paint a picture of the general economic climate, their relatively easy accessibility down to the monthly level and the fact that to the extent that they exhibit seasonality, they do so in the same yearly rhythm as the amount of sporadic donations, simplifying the statistical analyses. The statistical analyses and models employed in this blog post were all implemented in

**Python**, taking heavy use of the

**library.**

*statsmodels***time series**, which is a set of data points collected over a discrete, ordered time. We have already looked at time series in detail in a past entry on this blog. An important property of time series is that of

**stationarity**, which we discussed here. As in that post, we use the

**Dickey-Fuller test**to investigate whether or not our time series are stationary, which can be done in python like this:

**Dickey-Fuller test**

**differencing**the time series – this means subtracting the previous value from the current – can help in making the season stationary. However, due to our time series exhibiting strong yearly trends, in our case it makes sense also to take the time series’

**seasonal differences**, subtracting the values from the year before.

**Granger causality**

**Pearson’s Chi-Squared test**, we examine the (made stationary) series’

**Granger causality**, which tells us whether data from one series can help in forecasting the future trajectory of another series. The test is applied pairwise on two different time series, with the null hypothesis being that the second time series does not Granger cause the first. The following function returns a data frame that shows the p-values of the tests investigating whether the column variable granger causes the row variable. Especially the first row – indicating (no) Granger causality between the economic data sets on the one hand and the amount of sporadic donations on the other – is interesting to us. At a significance level of 0.05, we keep the null hypothesis of no Granger causality between the economic indicators and the sporadic donations for all three time series.

**Granger causality**

**Cointegration**

**cointegration**. We consider a set of time series

**cointegrated**if there exists a linear combination of the time series that is stationary. Cointegrated time series share a common, long-term equilibrium and we can use them to predict each other’s future trajectory using a process called

**Vector autoregression (VAR)**. A common test for cointegration is the

**Johansen cointegration test**. In the following, we define a function that returns a dataframe with the test statistic and the critical values of the Johansen test, leading to these results:

**Johansen cointegration test**

## Vector autoregression (VAR)

**VAR-model**. If we ignore our test results for a moment and do so anyway, we can immediately see that the model falls catastrophically short. The black line in the below graphs show the actual time series of sporadic donations. Having used data until 2016 as our training set, we constructed with the Python library

*statsmodels*a

**VAR-model**that we can use to forecast the sporadic donations from 2017 to 2019 – the red line in the graph – using the actual data for these years for evaluation. As we can see, the model is not able to forecast the amount of sporadic donations very well, capturing the seasonality, but failing to accurately predict the trend and overall trajectory.

**ARIMA and ARIMAX**

**ARIMAX**-model, a generalisation of the

**ARIMA**-model, which we have used previously in this blog, that also takes into account data from external variables – in our case, the economic time series.

**ARIMA**models are composed of three parts:

**AR**for autoregression, indicating a regression on the time series’ past values,

**I**for integration, signifying differencing terms in the case of a non-stationary time series, and

**MA**for the moving-average-model, a regression on past values white noise terms.

**ARIMAX**takes all three of those terms and adds data from external variables – a different time series – to better forecast the time series at hand. Both

**ARIMA**and

**ARIMAX**are implemented in python as part of the

*statsmodels*library, while the

*pmdarima*library comes with an autoarima function modelled on R’s autoarima function, allowing for a quick search through the possible parameters of the

**ARIMA(X)**model.

We have used all four time series to construct an

**ARIMAX**-model, using the economic data to help forecast the amount of sporadic donations. Again we used the data until 2016 as our training set, with the data from 2017 to 2020 as a test set to evaluate results. We have also used a standard

**ARIMA**-model to construct a forecast for sporadic donations only on the time series’ historical data. Interestingly, the models’ projected forecasts did not differ much from each other:

**It seems thus that the economic data we have provided – the unemployment rate, the stock index and a retail index – did not add much extra information to better forecast the amount of sporadic donations**. Upon closer inspection the

**ARIMA**-model, relying solely on historical data from the time series itself, even performed slightly better. In light of the previous test results – the lack of Granger causality and cointegration – this points to the fact that those economic indicators have little measurable effect on the development of sporadic donations, and thus cannot be used to improve forecasting models for the amount of sporadic donations in the future.

**However, we believe that the approach of using time series analysis for fundraising income predictions hand in hand with open (economic) data deserves further focus as other base data, time frames, data sources etc. might lead to different results!**

**COVID-19 impacts on fundraising - the case of Face-to-Face**

The ongoing crisis caused by the Corona pandemic has brought huge challenges for many people all over the globe and dislocation in all types of industries. The evident impacts and maybe the ones yet to come imply serious threats for numerous fundraising nonprofit organizations. The pandemic has significantly affected the conditions under which widespread fundraising channels can be used. Considering lockdowns all over the world leading to drastically reduced mobility, Corona has most obviously affected

**Face-to-Face fundraising (F2F)**. Since the introduction of its contemporary form in the 1990ies (by the way in Austria, where the askyourdata-team is based), F2F has become an

**enormously important channel for many charities,**particularly for the acquisition of regular supporters.

In the majority of countries affected by COVID-19, people were not completely forced to stay inside but allowed to move for certain purposes (work, groceries, walking etc.). One can get an idea of the impact on people´s mobility using the

**currently publicly available mobile data from Google**. You can go ahead and download a flat file to play with on this website. We obtained the global dataset and put together the following

**dashboard**for which we invite you to have a closer look. Just click the two little arrows in the bottom right corner of the dashboard or follow this link.

If you are looking for an

**insightful situation report**on the

**state of Face-to-Face fundraising in times of Corona**from a global perspective, we can recommend the

**recording of a recent panel discussion hosted by**. In short, F2F teams all across the world have proved their adaptiveness in many ways already...

*The Resource Alliance***What Now?**

I cannot tell how many countless times I have recently come across quotes talking about the opportunities that lie in crises. In many cases, they were mere platitudes, at the same time I deeply believe that the world will gradually get closer to how it was before Corona. This will be reflected by people sitting in cafes after some relaxed high-street shopping enjoying the sun ... everything completely mask-free.

**Will Face-to-Face fundraising be exactly the same then?**

Let us try to start dealing with this question with an analogy. COVID seems to have changed almost everything in our lives - but

**the world keeps turning for the good and the bad**. This means, for instance, that Climate Change will not pause just because we are busy with another crisis. The same applies to - in a more positive way - the ongoing digital revoution as well as the expansion of analytics and data science across all types of industries. From my point of view,

**F2F fundraising has been keeping pace with technological developments quite well in the recent past**. The chances to come across F2F agents using tablets, simple and customer-oriented processes, instant messaging services etc. are quite high in many countries. Our hypothesis is, however, that there is scope for even farther innovations ...

**The Power of Where: Using Location Intelligence in F2F?**

This nice newspaper article illustrates examples of how companies use geolocation data to target their (potential) customers. One of our favourite blogs

*Towards Data Science*has summarised the Power of Where and goes as far as to postulate that location analytics will change the world. Location analytics also has the potential to make contributions during and after this pandemic, as outlined in this recent article by the platform Carto.

Seen from a practical perspective, what might use cases of Location Intelligence be in F2F fundraising? Many

**mobile network providers**across the globe have started offering services in the context of

**Mobile Location Analytics**, as US-provider Verizon calls it. These services are typically not as prominently advertised as other products and tools - but they are there. What might "Mobile Location Analytics" mean? Well, in a retail context, interesting "research questions" might be:

- How mobile users get to brick-and-mortar stores?
- Where do they come from and where do they go subsequently?
- Which locations to they frequently use?
- ...

Seizing this idea,

**would it not be interesting to know who is moving when across the (high-)street or the shopping center where the next large-scale F2F campaign will take place?**Of course, nonprofits following-up the use of such services have to have

**a**

**wareness of data protection and privacy**(although this is what the networks have to take care of) and donor communication to be prepared.

**Admittedly, we are raising a somewhat ambiguous and maybe even controversial approach as potential add-on to professionalized Face-to-Face fundraising. What is your opinion?**

## In modern economy, coming up with forecasts as best possible predictions of future income has become an imperative across industries. Also the charitable nonprofit sector has seen an increasing adaption of forecasting methods in recent years. This is why we already dealt with this topic on this blog some time ago. The endeavour of forecasting is challenging enough in times of economic stability but seems almost impossible after the advent of a „black swans“ like the Corona Virus. Of course, the future is just as uncertain as Ilya Prigogine said. At the same time, there is a familiy of statistical models which not only provides "well-informed" income predictions but particularly help in finding out to what extent current data deviates from the "expected normal". Let us therefore take a closer look at **fundraising income forecasting using time series.**

**The basiscs**

In their book Financial Management for Nonprofit Organizations from 2018 (by the way a recommendable read), Zietlow et al. differentiate between different forecasting types:

**causal model**is one in which an analyst or data scientist has defined one (simple regression) or several cause factors (multiple regression) for the dependent variable she or he is trying to predict. In the case of income prediction on an aggregated level, drivers (i.e. independent variables) might be found both on the level of donors and exogenous factors.

**Time series models**work differently and tend to be more complex. They essentially use historical data to come up with predictions on future outcomes. Time series are widely used for so called non-stationary data. A stationary time series is one whose properties are independent from the point on the timeline for which its data is observed. From a statistical standpoint, a time series is stationary if its mean, variance, and autocovariance are time-invariant. The requirement of stationarity (i.e. stability) makes intuitive sense as time series use previous lags of time to model developments and modeling stable series with consistent properties implies lower overall uncertainty.

**Time Series in a Nutshell**

Time series are often comprised of the following components, although not all time series will necessarily have all or any of them.

**Trend:**The time series contains a trend when there is a long-term increase or decrease in the data. This trend does not have to be linear.**Seasonal:**A*seasonal*pattern exists when a time series is affected by seasonal factors such as month(s) of a year or certain weekdays.**Cycle:**A cylce is there when the data shows rises and falls that are not fixed in frequencies. The respective fluctuations are often related to economic conditions.**Noise:**Remaining random variation in the data.

The model we will take a closer look at for the prediction of fundraising income is ARIMA which stands for Auto-Regressive Integrated Moving Average. ARIMA models can be seen as the most generic approach to model time series. Wheter the ARIMA algorithm can be applied to the historic income data right away can be evaluated by statistical methods such as the Dickey-Fuller-Test. The null hypothesis of the Dickey-Fuller-Test assumes that the series is non-stationary. Even if this hypothesis cannot be rejected, which means that the data under scrutiny is non-stationary, there are ways to make time series usable. In this case, differencing and log-transformations of the data can be applied in a preparatory step.

**Applied Example**

The raw data used for the following example is really straightforward. Historic fundraising income from 2012 to 2019 is extracted in a simple structure (Donor ID, payment date, amount). In a next step, datewise aggregation is applied. Having distinct dates in the dataset allows using both a weekly and monthly perspective on the data. ARIMA is now used to decompose the data. The resulting charts look as follows:

*No. donations*, first on top). The second chart from above shows an overall upward trend in the data.

For research purposes, we decided to use the data from 2012 to 2018 as "training set" and use ARIMA to generate a prediction for the already closed year 2019. We did that both on the level of accumulated donation counts and donation sum per week. The week-based prediction for the donations looks like this with the

**blue line being the prediction and the red line being the acutal data:**

**The chart above shows that the blue predicted line generally runs close to the red line representing the actual data.**The actual data also oscillates within the forecast prediction intervals at 80% and 95% confidence levels. This is what the actual and predicted data look like for accumulated income over the weeks 2019.

**Conclusion**

**To what extent can a time series approach inform fundraising planning and decision making in times of a highly dynamic environment coined by the Corona pandemic?**Well, it is still quite unclear how global economy and different countries will develop in the near future. It is also yet to be seen how the Corona crisis will affed fundraising markets on the mid- and long-run.

**In essence, time series can be an interesting approach to come up with a sophisticated analysis on the extent of the deviation from normal income level**, most probably caused by Corona in a direct manner (e.g. face to face fundraising currently stopped) or indirect effects (e.g. rising unemployment) ...

**One must have a tough mind - and a soft heart.**## Categories

Alle

Artificial Intelligence

Because It´s Fun!

Churn

Clustering

Data Sciene @NPO

Data Visualization

Facebook

Machine Learning

Maps

Natural Language Processing

Neural Nets

Power BI

Predictive Analytics

Social Media

Time Series

Twitter

## Archive

November 2020

September 2020

August 2020

Mai 2020

April 2020

Februar 2020

Dezember 2019

November 2019

September 2019

Juni 2019

April 2019

März 2019

Januar 2019

Dezember 2018

Oktober 2018

August 2018

Juni 2018

Mai 2018

März 2018

Februar 2018

Dezember 2017

November 2017

Oktober 2017

September 2017

August 2017

Juli 2017

Mai 2017

April 2017

März 2017

Februar 2017

Januar 2017