Clustering means grouping similar things together. In this month’s we take a closer look at how the so called kmeans algorithm as a clustering method might help to develop an even deeper understanding of established donor segments. Many fundraising organizations segment their donor bases to enable targetgroup oriented communication and appeals. Working with segments also allows the development of more differentiated analyses on group migrations and income development. The example organization whose data we are playing around with has an upandrunning segmentation model that evaluates the payment behaviour of donors in terms of Recency (when was the last payment), Frequency (how often payments are made) and Monetary Value (sum of donations). This RFM method is commonly used in databaserelated direct marketing, both in the profit and nonprofit sector. We take a closer look at the best donor group in the following. The segmentation model already tells us that the group under scrutiny consists of “good donors” in terms of payment behaviour. In this regard, the segment as such can be seen as a homogenous group. Can a clustering algorithm like kmeans generate additional insights regarding the “inner structure” of this group? Kmeans is an unsupervised machine learning algorithm. Unsupervised learning data comes without predefined labels, i.e. any kind of classifications or categorizations are not included. The goal of unsupervised learning algorithms is to discover hidden structures in data. The basic idea of kmeans is to assign a number of records (n) into a certain number of similar clusters (k). The number of clusters k is either predefined or can be jointly defined by the respective analyst / fundraising manager. The following walktrough was inspired by Kimberly Coffey’s blogpost on kmeans Clustering for Customer Segmentation. It is a highly recommended read and can be found here. Data extract and preprocessing Our example organization has an established RFMbased segmentation model that yields 4 core groups. We defined the “best” of those groups to be subject for a kmeans clustering attempt. The dataset we extract is straightforward in the first place as it contains the unique identifier, the Recency measured as number of days between December 31st, 2017 and the day of the last donation per person. Frequency reflects the number of payments for the respective record in 2017 whereas Monetary Value shows the donation sum. Kmeans clustering requires continuous variables and it works best with (relatively) normallydistributed, standardized input variables. We therefore apply a logarithm (log) to the variables and standardize them to avoid positive skew. Standardizing variables means rescaling them to have a mean of zero and a standard deviation of one, i.e. aligning them to the standard normal distribution. The following code snippet illustrates the loading and transformation of the data. Code Snippet 1: Load packages and data, then transfrom data
This is how our dataset looks like the aforementioned transformation. Columns 5 to 7 contain the logs of Recency, Frequency and Monetary Value, whereas the standardized (z)values are in columns 8 to 10. A quick exploratory view To intially dive into the data, we plot the logtransformed Monetary Value as well as logtransformed Frequency of donations and use the logtransformed Recency for colouring by using the code in Code Snippet 2. Code Snippet 2: Exploratory plot of RFM variables
What is striking is the high general density of observations on the one hand (which is due to the large amount of data) and the different shades of blue that reflect certain heterogeneity in terms of Recency. Running the Kmeans Algorithm We now turn to running the kmeans algorithm. The following code contains a loop that runs for a number of j clusters (in this example 10). It writes the cluster membership on donor level back into the dataset, creates twodimensional plots (see example for 3 and 7 clusters below) and collects the model information. Code Snippet 3: Kmeans clustering
These are two output examples of the code using 3 and 7 clusters. So how do we choose the “optimal” number of clusters now? The graphs below both aim for the detection of the number of clusters beyond which adding a cluster adds only little additional explanatory power. In other words we look for bends in the socalled elbow chart. In our example it looks as if this would be at 4. The same decision could have also been made or at least influenced by a business decision regarding the "feasible" number of clusters. Adding additional clusters always adds explanatory power, however, in practice 4 groups are easier to handle (e.g. in the context of a direct marketing test) than 10 or more clusters. Results and interpretation Let us now take a closer look at the results. Clik on the picture on the left to get to an interactive 3dgraph of the 4cluster solution for which the Rcode can be found below. The 4cluster solution yields 4 ellipsoids aiming to reflect the areas with high observation densities for the clusters. These ellispoids should contribute to the ease of reading the graph, the actual observations are still represented by differently coloured dots just like in the 2dimensional plot we used for exploration. The three "upper clusters" in the picture share a comparable level of Monetary Value and Recency. The dark blue ellispoid stand out of the three as it reflects higher Frequeny. The lower ellipsoid reflects observations that rank relatively low on all of the three RFM variables (remember, the higher the recency, the "worse"  knowing that we are working with a dataset of good donors). The video below contains a fixedaxis rotation. Code Snippet 4: 3d graph
Conclusion
So, what can we conclude from our clustering attempt? Kmeans is a widelyused and straightforward algorithm that can be applied relatively easily in practice. It is, however, worthwhile to dive into the underlying concepts of the algorithm and consider the related diagnostics (variance explianed, "withins"). Due to the derived possibilites in terms of data visualization, results can be directly communicated to fundraising decision makers. These decision makers should be involved in the process at an early stage. Although there are "objective" measures for the numbers of clusters, applicationoriented considerations (e.g. for further analyses of test designs) should not be left out. We hope you liked this month's post and wish you a nice beginning of autumn. Do not hesitate to share, comment, recommend etc. Read / hear / see you soon!
0 Kommentare
Hinterlasse eine Antwort. 
Categories
Alle
Archive
April 2019
