Big data has become a big and – as far as I perceive – somewhat controversial term. Google delivers 325.000.000 search results when you enter „big data“ with some 300.000 search requests per months according to Keywords Everywhere . This is way more than what one gets when searching for Internet of things (a term with high buzzword potential - 222.000.000 results) or Data science (75.600.000 hits).
Some years ago Google contributed to the big data hype by launching Google Flu Trends (GFT). The aim of GFT was simple: By targeting specific terms within the millions of Google search requests, aggregating them and plugging them into a linear model it should be possible to monitor and predict the spread of the flu in various countries. The underlying idea of GFT was that modern means of gathering and storing data make staticial sampling, i.e. obtaining a representative sample to work with, obsolete because it is possible to observe the population as a whole. Two articles I can recommend in this regard, one from the Financial Times and another from Wired, discuss the fallacies that eventually stopped Google form continuing the project. The gist of the methodical critism was that, apart from weaknesses in the model Google had applied, the approach was essentially „theory-free“, i.e. neglecting any assumptions on causality and simply focussing on recognizing patterns in incredible amounts of data that is available. A large number of „false posivities“ might be a consequence of that which is particularly interesting from a marketing and fundraising context. I am quite sure you have received (digital) advertising that made you think: Ok, not so far-fetched to send me that but why actually me (wedding products, babywear, health problems from sleeping disorders to incontinence)?
I think that particularly smaller and medium-sized companies in industries where amounts of data flowing through systems are not as large as for retailers or digital businesses should definitely deal with big data – but in a pragmatic and down-to-earth manner. Their focus should be on how to use the data that is already available or obtainable at low cost and with minimal side-effects. I find Andrew Joss article on how to gradually build a 360-degree view on the customer highly interesting in this regard. One of the main conclusions I drew from it is that it might be worthwhile to have a close look at which (unstructured) data is acutally already available and to develop ways to analyze it.
A handful of terms starting with the letter V is often referred to in the big data debate. They are far from being a „receipe“ on how to succeed with big data but might be helpful for you to reflect and discuss big data issued.
The main characteristic that makes data big is its sheer volume. The amount of data created daily is incredibly high and continuously growing. There are estimates that 2.5 quintillion bytes are created daily, this is enough to fill 10 million Blu-ray discs. These volumes have been growing and continue to do so also in smaller and medium-sized companies across industries.
Data comes with different types. A basic way to classify types of data is to differentiate between strcutured data, i.e. highly organized data (names and addresses, payment transactions or quantitative survey results like Net Promotor Scores) and its opposite which is unstructured data. Unstructured data is mostly qualitative such as interactional data from social media (likes, shares, comments, tweets), stored customer contacts or surevy feedback as free text. Both structured and unstructured data are fundamentally different when it comes to data processing and analyses.
Veracity as our third V-word reflects the overall „trustworthiness“ of data. Aspects to be considered in the light of the respective data source are authenticity, completeness, reproducibility and last not least reliabilty of underlying models and assumptions – particularly when it comes to things like scores or indirectly measured socio-demographics like income.
The penultimate V represents what it takes to process big data with IT systems. It is where processing and response times as well as overall system performance come into play. Aiming to work with larger amounts and/or more complex data definitely sets a minimum baseline for the underlying IT infrastrucure.
This V, last but not least, asks the „So-What“-question. The search for patterns and correlations is an essential and legitimate technique – but should not be practied for its own sake in order not to end up with spurious correlations. It is questionably whether the mere abundance of data makes up for theory ignorance or flawed overall reasoning.
Where does this all take us?
Technology will lead the way. Continuous progress in the are of data storage and processing capacities paired with the ubiquity of devices that collect data at points of sale, homes and even bodies will turn big data in an even bigger topic. However, it might be the first small steps like asking yourself „What does big data mean for me and us?“ that big data success stories begin with.