Exploratory Data Analysis
It doesn't take much to trigger me into a rant about the weaknesses of reports on data and "dashboards" purporting to be "analytics" or "business intelligence". Lots of pie charts and line graphs with added bling are as the proverbial red rag to a bull.
Until recently my response was to demand more rigorous statistics: hypothesis testing, confidence limits, tests for reverse causality (but recognising that causality is a slippery concept in complex systems). Having recently spent some time thinking about using data analysis to gain actionable insights, particularly in the setting of an educational institution, it has become clear to me that this response is too shallow. It embeds an assumption of a linear process: ask a question, operationalise it in terms of data and statistics and crunch some numbers. As my previous post indicates, I don't suppose all questions are approachable. Actually, thinking back to the ways I've done a little text and data mining in the past, it wasn't quite like this either.
The label "exploratory data analysis" captures the antithesis to the linear process. It was popularised in statistical circles by John W Tukey in the early 1960's and he used it as a title for a highly influential book. Tukey was trying to challenge a statistical community that was very focused on hypothesis testing and other forms of "confirmatory data analysis". He argued that statisticians should do both, approaching data with flexibility and an open frame of mind and he saw having a well-stocked toolkit of graphical methods as being essential for exploration (Tukey was responsible for inventing a number of plot types that are now widely used).
Tukey read a paper entitled "The Technical Tools of Statistics" at the 125th Anniversary Meeting of the American Statistical Association in 1964 which anticipated the development of computational tools (e.g. R and RapidMiner), is well worth a read and has timeless gems like:
"Some of my friends felt that I should be very explicit in warning you of how much time and money can be wasted on computing, how much clarity and insight can be lost in great stacks of computer output. In fact, I ask you to remember only two points:
- The tool that is so dull that you cannot cut yourself on it is not likely to be sharp enough to be either useful or helpful.
- Most uses of the classical tools of statistics have been, are, and will be, made by those who know not what they do."
There is a correspondence between the open-minded and flexible approach to exploratory data analysis that Tukey advocated and the Grounded Theory (GT) Method of the social sciences. As a non-social scientist, GT seems to be a trying a bit too hard to be a Methodology (academic disputes and all) but the premise of using both inductive and deductive reasoning and going in to a research question free of the prejudice of a hypothesis that you intend to test (prove? how often is data analysed to find a justification for a prejudice?) is appealing.
Although GT is really focussed on qualitative research, some of the practical methods that the GT originators and practitioners have proposed might be applicable to data captured in IT systems and for practitioners of analytics. I quite like the dictum of "no talk" (see the wikipedia entry for an explanation).
My take home, then, is something like: if we are serious about analytics we need to be thinking about exploratory data analysis and confirmatory data analysis and the label "analytics" is certainly inappropriate if neither is occurring. For exploratory data analysis we need: visualisation tools, an open mind and an inquisitive nature.