The question of whether or not something works is a basic one to ask when investing time and money in changing practice and in using new technologies. For learning analytics, the moral dimension adds a certain imperative, although there is much that we do by tradition in teaching and learning in spite of questions about efficacy.
I believe the move to large-scale adoption of learning analytics, with the attendant rise in institution-level decisions, should motivate us to spend some time thinking about how concepts such as validity and reliability apply in this practical setting. Motivation comes from: large scale adoption has “upped the stakes”, and non-experts are now involved in decision-making. This article is a brief look at some of the issues with where we are now, and some of the potential pit-falls going forwards.
There is, of course, a great deal of literature on the topics of validity (and reliability) and various disciplines have their own conceptualisations, and sometimes multiple kinds of validity. The wikipedia disambiguation page for “validity” illustrates the variety and the disambiguation page for “validation” adds further to it.
For the purpose of this article I would like to avoid choosing one of these technical definitions, because it is important to preserve some variety; the Oxford English Dictionary definition will be assumed: “the quality of being logically or factually sound; soundness or cogency”. Before looking at some issues, it might be helpful to first clarify the distinction between “reliability” and “validity” in statistical parlance(diagram, below).
Technical terminology may mislead
The distinction between reliability and validity in statistics leads us straight to the issue that terms may be used with very specific technical meanings that are not appreciated by a community of non-experts, who might be making decisions about what kind of learning analytics approach to adopt. This is particularly likely where terms with every-day meaning are used, but even when technical terms are used without everyday counterparts, non-expert users will often employ them without recognising their misunderstanding.
Getting validity (and reliability) universally on the agenda
Taking a fairly hard-edged view of “validation”, as applied to predictive models, a good start would be to see this being universally adopted, following established best practice in statistics and data mining. The educational data mining research community is very hot on this topic but the wider community of contributors to learning analytics scholarship is not always so focused. More than this, it should be on the agenda of the non-researcher to ask the question about the results and the method, and to understand whether “85% correct classification judged by stratified 10-fold cross-validation” is an appropriate answer, or not.
Predictive accuracy is not enough
When predictive models are being described, it is common to hear statements such as “our model predicted student completion with 75% accuracy.” Even assuming that this accuracy was obtained using best practice methods it glosses over two kinds of fact that we should seek in any prediction (or, generally “classification”), but which are too often neglected:
- How does that figure compare to a random selection? If 80% completed then the prediction is little better than picking coloured balls from a bag (68% predicted correctly). The “kappa statistic” gives a measure of performance that takes account of improved predictive performance compared to chance, but it doesn’t have such an intuitive feel.
- Of the incorrect predictions, how many were false positives and how many were false negatives? How much we value making each kind of mistake will depend on social values and what we do with the prediction. What is a sensible burden of proof when death is the penalty?
Widening the conception of validity beyond the technical
One of the issues faced in learning analytics is that the paradigm and language of statistics and data mining could dominate the conceptualisation of validity. The focus on experiment and objective research that is present in most of the technical uses of “validity” should have a counterpart in the situation of learning analytics in practice.
This counterpart has an epistemological flavour, in part, and requires us to ask whether a community of practice would recognise something as being a fact. It includes the idea that professional practice utilising learning analytics would have to recognise an analytics-based statement as being relevant. An extreme example of the significance of social context to what is considered valid (fact) is the difference in beliefs between religious and non-religious communities.
For learning analytics, it is entirely possible to have some kind of prediction that is highly statistically-significant and scores highly in all objective measures of performance but is still irrelevant to practice, or which produces predictions that it would be unethical to use (or share with the subjects), etc.
Did it really work?
So, lets say all of the above challenges have been met. Did the adoption of a given learning analytics process or tool make a desirable difference?
This is a tough one. The difficulty in answering such questions in an educational setting is considerable, and has led to the recent development of new research methodologies such as Design-based Research, which is gaining adoption in educational technology circles (see, for example, Anderson and Shattuck in Educational Researcher vol. 41, no. 1).
It is not realistic to expect robust evidence in all cases, but in the absence of robust evidence we also need to be aware of proxies such as “based on the system used by [celebrated university X]“.
We also need to be aware of the pitfall of the quest for simplicity in determining whether something worked; an excessive focus on objective benefits neglects much of value. Benefits may be intangible, indirect, found other than where expected, or out of line with political or business rhetoric. As the well worn aphorism has it, “not everything that counts can be counted, and not everything that can be counted counts.” It does not follow, for example, that improved attainment is either a necessary or sufficient guide to whether a learning analytics system is a sound (valid) proposition.
Does it generalise? (external validity)
As we try to move from locally-contextualised research and pilots towards broadly adoptable patterns, it becomes essential to know the extent to which an idea will translate. In particular, we should be interested to know which of the attributes of the original context are significant, in order to estimate its transferability to other contexts.
This thought opens up a number of possibilities:
- It will sometimes be useful to make separate statements about validity or fitness for purpose for a method and for the statistical models it might produce. e.g. is the predictive model transferable, or the method by which it is discovered?
- It may be that learning analytics initiatives that are founded on some theorisation about cause and effect, and which somehow test that theorisation, are more easily judged in other contexts.
- As the saying goes “one swallow does not a summer make” (Aristotle, but still in use!), so we should gather evidence (assess validity and share the assessment) as an initially-successful initiative migrates to other establishments and as time changes.
Transparency is desirable but challenging
The above points have been leading us in the direction of the need to share data about the effect of learning analytics initiatives. The reality of answering questions about what is effective is non-trivial, and the conclusions are likely to be hedged with multiple provisos, open to doubt, requiring revision, etc.
To some extent, the practices of good scholarship can address this issue. How easy is this for vendors of solutions (by this I mean other than general purpose analytics tools)? It certainly implies a humility not often attributed to the sales person.
Even within the current conventions of scholarship we face the difficulty that the data used in a study of effectiveness is rarely available for others to analyse, possibly asking different questions, making different assumptions, or for meta-analysis. This is the realm of reproducible research (see, for example, reproducibleresearch.net) and subject to numerous challenges at all levels, from ethics and business sensitivities down to practicalities of knowing what someone else’s data really means and the additional effort required to make data available to others. The practice of reproducible research is generally still a challenge but these issues take on an extra dimension when we are considering “live” learning analytics initiatives in use at scale, in educational establishments competing for funds and reputation.
To address this issue will require some careful thought to imagine solutions that side-step the immovable challenges.
Conclusion… so what?
In conclusion, I suggest that we (a wide community including research, innovation, and adoption) should engage in a discourse, in the context of learning analytics, around:
- What do we mean by validity?
- How can we practically assess validity, particularly in ways that are comparable?
- How should we communicate these assessments to be meaningful for, and perceived as relevant by, non-experts, and how should we develop a wider literacy to this end?
This post is a personal view, incomplete and lacking academic rigour, but my bottom line is that learning analytics undertaken without validity being accounted for would be ethically questionable, and I think we are not yet where we need to get to… what do you think?
Target image is “Reliability and validity” by Nevit Dilmen. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons – http://commons.wikimedia.org/wiki/File:Reliability_and_validity.svg