Data assumptions can be deluding
Here’s a slightly critical add-on to my previous post about Learning Analytics and EDM. This year’s Learning and Knowledge Analytics course (LAK12), once again, brings up some highly valuable perspectives and opportunities for developing new insights, models, and harvesting possibilities for learning in general. However, this should not stop us from being aware of delusions and assumptions that are somehow orphaned in the ongoing discussions. In particular, I want to mention three potential pitfalls that are perhaps too much taken for granted:
- data clenliness
- meaningful traces
When developers talk about their products, everything looks shiny and wonderful. All the examples shown work smoothly and give meaningful results. This makes me pause, for while the ideation of new analytics technology is a wonderful thing and anticipated with much enthusiasm, the data isn’t always as clean as it is presented to be in the theoretical context. Most, if not all data processes have to undergo a clensing process to get rid of datasets that are “contaminated”. A good example is the typical “teacher test” content found in virtually every VLE database. It’s not always clearly indicated as “test”either, so in many cases extensive manual processes of eliminating nonsensical data have to be conducted. It should, therefore, be standard practice to report on how much data was actually “thrown away” and on what basis. Not that this would discredit in any way the usefulness of the remaining dataset, but indicate the amount of automated or manual selection that has gone into it.
This necessarily leads to questioning the weighting of data. By which mechanism are some datasets selected as being meaningful and at which priority level over others. Very often, the rational behind the selection of variables is not exposed neither is the priority relationship to other variables in the same dataset. Still it must be transparent whether e.g. the timing, the duration, or the location of an event is given more weight when predicting a required pedagogic intervention (or not). After all, a young person’s future may depend on it.
From the above two limitations results a third, that concerns the question of what is a meaningful trace a user leaves on the system. We know users leave data behind when conducting an electronic activity. These can be sequenced by time, but it is by far not clear where the useful cut-off points of a sequence or ‘trace’ are. Say you had a string of log data A-B-C-D-E-F-G-H. Does it make more sense to assume BCD constituting meaning or would CDEF perhaps be better – and why would it be better?
I realise that these questions could be interpreted as destructive criticism, but we have one other possibility, which is to just take the results conjured up in a black box at face value and see if they look plausible no matter how they were derived. This we could call the Google approach.