Data assumptions can be deluding

Here’s a slightly critical add-on to my previous post about Learning Analytics and EDM. This year’s Learning and Knowledge Analytics course (LAK12), once again, brings up some highly valuable perspectives and opportunities for developing new insights, models, and harvesting possibilities for learning in general. However, this should not stop us from being aware of delusions and assumptions that are somehow orphaned in the ongoing discussions. In particular, I want to mention three potential pitfalls that are perhaps too much taken for granted:

  • data clenliness
  • weighting
  • meaningful traces

When developers talk about their products, everything looks shiny and wonderful. All the examples shown work smoothly and give meaningful results. This makes me pause, for while the ideation of new analytics technology is a wonderful thing and anticipated with much enthusiasm, the data isn’t always as clean as it is presented to be in the theoretical context. Most, if not all data processes have to undergo a clensing process to get rid of datasets that are “contaminated”. A good example is the typical “teacher test” content found in virtually every VLE database. It’s not always clearly indicated as “test”either, so in many cases extensive manual processes of eliminating nonsensical data have to be conducted. It should, therefore, be standard practice to report on how much data was actually “thrown away” and on what basis. Not that this would discredit in any way the usefulness of the remaining dataset, but indicate the amount of automated or manual selection that has gone into it.

This necessarily leads to questioning the weighting of data. By which mechanism are some datasets selected as being meaningful and at which priority level over others. Very often, the rational behind the selection of variables is not exposed neither is the priority relationship to other variables in the same dataset. Still it must be transparent whether e.g. the timing, the duration, or the location of an event is given more weight when predicting a required pedagogic intervention (or not). After all, a young person’s future may depend on it.

From the above two limitations results a third, that concerns the question of what is a meaningful trace a user leaves on the system. We know users leave data behind when conducting an electronic activity. These can be sequenced by time, but it is by far not clear where the useful cut-off points of a sequence or ‘trace’ are. Say you had a string of log data A-B-C-D-E-F-G-H. Does it make more sense to assume BCD constituting meaning or would CDEF perhaps be better – and why would it be better?

I realise that these questions could be interpreted as destructive criticism, but we have one other possibility, which is to just take the results conjured up in a black box at face value and see if they look plausible no matter how they were derived. This we could call the Google approach.


The twinning of EDM and Learning Analytics

After listening to Ryan Baker’s presentation on Educational Data Mining (EDM), I am more convinced than ever that EDM and Learning Analytics are actually the same side of the same coin. Despite attempts being made to explain them into different zones of influence or different missions, I fail to see such differences, and from reading other LAK12 participants’ reflections, I am not alone in this. Baker’s view that Learning Analytics are somewhat more “holistic” can be refuted with a simple “depends”. What is more, historically, EDM and LA don’t even originate from different scientific communities, such as is the case with metadata communities versus librarians, or with electric versus magnetic force physics – now of course known as electromagnetism.

Both approaches (if there are indeed two) are based on examining datasets to find ‘invisible’ patterns that can be translated into information useful to improve the success and efficiency of the learning processes. A good example Baker mentioned was the detection of students that digress, misunderstand, game the system, or disengage. It’s all in the data.

I would also like to believe that predicting the future leads to changing the future, at least it could give users the air of being in control of their destination. As a promotional message this has quite some power. But even in support of reflection the same can be postulated: knowing past performance can help your future performance! So, once again a strong overlap between predictive and reflective application of data analytics.

For me, all of this can only lead one way: instead of using efforts and energies to differentiate the two domains, which would only lead to reduced communities both ends, and friction in between, we need to think big and marry them into one large community and domain: Let’s twin EDM and LA!

Crowdsourcing while learning

This is an interesting approach. Duolingo promises to translate the Web while users learn a language. I liked their nice intro video, which explains how it works: Duolingo uses your native speaker skills to have you translate foreign sentences into your own language. This is a sensible approach which is also used in translation sciences and interpreting – always translate into your native tongue.

How do you translate from a language you don’t understand? Duolingo adjusts to your competence level and provides help on the fly, such as translation suggestions. I suppose this approach works reasonably well with languages that have an association to yours, e.g. English and Spanish (through their shared Latin vocabulary: library – libereria). But I’d be interested to see how this is done with Chinese or Finnish.

Google does similar stuff with its translation service, but what’s innovative here is that Duolingo promises learning in return for your translations. There are two open questions for me: Firstly, what does translating the Web mean? i.e. how are the translations fed back to the Web, and will they be free and open? Secondly, what is the learning model behind, since merely translating sentences only gets you so far in language learning? In language learning you want to be able to produce the other idiom not only understand it passively. How are repetition and grammatical structure analysis incorporated in the tool?

It’s in private beta, but I am curious about the didactical model once I get access to it.

End for for-profit HE in the UK

A good post by Sean Mehan commenting on the UK government dropping a bill that would have allowed private for-profit companies to enter the HE market. Refering to a news item in the Telegraph, Sean writes:

The legislation would have allowed state loans to go into profits for for-profits, even allowing foreign companies (yes, companies, not institutions, for that is what they are), into the mix. So, UK taxpayer money goes to profits in a foreign country, while the national infrastructure is forced to compete or rot.

Quite right! Add to this monetary concern the socio-intellectual one, that such a privatisation move would severly damage the mission of HE education to serve wider society not a handful of shareholders.



Metacognition and Learning Analytics

Following the first live session in the latest MOOC on Learning and Knowledge Analytics (LAK12), I did some reflection on the direction that Learning Analytics has taken over the past year or two. As far as I can see, Learning Analytics follows to a large extent the line of web analytics, but with the intention to improve learning by gaining insights into hitherto invisible connections between user characteristics and actions.

However, web analytics has, I believe, a very different objective when analysing people’s navigation patterns and tracking their activities online. This objective is to better influence user behaviour in order to direct them (unknowingly and personalised) to the pages and activities that matter – to the company not the user. In almost parallel, the expressed attitude and examples brought forward in favour of Learning Analytics, puts the main focus on understanding and influencing learner behaviour, and only to an extremely limited extent if at all, their cognitive development.

An often mentioned example is that of a jogger who trains up for a marathon run, and through collection of performance data becomes more motivated, is able to see progress, compares this to other runners, etc. Similarly, tools that track the usage of software applications on your computer, provide feedback that is useful if you think you should change the amount of time you spend on e-mails. Equally, tracking your own smoking or eating habits, will hopefully lead to achieving a personal goal. These are all valid examples where and how feedback loops can improve a person’s acustomed performance.

It is vitally important, though, that if Learning Analytics is supposed to make a beneficial impact on (self-directed) learning, it does not stop at manipulating learners in a way that these are merely conditioned into different behaviours! It is not enough to check behaviour patterns of learners even though some such feedback might be helpful at times. We need more LA applications that support metacognition and cognitive development. Even memory joggers are quite useful at this. One of the oldest I am familiar with and which I haved used to great benefit are vocabulary trainers. In using those, I could see that in the first run, I was able to answer maybe 46% of a given wordlist, increasing to 65% in the next run. Over only a few runs I was able to answer 96% of all questions. Not only was this summative feedback in % a motivator and excellent for my own benchmarking; I also was able to detect decline in memorised vocabulary and identify which words I was most likely to forget, once I stopped actively revising (say three weeks later).

Since I am most interested in cognitive development and less in learning behaviour patterns, I would like to see more Learning Analytics tools that allow this to happen.

A near complete history of EC funded research

More transparency where EU-funded research is going has always been a desirable. Also to know who is active in it and to what extent. ResearchRanking (beta) is an interesting attempt to rating European research institutions by participation in EU-funded projects. Total funding has been relatively steady over the past two decades:

EU funding statsThe site allows search by institution to see how successful they have been in getting funding, whether as participants or as coordinators. When looking at my own institution, the data is still incomplete, only covering Framework Programmes, but since it is beta, I expect more to come. Still, it’s a good start and judging from how our project activities from the past are identified, it looks representative for the work we are doing.

Interesting to inspect are the ranking tables, where usual suspects CNRF (France) and Fraunhofer (Germany) are leading the 2010 table.

In summary, the site provides opportunities for interesting browsing and it’s worth spending a few moments on it. Finally, it seems, the idea of Open Data has reached the European Commission and we can expect more insights into the workings of the ivory tower in Brussels.


Good teaching comes from the teacher!

Over the holidays, I watched the 12 part video lecture series by Neil deGrasse Tyson called “my favorite universe”. Not only is this a fascinating topic anyhow, but the astrophysicist and director of the Hayden Planetarium brings it to life. As one commenter put it:

They need to clone Tyson and put him in every class room across the planet. I bet the world would be a better place for it…

What thrilled me was the enthusiasm that Tyson radiated, the love for his subject discipline and the love for telling people about it. Now here’s a good teacher if ever I saw one. What’s even more striking is that he used no technology in his presentations, apart from a few still images illustrating parts of the cosmos. Now this made me wonder, because all the emphasis on being a good teacher that I know and hear about lies on the competent use of technology! Institutions invest zillions of currency into putting a projector and smartboards into every classroom, as well as staff development programmes training people how to use powerpoint or upload a file into the VLE. Who, nowadays, would dare go to a conference without a USB stick with the obligatory presentation on?

Technology surely has its place, especially for reaching out. I would not ever have been able to watch this great series if it were not for youtube, cloud computing, ubiquitous Internet access, and the good man filming and sharing his lectures. But let’s face it, good teaching does not come from technology, it comes from the teacher, presenter, or expert and we need to invest in it!

Are we at the limits of science?

Scientific productivity is paramount in an academic economy that tries to hold on to the best, and where the best are trying to hold on. But as this interesting article discusses, we might put the wrong measures to it.

With the application of industrial criteria to science, we may have reached the limits of research in more than one way. In every economic domain, upscaling eventually reaches a ceiling where no further growth is possible without cheats, compromises, or loss in quality. Apart from outright fraudulent manipulation of the scientific publishing mechanisms and the lack of objective possibilities to replicate claimed research results, both of which are mentioned in the article, there are other issues that begin to shape up into a kind of scientific sound barrier ahead of us that increasingly separates us from the search for new knowledge to the benefit of humanity:

  • Productivity pressures that quantify research output in citations and impact measures, at some scaling point leads to “work-arounds” as scholars can no longer meet or maintain expectations. Just like with “backlink” trading in web SEO, similar new spin-off models for increasing impact factors spring into life. At the same time, industrial expectations anticipating “a paper submission per week” have to compromise on quality.

  • The run for money, i.e. research funding, becomes more important than the quest for knowledge. As long as there is only a handful of institutions searching for funding, this may be (a) easy and (b) successful. As soon as it is mandated and scaled up as a public objective to relieve budgets, a third money stream economy becomes increasingly harder and requires higher pre-investment. This can already be seen in EU funding rounds, where the number of applications has dramatically increased, leading to a much reduced chance of success for competitors. It already leaves many smaller institutions out in the cold. The same thing is true looking at industry sponsorship, making every institution go knocking at company doors for donations or private funding is like having not only one needy hand stretched out, but hundreds!

  • Similar resource limits are encountered when looking at empirical research. It has become a real challenge to find participants for pilots, surveys, evaluations, etc. People are over-surveyed and over-evaluated. Having one survey a month, was still o.k., but with pilots, tests, and questionnaires becoming a daily diet, this approach turns itself on its head. Scientifically, it leads to the risk of low participation or low quality returns with less scientific relevance. Alternatively, as is often the case, students are forcefully pushed into the role of a lab rat, but with the number of tests and pilots their entire education runs into danger of becoming an experiment.

  • Peer review, originally conceived as a measure for scientific quality, also suffers from the scaling issue. Doing a peer review on one reputable journal or conference now and then, was superbly rewarding and honorable to be involved in. But with the growth in publication outlets, the requests for reviewers’ unpaid time have also grown beyond proportion. This again leads to poor engagement with the task.

The paradox with all this is that the more organisations try to quantify and control these issues, the more they are failing. Scientific half-life is shortened not only by the speed by which new knowledge is created, but also by the amount of invalidity contained in it. Do we have a bubble that is about to burst?

Learning Analytics meets sports

Starting this season, physical performance data of German Premier League footballers is being collected and published.

Video sensors are tracking players from both teams and allow a detailed analysis of their physical movements. 35 times per second (the video frame-rate) the coordinates of each player are stored. Among the things that can be analysed are spacial and movement profiles, sprints, speed, and heat maps. It seems that coaches and fans no longer only want to trust their own judgment, but prefer to see it in figures and stats. Football already employed quite detailed number crunching of teams such as fouls commited, ball possession, time in opponent’s half, shots on target, etc. This takes analytics to a personal level, where, I guess, it is hoped to help them learn from their behaviour on the pitch.

tracker graph
Criticism has come from some clubs that the tracking company does not restrict this information to club managers and coaches but also sells it to the media and to the general public. It is bemoaned that such analyses may give the wrong impression of being able to single out some players and make them scientifically responsible for a lost game. It is clear, though, that the analysis cannot come to contextual or qualitative conclusions of why a player performed the way he did. The danger that is pointed out is that figures, like milage run, may be taken by the public unreflected. Publishing performance data of players may even – it is feared – impact on the market value of some players or clubs.

What this will do to the sport is yet unknown, but we’ll just hope it still is fun to watch!

Notes to Learning Analytics

The recent two day seminar on Learning Analytics, organised by the Dutch SURF academy, brought some interested parties from different education institutions and vendors together. While stimulating in its presentation, the seminar mainly presented technical showcases. What got somehow left behind were relevant pedagogic showcases and a feeling of how receptive the teaching practitioner community is to this kind of innovation. Are we running again into the old pattern of being technology driven?

Some interesting showpieces included tools to elicit what I would call educational business analytics (as opposed to learning analytics). To some extent these were not really new, as business reporting systems on student grades, drop-out figures, and the likes have existed for many years, albeit that they are mainly available to university registrars. It is not yet clear what these figures do to teaching and learning when presented to teaching staff instead of administrators, but this would be a novel approach.

Here are some notes that came to my mind while listening to the presentations:

  • LA tools are a bit like a cooking thermometer or oven thermostat. It doesn’t give you an indication of what meal a person is preparing or whether it will taste good or not, but it may be a vital (on-demand) instrument to determine the right action at the right time to get it done.

  • How do we avoid teachers being turned into controlers, sitting like Homer Simpson in front of a dashboard and control panel looking at visualisations of their students’ datasets? Does an increase in such activities reduce their contact time with students?
  • One common assumption I noted is the belief that all students are ambitious and only aim for top grades and the best learning experience. Being a father and having seen a few student generations, I contest this assumption. Many, if not most students, just want to pass. Studying isn’t in fact what they perceive as the prime focus of their lives. Tactical calculations that students are used to doing (how often can I still be absent; what’s the minimum mark I need for passing, etc.) maybe ‘prehistoric’ forms of Learning Analytics that have existed for as long as assessments have been around!