Text and Data Mining workshop, London 21 Oct 2011

There were two themes running through this workshop organised by the Strategic Content Alliance: technical potential and legal barriers. An important piece of background is the Hargreaves report.

The potential of text and data mining is probably well understood in technical circles, and were well articulated by JohnMcNaught of NaCTeM. Briefly the potential lies in the extraction of new knowledge from old through the ability to surface implicit knowledge and show semantic relationships. This is something that could not be done by humans, not even crowds, because of volume of information involved. Full text access is crucial, John cited a finding that only 7% of the subject information extracted from research papers was mentioned in the abstract. There was a strong emphasis, from for example Jeff Lynn of the Coalition for a digital economy and Philip Ditchfield of GSK, on the need for business and enterprise to be able to realise this potential if it were to remain competetive.

While these speakers touched on the legal barriers it was Naomi Korn who gave them a full airing. They start in the process of publishing (or before) when publishers acquire copyright, or a licence to publish with enough restriction to be equivalent. The problem is that the first step of text mining is to make a copy of the work in a suitable format. Even for works licensed under the most liberal open access licence academic authors are likely to use, CC-by, this requires attribution. Naomi spoke of attribution stacking, a problem John had mentioned when a result is found by mining 1000s of papers: do you have to attribute all of them? This sort of problem occurs at every step of the text mining process. In UK law there are no copyright exceptions that can apply: it is not covered by fair dealling (though it is covered by fair use in the US and similar exceptions in Norwegian and Japanese law, nowhere else); the exceptions for transient copies (such as in a computers memory when readng on line) only apply if that copy has not intrinsic value.

The Hargreaves report seeks to redress this situation. Copyright and other IP law is meant to promote innovation not stifle it, and copyright is meant to cover creative expressions, not the sort of raw factual information that data mining processes. Ben White of the British Library suggested an extension of fair dealling to permit data mining of legally obtained publications. The important thing is that, as parliament acts on the Hargreaves review people who understand text mining and care about legal issues make sure that any legislation is sufficient to allow innovation, otherwise innovators will have to move to those jurisdictions like the US, Japan and Norway where the legal barriers are lower (I’ll call them ‘text havens’).

Thanks to JISC and the SCA on organising this event; there’s obviously plenty more for them to do.

Web2 vs iTunesU

There was an interesting discussion last week on the JISC-Repositories email list that kicked off after Les Carr asked

Does anyone have any experience with iTunes U? Our University is thinking of starting a presence on Apple’s iTunes U (the section of the iTunes store that distributes podcasts and video podcasts from higher education institutions). It looks very professional (see for example the OU’s presence at http://projects.kmi.open.ac.uk/itunesu/ ) and there are over 300 institutions who are represented there.

HOWEVER, I can’t shake the feeling that this is a very bad idea, even for lovers of Apple products. My main misgiving is that the content isn’t accessible apart from through the iTunes browser, and hence it is not Googleable and hence it is pretty-much invisible. Why would anyone want to do that? Isn’t it a much better idea to put material on YouTube and use the whole web/web2 infrastructure?

I’ld like to summarize the discussion here so that the important points raised get a wider airing; however it is a feature of these high quality discussions like this one that people learn and change their mind as a result, so I please don’t assume that people quoted below still hold the opinions attributed to them. (Fro example, invisibility on Google turned out to be far from the case for some resources.) If You would like to see the whole discussion look in the JISCMAIL archive

The first answers from a few posters was that it is not an either/or decision.

Patricia Killiard:

Cambridge has an iTunesU site. [...] the material is normally deposited first with the university Streaming Media Service. It can then be made accessible through a variety of platforms, including YouTube, the university web pages and departmental/faculty sites, and the Streaming Media Service’s own site, as well as iTunesU.

Mike Fraser:

Oxford does both using the same datafeed: an iTunesU presence (which is very popular in terms of downloads and as a success story within the institution); and a local, openly available site serving up the same

Jenny Delasalle and David Davis of Warwick and Brian Kelly of UKOLN also highlighted how iTunesU complemented rather than competed with other hosting options, and was discoverable on Google.

Andy Powell, however pointed out that it was so “Googleable” that a video from Warwick University on iTunesU video came higher in the search results for University of Warwick No Paradise without Banks than the same video on Warwick’s own site. (The first result I get is from Warwick, about the event, but doesn’t seem to give access to the video–at least not so easily that I can find it; the second result I get is the copy from iTunes U, on deimos.apple.com . Incidentally, I get nothing for the same search term on Google Videos.) He pointed out that this is “(implicitly) encouraging use of the iTunes U version (and therefore use of iTunes) rather than the lighter-weight ‘web’ version.” and he made the point that:

Andy also raised other “softer issues” about which ones will students be referred to that might reinforce one version rather than another as the copy of choice even if it wasn’t the best one for them.

Ideally it would be possible to refer people to a canonical version or a list of available version, (Graham Triggs mentioned Google’s canonical URLs, perhaps if if Google relax the rules on how they’re applied) but I’m not convinced that’s likely to happen. So there’s a compromise, variety of platforms for a variety of needs Vs possibly diluting the web presence for any give resource.

And a response from David Davies:

iTunesU is simply an RSS aggregator with a fancy presentation layer.
iTunesU content is discoverable by Google – should you want to, but as we’ve seen there are easier ways of discovering the same content, it doesn’t generate new URLs for the underlying content, is based upon a principle of reusable content, Apple doesn’t claim exclusivity for published content so is not being evil, and it fits within the accepted definition of web architecture. Perhaps we should simply accept that some people just don’t like it. Maybe because they don’t understand what it is or why an institution would want to use it, or they just have a gut feeling there’s something funny about it. And that’s just fine.

mmm, I don’t know about all these web architecture principles, I just know that I can’t access the only copy I find on Google. But then I admit I do have something of a gut feeling against iTunesU; maybe that’s fine, maybe it’s not; and maybe it’s just something about the example Andy chose: searching Google for University of Warwick slow poetry video gives access to copies at YouTube and Warwick, but no copy on iTunes.

I’m left with the feeling that I need to understand more about how using these services affects the discoverability of resources using Google–which is one of the things I would like to address during the session I’m organising for the CETIS conference in November.

George Orwell is blogging

George Orwell is blogging, so is Samuel Pepys. And quite aside from the content (I’m an Orwell fan, the merits of this content was discussed when the blog was launched here, and here), I think this is brilliant way of putting diaries online as open content[1]. Delivery, at least, relies on software anyone can use for free[2]; you and I can get the text in a machine readable format, HTML and RSS; each entry gets a URI; the entries can be tagged and commented on; locations can be mapped on Google (Orwell, Pepys), other concepts mentioned linked to encyclopaedia entries; the blog owners could, at least in principle, export the whole lot in XML and stick it in a database to process, and anyone can process entries with text mining software or by setting up a Google custom search engine or . . . .

The two examples above are slightly short of perfect. I like to see the dates for the blog entries matching the dates for the diary entries (the Pepys diaries do this, Orwell managed it at first, but then slipped). And I think it would make more sense if the monthly archives were arranged to be read top-to-bottom in chronological order. Also I wonder if hosting on wordpress.com is the best idea. It has its attractions, but the tags in the Orwell blog link to posts from other blogs which are well out of scope while the Pepys diary has some very interesting customizations; also if the Orwell blog owners do ever find a way to go back to posting against the diary entry date I imagine they would have problems setting up redirects so that links to the current posts still work.

1. I guess I should be clear: I’m not saying that these diaries are open content. The Pepys text is from Project Gutenberg, I don’t know the licensing arrangements for other aspects of the blog; Orwell’s text is still copyright in many countries (including the UK and the US), I don’t the licensing arrangements for the blog.
2. The Orwell diary is on WordPress.com; Pepys uses a customized installation of Moveable Type.

Why share?

In a comment to a previous post of mine, Gayle reminded me of the point made by the ACETS project:

“Re-use is not in itself a good or bad thing and it should not be encouraged or discouraged as a matter of dogma. Rather it should be nurtured and supported where it can provide benefits and not where it will not.”

So the question we should ask is: when will re-use provide benefits? Here are some links to recent and ongoing work relating to the benefits of sharing, reuse and open content.
I’ve just been talking to colleagues about sharing learning resources and I suggested that we could try to describe what attributes make a resource more easily shared. I’ve been using the set listed below in discussions relating to several projects I’ve been involved with over the last two or three years, but I don’t think I’ve ever put them down clearly on their own rather than embedded in some presentation on a specific project. Mostly they were first suggested by Charles Duncan at the 2005 Eduserv Symposium, but the first two are my own addition (Charles probably thought them to obvious to mention).

So here for clarity and ease of reference (but certainly not novelty) are six attributes of a resource which I suggest will make it more likely to be shared:
