Metadata requirements from analysis of search logs

Many sites hosting collections of educational materials keep logs of the search terms used by visitors to the site who search for resources. Since it came up during the CETIS What Metadata (CETISWMD) event I have been think about what we could learn about metadata requirements from the analysis of these search logs. I’ve been helped by having some real search logs from Xpert to poke at with some Perl scripts (thanks Pat).

Essentially the idea is to classify the search terms used with reference to the characteristics of a resource that may be described in metadata. For example terms such as “biology” “English civil war” and “quantum mechanics” can readily be identified as relating to the subject of a resource; “beginners”, “101″ and “college-level” relate to educational level; “power point”, “online tutorial” and “lecture” relate in some way to the type of the resource. We believe that knowing such information would assist a collection manager in building their collection (by showing what resources were in demand) and in describing their resources in such a way that helps users find them. It would also be useful to those who build standards for the description of learning resources to know which characteristics of a resource are worth describing in order to facilitate resource discovery. (I had an early run at doing this when OCWSearch published a list of top searches.)

Looking at the Xpert data has helped me identify some complications that will need to be dealt with. Some of the examples above show how a search phrase with more than one word can relate to a single concept, but in other cases, e.g. “biology 101″ and “quantum mechanics for beginners” the search term relates to more than one characteristic of the resource. Some search terms may be ambiguous: “French” may relate to the subject of the resource or the language (or both); “Charles Darwin” may relate to the subject or the author of a resource. Some terms are initially opaque but on investigation turn out to be quite rich, for example 15.822 is the course code for an MIT OCW course, and so implies a publisher/source, a subject and an educational level. Also, in real data I see the same search term being used repeatedly in a short period of time: I guess an artifact of how someone paging through results is logged as a series of searches: should these be counted as a single search or multiple searches?

I think these are all tractable problems, though different people may want to deal with them in different ways. So I can imagine an application that would help someone do this analysis. In my mind it would import a search log and allow the user to go through search by search classifying the results with respect to the characteristic of the resource to which the search term relates. Tedious work, perhaps, but it wouldn’t take too long to classify enough search terms to get an adequate statistical snap-shot (you might want to randomise the order in which the terms are classified in order to help ensure the snapshot isn’t looking at a particularly unrepresentative period of the logs). The interface should help speed things up by allowing the user to classify by pressing a single key for most searches. There could be some computational support: the system would learn how to handle certain terms and that this learning would be shared between users. A user should not have to tell the system that “Biology” is a subject once they or any other user has done so. It may also be useful to distinguish between broad top-level subjects (like biology) and more specific terms like “mitosis”, or alternatively to know that specific terms like “mitosis” relate to the broader term “biology”: in other words the option to link to a thesaurus might be useful.

This still seems achievable and useful to me.

The modern art of metadata

At a meeting relating to the Resource Discovery Task Force the other day, Caroline Williams compared the variety of resource description metadata standards and profiles in Libraries, Archives, Museums and beyond as being like a Jackson Pollock picture.
She wouldn’t call it a mess (nor would I), but I think we could agree that it’s a bit chaotic.

That got me wondering what the alternatives might be…

Yves Klein?
Monochrome bleu sans titre [Untitled blue monochrome], 1960
Everything the same shade of blue. No, not achievable even if that is what you want.

How about Piet Mondrian?
Nice and neat, but compartmentalized, and quite few empty compartments.

Or Picasso,
trying to fit several perspectives into one picture with the result that… well, two noses.

Actually Caroline had the answer when I asked her what she would like to see,

Bridget Riley
June, 1992-2002
Diversity, but working together.

Some downside to OER?

As part of my non-CETIS work I occasionally go out to evaluate teaching practice in Engineering for the HE Academy Engineering Subject Centre. This involves me going to some University and talking to an Engineering lecturer and some students about an approach they are using for teaching and learning. I especially enjoy this because it brings me close to the point of education and helps keeps me in touch with what is really happening in universities around the UK. During a recent evaluation the following observations came up which are incidental to what I was actually evaluating but relevant, I think, to UKOER. They concern a couple of points raised, one by the lecturer and one by the students, that reflect genuine problems people might have to OER release.

Part of the lecturer’s approach involves a sequence of giving students some problems to solve each week, and then providing online and face-to-face support for these problems. The online support is good stuff; it’s video screen captures of worked model solutions with pretty good production values. Something like the Khan academy but less rough. It would be great if the this were released as OER, however doing so would compromise the pedagogic strategy that the tutor has adopted. I don’t want to go into the specifics of why this lecturer has adopted this strategy but it in general it may be important that the students try the problems before they look at the support, and this is an interesting example of how OER release isn’t pedagogically neutral.

The point raised by the students concerned to reuse of OERs rather than their release. They really liked what their lecturer had done, and part of what they liked about it was that it was personal. This was important to them not just because it meant that there was an exact fit between the resources and their course but because they took it as showing that the lecturer had taken a deal of time and made a real effort in preparing their course. They were right in that, but they also went on to say that if the lecturer had taken resources from elsewhere, that they themselves could have found, they would have drawn the opposite inference. We may think that the students would be wrong in this, or that their expectations are unrealistic, but what’s important is that they felt that they would have been demotivated had the lecturer reused resources created elsewhere. I think this fits into a wider picture of how reuse of materials affects the relationship between teacher and student.

I’m not claiming that either of these observations are conclusive or in any way compelling arguments against the release of OER, and I will object to anyone claiming that a couple of data points is ever conclusive or compelling, but I did find them interesting.

Important advice on licensing

It frequently comes to the attention of the CETIS-pedantry department that certain among the people with whom we interact, while they have much to say and write that is worth heeding, do not know when to use “licence” and when to use “license”. Those of you who prefer to use US English can stop reading now, unless you’re intrigued by the convolutions of the UK variant of the language: this won’t ever be an issue for you.

It’s quite simple: licence is a noun, it’s the thing; license is a verb, it’s what you do. But how to remember that? Well, hopefully you’ll see that advice is a noun but advise is a verb; similarly device (noun), devise (verb); practice (noun), practise (verb). Words ending –ise are normally verbs[*]. So license/licence sticks to the pattern of c for noun, s for verb.

Hope this helps.

[* OK, you may prefer --ize, which isn't just for US usage in some cases--but that's a different rant story]