Archive for the ‘semantic technologies’ Category

Meshing up a JISC e-learning project timeline, or: It’s Linked Data on the Web, stupid

Friday, March 12th, 2010

Inspired by the VirtualDutch timeline, I wondered how easy it would be to create something similar with all JISC e-learning projects that I could get linked data for. It worked, and I learned some home truths about scalability and the web architecture in the process.

As Lorna pointed out, UCL’s VirtualDutch timeline is a wonderful example of using time to explore a dataset. Since I’d already done ‘place’ in my previous meshup, I thought I’d see how close I could get with £0.- and a few nights tinkering.

The plan was to make a timeline of all JISC e-learning projects, and the developmental and affinity relations between them that are recorded in CETIS’ PROD database of JISC projects. More or less like Scott Wilson and I did by hand in a report about the toolkits and demonstrators programme. This didn’t quite work; SPARQLing up the data is trivial, but those kinds of relations don’t fit well into the widely used Simile timeline from both a technical and usability point of view. I’ll try again with some other diagram type.

What I do have is one very simple, and one not so simple timeline for e-learning projects that get their data from the intensely useful rkb explorer Linked Data knowledge base of JISC projects and my own private PROD stash:

Recipe for the simple one

  • Go to the SPARQL proxy web service
  • Tell it to ask this question
  • Tell the proxy to ask that question of the RKB by pointing it at the RKB SPARQL endpoint (http://jisc.rkbexplorer.com/sparql/), and order the results as CSV
  • Copy the URL of the CSV results page that the proxy gives you
  • Stick that URL in a ‘=ImportData(”{yourURL}”)’ function inside a fresh Google Spreadsheet
  • Insert timeline gadget into Google spreadsheet, and give it the right spreadsheet range
  • Hey presto, one timeline coming up:
Screenshot of the simple mashup- click to go to the live meshup

Screenshot of the simple mashup- click to go to the live meshup

Recipe for the not so simple one

For this one, I wanted to stick a bit more in the project ‘bubbles’, and I also wanted the links in the bubbles to point to the PROD pages rather than the RKB resource pages. Trouble is, RKB doesn’t know about data in PROD, and for some very sound reasons I’ll come to in a minute, won’t allow the pulling in of external datasets via SPARQL’s ‘FROM’ operator either. All other SPARQL endpoints in the web that I know of that allow FROM couldn’t handle my query- they either hung or conked out. So I did this instead:

  • Download and install your own SPARQL endpoint (I like Leigh Dodds’ simple but powerful Twinkle)
  • Feed it this query
  • Copy and paste the results into a spreadsheet and fiddle with concatenation
  • Hoist spreadsheet into Google docs
  • Insert timeline gadget into Google spreadsheet, and give it the right spreadsheet range
  • Hey presto, a more complex timeline:
Click to go to the live meshup

Click to go to the live meshup

It’s Linked Data on the Web, stupid

When I first started meshing up with real data sets on the web (as opposed to poking my own triple store), I had this inarticulate idea that SPARQL endpoints were these magic oracles that we could ask anything about anything. And then you notice that there is no federated search facility built into the language. None. And that the most obvious way of querying across more than one dataset - pulling in datasets from outside via SPARQL’s FROM - is not allowed by many SPARQL endpoints. And that if they do allow FROM, they frequently cr*p out.

The simple reason behind all this is that federated search doesn’t scale. The web does. Linked Data is also known as the Web of Data for that reason- it has the same architecture. SPARQL queries are computationally expensive at the best of times, and federated SPARQL queries would be exponentially so. It’s easy to come up with a SPARQL query that either takes a long time, floors a massive server (or cloud) or simply fails.

That’s a problem if you think every server needs to handle lots of concurrent queries all the time, especially if it depends on other servers on a (creaky) network to satisfy those queries. By contrast, chomping on the occasional single query is trivial for a modern PC, just like parsing and rendering big and complex html pages is perfectly possible on a poky phone these days. By the same token, serving a few big gobs of (RDF XML) text that sits at the end of a sensible URL is an art that servers have perfected over the past 15 years.

The consequence is that exposing a data set as Linked Data is not so much a matter of installing a SPARQL endpoint, but of serving sensibly factored datasets in RDF with cool URLs, as outlined in Designing URI Sets for the UK Public Sector (pdf). That way, servers can easily satisfy all comers without breaking a sweat. That’s important if you want every data provider to play in Linked Data space. At the same time, consumers can ask what they want, without constraint. If they ask queries that are too complex or require too much data, then they can either beef up their own machines, or live with long delays and the occasional dead SPARQL engine.

Linked Data meshup on a string

Thursday, February 25th, 2010

I wanted to demo my meshup of a triplised version of CETIS’ PROD database with the impressive Linked Data Research Funding Explorer on the Linked Data meetup yesterday. I couldn’t find a good slot, and make my train home as well, so here’s a broad outline:

The data

The Department for Business Innovation and Skills (BIS) asked Talis if they could use the Linked Data Principles and practice demonstrated in their work with data.gov.uk to produce an application that would visualise some grant data. What popped out was a nice app with visuals by Iconomical, based on a couple of newly available data sets that sit on Talis’ own store for now.

The data concerns research investment in three disciplines, which are illustrated per project, by grant level and number of patents, as they changed over time and plotted on a map.

CETIS have PROD; a database of JISC projects, with a varying amount of information about the technologies they use, the programmes they were part of, and any cross links between them.

The goal

Simple: it just ought to be possible to plot the JISC projects alongside the advanced tech of the Research Funding Explorer. If not, than at least the data in PROD should be augmentable with the data that drives the Research Funding Explorer.

Tools

Anything I could get my hands on, chiefly:

The recipe

For one, though PROD pushes out Description Of A Project (DOAP, an RDF vocabulary) files per project, it doesn’t quite make all of its contents available as linked data right now. The D2R toolkit was used to map (part of) the contents to known vocabs, and then make the contents of a copy of PROD available through a SPARQL interface. Bang, we’re on the linked data web. That was easy.

Since I don’t have access to the slick visualisation of the Research Funding Explorer, I’d have to settle for augmenting PROD’s data. This is useful for two reasons: 1) PROD has rather, erm, variable institutional names. Synching these with canonical names from a set that will go into data.gov.uk is very handy. 2) PROD doesn’t know much about geography, but Talis’ data set does.

To make this work, I made a SPARQL query that grabs basic project data from PROD, and institutional names and locations from the Talis data set, and visualises the results.

Results

A partial map of England, Wales and southern Scotland with markers indicating where projects took place
An excerpt of PROD project data, augmented with proper institutional names and geographic positions from Talis’ Research Grant Explorer, visualised in OpenLink RDF browser.

A star shaped overview of various attributes of a project, with the name property highlighted
Zooming in on a project, this time to show the attributes of a single project. Still in OpenLink RDF browser.

A two column list of one project's attributes and their values
A project in D2R’s web interface; not shiny, but very useful.

From blagging a copy of the SQL tables from the live PROD database to the screen shots above took about two days. Opening up the live server straight to the web would have cut that time by more than half. If I’d have waited for the Research Grant Explorer data to be published at data.gov.uk, it’d have been a matter of about 45 minutes.

Lessons learned

Opening up any old database as linked data is incredibly easy.

Cross-searching multiple independent linked data stores can be surprisingly difficult. This is why a single SPARQL endpoint across them all, such as the one presented by uberblic’s Georgi Kobilarov yesterday, is interesting. There are many other good ways to tackle the problem too, but whichever approach you use, making your linked data available as simple big graphs per major class of thing (entity) in your dataset helps a lot. I was stymied somewhat by the fact that I wanted to make use of data that either wasn’t published properly yet (Talis’ research grant set), or wasn’t published at all (our own PROD triples).

A bit of judicious SPARQLing can alleviate a lot of inconsistent data problems. This is salient to a recent discussion on twitter around Brian Kelly’s Linked Data challenge. One conclusion was that it was difficult, because the data was ‘bad’. IMHO, this is the web, so data isn’t really bad, just permanently inconsistent and incomplete. If you’re willing to put in some effort when querying, a lot can be rectified. We, however, clearly need to clean up PROD’s data to make it easier on everyone.

SPARQL-panning for gold in multiple datastores (or even feeds or webpages) is way too much fun to seem like work. To me, anyway.

What’s next

What needs to happen is to make all the contents of PROD and related JISC project information available as proper linked data. I can see three stages for this:

  1. We clean up the PROD data a little more at source, and load it into the Data Incubator to polish and debate the database to triple mapping. Other meshups would also be much easier at that point.
  2. We properly publish PROD as linked data either on a cloud platform such as Talis’, or else directly from our own server via D2R or OpenLink Virtuoso. Simal would be another great possibility for an outright replacement of PROD, if it’s far enough along at that point.
  3. JISC publishes the public part of its project information as Linked Data, and PROD just augments (rather than replicates) it.

Poking data.gov.uk for a day

Saturday, January 23rd, 2010

data.gov.uk - a portal for UK governmental open data - is clearly a fantastic idea, and has the makings of a real treasure trove. But estimating how far it is in practice, means picking up a stick and poking the beast.

Which data

What it’s all about. To my unscientific eyeballing, the existence of most every dataset of interest is flagged. You can find them too, with some patience and use of the tags. Clicking through to the source quickly shows that most aren’t available to us yet in any shape or form, though. That’s fine, I’m sure something like that just takes time.

What would be good, though, is some indication which ones are available in the navigation or searching. Where available means: queriable through the SPARQL interface, or downloadable as an RDF dump.

Getting at the data

The main means through which you get at the data is via the SPARQL query service provided. That works, but has some quirks:

From experience, the convention for such services would be that you’d provide one at http://mysite.org/sparql, and that you’d get a human SPARQL query form if you go there with a browser, and you get an answer if you fire a SPARQL protocol query at it from the comfort of your own SPARQL client.

There is a human interface at https://www.data.gov.uk/sparql, but machine queries need to go someplace else: http://services.data.gov.uk/{optional dataset}/sparql Confusingly, the latter has a basic but very nice human interface too…

Of more concern is that both spit out either JSON (in theory) or XML, but not RDF. Including XML and JSON is very sensible indeed, because the greatest number of people can suck those formats up and stick them in the mashups they know and love. But for the promises of linked data to work, there is an absolute need for some kind of RDF output.

Exploring the data

In order to formulate the whizzy linked data queries that this stuff is all about, you have to get a feel for what an open data set and its vocabulary looks like. That is a bit lacking as well at the moment: you can ask the SPARQL endpoint for types, and then keep poking, but making the whole thing browsable on something like the OpenLink RDF browser would be even better. I stumbled on some ways of exploring vocabularies, but that didn’t seem to allow navigation between concepts just yet.

There is a forum and wiki where people can cooperate on how to work on these issues, but I couldn’t see how to join them- hence the post here.

So?

If you’re getting rather disappointed by now: don’t. As far as I can see, the underlying platform is easily capable of addressing each of the points I stumbled across. More importantly, all the pieces are there for something truly compelling: freely mixable open data. It’s just that not all the pieces have been put together yet. The roof has been pitched, but the rest still needs doing.