The Challenge of ebooks

Yesterday I was in London, along with a group of people with a wide range of experience in digital resource management, OERs, and publishing for a workshop which was part of the Challenge of eBooks project. Here’s a quick summary and some reflections.

To kick off, Ken Chad defined eBooks for the purpose of the workshop, and I guess the report to be delivered by the project, as anything delivered digitally that was longer than a journal article. I’ll come back to what I think are the problems with that later, but we didn’t waste time discussing it. It did mean that we included in the discussion such things as scanned copies of texts such as those that can be made under the CLA licence, and the difficulties around managing and distributing those.

For the earliest printed books, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way publishers now seek to replicate printed books as ebooks.

With the earliest printed book, or incunabula, such as the Gutenberg Bible, printers sought to mimic the hand written manuscripts with which 15th cent scholars were familiar; in much the same way as publishers now seek to replicate printed books as ebooks.

The main part of the workshop was organised around a “jobs to be done” framework. The idea of this is to focus on what people are trying to do “people don’t want a 5mm drill bit, they want a 5mm hole”. I found that useful in distinguishing ebooks in the domain of HE from the vast majority of those sold. In the latter case the job to be done is simply reading the book: the customer wants a copy of a book simply because they want to read that book, or a book by that author, or a book of that genre, but there isn’t necessarily any further motive beyond wanting the experience of reading the book. In HE the job to be done (ultimately) is for the student or researcher to learn something, though other players may have a job to do that leads to this, for example providing a student with resources that will help them learn something. I have views on how the computing power in the delivery platform can be used for more than just making the delivery of text more convenient: how it can be used to make the content interactive, or to deliver multimedia content, or to aid discussion or just connect different readers of the same text (I was pleased that someone mentioned the way a kindle will show which passages have been bookmarked/commented on by other readers).

The issues raised in discussion included rights clearance, the (to some extent technical, but mostly legal) difficulties of creating course packs containing excerpts of selected texts, the diversity of platforms and formats, disability access, and relationships with publishers.

It was really interesting that accessibility featured so strongly. Someone suggested that this was because the mismatch between an ebook and the device on which it is displayed creates an impairment so frequently that accessibility issues are plain for all to see.

A lot of the issues seem to go back publishers struggling with a new challenge, not knowing how they can meet it and keep their business model intact. It was great to have Suzanne Hardy of the PublishOER project there with her experience of how publishers will respond to an opportunity (such as getting more information about their users through tracking) but need help in knowing what the opportunities are when all they can see is the threat of losing control of their content. Whether publishers can make the necessary changes in currently print-oriented business processes to realise these benefits was questioned. Also there are challenges to libraries in HE, who are used to being able to buy one copy of a book for an institution whereas publishers now want to be able to sell access to individuals–partly, I guess, so that they can make that link between a user and the content they provide, but also because one digital copy can go a lot further than a single physical copy.

Interestingly, the innovation in ebooks is coming not from conventional publishers but from players such as Amazon, Apple and from publishers such as O’Reilly and Pearson. (Note that Pearson have a stake in education that includes an assessment business, online courses and colleges and so go beyond being a conventional publisher.) Also, the drive behind these innovations comes from new technology making new business models possible, not from evolution of current business, nor, arguably, from user demand.

So, anyway, what is an ebook? I am not happy with a definition that includes web sites of additional content created to accompany a book, or pages of a physical book that have been scanned. That doesn’t represent the sort of technical innovation that is creating new and interesting opportunities and the challenges come with them. Yes there are important (long-standing) issues around digital content in general, some of which will overlap with ebooks, but I will be disappointed if the report from this project is full of issues that could have been written about 10yrs ago. That’s not because I think those issues are dead but because I think ebooks are something different that deserves attention. I’ll suggest two approaches defining to what that something is:

1. an ebook is what the ebook reading devices and apps read well. By-and-large that means content in mobi or ePub format. Ebook readers don’t handle scanned page images well. They don’t read most pdf well (though depends on the tool and nature of pdf used, but aim of pdf was to maintain page layout which is exactly what you don’t want on an ebook reader). Word processed files are borderline but mostly word processed documents are page-oriented which raises the same issue as with pdfs. In short WYSIWYG and ebooks don’t match.

2. an ebook is aggregated content, packaged so that it can be moved from server to device, with more-or-less linear navigation. In the aggregation (which is often a zip file under another extension name) are assets (the text, images and other content that are viewed) plus metadata that describes the book as a whole (and maybe the assets individually) and information about how the assets should be navigated (structural metadata describing the organisation of the book). That’s essentially what mobi and ePub are. It’s also what IMS Content Packaging and offspring like SCORM and Common Cartridge are; and for that matter it’s what the MS Office and Open Office formats are.

I had a short discussion with Zak Mensah of JISC Digital Media about whether the content should be mostly text based. I would like to see as much non-text material as is useful, but clearly there is a limit. It would be perverse to take a set of videos, sequence them one after another with screen of text between each one like a caption frame in a silent movie, and then call it a book. However, there is something more than text that would make sense as a book: imagine replacing all the illustrations in a well-illustrated text book with models, animations, videos … for example, a chemistry book with interactive models of chemical structures, graphs that change when you alter the parameters; or a Shakespeare text with videos of performance in parallel with text…that still makes sense as a book.

[image of page from Gutenberg Bible taken from wikipedia]

A short update on resource tracking

In our reflections on technical aspects of phase 2 of the UKOER programme, we said that we didn’t understand why projects aren’t worrying more about tracking the use and reuse of the OERs they released. The reason for this was that if you don’t know how much your resources are used you will not be in a good position sustain your project after JISC have stopped funding it. For example, how can you justify the effort and cost of clearing resources for release under a Creative Commons licence unless you can show that people want their own copies of the resources you release rather than just view the the copy you have on your own server? Here is a quick update projects related to resource tracking.

TrackOER
Under the OER Rapid Innovation programme JISC have funded the TracOER project. It was known from the outset that the project would start slowly but in the last couple of weeks it has got some momentum going. The nub of the problem they are looking at is that

when an OER is taken from its host or origin server, in order to be used and reused the origin institution and the community generally lose track of it.

Building on work by Scott Leslie, their prospective solution is the use of a web bug/beacon: an image, normally invisible but TrackOER may use the creative commons licence badge, embedded in the resource but hosted by whomever is collecting the stats (let’s say the OER publisher). So long as the image is not removed, whenever the resource is loaded there will be a request sent to the publisher’s server for that image and that request can be logged. Additional information can be acquired by appending ?key1=value1&key2=value2… onto the src url of the img element in the resource; anything after the ? is logged in the server logs but does not affect the image that is served. For example, you could encode an identifier for the OER like this

<img src="http://example.com/tracker.png?oerID=1234">

TrackOER are investigating the use of Google analytics and the open source alternative piwiki (both with and without JavaScript, maybe) for the actual tracking. One of their challenges is that both normally assume that the person doing the tracking knows where the resource is, i.e. it will be where they put it, whereas with OERs one of the things that would be most worth knowing about is whether anyone has made a copy your resource somewhere else. However if you use JavaScript you have access to this information and can write it to the tracking image URL. Another challenge comes with using Creative Commons licence images instead of an invisible tracking bug is that you use several images for tracking not just the one. TrackOER have modified piwiki to allow for the use of multiple alternative images.

As an aside, TrackOER have also found a service called Stipple, they say:

using Stipple to track OER across the web in the same way as the TrackOER script is perfectly feasible. It might even be easy. You could get richer analytics as well as access to promotional tools.

OER tracking at Creative Commons
Creative Commons have posted three ideas for tracking OERs, two which use a mechanism they call refback and one which provides an API to data they acquire as a result of people linking to their licences and using images of licence badges served from their hosts. In all cases it is a priority to avoid anything that smacks of DRM or excessive and covert surveillance, understandable given that Creative Commons as an organistion is a third party between resource user and owner and cannot do anything that would risk losing the trust of either.

Refback tracking involves putting a link in the resource being tracked to the site doing the tracking (the two variants are that this may be either the publisher or Creative Commons, i.e. independent and distributed or hosted and centralised). If a curious user follows that link (and the assumption is that occasionally someone will) the tracking site will log request for the page to which the link goes, included in the log information is the “referrer” i.e. the URL of the page on which the user clicked the link. An application on the tracking site will work through this referrer log and fetch the pages for any URL it does not recognise to ascertain (e.g. from the attribution metadata) whether they are copies of a resource that it is tracking.

The third approach involves Creative Commons logging the referrers for requests to get a copy of one of their licence badges, and then looking at the attribution metadata on the web page in which the badge was embedded to build up a graph of pages that represent re-use of others. This information would be hosted on Creative Commons servers and be available to others via an API.

Where does schema.org fit in the (semantic) web?

Over the summer I’ve done a couple of presentations about what schema.org is and how it is implemented (there are links below). Quick reminder: schema.org is a set of microdata terms (itemtypes and properties) that big search engines have agreed to support. I haven’t said much about why I think it is important, with the corollary of “what it is for?”.

The schema.org FAQ answers that second question with:

…to improve the web by creating a structured data markup schema supported by major search engines. On-page markup helps search engines understand the information on web pages and provide richer search results. … Search engines want to make it easier for people to find relevant information on the web.

So, the use case for schema.org is firmly anchored around humans searching the web for information. That’s important to keep in mind because when you get into the nitty gritty of what schema.org does, i.e. identifying things and describing their characteristics and relationships to other things, in the context of the web, then you are bound to run into people who talk about the semantic web, especially because the RDFa semantic web initiative covers much of the same ground as schema.org. To help understand where schema.org fits into the semantic web more generally it is useful to think about what various semantic web initiatives cover that schema doesn’t. Starting with what is closest to schema.org, this includes: resource description for purposes other than discovery; descriptions not on web pages; data feeds for machine to machine communication; interoperability for raw data in different formats (e.g. semantic bioinformatics); ontologies in general, beyond the set of terms agreed by schema.org partners, and their representation. RDFa brings some of this sematic web thinking to the markup of web pages, hence the overlap with schema.org. Thankfully, there is now an increasing overlap between the semantic web community and the schema.org community, so there is an evolving understanding of how they fit with each other. Firstly, the schema.org data model is such that:

“[The] use of Microdata maps easily into RDFa Lite. In fact, all of Schema.org can be used with the RDFa Lite syntax as is.”

Secondly there is a growing understanding of the complementary nature of schema.org and RDFa, described by Dan Brickley; in summary:

This new standard [RDFa1.1], in particular the RDFa Lite specification, brings together the simplicity of Microdata with improved support for using multiple schemas together… Our approach is “Microdata and more”.

So, if you want to go beyond what is in the schema.org vocabulary then RDFa is a good approach, if you’re already committed to RDFa then hopefully you can use it in a way that Google and other search engines will support (if that is important to you). However schema.org was the search engine providers’ first choice when it came to resource discovery, at least first in the chronological sense. Whether it will remain their first preference is moot but in that same blog post mentioned above they make a commitment to it that (to me at least) reads as a stronger commitment than what they say about RDFa:

We want to say clearly that we continue to support Microdata

It is interesting also to note that schema.org is the search engine company’s own creation. It’s not that there is a shortage of other options for embedding metadata into web pages, HTML has always had meta tags for description, keywords, author, title; yet not only are these not much supported but the keywords tag especially can be considered harmful. Likewise, Dublin Core is at best ignored (see Invisible institutional repositories for an account of the effect of the use of Dublin Core in Google Scholar–but note that Google Scholar differs in its use of metadata from Google’s main search index.)

So why create schema.org? The Google schema.org faq says this:

Having a single vocabulary and markup syntax that is supported by the major search engines means that webmasters don’t have to make tradeoffs based on which markup type is supported by which search engine. schema.org supports a wide collection of item types, although not all of these are yet used to create rich snippets. With schema.org, webmasters have a single place to go to learn about markup for a wide selection of item types, search engines get structured information that helps improve search result quality, and users end up with better search results and a better experience on the web.

(NB: this predates the statement quoted above about “Microdata and more” approach)

There are two other reasons I think are important: control and trust. While anyone can suggest extensions to and comment on the schema.org vocabulary through the W3C web schemas task force, the schema.org partners, i.e. Google, Microsoft Bing, Yahoo and Yandex pretty much have the final say on what gets into the spec. So the search engines have a level of control over what is in the schema.org vocabulary. In the case of microdata they have chosen to support only a subset of the full spec, and so have some control over the syntax used. (Aside: there’s an interesting parallel between schema.org and HTML5 in the way both were developed outwith the W3C by companies who had an interest in developing something that worked for them, and were then brought back to the W3C for community engagement and validation.)

Then there is trust, that icing on the old semantic web layer cake (perhaps the cake is upside down, the web needs to be based on trust?). Google, for example, will only accept metadata from a limited number of trusted providers and then often only for limited use, for example publisher metadata for use in Google Scholar. For the world in general Google won’t display content that is not visible to the user. The strength of the microdata and RDFa approach is that what is marked up for machine consumption can also be visible to the human reader; indeed if it the marked-up content is hidden Google will likely ignore it.

So, is it used? By the big search engines, I mean. Information gleaned from schema.org markup is available in the XML can be retrieved using a Google Custom Search Engine, which allows people to create their own search engines for niche applications, for example jobs for US military veterans. However, it is use on the main search site, which we know is the first stop for people wanting to find information, that would bring about significant benefits in terms the ease and sophistication with which people can search. Well, Google and co. aren’t known for telling the world exactly how they do what they do, but we can point to a couple of developments to which schema.org markup surely contributes.

First, of course, is the embellishment of search engine result pages that the “rich snippets” approach allows: inclusion of information such as author or creator, ratings, price etc., and filtering of results based on these properties. (Rich snippets is Google’s name for the result of marking up HTML with microdata, RDFa etc., which predates and has evolved into the schema.org initiative).

Secondly, there is the Knowledge graph, which while it is known to use FreeBase, and seems to get much of its data from dbpedia, has a “things not strings” approach which resonates very well with the schema.org ontology. So perhaps it is here that we will see the semantic web approach and schema.org begin to bring benefits to the majority of web users.

See also