semweb

The Cranky Librarian's Guide to OAI-ORE

0. This Cranky Librarian is tired of the promulgation of library standards that attempt to define things in ways that don't fit the way everybody else does things on the web. I would much rather see as few as possible new concepts and protocols introduced as part of new library standards development. Much like the reductionist philosophy of the Microformats project, the more we make what we want to make work happen in a way that fits naturally into what people already do on The Nets, the better chance we will have at gaining adoption and wide use for our new specs. I'm particularly Cranky when I have to find or to write my own software to handle arcane or obscure library concepts. I uncrankificate when I find that some part of a library specification actually just leans on some already widely known protocol or spec and the amount of original work necessary to put the new spec to use becomes simply a matter of developing a lightweight new application using some software toolkit I already know. Because of this, and because their goals are important, I'm excited about the OAI-ORE specification, whose authors purport to want to do things this uncrankish way. I attended their "here's what we've got so far" meeting this week, and these are my Cranky thoughts.

1. OAI-ORE sets out to enable us to identify and describe aggregations of web resources. Not just "sites", or "pages", but those funky sets of multiple things that the web architecture doesn't explicitly speak to: "resources made up of other resources". This is an excellent pair of objectives - identifying and describing aggregations - because sometimes these aggregations need to be cited as a whole, or moved as a whole, or versioned or chopped up or crawled or otherwise manipulated in some way that acknowledges that "yeah, these separate bits of things actually fit together as a molecular unit in this way with this name at some point and based on that I can do stuff with it more usefully." Wonderful.

2. OAI-ORE is being developed by a stellar crew of Smart People. Many of whom have PhDs. All of whom have developed well-known standards before, among other well-known work in their fields of expertise. Some of whom I've had the good fortune to meet and get to know and work with some over the years and of whom I can privately ask a blunt question in sincere expectation of getting an honest answer. Terrific. Though some of the specs these same folks wrote before have been the kind that make me Cranky. And although they aspire to having the development of OAI-ORE being open for community discussion, their public discussion archives are clearly not where the Real Work has been done on this spec to date. Frustrating. Herein my Cranky rant.

3. OAI-ORE defines what they mean by "aggregation", and its description in a "resource map", in an abstract data model. This is the first and most important OAI-ORE document you can read, since it lays out all the ideas. So we should read it very carefully to understand it well. This document repeats a whole chunk of the Introduction from the TOC in its own Introduction, which is confusing. For some reason also they try to explain the web architecture, the semantic web, rdf, and named graphs as part of the next section, "Architecture Foundations". These concepts are defined elsewhere. This attempt to restate their definitions here expands the scope of this document, not to mention that they've mixed core, proven architectural generalizations based on years of experience at scale (about the web architecture) with provisional generalizations about the semantic web as if the knowledge of the semantic web is driven by years of experience at scale (which nobody has). To me, this is where OAI-ORE starts to veer of the tracks and make me Cranky. I would rather see the Introduction merely say "we are building on the well-understood notions of Resource, URI, Representation, and Link as defined in the [web architecture docs], and the developing notions of named graphs and the semantic web as understood to date in [semweb docs]." This would be more honest, recognizing that they are attempting to leverage both things we all know and other things some of us think we know all at once, and together. And the abstract data model document would grow much shorter. Both of which would make me less Cranky.

4. Now it gets meaty. Sections 3 and 4 of the 0.2 version of the abstract data model are the meat, potatoes, and apple pie of OAI-ORE, where they define what they mean by an aggregation and a resource map. Slow down. Read these three times. Draw a picture or two. Read them again. This is the important part. It's only one-and-a-half screenfuls in my browser, which is nice. To me, this is precisely the place where OAI-ORE fully falls off-track. In my advanced state of Crankiness, I don't want to have to think this hard. Largely because I've learned over the years that as a librarian I will wade through this kind of thing patiently and repeatedly until I Think I Get It, but that that's a perverse instinct borne of insecurity and geek power that most Normal People don't share. If I have to work hard to understand it, somebody else won't want to work so hard, and both of us will be left not understanding, and fewer people will use the spec. This is my standard for standards. Can I understand it readily? Will other people? If the answer to either question is "no", I've learned, it's best to Just Move On because People Don't Read Stuff Anymore. Blame market forces. Blame some despot leader of a failed state. Blame HBO's new series In Treatment featuring Gabriel Byrne, new episodes every weeknight! But people don't read stuff. So we need to give them less to read, and make what we give them readily understandable.

5. Here's how I think I'd do that. The goals of OAI-ORE are (a) identifying and (b) describing resource aggregations. To do that with the 0.2 specs you have to wrap your head around ReM, URI-R, URI-A - three new acronyms. They expand to four new concepts: aggregation, aggregated resource, resource map, and resource map document. These are thoughtfully marked in BOLD TEXT so you don't miss them. The entire rest of this document (section 5) goes into great detail about how these new concepts interrelate, and it gets confusing quickly. If you want to jump straight to the confusing part, see section 5.8, which also introduces URI-S, URI-P, and URI-O, which are acronyms for RDF triple roles, and a table of the cardinality of relationships between all of these concepts. I understand all of these concepts on a basic level but am still struggling to understand how they really fit together and why this abstract data model needs to be talking about anything other than aggregations of resources as described by resource maps and identified by URIs. And why concepts as core to the semweb such as How You Connect a Graph of Ss, Ps, and Os where some of those things are really ORE things need to be spelled out here in this abstract data model. I would prefer to see all of this stuff removed from this document.

6. Version 0.Cranky.3 of ORE's abstract data model document has two sections - the introduction and the definitions of an aggregation and resource map. That's all we're doing here, right? So let's just do that. "In ORE we describe aggregations with a resource map, which may be specified in several compatible ways, and which must be discoverable via an explicit URI. For humans, this provides a bookmarkable or citable reference; for machines, this provides access to a named graph for further automated processing." Describe the internal logic of the graph in another document, and then only people who want to further process this stuff automatically have to read that much further, but everybody gets the core concepts: aggregations are described in resource maps which are discoverable via URIs and rendered in several compatible ways. That's the abstract abstract data model, right? So let's stop that document right after we say that, and then point people at separate docs explaining the core resource map concepts (the vocabulary), the internal structure of graph-based processing of resource maps (section 5 of the current abstract data model), the recommended discovery techniques, and other serialization examples.

(By now I'm violating my suggested reader crankiness rules, so I'll be briefer. The above is my main point, the secondary main point follows in 8 and 9.)

7. The vocabulary document is easier to understand, and should be the second thing people read after the data model. My beefs with it are simple: (a) why consider a resource map document a separate entity? What we really need is compatible understandings of the conceptual makeup of resource maps across diverse software implementations working with equivalent resource maps expressed in different serializations. The semantic web is all about triples adding up to a graph, right? Let us work hard to implement test suites and sample data that allows developers to build diverse but equally reliable digital libraries that understand resource maps compatably. The document is incidental to a resource map's discoverability and its internal model. (b) why define "aggregates" and "isAggregatedBy" when you can instead define "type resourcemap" and lean on "hasPart" and "isPartOf"? (c) I think Rob S. is onto something useful with his notion of a "resource in context", where a context can be defined with a date-of-access timestamp and that context can be used down (up?) the "scholarly value chain" by others to refer to what they read when they read it. But this concept isn't in the current specs.

8. If ORE is to be successful at web scale (why else follow principles of the web architecture?), which I would like it to be, then it has to be useful to normal people using ordinary browsers to save and share bookmarks and read text in HTML. Having worked a medical library reference desk for a few years I'm guessing the most important, most common, and simplest use case for ORE is to describe a "journal article" as an aggregation of "this html page, that pdf, these references, and those images", which is *what online journals already do*. This consistent expression of the molecular nature of an aggregation - in this case "all that journal article comprises" - can and should be equally manifest via microformat-style semantic html, rdfa-style embedded description, autodiscovery-style links to alternate expressions, and direct access to atom or rdf serializations. I'm ordering that list that way because I think that's the ordering most likely to be actually experienced by people, because that acknowledges what's already being actually experienced by people: first and foremost, through little blocks of html in the top corner of a journal article page. Let's give publishers a way to keep doing what they're already doing but also newly define those blocks consistently with an html pattern that allows processing environments to find those blocks and treat their contents like any other resource map. Let's do that first, because that's pretty much how everybody already does this. Like I wrote before, the "resource map document" is incidental - what matters is the resource map, and that should be made able to be discovered and understood unambiguously through a microformat pattern, rdfa, atom, or rdf.

9. If that makes sense, then the discovery document becomes much more important. And a section like its 5 (on "Methods Not Recommended for ReM Discovery"), which basically says "this well-known web publishing pattern over here, which seems so logical and useful that popular modern web frameworks are baking it in to their cores (note: link broken at time of writing, but I think that's the right link), yeah, that's the one, well, that, and yknow, SIMPLE HTML LINKS, you can't do those here. Nope, they're forbidden", can just go away. Lovely. Much less Cranky, I am.

10. Forgive me for repeating myself, but here's what I think ORE should say: "ORE specifies how to describe aggregations with a resource map, which may be rendered in several compatible ways, and which must be discoverable via an explicit URI. For humans, this provides a bookmarkable or citable reference; for machines, this provides access to a named graph for further automated processing." And the core vocabulary, and the discovery mechanisms, and sample renderings as microformattish html, rdfa, atom, and rdf/*, and provide a suite of web-accessible named test documents in all renderings to allow developers of ORE processing tools to know (a) what they have to do to make their code understand ORE resource maps compatibly and (b) how to communicate with other developers about their implementations and (c) when their code is done.

11. This rant does not Crank to 11.

Ongoing questions about linked data and the semantic web

in

I'm getting a bit further along in trying to understand how the pieces are supposed to fit together. I don't have immediate answers but I've found one particular use case that I think is a big win for linked data. If I can assemble a useful working implementation I'll write about it here.

In the meantime, I keep stumbling over the following sticking points. I'm searching for answers (even for the ones not framed as questions), and would gladly take any advice or suggestions.

  • I love the *idea* of linked data but I'm not sure I can buy into the current state of the art in how to best link data. In particular, it seems like "sameAs" claims should be jumping off points for human judgement, rather than being presumed to be automated declarations of equivalence. Let's automate bridging the human judgement pieces... that'd be interesting.
  • I have never understood FOAF. It seems like a fine way to serialize a cult-of-personality network (e.g. "see? i'm only two steps from timbl himself!!") Similarly I don't get the whole "social graph" buzz either. I'm not a marketer looking to harvest customer data. I'm not doing any affinity indexing just now. What other use is there for saying who my friends are, besides those two?
  • Does the linked data movement really depend upon RDF? It doesn't seem like it has to. Maybe it could grow faster if it didn't.
  • The info resource / non-info resource dichotomy doesn't fit my brain. (Wherein everything is always a representation, and sometimes I can only share description, but that description is as important as any other representation, because surrogation is really important too.) It's been pointed out to me that this is still controversial. I can understand why.
  • If blank nodes are bad (end of the section), how do I represent sets of literals that mean the same thing but are expressed in different languages? I need to do that right now and I can't figure out how without blank nodes.
  • I'm still mainly interested in Description (talking about things) and am completely disinterested in modeling knowledge (what things are and mean) and seem to keep finding examples where arguments about best practices hinge on notions of essential truths ("is it a resource that can be dereferenced on the web? a dog is not, so it's a non-information resource") that simply never matter in the work I do (I'm a librarian and I want to improve ways to organize and provide access to stuff). I care about systems that help people come to their own judgements about what things are and what they mean, and in particular I care about systems that allow a wide range of people to come to a wide range of these judgements. If I have to start a system about things that aren't on the web by accepting truth-based categorizations about web availability, I'm shoehorning my system into an oddly-structured container from day zero. I think I don't want to do that. Granted, we've had to do that for centuries in libraries (see Books, Oversized, or the "basement full of gifts the President received while in office" in any presidential library/archives for good examples), but we make do for weird examples and still put "most stuff" in "standard, appropriate containers" (e.g. ordinary shelves and file boxes), rather than building whole systems around odd catch-all structures (basements full of stuff).
  • er, that last one was over-long, so I'll try it this way instead. I think I'm interested in Linked Description, not Linked Data.

A colleague I trust and always learn from tells me that I should stop saying snarky-seeming things like the above in private channels and should instead say them publicly. It's weird - when some friends and correspondents hear me going on like I do in this post, they seem to presume that I'm saying I hate the semantic web or am bad-mouthing RDF. I don't intend to do either of those things. I'm seriously just trying to understand their place and whether they can help me and my colleagues in the work we're dong. After ten years of keeping an eye on them, and having real problems in front of us to solve right now, I need to know whether this path is going to help me right now. It seems to have a lot of potential, but I don't see many public examples of current implementations that solve real problems for people. I've read about a few private ones that do sound promising, but I can't see them for myself, so I have to dismiss them. There are cool linked data sites that offer up good data, but I haven't found many that really require the RDF/OWL/etc. flavor they're offering (i.e. they could just offer up their data in other formats and might be just as effective for data linking over time).

So I'm not trying to be dismissive or insulting, though I admit that sometimes I just am, and jerkily so. I'm trying to avoid that now by posting more thoughtfully here. I still need help understanding why my thinking might be wrong on the issues I listed above, but if the helpers take a defensive stand without backing up claims with working sites, I'm going to keep questioning the state-of-the-received wisdom. Please don't take it personally.

Somewhere between Anger and Bargaining

I've been doing some homework, trying to learn what RDF is all about, starting from suggested reading linked from the comments on my earlier post. So far I have a few basic questions and comments:

  • I get that things "started with" RDF/XML. But why do we need anything more than N3? It's not the simplest grammar in the world, but it's readable, compact, featureful, and it's what people seem to use when they actually talk about RDF. On top of that, it's just text. These seem like winning features.
  • I am befuddled by the sheer number of RDF examples that attempt to "model knowledge". What if I really just want to "describe resources"? Aren't those completely different activities? I think I want a framework for talking about stuff - not for representing stuff itself. Instinctively I'd guess this reflects the whole AI/KR "I Know What You Mean" heritage of hype, but we're not talking about the "Knowledge Representation Framework" here, right?
  • Somebody please hire Joe Celko (or a sufficiently advanced AI thereof) to write us a "SPARQL for Smarties".

And another thing, hard to form as a question. This nearly-machismistic talk about "how many triples? HOW MANY?" feels like a distraction. Every bit of recent experience I have building data-backed apps with a lot of data coming in the door has taught me to (a) keep the data just like you got it, (b) build a way to extract enough from it to run your app, and (c) optimize the extracted bits to meet user requirements. The separation of steps from (a) to (c) means you can always swap out (b) and (c) from the original data in (a), keeping the sourced data safe over time. Especially because over time (b) and (c) just get easier and easier (faster machines, more RAM, better web frameworks, etc.). If this is common understanding (is it? it's what I know, now, at least), then what's the point of having a live environment for 2,000,000,000 triples? Is that really the only useful way you can query that kind of pile of data in all its semantic glory?

Instinctively I don't want to believe that that's true. I get that some applications are really about accumulating more and more data, and that that can get pretty big pretty quickly. But the triple model seems inherently optimized for flexibility, and as apps/data get bigger and bigger, you want to optimize for efficiency along a few known paths, and I can only imagine you'd again want to (b) extract enough from the source data to run your app. Which, I presume, would mean something very different from 2,000,000,000 triples.

Does that make any sense? Maybe I'm not stating my concern clearly... or just don't yet get something fundamental about all this (very likely!).

Will I need to understand the Semantic Web in 2008?

in

Lately I've been thinking a lot about alternate metadata universes where things might look rather different from our libraries' one item => one record world. The thing is, every time I reach some intermediate conclusion about it, the only people I can find who are thinking the same ways seem to be Semantic Web People, or at least people whose blogs/projects I follow, an overlapping set with people from our own profession who care about these things and have already drunk the SemWeb punch to some degree. They tend to call things by names different from what my brain wants to assign them, but no matter, so long as the URIs don't change, I suppose.

On the other hand, if somebody asks, today, in January 2008, "where is the Semantic Web?", I, at least, as a neophyte, have no idea how to answer. Except maybe to suggest that the Web was around for years before People (capital 'P' as in world-scale counts of "people") did interesting things with it and it appeared on a machine near me, and maybe we're now in a similar intermediate phase, so stay tuned, eh.

Which leads me to wondering - is now the time for all good library hackers to come to grips with the state of the SemWeb art? Have we crossed some tipping point?

I know which software libraries to try out, I have some data to muck around with, and there's plenty of interesting work on linked data, so there's something to start with. And it feels like it's time to start. So maybe now's the time?

Syndicate content

This site is Copyright (c) 2005-2008 by Daniel Chudnov. All rights reserved.

All opinions stated here are my own, and do not reflect those of my employer.