oaiore

In like a l10n

It's been a busy few weeks. A week in Portland for code4libcon 2008. A jaunt up to Baltimore for the OAI-ORE meeting. Most of a week in Chicago (er, well, O'Hare, mostly) for PyCon 2008. Ordinarily I'd write up blurbs about the two 'cons at least as substantial as the one I wrote for the ORE meeting but I'm still pretty tired from it all and my absorption rate has slowed.

Some quick hits, then:

  • We managed to not screw up code4libcon yet again. Undeniably it was better in many ways than before: more people, a better gender balance, lots of new voices, more preconferences and an unpreconference that didn't suck, a fun time in a great city, lots of connections made, faces put to names, and by-now-old friends caught up with. We still have to deal with the "yeah well what about next year" decision-making problem, which is bordering on intractable.
  • March 12, 2008 marked the 15-year anniversary of the beginning of my year-long treatment for bone cancer. I've long meant to write more about that year, and it feels like now might be a good time to do that. Those of you who know me know most of the story, but I think it's probably worth telling in full. The short of it is that I shouldn't really be here now, quite simply, but here I am, and in the intervening years I've fallen in love and married, met many amazing people (scientists, musicians, clergypeople, scholars, lots of great neighbors and other friends, hackers, and more than my fair share of sexy librarians), travelled to several interesting corners of the world, buried too many people and fortunately seen many more people than that arrive into this crazy world and grow up quite a lot, hung out quite a lot at some of the world's greatest institutions of higher learning and knowledge preservation, started more than my fair share of weblogs, celebrated my teams' hockey, football, and basketball championships, seen my name and my ugly mug in print repeatedly including being indexed in MEDLINE, taught a roomful of 14-year-olds the basics of a language spoken halfway around the world, racked up quite a lot of debt but somehow paid most of it off already, and found a way to make a surprisingly good and rewarding living doing something I honestly love that somehow helps many people more than it hurts them.
  • PyCon 2008 was a great event. The Python community is struggling with its size (over 1K people this year!) and the event needs tweaking to stay true to its roots as a community- and volunteer-run event. But on the other hand, for an event with so many people, I got to meet pretty much everybody I'd hoped to, I saw more great talks than crappy ones, and I lucked into many neat connections with different folks I hope to be able to work with or who knew somebody I knew or with whom we could just find lots of fun things to talk about over a beer and a pizza or a laptop and a text editor (and sometimes all four!). And though it'll be difficult because the two events are so close to each other, I think that in the next year I want to devote some time to helping out with planning and running at least some small piece of PyCon and probably do accordingly less of that for code4libcon. My perfectionist streak has less chance to assert itself in a roomful of 1000 people who were there before me and don't know me than it does in a roomful of 200 people where I know most everybody, and they're both pretty important to me professionally and personally.
  • I'm reading a terrible book by David Baldacci. Yeah, that one. I would tell you not to bother, because it really is awful, but for some twisted reason I can't stop. Can anyone explain that?
  • At both PyCon and code4libcon I was able to give lightning
    talks to several hundred people where I showed off the project I've spent the most time on at my new job, the World Digital Library. After several months of down time on that project I'm hoping we might get started on it again soon, mainly because the response to it that I've heard has been overwhelmingly positive. How many people get paid to work on something that is essentially a way to help people all over the world learn more about each other, and has a genuine chance of succeeding?
  • Related to that, and now that I know more about the coming python-3.0 release, I think I need to figure out how to handle Unicode Normalization in py3k, and come up with a solution myself soon if the obvious options don't make immediate sense. [Update (3/20): I'm an idiot. Of course this is already easy.]
  • My new year's resolution for 2008 is to teach 1000 people how to program. I think I know how that needs to go now, and I'll get started soon. Stay tuned!
  • It's really time for me to finally lose that extra weight. It never matters more than during travel. I don't travel so much anymore, but when I do (and I might not again until next year), these extra pounds I'm trucking about just beat me up. Feh!

The Cranky Librarian's Guide to OAI-ORE

0. This Cranky Librarian is tired of the promulgation of library standards that attempt to define things in ways that don't fit the way everybody else does things on the web. I would much rather see as few as possible new concepts and protocols introduced as part of new library standards development. Much like the reductionist philosophy of the Microformats project, the more we make what we want to make work happen in a way that fits naturally into what people already do on The Nets, the better chance we will have at gaining adoption and wide use for our new specs. I'm particularly Cranky when I have to find or to write my own software to handle arcane or obscure library concepts. I uncrankificate when I find that some part of a library specification actually just leans on some already widely known protocol or spec and the amount of original work necessary to put the new spec to use becomes simply a matter of developing a lightweight new application using some software toolkit I already know. Because of this, and because their goals are important, I'm excited about the OAI-ORE specification, whose authors purport to want to do things this uncrankish way. I attended their "here's what we've got so far" meeting this week, and these are my Cranky thoughts.

1. OAI-ORE sets out to enable us to identify and describe aggregations of web resources. Not just "sites", or "pages", but those funky sets of multiple things that the web architecture doesn't explicitly speak to: "resources made up of other resources". This is an excellent pair of objectives - identifying and describing aggregations - because sometimes these aggregations need to be cited as a whole, or moved as a whole, or versioned or chopped up or crawled or otherwise manipulated in some way that acknowledges that "yeah, these separate bits of things actually fit together as a molecular unit in this way with this name at some point and based on that I can do stuff with it more usefully." Wonderful.

2. OAI-ORE is being developed by a stellar crew of Smart People. Many of whom have PhDs. All of whom have developed well-known standards before, among other well-known work in their fields of expertise. Some of whom I've had the good fortune to meet and get to know and work with some over the years and of whom I can privately ask a blunt question in sincere expectation of getting an honest answer. Terrific. Though some of the specs these same folks wrote before have been the kind that make me Cranky. And although they aspire to having the development of OAI-ORE being open for community discussion, their public discussion archives are clearly not where the Real Work has been done on this spec to date. Frustrating. Herein my Cranky rant.

3. OAI-ORE defines what they mean by "aggregation", and its description in a "resource map", in an abstract data model. This is the first and most important OAI-ORE document you can read, since it lays out all the ideas. So we should read it very carefully to understand it well. This document repeats a whole chunk of the Introduction from the TOC in its own Introduction, which is confusing. For some reason also they try to explain the web architecture, the semantic web, rdf, and named graphs as part of the next section, "Architecture Foundations". These concepts are defined elsewhere. This attempt to restate their definitions here expands the scope of this document, not to mention that they've mixed core, proven architectural generalizations based on years of experience at scale (about the web architecture) with provisional generalizations about the semantic web as if the knowledge of the semantic web is driven by years of experience at scale (which nobody has). To me, this is where OAI-ORE starts to veer of the tracks and make me Cranky. I would rather see the Introduction merely say "we are building on the well-understood notions of Resource, URI, Representation, and Link as defined in the [web architecture docs], and the developing notions of named graphs and the semantic web as understood to date in [semweb docs]." This would be more honest, recognizing that they are attempting to leverage both things we all know and other things some of us think we know all at once, and together. And the abstract data model document would grow much shorter. Both of which would make me less Cranky.

4. Now it gets meaty. Sections 3 and 4 of the 0.2 version of the abstract data model are the meat, potatoes, and apple pie of OAI-ORE, where they define what they mean by an aggregation and a resource map. Slow down. Read these three times. Draw a picture or two. Read them again. This is the important part. It's only one-and-a-half screenfuls in my browser, which is nice. To me, this is precisely the place where OAI-ORE fully falls off-track. In my advanced state of Crankiness, I don't want to have to think this hard. Largely because I've learned over the years that as a librarian I will wade through this kind of thing patiently and repeatedly until I Think I Get It, but that that's a perverse instinct borne of insecurity and geek power that most Normal People don't share. If I have to work hard to understand it, somebody else won't want to work so hard, and both of us will be left not understanding, and fewer people will use the spec. This is my standard for standards. Can I understand it readily? Will other people? If the answer to either question is "no", I've learned, it's best to Just Move On because People Don't Read Stuff Anymore. Blame market forces. Blame some despot leader of a failed state. Blame HBO's new series In Treatment featuring Gabriel Byrne, new episodes every weeknight! But people don't read stuff. So we need to give them less to read, and make what we give them readily understandable.

5. Here's how I think I'd do that. The goals of OAI-ORE are (a) identifying and (b) describing resource aggregations. To do that with the 0.2 specs you have to wrap your head around ReM, URI-R, URI-A - three new acronyms. They expand to four new concepts: aggregation, aggregated resource, resource map, and resource map document. These are thoughtfully marked in BOLD TEXT so you don't miss them. The entire rest of this document (section 5) goes into great detail about how these new concepts interrelate, and it gets confusing quickly. If you want to jump straight to the confusing part, see section 5.8, which also introduces URI-S, URI-P, and URI-O, which are acronyms for RDF triple roles, and a table of the cardinality of relationships between all of these concepts. I understand all of these concepts on a basic level but am still struggling to understand how they really fit together and why this abstract data model needs to be talking about anything other than aggregations of resources as described by resource maps and identified by URIs. And why concepts as core to the semweb such as How You Connect a Graph of Ss, Ps, and Os where some of those things are really ORE things need to be spelled out here in this abstract data model. I would prefer to see all of this stuff removed from this document.

6. Version 0.Cranky.3 of ORE's abstract data model document has two sections - the introduction and the definitions of an aggregation and resource map. That's all we're doing here, right? So let's just do that. "In ORE we describe aggregations with a resource map, which may be specified in several compatible ways, and which must be discoverable via an explicit URI. For humans, this provides a bookmarkable or citable reference; for machines, this provides access to a named graph for further automated processing." Describe the internal logic of the graph in another document, and then only people who want to further process this stuff automatically have to read that much further, but everybody gets the core concepts: aggregations are described in resource maps which are discoverable via URIs and rendered in several compatible ways. That's the abstract abstract data model, right? So let's stop that document right after we say that, and then point people at separate docs explaining the core resource map concepts (the vocabulary), the internal structure of graph-based processing of resource maps (section 5 of the current abstract data model), the recommended discovery techniques, and other serialization examples.

(By now I'm violating my suggested reader crankiness rules, so I'll be briefer. The above is my main point, the secondary main point follows in 8 and 9.)

7. The vocabulary document is easier to understand, and should be the second thing people read after the data model. My beefs with it are simple: (a) why consider a resource map document a separate entity? What we really need is compatible understandings of the conceptual makeup of resource maps across diverse software implementations working with equivalent resource maps expressed in different serializations. The semantic web is all about triples adding up to a graph, right? Let us work hard to implement test suites and sample data that allows developers to build diverse but equally reliable digital libraries that understand resource maps compatably. The document is incidental to a resource map's discoverability and its internal model. (b) why define "aggregates" and "isAggregatedBy" when you can instead define "type resourcemap" and lean on "hasPart" and "isPartOf"? (c) I think Rob S. is onto something useful with his notion of a "resource in context", where a context can be defined with a date-of-access timestamp and that context can be used down (up?) the "scholarly value chain" by others to refer to what they read when they read it. But this concept isn't in the current specs.

8. If ORE is to be successful at web scale (why else follow principles of the web architecture?), which I would like it to be, then it has to be useful to normal people using ordinary browsers to save and share bookmarks and read text in HTML. Having worked a medical library reference desk for a few years I'm guessing the most important, most common, and simplest use case for ORE is to describe a "journal article" as an aggregation of "this html page, that pdf, these references, and those images", which is *what online journals already do*. This consistent expression of the molecular nature of an aggregation - in this case "all that journal article comprises" - can and should be equally manifest via microformat-style semantic html, rdfa-style embedded description, autodiscovery-style links to alternate expressions, and direct access to atom or rdf serializations. I'm ordering that list that way because I think that's the ordering most likely to be actually experienced by people, because that acknowledges what's already being actually experienced by people: first and foremost, through little blocks of html in the top corner of a journal article page. Let's give publishers a way to keep doing what they're already doing but also newly define those blocks consistently with an html pattern that allows processing environments to find those blocks and treat their contents like any other resource map. Let's do that first, because that's pretty much how everybody already does this. Like I wrote before, the "resource map document" is incidental - what matters is the resource map, and that should be made able to be discovered and understood unambiguously through a microformat pattern, rdfa, atom, or rdf.

9. If that makes sense, then the discovery document becomes much more important. And a section like its 5 (on "Methods Not Recommended for ReM Discovery"), which basically says "this well-known web publishing pattern over here, which seems so logical and useful that popular modern web frameworks are baking it in to their cores (note: link broken at time of writing, but I think that's the right link), yeah, that's the one, well, that, and yknow, SIMPLE HTML LINKS, you can't do those here. Nope, they're forbidden", can just go away. Lovely. Much less Cranky, I am.

10. Forgive me for repeating myself, but here's what I think ORE should say: "ORE specifies how to describe aggregations with a resource map, which may be rendered in several compatible ways, and which must be discoverable via an explicit URI. For humans, this provides a bookmarkable or citable reference; for machines, this provides access to a named graph for further automated processing." And the core vocabulary, and the discovery mechanisms, and sample renderings as microformattish html, rdfa, atom, and rdf/*, and provide a suite of web-accessible named test documents in all renderings to allow developers of ORE processing tools to know (a) what they have to do to make their code understand ORE resource maps compatibly and (b) how to communicate with other developers about their implementations and (c) when their code is done.

11. This rant does not Crank to 11.