Open Data is not the point
There have been several people gushing over Casey Bisson's award pronouncement about giving away LC data, and even a whole podcast episode recorded over it. Several things are undeniable about all of this:
- Casey is a great guy and WPOpac is a great hack, so he is totally deserving;
- Giving away a copy of the LC data is a noble thing to offer;
- Many of the gushing people are people whom I respect and admire.
That said, I think the fawning "is this a new era?" talk is unfounded and in many ways insulting. It seems to mirror the kind of attitude we disdain when library visitors say things like "oh but it's all on the internet anyway" and we mock them, or when we tech-heads scorn our aging non-net-savvy superiors for "not getting it." It's as if the existence of a whole wing of our professional ouvre is not merely being not talked about but actively denied.
There are several things just not being said here, and somebody has to say them.
First off, the LC bibliographic data is not exactly being held captive. Anybody can go buy a copy of this data now right from LC or from third parties today. The cost of this data is not in any way prohibitive for a medium- to large-scale institution that is already used to doing Big Deals in the six and seven figures. As I understand it many largeish library institutions *already* have access to the whole dataset and use it regularly for cataloging. I know this because I have a copy of some of it on one of my workstations, a copy I was allowed to use for research purposes. Granted, this came while I was working at Yale, but I assure you, Yale's not the only place where this might be true.
Secondly, LC bibliographic data is nowhere near exhaustive. It's not like this is "every book ever written." It is, like any other library collection, just one library collection. The authority data is, I'd presume, a very full representation of the whole of What's Been Done, but even that really only addresses a narrow breadth of types of works (mostly books, that is).
Third, it's not as if cataloging *hasn't* been massively distributed for many, many years. The tension between "random people can mark this up" and "expensively trained catalogers are necessary" isn't just a question of some AJAX and an XML exchange format. Surrogating works is a sloppy business, with many common items like those for which CIP data is produced early in the pipeline possibly obscuring what's really hard and expensive about all of this. There *are* people all over the world already sharing the load of producing reliably consistent metadata for both common and uncommon works, they've already been doing it for a long time, and many highly evolved techniques have been developed to ensure some level of quality.
Point the fourth: what good is a dump of a snapshot of data if it doesn't have a reliable backbone for updates, integration, and consistent access? If I'm Casey right now I'm doing some time and space and cost calculations and it's not adding up for providing those services. Here's why - if you think somebody's going to post this data and suddenly we'll have a mashupable resource like the Google Maps API you're mistaken. The Google Maps API (and others like it) is as awesome as it is because Google's sitting behind it absorbing the costs of keeping the data up to date. You want a dump of worldwide satellite data? You can go get it, right now. For free. You can even come up with clever hacks to do fun things with it and those might be incredibly useful. But if you want to change the market for satellite data, you're going to have to invest a lot more than what you've got in your office today, and there's already a marketplace full of competitors, and it's not like they're not paying attention.
So far this post has been rather negative - fine, I'm being negative. But when I listened to the Talis podcast on this topic and Ross or whomever it was said something about how important the authority data was, even though this hadn't been mentioned yet in the previous 40 minutes, it seemed as if Paul had piped in the sound of crickets. This drew practically no response, and that's just frustrating. It seemed as if the participants were either not willing or not able to explain how there isn't just some roomful of catalogers at LC churning out metadata and another roomful of guardians exacting wrenching payments for the catalogers' output. There are hundreds-to-thousands of people in institutions all over the world who spend time every day on original cataloging, and there are multi-institutional and even international workflows honed over years to share the results of those efforts. It seems to me that any conversation about how to improve the availability and usefulness of large swaths of the output of these processes needs to at least acknowledge that in many ways these workflows are the very soul of our profession. Knowledge about how they evolved and are continuing to change needs to be brought to bear, and it exists, probably close to where you are, so let's not ignore it.
That said, let's see if I can turn this around to end on a more positive note.
Yes, there is a wealth of important data sitting in the LC dataset. Yes, for all the massive expense and bizarre rules and ancient formats we still use, a lot of sharing happens. Yes, for all of that sharing, at such great cost, we get far less out of the data than we should - we share it less widely, we make it less accessible, we expose it to practically nobody in our own local communities.
I stand 100% behind any attempt to change this situation. Make our data work harder!
Yes, the economics and legal framework under which data is shared in our communities need a hard look, and perhaps a significant refactoring.
Yes, we could get a whole lot of people from outside of our community more interested in hacking on our data and helping us make libraries better and coming up with fabulous new services if we took steps to make it easier for them.
Yes, the few biggest players in all of this (LC and OCLC) seem to have a ton of leverage, and a significant amount of money flows in their direction. But, yes, the roles they serve in the broader functions of our community are critical, even though they need to be reconsidered. It's not as if there aren't already a lot of smart, experienced people who do indeed "get it" already reconsidering most everything they do.
Adding all of this up, I'm excited by the prospect that somebody like Casey would choose to want to set up another thousand Caseys with an easier opportunity to come up with another thousand great hacks. But I just can't see that as the whole story.
Ultimately, there are two big lurches I feel coming in our profession (and society, really). The first is the greedy librarian phase, where the mere fact that it has become cheaper and easier for all of us to have copies of Worldcat on our iPods means that we will. This is exciting. The second one, which probably has to follow that for reasons of sheer scale, is the era of the bibliographic backplane.
I call this weblog "One Big Library" because the only long-term professional future I see for myself in libraries is one in which most of what I do is directed toward making the whole net work like one big library. Somewhere in the mess of fauxonomies and arcane bibliographic control techniques is an API and an emergent dataset and service that starts to just glue all our distributed and divergent stuff together. I don't see it emerging from solely a shotgun approach, though. One of the trickiest parts of the jake project was figuring out the "how do we draw the line between global knowledgebase and local holdings?" question. It's no coincidence that that wasn't exactly a new question, and that the shifting dynamic it highlights is the same one underlying my rant here.
In the race to build out the human genome dataset there were simultaneous, competing efforts based alternatively on years -- generations, really -- of slow, steady progress and an aggressive, "shotgun" approach. Ultimately the two drove each other harder and their output complemented each other - though that's a vast oversimplification. What that says to us, though, is that it seems very likely that no effort to "open our data" is going to work unless we're awfully smart about keeping in touch with the people who've been making slow, steady progress for generations. You can find these people working on the VIAF project and serving in various ways under the PCC, just to name a few starting points.
I've been listening to the History of Information course podcast from UC Berkeley this fall, and one of the constant themes Drs. Nunberg and Duguid return to is the fallacy of technological determinism. I'm not sure I've absorbed all the implications of this notion yet but what it's taught me is to be extremely wary whenever somebody says of a single technical or cultural breakthrough that "this is going to change everything." If Open Data in the library community actually does help to change everything, it won't be because of one person freeing up some records. (And, again, I love what you're doing, Casey, and am really excited for you.) It won't even be because of the economics of technology and bandwidth have created the opportunity, or even just because while some of us wacky librarians have actually been talking about sharing all of our data for hundreds of years a few particularly wacky librarians actually went and started doing it a few centuries ago, and more people have kept at it all along.
If what the gushing hints at - and I tend to hear it as something similar to what I mean by the "bibliographic backplane," linking everything we do and own and share up seamlessly and having it all grow in response to our use of the system as participatory feedback and all of that web 2.0 magic - it will be because our society moved in a direction whereby we chose to build a bibliographic backplane. And because that move coincided, eventually, with all of these other factors. In the meantime I'm still gushing at the possibility of the forthcoming OCLC Identities project (though I forget exactly what its name was going to be) even though I have pretty much the same raw data sitting here on my hard drive.
If you've read all of this, well then, here's the good part: I've got this satellite imagery here that'll fundamentally change the way you look at maps and how all of us see the world, and I'll sell it to you, CHEAP...