Open Data is not the point

There have been several people gushing over Casey Bisson's award pronouncement about giving away LC data, and even a whole podcast episode recorded over it. Several things are undeniable about all of this:

  • Casey is a great guy and WPOpac is a great hack, so he is totally deserving;
  • Giving away a copy of the LC data is a noble thing to offer;
  • Many of the gushing people are people whom I respect and admire.

That said, I think the fawning "is this a new era?" talk is unfounded and in many ways insulting. It seems to mirror the kind of attitude we disdain when library visitors say things like "oh but it's all on the internet anyway" and we mock them, or when we tech-heads scorn our aging non-net-savvy superiors for "not getting it." It's as if the existence of a whole wing of our professional ouvre is not merely being not talked about but actively denied.

There are several things just not being said here, and somebody has to say them.

First off, the LC bibliographic data is not exactly being held captive. Anybody can go buy a copy of this data now right from LC or from third parties today. The cost of this data is not in any way prohibitive for a medium- to large-scale institution that is already used to doing Big Deals in the six and seven figures. As I understand it many largeish library institutions *already* have access to the whole dataset and use it regularly for cataloging. I know this because I have a copy of some of it on one of my workstations, a copy I was allowed to use for research purposes. Granted, this came while I was working at Yale, but I assure you, Yale's not the only place where this might be true.

Secondly, LC bibliographic data is nowhere near exhaustive. It's not like this is "every book ever written." It is, like any other library collection, just one library collection. The authority data is, I'd presume, a very full representation of the whole of What's Been Done, but even that really only addresses a narrow breadth of types of works (mostly books, that is).

Third, it's not as if cataloging *hasn't* been massively distributed for many, many years. The tension between "random people can mark this up" and "expensively trained catalogers are necessary" isn't just a question of some AJAX and an XML exchange format. Surrogating works is a sloppy business, with many common items like those for which CIP data is produced early in the pipeline possibly obscuring what's really hard and expensive about all of this. There *are* people all over the world already sharing the load of producing reliably consistent metadata for both common and uncommon works, they've already been doing it for a long time, and many highly evolved techniques have been developed to ensure some level of quality.

Point the fourth: what good is a dump of a snapshot of data if it doesn't have a reliable backbone for updates, integration, and consistent access? If I'm Casey right now I'm doing some time and space and cost calculations and it's not adding up for providing those services. Here's why - if you think somebody's going to post this data and suddenly we'll have a mashupable resource like the Google Maps API you're mistaken. The Google Maps API (and others like it) is as awesome as it is because Google's sitting behind it absorbing the costs of keeping the data up to date. You want a dump of worldwide satellite data? You can go get it, right now. For free. You can even come up with clever hacks to do fun things with it and those might be incredibly useful. But if you want to change the market for satellite data, you're going to have to invest a lot more than what you've got in your office today, and there's already a marketplace full of competitors, and it's not like they're not paying attention.

So far this post has been rather negative - fine, I'm being negative. But when I listened to the Talis podcast on this topic and Ross or whomever it was said something about how important the authority data was, even though this hadn't been mentioned yet in the previous 40 minutes, it seemed as if Paul had piped in the sound of crickets. This drew practically no response, and that's just frustrating. It seemed as if the participants were either not willing or not able to explain how there isn't just some roomful of catalogers at LC churning out metadata and another roomful of guardians exacting wrenching payments for the catalogers' output. There are hundreds-to-thousands of people in institutions all over the world who spend time every day on original cataloging, and there are multi-institutional and even international workflows honed over years to share the results of those efforts. It seems to me that any conversation about how to improve the availability and usefulness of large swaths of the output of these processes needs to at least acknowledge that in many ways these workflows are the very soul of our profession. Knowledge about how they evolved and are continuing to change needs to be brought to bear, and it exists, probably close to where you are, so let's not ignore it.

That said, let's see if I can turn this around to end on a more positive note.

Yes, there is a wealth of important data sitting in the LC dataset. Yes, for all the massive expense and bizarre rules and ancient formats we still use, a lot of sharing happens. Yes, for all of that sharing, at such great cost, we get far less out of the data than we should - we share it less widely, we make it less accessible, we expose it to practically nobody in our own local communities.

I stand 100% behind any attempt to change this situation. Make our data work harder!

Yes, the economics and legal framework under which data is shared in our communities need a hard look, and perhaps a significant refactoring.

Yes, we could get a whole lot of people from outside of our community more interested in hacking on our data and helping us make libraries better and coming up with fabulous new services if we took steps to make it easier for them.

Yes, the few biggest players in all of this (LC and OCLC) seem to have a ton of leverage, and a significant amount of money flows in their direction. But, yes, the roles they serve in the broader functions of our community are critical, even though they need to be reconsidered. It's not as if there aren't already a lot of smart, experienced people who do indeed "get it" already reconsidering most everything they do.

Adding all of this up, I'm excited by the prospect that somebody like Casey would choose to want to set up another thousand Caseys with an easier opportunity to come up with another thousand great hacks. But I just can't see that as the whole story.

Ultimately, there are two big lurches I feel coming in our profession (and society, really). The first is the greedy librarian phase, where the mere fact that it has become cheaper and easier for all of us to have copies of Worldcat on our iPods means that we will. This is exciting. The second one, which probably has to follow that for reasons of sheer scale, is the era of the bibliographic backplane.

I call this weblog "One Big Library" because the only long-term professional future I see for myself in libraries is one in which most of what I do is directed toward making the whole net work like one big library. Somewhere in the mess of fauxonomies and arcane bibliographic control techniques is an API and an emergent dataset and service that starts to just glue all our distributed and divergent stuff together. I don't see it emerging from solely a shotgun approach, though. One of the trickiest parts of the jake project was figuring out the "how do we draw the line between global knowledgebase and local holdings?" question. It's no coincidence that that wasn't exactly a new question, and that the shifting dynamic it highlights is the same one underlying my rant here.

In the race to build out the human genome dataset there were simultaneous, competing efforts based alternatively on years -- generations, really -- of slow, steady progress and an aggressive, "shotgun" approach. Ultimately the two drove each other harder and their output complemented each other - though that's a vast oversimplification. What that says to us, though, is that it seems very likely that no effort to "open our data" is going to work unless we're awfully smart about keeping in touch with the people who've been making slow, steady progress for generations. You can find these people working on the VIAF project and serving in various ways under the PCC, just to name a few starting points.

I've been listening to the History of Information course podcast from UC Berkeley this fall, and one of the constant themes Drs. Nunberg and Duguid return to is the fallacy of technological determinism. I'm not sure I've absorbed all the implications of this notion yet but what it's taught me is to be extremely wary whenever somebody says of a single technical or cultural breakthrough that "this is going to change everything." If Open Data in the library community actually does help to change everything, it won't be because of one person freeing up some records. (And, again, I love what you're doing, Casey, and am really excited for you.) It won't even be because of the economics of technology and bandwidth have created the opportunity, or even just because while some of us wacky librarians have actually been talking about sharing all of our data for hundreds of years a few particularly wacky librarians actually went and started doing it a few centuries ago, and more people have kept at it all along.

If what the gushing hints at - and I tend to hear it as something similar to what I mean by the "bibliographic backplane," linking everything we do and own and share up seamlessly and having it all grow in response to our use of the system as participatory feedback and all of that web 2.0 magic - it will be because our society moved in a direction whereby we chose to build a bibliographic backplane. And because that move coincided, eventually, with all of these other factors. In the meantime I'm still gushing at the possibility of the forthcoming OCLC Identities project (though I forget exactly what its name was going to be) even though I have pretty much the same raw data sitting here on my hard drive.

If you've read all of this, well then, here's the good part: I've got this satellite imagery here that'll fundamentally change the way you look at maps and how all of us see the world, and I'll sell it to you, CHEAP...

Comments

The point is..

I agree with most and partially disagree with some. I think the idea of open data can be important in certain circumstances, especially in the example of genome data that you give. However, I think I missed the point. Was it that connecting things is the point?

...something like this.

Hey Ryan - maybe it didn't come through my extended rant clearly. :) I think my main points were:

  • congrats Casey!
  • more data sharing is good
  • this isn't some landmark tipping point
  • people getting excited about this maybe being some kind of tipping point are framing the discussion myopically
  • the dynamics of catalog data sharing are rich, well-understood, and carry deep-seated wisdom, even if today's implementations and marketplace are far from optimally balanced (and they are)
  • the people who know this stuff aren't the ones getting excited, and there's a reason for that

So who knows this stuff then?

"the people who know this stuff aren't the ones getting excited, and there's a reason for that"

So tell me why, then, when MIT offered Barton to the world, Code4lib.org immediately created a bittorrent of it?

Why did Kevin Clarke take Georgia Tech's entire bib database when I said I had a dump of it?

Why does Code4lib2007 have a proposed breakout: "Pilfering LOC MARC for fun and profit ... AKA, using the power of Perl and onion routing to anonymously download all of LOC's MARC records for free"

Why bother with the Sandy Berman catalog?

It seems a lot of your colleagues are at least interested, if not excited, about it. Or don't they know their stuff?

Yes, I agree with you. Calling this "the most important thing since the MARC record" or whatever is completely overselling the situation. That doesn't, however, render it insignificant.

Who knows this stuff

You asked:

"So tell me why, then, when MIT offered Barton to the world, Code4lib.org immediately created a bittorrent of it? Why did Kevin Clarke take Georgia Tech's entire bib database when I said I had a dump of it?"

When I said "the people who know this stuff", I meant "the catalogers."

Kevin's a cataloger. Who made the bittorrent, Ed? Ed knows from catalog data. As do you. I suspect it's compelling to all of us (me included, which is why I have some of this stuff here too) because this is just work we all need to be working on these days. It's important to do.

Fwiw, I'm not sure how interesting Barton data is - MIT's is an interesting, medium-sized academic library, but it's largely an engineering school with the most depth in engineering-related collections (which is not to make light of its holdings in business, art and architecture, and humanities, among others - they're just smaller).

The Sandy Berman catalog is interesting because it's Sandy Berman data.

So, yes, it's important for us all to work with this data to understand it better and to become able to do more to make it work harder. I'm not disagreeing with any of that. I just don't agree that suddenly having a big lump of LC is necessarily world-changing on its own.

...a big lump of LC is not world-changing on its own

"I just don't agree that suddenly having a big lump of LC is necessarily world-changing on its own."

On that we agree. It raises the profile of the discussion, and of the issues, and leads (hopefully) to more and more collections being added to the pool.

Personally, I can see little use for just LC's catalogue (outside the very real and useful uses to which it is already put, of course), or just Barton's, or just any other single catalogue. I can see all sorts of potential in aggregating them and matching local holdings against them, though...

...and I guess my response

...and I guess my response is that that's what we already do. :)

As you stated eloquently on your panlibus post, though, the dynamics of how we can and should do that mechanically definitely have to change to effect a global bibliographic backplane. And for those already hooked up to one bibliographic mothership or the other, "changing the financial model doesn't immediately impact in their world." We agree on that too!

I (obviously) just thought it important to bring that more explicitly into the discussion.

:-)

:-)

Sanity

Dan, just two words. Thank you.

Excellent post, thanks very

Excellent post, thanks very much.

I think there is a dangerous divide that has occured between the traditional cataloging community and the, for want of a better word, techies. It must be bridged for progress to occur. The problems we are facing in the library world (with this divide being just one of them) are primarily social problems (including economic problems), not technological problems.

Cataloging is a science and art with 100+ years of history in working on the same sorts of problems that the library tech vanguard is wanting to solve. There are many in the cataloging community that are indeed currently working on solving these same problems--who 'get it'. (There are also some who don't also, certainly) (I think this is who Dan was alluding to as "the people who know this stuff", and "Knowledge about how they evolved and are continuing to change needs to be brought to bear, and it exists, probably close to where you are, so let's not ignore it."). This stuff is not new.

That change is not happening quicker than we want is not for lack of some people trying. It's because it's hard, for various social reasons (exactly what those are is something we're all trying to figure out). So I see the temptation for the tech vanguard to say, okay, let's just ignore that whole realm, it's just too hard, let's just try to do our own thing ignoring them. I feel the temptation too. But if the problem is really a social problem (a hypothesis I have not supported in this comment, but which I hold and I suspect Dan shares), then the solution has to be a social solution. (Doesn't mean people can't be working on their own thing simultaneously; I like Dan's point about "simultaneous, competing efforts". I like Dan's whole post. I think he's spot-on.)

open data

Very thoughtful post Dan. I just listened to the podcast so I think I understand your response a bit more than I did yesterday when I read it first. But I'm still wondering--if open data isn't the point, what is?

I think you are right to point out that the current situation is far from bad, and that libraries have been doing a remarkable job of distributing and collaboratively creating bibliographic and authority data for many years now.

Perhaps some of the discussion around the topic has seemed a bit condescending to library professionals who have spent their lives creating and sharing this data via collaborative networks like OCLC. But really these individuals should feel proud that others are earnestly looking for ways to make their work more widely available to enable greater innovation in libraries. It's really an complete validation of their work.

Librarians should also find comfort in the fact that theirs is not the only industry that is finding the changes in the information landscape over the past 10 years to be unsettling.

open data better serves a global audience?

Wow, Dan, thoughtful post (as always), I am not sure I am understanding how the threads come together though.

I sense you are reacting to a perceived negativity towards cataloguing practices and traditions, which is indeed worth confronting if it's there, but I suspect that, even if it is, the single best way to show the value of all of this hard work is to position the results so that more eyes can peer into it.

Beyond this, there are organizations that could benefit immensely from a static set of CDs that hold the LC database with even a minimal interface. Most of the planet's libraries outside the West are subject to much greater network availability and latency issues, let alone the fiscal realities of not being located in the small group of humanity that gets alloted almost all of the world's resources. There is a reason why badly outdated copies of BiblioFile churned away for years beyond any update period in resource poor regions.

I am also not sure how the Human Genome Project really fits in this context. Given the role of players like Bill Clinton and Craig Venter in the process, it makes me wonder about the dynamic of brash interventions, pronouncements, broad consensus, and slow and steady progress (James Shreeve's book, for example, is good on this). An initiative like Humanitarian Information for All, which distributes health information on CD-ROM, has had an enormous impact for relatively modest investments in infrastructure, and I could see the building blocks for something similar with the LC data and Casey's initiative. I recognize that this is hardly revolutionary from a technology standpoint.

What is revolutionary, I think even by wacky technology standards, is your notion of a bibliographic backplane, though I may not necessarily understand it very well. I really believe that humanity's ability to share information and narratives is a large part of the metric for any progress in the future, and I think your vision would enable this in ways that would make many of the current technology initiatives in libraries seem small in retrospect. But I do think that there is a disconnect between the mechanisms built by libraries for fostering sharing and the need to extend the reach of these structures globally. I also think the bidirectional possibilities need some attention, pushing data out to other parts of the world could be an opportunity to create a conversation with the recipients. I don't know how this fits in with FRBR or whatever, but I suspect the enthusiasm for creating the plumbing for this needs some good sized datasets with a bit of hype, deserved or otherwise, associated with them.

This is not to read anything sinister into the gap between where library data is and where it might need to be, or in any way to be disrespectful of the brilliant minds and dedicated people that made resource sharing such a vibrant reality in libraries, but I wonder about the planing and fidelity of the feedback loops, say for the AIDS organization in Zimbabwe with a small resource collection, or the small library in Beirut trying to balance a collection with multiple viewpoints. I know little about the makeup of the committees that steer RDA and the rest, I just hope that they reflect the realities of a small and inequitable planet. Open Data would seem to have a natural synergy with libraries worldwide.

So I dunno, I am not sure I understand how my threads come together either, but I see an inclusiveness to sharing the LC data in bulk that is worth celebrating.

Well said

Hey Art - you make several very compelling points, I won't try to argue them.

In particular, this:

"But I do think that there is a disconnect between the mechanisms built by libraries for fostering sharing and the need to extend the reach of these structures globally."

...is a great explication of what I was trying to get at. I see an active disconnection between the workflows that create, share, and keep data consistently useful as being a net negative. Your point about the viability of this data in disconnected locales is incredibly strong, but that does seem to be a worst case that we should only be satisfied with when it's out at the "leaf nodes", as it were.

I won't cheer for an effort to sever that connection so close to home, though. Which isn't to say that that's what I heard people calling for, but to hear the cheers for open data without an accompanying awareness of the diverse connections that led to good data to begin with, I sense impending disconnection, and I get uppity. :)

Awareness could be in the licensing?

Thanks for the thought provoking post, Dan. I'm not sure I see an impending disconnection as a result of the desire to open up all that data that libraries have been generating for a very long time. I do think awareness of the source of the data is important though (it is only polite to acknowledge those on whose shoulders you want to climb).

What would you think of doing this through licensing? Make the data available, but wrap its use in a copyleft like license so that any use of it must reaffirm the spirit in which it was received. CC has a copyleft-ish license for data. Would using that quell some of your concerns? I believe the initial announcement said Casey plans to use a GPL-like license (but I would assume something for data rather than for code).

An Open Data Licence

A good point - and exactly what the TCL was designed for - www.talis.com/tdn/tcl.

We only set it up because we didn't think the current 'open' licenses catered for data very well... and we're in active discussion with the obvious homes for open licenses to have them take it on.

Or alternatively...

You could also read Art's quote as suggesting that the current mechanisms for fostering sharing are a product of their time, and possibly not fit for purpose moving forward. That doesn't mean we throw them away, or forget that they existed. It does mean, though, that we need to look at how they could and should evolve.

There is huge and undeniable value in the data, and in the community and practices that built and maintain those data. We need to respect that, applaud that, build upon that, and learn from that. We shouldn't simply preserve it at the expense of evolving it, though.

I think the current

I think the current workflows for creating, maintaining, distributing metadata are broken. I think maybe this is what Dan meant when he said "I see an active disconnection between the workflows that create, share, and keep data consistently useful as being a net negative."

Metadata is not being created and cared for in a way that let's us actually use it productively; the resources being put into library metadat are not being used appropriately.

This is a tough thing for catalogers to hear, because it can be an attack on their community. But I think it's true, and I don't think it does anyone any good for technies to say, as I recently read a prominent techie who's been doing some really good stuff write, "Hey, you catalogers are really the experts, we techies don't know anything about metadata, you just keep doing your thing, we won't interfere, and we'll try to use the data you generate the best we can." It's patronizing. And is part of the disconnect that I think Dan is talking about.

It's also patronizing for someone from outside the cataloging community to think he (cause it is always a he for whatever reason; the gender thing may be part of this whole thing) knows everything there is to know about 'proper' metadata life cycle, and if only the catalogers would just step out of the way, or do what the techies say, all would be well.

Both are different manifestations of the dangerous divide between cataloging community and technological community. (Exascerbated if you pretend 'cataloging' and 'metadata' are different things, in my opinion).

Somehow, we techies need to approach the cataloging community in the spirit of respect and cooperation. Yes, you cataloging communtiy, you have been working on these issues for a long time. We too are working on these issues. We are all in it together. What's going on right now is fundamentally broken from many sides (most catalogers are (justly) unhappy with the technology they get), but it's not easy to fix, let's work on fixing it together. In the 21st century, the cataloging community and the information retrieval technology community need to be different sides of the SAME community.

This is a social problem. We all know that technology alone can not solve the problem of a disconnected and inefficient metadata life cycle.

If you'll allow me to be reductive...

So, open data is not in itself the holy grail. It is means to an end. It is a necessary condition of finding the grail but is not sufficient. That about sum it up? :)

A Gender Thing?

I'm not sure where gender comes in to this discussion because this argument has been almost exclusively argued from both sides by men (including whatever "gushing" was going on). I wrote about WPOPAC for ALA Techsource but focused exclusively on the disruptive nature of the product, and did not mention the LC dataset. I did that not because it's so easy to get one's hands on an LC dataset--you must have quite a budget where you're working, Dan ;> -- but because that was just not the most interesting part of what Casey had done, in my opinion. A snapshot of a database is not a replication of--or improvement on--a workflow, which is what I think I hear Dan saying. Once we agree on that, we can have a discussion about whether current cataloging practices (and the practices we are not generating) work on our profession's behalf. I worry that a lot of good people are doing good work that is nonetheless increasingly divorced from our needs.

That said, access to this dataset *for free* is interesting, given the proprietary nature of most of our datasets (de facto or de jure), and I hope people (or at least Casey, one people) can make good use of it.

Well said

"I worry that a lot of good people are doing good work that is nonetheless increasingly divorced from our needs."

I agree - I worry too that not enough of us are working to cross the chasms within our own profession.

I'm not sure if gender enters into it in any new way, but I'd bet that most readers of this weblog are male, if that matters.

Gender...

The gender comment came up because Jonathan had said, "cause it is always a he for whatever reason; the gender thing may be part of this whole thing ..."

At first I was thinking "why does he have to bring up gender," but in retrospect, his comment may be spot-on. It *is* interesting that (warning: gross oversimplification follows) women are the catalogers and men are the cultural critics of cataloging.

Yet the real issue, as JR also points out, is that none of us know why we do what we do. That's the fundamental brokenness of our activities, and it's not limited to one community or another.

On the other hand, I'm a bit weary of the extended talk about the utility of Casey's data or what he might be doing with it when the one person we haven't heard from in all this is Casey himself. I believe the guy who did WPOPAC has more than one trick up his sleeve.

open access

I cannot help but compare open data with open access movement. The open access movement has been ongoing for a while and we sure can learn something from it. After more than 10 years of different experiments, we still see a lot of debates and all kinds of practices: arXiv, institutional repository, personal homepage, author paid open access journal. Stevan Harnad has many good articles and thoughts about this topic.

if we can learn anything from open access movement, I guess that the "open data" is also complex issue and needs a lot of experiments, and you cannot just paint it simply as black or white.