linked data

Dreaming of a read/write Linked Data web

I just turned in my fourth (count 'em, 4!) "Libraries in Computers" column for "Computers in Libraries" magazine about Linked Data in 2009. The third is due to appear in print next month, and this one's due in October. It took me four go-rounds, though, before I could get to the key payoff, answering a nagging Collect Underpants kinda question. Surely, less understanding editors would have already sacked such an idiot who needs to write the same story four times before making any sense of it. Shhh! Don't tell them.

All this talk about Linking Data is great, because we can all follow links. But I think what will put Linked Data over the proverbial top is opening an explicit feedback loop into the Linked Data web. Like the web itself and its transition through the Age Of Web 2.0 Hype Budgets, a read-only existence is only a temporary state of affairs. When people start figuring out how to write into the web as much as they read from it — and gaining value from such interactions — is when things start to get really interesting. So let's just skip the Trough of Disillusionment, here, and jump right into the read/write Linked Data web.

I've written here before about my most recent keynote-length talk about the value I see in Linked Data. Flip through those slides far enough and you'll see a long series of seemingly disconnected screenshots excerpted from an imagined user session in a well-known digital library application. Following that is a similarly imagined user session in a well-known OPAC at that app's same host institution. Following that is a series of screenshots from a diverse set of additional web resources about the same intellectual content — the same subject and the same authority records, but different institutions (several academic libraries, a few major museums, and online encyclopedias). And, finally, there's a slide that shows a search result set from a well-known search engine that shows barely a few of those diverse sites.

The intended effect of those slides is summed up simply — nobody will find your stuff, and even if they do, they'll never find all the other cool similar stuff out there about the same thing, because we're not feeding it all into the web correctly. We're not linking these things well enough. We haven't pulled our own thickly entangled yet beautifully informative traditional web of authorized headings and cross-referenced name forms up into the modern Web and used those as a target for linking our stuff together yet.

But wait… did you see it?

There it is again!

A *target* for linking!

A read-only target!

Hence my failure.

In suggesting that what we all need to do is to start pointing our web resources that have LCSH subject headings into sites like id.loc.gov I failed to mention one key thing: when we start pointing our stuff at sites like id.loc.gov, those sites need to start pointing back at sites like ours.

If today a reasonable search on a prominent search engine cannot find the eight great sites with archival materials about Billy Strayhorn, how is adding links from each of them to the same record about Billy Strayhorn at id.loc.gov going to help tie the sites together? That was my goal, my current problem statement. Actually, I'd better spell that out, since I've left it implicit.

I work in a big library. We have lots of collections. Collections of collections of collections. Materials so spread out that we honestly can't answer the question "what all do you have about Billy Strayhorn?" without knowing we're probably missing something. On one hand, this is reasonable, because it's a really big library. On the other hand, it's not reasonable, because lots of people come to libraries asking for "all the stuff you have about so-and-so" and even if some librarians in some libraries could say "right here, yes, right here, this is everything we have about so-and-so", there'd still be what all the other libraries and other sites share about so-and-so on the web that that person's maybe going to want to connect to also, so even that librarian's answer is still insufficient.

There's the question, then: how do we give people a better answer than "here's a few things, and you can search for it to find other stuff, but I know you'll be missing some really important stuff, and lots of things I don't even know about, so, yeah, search Google, but you'll still miss a ton of cool stuff"?

Popping that off the stack, we're back at the unstated earlier question: can't we just future-crawl the future-web and find the future-graph of sites future-pointing at that same id.loc.gov record about Billy Strayhorn and work back out from future-there? Wouldn't that give us a big picture of all the stuff out there about Billy Strayhorn, and help each of us at each of our libraries give a better answer to that patron's question, whether or not we even have anything about Billy Strayhorn?

Well, yes, maybe, but only if we each have a really big copy of the future-web's future-graph. But most of us don't. Even if I run a crawl seeded by the eight sites with materials about Billy Strayhorn, I'm not too confident that I wouldn't have to repeat the same process again for materials by or about George Russell. Or Charlie Christian, or Charlie Rouse. And so on. What are the odds?

Not good.

So lately my mind's stuck on models for doing something somewhere between manually trawling server logs for patterns in HTTP_REFERER values and automatically populating a list of links from whatever referring pages link in. The former is a lot of work, the latter is a public request for embarrassing spam. Which is what the Trackback spec can often lead to.

If I'm a leaf note in a Linked Data library collection world, I want somebody searching for Foo at Footown Library to find my interesting set of resources about Foo at Bartown University Library. If Footown and Bartown both point their Foo pages to id.loc.gov/foo-as-a-URI, then there will two sets of folks able to see that we've both got stuff about Foo: the biggest crawlers out there, the ones likely to see both links; and the host of id.loc.gov/foo-as-a-URI. That's not a lot of options.

The traditional record for Foo has long been used to make these kinds of connections in OPACs, but those connections tend to be tightly scoped at the boundaries of the local collection. Similar connections are made from the same record at other sites, but the connections an authority record enables among holdings at any one institution are not connected to the connections made at any other institution, or at least I don't think they are, at least not in the open web, in a crawlable-by-little-old-me sort of way.

I don't know how to do this, but I'm pretty sure we need to find a way.

TCDL 2009 talk: Better living through linking

Wednesday I spoke at the TCDL 2009 conference about why I think Linked Data is important for libraries. I've given talks about this twice before, once at the code4lib 2009 pre-conference on linked data, and a variation on that talk at the TCDL 2009 developers forum pre-conference Tuesday.

This was the first time I spoke about this in a room not entirely filled with hackers, though, so I couldn't just start talking about conneg and RDF models. It needed more context. As far as I can tell, the context that matters most is that we've been building a web for fifteen years, now, and we've continually changed how we build the web as we've changed how we use the web. So I spent most of the talk stressing how adhering to the four rules of Linked Data can help us make our libraries' stuff more relevant, more connected, and more likely to be found and used by improving how we link things together.

First, though, a comment about the contents of the slides - I work for the Library of Congress, but I wasn't representing the library at this talk, which I traveled to and gave off work hours. So that second slide is for real - the opinions are my own. You'll see a lot of LC examples, there, though, for two reasons. One is that I see these sites and think about them a lot, much like the rest of you, just more so because I'm there. When I can show an example from an LC site, it's likely something most people in a room have seen before and understand. The other reason is that LC has a long history of doing digital library stuff, so long that a lot of what's up there looks prehistoric in some ways, but at the same time, there are a lot of cool new things happening there, not all of which get a lot of attention, like LCCN Permalink. I don't work directly on any of the systems which have screenshots in these slides, so when you see images of those systems, you're not seeing my work. I know a few scattered details about the systems and am lucky to get to interact with many of the people who work on building them, but when I spoke about them at TCDL I had no intention of representing their work, and said so. My comments probably seemed more critical than promotional, but I meant them to illustrate situations we all find ourselves in at all our institutions, that we all know well about already, so it's not news to anybody that we all need to improve how we do things.

So, right, disclaimer doubly disclaimed. On with the slides:

I really enjoy events like TCDL - a single track, a healthy mix of public services, technical services, IT, managers, and administrators, and a tech focus but with a broad perspective necessary to talk tech in a roomful of diverse skills and interests. It really focuses my attention on the one or two issues that are at the core of the changes in technology coming at us. It seemed like people received the talk well, as I heard several comments from non-coders and coders alike about how it made sense that we should move in this direction.

Unfortunately I had to leave early but I'd encourage you to look at the abstracts and learn about all the great work being done in the Lone Star state.

code4lib 2009 talk on caching and proxying linked data

Here are my slides from today's talk. It's called "what i want from linked data", and it spells out in 110 slides some more context-setting for what I wrote about previously here in "Caching and Proxying Linked Data".

I went to screencast this myself as I delivered the talk using screenflow, but something went wrong. It might have been because the laptop's screen resized itself when I plugged into the big screen. Oh well... good idea, lousy execution by me.

JodiS recorded audio, though, so if I can get a file from her I'll post it here, too.

Caching and Proxying Linked Data

Been thinking more about the demise of lcsh.info. Some quick thoughts, and then some longer ones.

Formats are a distraction. Or, at least, they're a distraction for this discussion. Don't let them distract you.

Don't get distracted by the weird way it was handled. Sometimes big organizations do things in weird ways. If you fret over that you'll miss the real opportunity here.

I'm a lot less comfortable talking in *any* public setting about LC and its internal workings than are some of my close colleagues. We argue about it sometimes, but mostly we accept that we all feel a bit differently from each other about it, and that's okay. I don't feel qualified or well-positioned to represent my employer, so I don't, but if I were to intend to do so, I would say so. The disclaimer below says that I'm not, right now. I really mean that, here, now, and always.

Again, the first question I think that is important to consider is not "what happened?" or "why did LC do this?" It's done. Get over it. Hopefully LC will rectify the situation soon. To me the first important question is "what did this break?"

Seriously, what did it break?

Not much, so far as I can tell. Some people were using it for reference, which is nice, but it's not the only way to do that. Some other people were using it as a data source, too, but it's not the only way to do that, either. So those functions "broke", but there are both drop-in and if-you-squint-right substitutes available for each of these functions. No big deal.

Some people were starting to use lcsh.info URIs as value identifiers and link endpoints. There wasn't really a way to do those things before, so we can say that these functions broke. This feels important, but why? It seems bad that we can no longer "follow our noses" to and through and back out of LCSH, but why? I buy into the mantra that "information is good for people." But I am generally in the business of building systems for data that needs to *last*, and when I'm trying to accomplish that, I need design input far more concrete than just "what would be good."

This, then, is the most important question:

Why is it important that these functions broke?

If we can answer that, then we can follow up with:

How should we build systems that will be resilient to outages in the linked data web?

That, right there, is what I've been obsessing over for the past few months, more so since The December Event. I have some scratch-test answers that seem to be useful in continuing discussions over coffee and the like with those same people I work with, so I'm going to start writing about them more here, because this isn't an LC thing, it's a you and everybody else thing.

Why is it important to use a consistent, controlled value to describe like items? It's important because it helps us answer real questions like "what else do we have about that" and "how is this item different from other things near it" and real service requests like "gimme everything you have about foo." This is not news.

Why is it important to do this with linked data strategies (follow-your-nose, cool-URIs, conneg/303, etc.)? It's important because it opens the same functionality up to the open web. If you link data properly, a crawl will let you do more to compare items across diverse sites. A post-crawl indexing function can be a lot smarter, and you can serve more people as consistently well as above but across more stuff. But, this isn't news, either.

Here, then, is the kicker:

If you are counting on one or more linked data sources to provide service and content integration, and one of those sources goes away, what do you do?

The answer is the same as the answer to the question "what if a good mashup source goes away?" When a mashup gets popular, and it grows important to improve the service's performance and reliability, you have to cache the key sources' contents for yourself. If you use a particular reference source all the time, you want a copy at the ready, which is why we call them ready reference sources. Same deal.

So maybe we need to cache some linked data, local to applications that depend on that linked data. Most big ILS/OPAC instances have a local authority file, right? Most DNS servers have or live near caches, right? Same deal, again, for both.

What we have when we have a lot of records that refer to the same concept value the same way is enough linguistic similarity to make it probable that a given user might leverage that similarity across relevant items. In a single system, you can do okay, but on the open web, you fall back to string matching, basically.

If we back all that up with locally cached authority file records, we have the ability to increase precision and recall in that single system, and to do that in way that supports internal consistency over time, more resiliently, in the case that that authority source goes away. This can also increase the resilience of increased precision and recall across diverse sites on the open web that also use the same authority sources.

This still isn't really news, though. Hell, we've doing all this in libraries for *generations* — well, we're just getting around to the open web part, but that's a topic for a later post.

To me what we have to add right here is *proxying* of sources of meaning. It's lovely to add following-your-noseability along with added resilience of improved precision and recall inside and across sites, but how do we make following-your-nose itself resilient? If that main linked data source goes away, what do we do?

We make systems that cache remotely published concepts, wherein we can follow our noses to meaning, also *themselves* sources of meaning. Instead of just making a concept reference on an item view only a clickable link that performs a search, and in addition to storing what that concept *means* in your system to gain resilience of meaning, you add your own linked data node for that concept to the open web as part of your own application, and in turn back that up with a full record format sharing the concept's meaning, and you explain and link back to where you got it all from.

(Whew, deep breath. I'd better reread that last paragraph. Oy, it's a horrid run-on, but it says what I wanted to say, so I'm leaving it.)

This accomplishes several things at once. It contextualizes that concept's meaning within your application. It connects your use of use of the concept to the original concept source. It enables all the open web benefits I discussed above, because it will become even easier to post-process crawls to find connections. And it adds the most critical piece: resilience of meaning. If the original concept source disappears, those open web benefits *don't* go away, because the world may now be repopulated from your cache.

(#include hand-waving-about-licensing.h)

There's a reason so many people are excited about distributed revision control. It allows decentralization without disallowing centralization. It lets you adapt to your workflows as you need. It makes any given node more likely to able to repopulate the world with revision histories if a mass die-off of repo sources occurs. Decentralization allows resilience to grow.

Shouldn't the linked data web work this way?

(Er, it should, and it should be backed by distributed revision control techniques, too, but that's another topic for a later post.)

I don't see all this as cut and dried. There are a lot of details I'm glossing over, I know. But if you're still reading, maybe you'll agree with me that it's good to have something interesting to work on next.

A Nicely Built Linked Data Web Never Resists Destruction

Yesterday was a bit of a downer in the office when Ed filled me in on what was going on in with lcsh.info. And it was sadder still to log on tonight and see the site down and replaced for good, though it's nice to see that the in-its-place blog post's comments coming in are encouragingly positive.

I didn't really have much to do with this project, nor with its demise. I'd guess, though, that this will be one of those things that can feel like something of a failure today, in the immediate now(), but will be remembered as a clear success later on. Here's why I think that way:

  • It worked. It showed off what was possible, and was useful, in just the way it was supposed to be.
  • It was fun for humans to explore.
  • People noticed it. Lots of people.
  • People at LC were proud of it. This is from first-hand discussions, and from having heard it trumpeted as an important experiment by LC leaders in public statements.
  • People got it. Even the non-semwebbers in the room saw what it was about and could relate to that.
  • It filled a need. Many people had called for something like lcsh.info, and when it appeared, it seemed a plausible promise of answering their calls.
  • It's not just the geeks and semwebbers who were calling for it and things like it. The wogrofubico report explicitly recommended several steps much like the ones taken with lcsh.info.
  • It was, in part, the output of diverse staff at LC who, in part, have been active participants in a related w3c initiative.
  • It looked cool. Those graphs were nifty.
  • It was an important enough success that it was taken down. If it never gained notice, if it weren't useful, if it didn't promise something bigger, if it didn't make sense, if nobody cared, it would still be up. Yknow?

Ultimately, if linked data is going to work as infrastructure -- which, as something beyond linked data for linked data's sake, I hope it becomes -- there will be more fits and starts on the way to What Works. It seems cheap to remember that "this was always labeled an experiment", even though it was, and rightly so, but to get things right in the long run you have to try some stuff and learn from it along the way. Even simply having this minor crisis is going to drive people, hopefully, to taking issues of persistence and stewardship very seriously as they figure how to move this exciting idea ahead.

--

(Title appropriated, with apologies, from William Kentridge.)