Caching and Proxying Linked Data
Been thinking more about the demise of lcsh.info. Some quick thoughts, and then some longer ones.
Formats are a distraction. Or, at least, they're a distraction for this discussion. Don't let them distract you.
Don't get distracted by the weird way it was handled. Sometimes big organizations do things in weird ways. If you fret over that you'll miss the real opportunity here.
I'm a lot less comfortable talking in *any* public setting about LC and its internal workings than are some of my close colleagues. We argue about it sometimes, but mostly we accept that we all feel a bit differently from each other about it, and that's okay. I don't feel qualified or well-positioned to represent my employer, so I don't, but if I were to intend to do so, I would say so. The disclaimer below says that I'm not, right now. I really mean that, here, now, and always.
Again, the first question I think that is important to consider is not "what happened?" or "why did LC do this?" It's done. Get over it. Hopefully LC will rectify the situation soon. To me the first important question is "what did this break?"
Seriously, what did it break?
Not much, so far as I can tell. Some people were using it for reference, which is nice, but it's not the only way to do that. Some other people were using it as a data source, too, but it's not the only way to do that, either. So those functions "broke", but there are both drop-in and if-you-squint-right substitutes available for each of these functions. No big deal.
Some people were starting to use lcsh.info URIs as value identifiers and link endpoints. There wasn't really a way to do those things before, so we can say that these functions broke. This feels important, but why? It seems bad that we can no longer "follow our noses" to and through and back out of LCSH, but why? I buy into the mantra that "information is good for people." But I am generally in the business of building systems for data that needs to *last*, and when I'm trying to accomplish that, I need design input far more concrete than just "what would be good."
This, then, is the most important question:
Why is it important that these functions broke?
If we can answer that, then we can follow up with:
How should we build systems that will be resilient to outages in the linked data web?
That, right there, is what I've been obsessing over for the past few months, more so since The December Event. I have some scratch-test answers that seem to be useful in continuing discussions over coffee and the like with those same people I work with, so I'm going to start writing about them more here, because this isn't an LC thing, it's a you and everybody else thing.
Why is it important to use a consistent, controlled value to describe like items? It's important because it helps us answer real questions like "what else do we have about that" and "how is this item different from other things near it" and real service requests like "gimme everything you have about foo." This is not news.
Why is it important to do this with linked data strategies (follow-your-nose, cool-URIs, conneg/303, etc.)? It's important because it opens the same functionality up to the open web. If you link data properly, a crawl will let you do more to compare items across diverse sites. A post-crawl indexing function can be a lot smarter, and you can serve more people as consistently well as above but across more stuff. But, this isn't news, either.
Here, then, is the kicker:
If you are counting on one or more linked data sources to provide service and content integration, and one of those sources goes away, what do you do?
The answer is the same as the answer to the question "what if a good mashup source goes away?" When a mashup gets popular, and it grows important to improve the service's performance and reliability, you have to cache the key sources' contents for yourself. If you use a particular reference source all the time, you want a copy at the ready, which is why we call them ready reference sources. Same deal.
So maybe we need to cache some linked data, local to applications that depend on that linked data. Most big ILS/OPAC instances have a local authority file, right? Most DNS servers have or live near caches, right? Same deal, again, for both.
What we have when we have a lot of records that refer to the same concept value the same way is enough linguistic similarity to make it probable that a given user might leverage that similarity across relevant items. In a single system, you can do okay, but on the open web, you fall back to string matching, basically.
If we back all that up with locally cached authority file records, we have the ability to increase precision and recall in that single system, and to do that in way that supports internal consistency over time, more resiliently, in the case that that authority source goes away. This can also increase the resilience of increased precision and recall across diverse sites on the open web that also use the same authority sources.
This still isn't really news, though. Hell, we've doing all this in libraries for *generations* — well, we're just getting around to the open web part, but that's a topic for a later post.
To me what we have to add right here is *proxying* of sources of meaning. It's lovely to add following-your-noseability along with added resilience of improved precision and recall inside and across sites, but how do we make following-your-nose itself resilient? If that main linked data source goes away, what do we do?
We make systems that cache remotely published concepts, wherein we can follow our noses to meaning, also *themselves* sources of meaning. Instead of just making a concept reference on an item view only a clickable link that performs a search, and in addition to storing what that concept *means* in your system to gain resilience of meaning, you add your own linked data node for that concept to the open web as part of your own application, and in turn back that up with a full record format sharing the concept's meaning, and you explain and link back to where you got it all from.
(Whew, deep breath. I'd better reread that last paragraph. Oy, it's a horrid run-on, but it says what I wanted to say, so I'm leaving it.)
This accomplishes several things at once. It contextualizes that concept's meaning within your application. It connects your use of use of the concept to the original concept source. It enables all the open web benefits I discussed above, because it will become even easier to post-process crawls to find connections. And it adds the most critical piece: resilience of meaning. If the original concept source disappears, those open web benefits *don't* go away, because the world may now be repopulated from your cache.
(#include hand-waving-about-licensing.h)
There's a reason so many people are excited about distributed revision control. It allows decentralization without disallowing centralization. It lets you adapt to your workflows as you need. It makes any given node more likely to able to repopulate the world with revision histories if a mass die-off of repo sources occurs. Decentralization allows resilience to grow.
Shouldn't the linked data web work this way?
(Er, it should, and it should be backed by distributed revision control techniques, too, but that's another topic for a later post.)
I don't see all this as cut and dried. There are a lot of details I'm glossing over, I know. But if you're still reading, maybe you'll agree with me that it's good to have something interesting to work on next.
Comments
Technology can be solved but policy may be more important
By Edward Corrado (not verified) on 11 Jan 2009 at about 13:12.The potential of things going away is certainly a problem with any information source on the Web, not just linked data. It also is an issue in print as well, which is why many consortia have "last copy" rules about making sure not to discard the last copy of a book. In the only journal world we have programs like LOCKSS.
When libraries build or implement new services, they need to take this into account and make sure their is a contingency plan, or at least make sure that they recognize there isn't one. This is one of my concerns with hosted e-book platforms. Companies aren't forever and can go out of business at any time leaving us high and dry.
I'm not sure how the best way is to do this with linked data, but mirrors such is typical with major Open Source projects is one way, as is your suggestion of a distributed revision control system. However, the technology can be solved if we put a little effort into it. What may make this impossible is licensing issues. This is one reason why things like the new OCLC WorldCat record sharing policy are so troubling. They make us put all of our eggs in one basket. With Open data, we are able to build truly redundant systems. If the data isn't open, nothing is going to help. If we build linked data services based on data that someone licenses, they can take it down at any time, and no level of mirroring or version control is going to help. In this way, it is much more of a policy and licensing issue than it is a technical one.