Somewhere between Anger and Bargaining

I've been doing some homework, trying to learn what RDF is all about, starting from suggested reading linked from the comments on my earlier post. So far I have a few basic questions and comments:

  • I get that things "started with" RDF/XML. But why do we need anything more than N3? It's not the simplest grammar in the world, but it's readable, compact, featureful, and it's what people seem to use when they actually talk about RDF. On top of that, it's just text. These seem like winning features.
  • I am befuddled by the sheer number of RDF examples that attempt to "model knowledge". What if I really just want to "describe resources"? Aren't those completely different activities? I think I want a framework for talking about stuff - not for representing stuff itself. Instinctively I'd guess this reflects the whole AI/KR "I Know What You Mean" heritage of hype, but we're not talking about the "Knowledge Representation Framework" here, right?
  • Somebody please hire Joe Celko (or a sufficiently advanced AI thereof) to write us a "SPARQL for Smarties".

And another thing, hard to form as a question. This nearly-machismistic talk about "how many triples? HOW MANY?" feels like a distraction. Every bit of recent experience I have building data-backed apps with a lot of data coming in the door has taught me to (a) keep the data just like you got it, (b) build a way to extract enough from it to run your app, and (c) optimize the extracted bits to meet user requirements. The separation of steps from (a) to (c) means you can always swap out (b) and (c) from the original data in (a), keeping the sourced data safe over time. Especially because over time (b) and (c) just get easier and easier (faster machines, more RAM, better web frameworks, etc.). If this is common understanding (is it? it's what I know, now, at least), then what's the point of having a live environment for 2,000,000,000 triples? Is that really the only useful way you can query that kind of pile of data in all its semantic glory?

Instinctively I don't want to believe that that's true. I get that some applications are really about accumulating more and more data, and that that can get pretty big pretty quickly. But the triple model seems inherently optimized for flexibility, and as apps/data get bigger and bigger, you want to optimize for efficiency along a few known paths, and I can only imagine you'd again want to (b) extract enough from the source data to run your app. Which, I presume, would mean something very different from 2,000,000,000 triples.

Does that make any sense? Maybe I'm not stating my concern clearly... or just don't yet get something fundamental about all this (very likely!).

Trackback URL for this post:

http://onebiglibrary.net/trackback/227

partial answers

I agree that N3 or Turtle are much more readable than the XML serialization for RDF--that's why they were created. It comes down to whether you'd rather read:

@prefix dc: <http://purl.org/dc/elements/1.1/>.
<http://onebiglibrary.net> dc:creator "Dan Chudnov".

or this:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://onebiglibrary.net">
    <dc:creator>Dan Chudnov</dc:creator>
  </rdf:Description>
</rdf:RDF>

I imagine there might be someone that finds the XML easier to read...but the real benefit to having an XML serialization is being able to leverage existing XML parsers.

The key thing is that RDF is not a particular syntax for describing resources, but a model that allows you to make statements about resources using the triple: subject, predicate, object.

I'm curious where you are getting sidetracked on Knowledge Representation. One of the best intros to RDF I've run across is the RDF Primer which only mentions KR a couple times, mainly to describe RDF's pedigree:

RDF draws upon ideas from knowledge representation, artificial intelligence, and data management, including Conceptual Graphs, logic-based knowledge representation, frames, and relational databases.

Pretty much all the examples in the Primer are centered on describing resources identified with URIs: web pages, courses, products, books, vehicles. When you say resources do you mean specifically information resources on the web?

On the subject of SPARQL, it's early days yet ... it just became a w3c recommendation last week. As a point of reference, Joe Celko's SQL for smarties came out in 1995, and SQL was standardized in 1986. I agree with you: more useful examples of working with SPARQL would be helpful. As with SQL, probably the easiest way to get to know SPARQL is to start doing some queries. Bob DuCharme had a good post a few months ago with examples of playing around with SPARQL and wikipedia content in dbpedia.

The newness of SPARQL is also why you see a lot of talk of how-many-triples they've got. If the semantic web is to grow, support for storing and efficiently querying across large amounts of triples is important. So when you see people bragging about how many triples they've got, they are really demonstrating that they are able to store and query across them. Also, there's been a great deal of skepticism over the years about the Semantic Web, so demonstrating that people are publishing and linking together descriptions is also important for morale I think.

Your point about tried-true methods of application development with data stores is really interesting. I don't know the answer. Part of the allure of triple-store technologies is that the data extraction and db schema creation step plays less of a significant role---assuming that your data sources are available as RDF.

I think the more interesting part of putting RDF in your toolbox is that it allows you to make your datastore available to other people in an inter-operable way. As a thought experiment assume you have created a web application using your tried-true methods. How would you give your data store to someone else? RDF gives you a way to do this, by assigning public identifiers to resources you have modeled, and allowing you to make statements about them.

Anyhow, thanks for your thoughtful post. I must admit when I first skimmed it I got distracted by language like 'anger', 'bargaining', 'machismo', 'hype', etc. I should be used to this by now, reading stuff on the blogosphere. Kesa pointed out to me that thinking critically about something is a key part of any kind of learning exercise, so it sounds like you are well on your way :-)

Thanks for all these

Thanks for all these comments/responses/suggestions. You're totally right that I use a lot of loaded words - I guess my own will-to-hype is stronger than I'd like to admit.

I should have been documenting the exact places where I see examples that are more KR-ish than RD-ish. As I keep moving through this stuff I'll be sure to point out specifics.

You asked:

"As a thought experiment assume you have created a web application using your tried-true methods. How would you give your data store to someone else?"

In the past I've given my data store to other people by... well, by giving my data store to other people. :) For example, the jake project is somehow still alive because the data was always available. unalog's data has always been right out there in the open via rss, opensearch, and dumps of one's own data. In the canary project we lead people right back to source data in Pubmed and give them interfaces (links and COinS and unAPI) to pull up the data and fulltext (if they're lucky) for themselves. In the canary, I think these semi-standard and built-into-the-UI interfaces are better than plain data dumps like with jake, because with the jake data out there as csv files people would have to learn the schema and work with that to do anything with it... which is an argument for your point and minimizing the role of schema creation. But, on the other hand, that the data is still out there at all is an argument that we've been able to share data usefully for a while, and maybe jake was just a primordial "linked data" project, period (it had fixed URI/URL identifiers with alternate, data-rich xml representations, etc.).

I see what you mean about "SPARQL for Smarties", but I'd still buy that book as soon as it became available. :)

After chatting with eikeon some today I'm encouraged to spend more time in rdflib just working with things at the triple level. Maybe that will help me get up the curve a bit further.

Thanks again for taking the time!

Triple counts

Your posting got me to brush off an old MARC to triples program I have and count the output. It just puts every word in a triple, ignoring control fields and subfielding. On some of the samples I had laying around, that came to around 100 triples/record, or 10 billion triples for WorldCat. I'm not sure how triples for MARC should be structured, but it's easy to come up with lots of them.

--Th

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <pre> <code> <img> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <form> <input> <span> <object> <embed> <br>
  • Lines and paragraphs break automatically.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <apache>, <bash>, <css>, <diff>, <dot>, <java>, <javascript>, <mysql>, <perl>, <php>, <python>, <rails>, <ruby>, <sql>, <xml>. Beside the tag style "<foo>" it is also possible to use "[foo]".

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
6 + 3 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
Syndicate content

This site is Copyright (c) 2005-2008 by Daniel Chudnov. All rights reserved.

All opinions stated here are my own, and do not reflect those of my employer.