I've been doing some homework, trying to learn what RDF is all about, starting from suggested reading linked from the comments on my earlier post [1]. So far I have a few basic questions and comments:
And another thing, hard to form as a question. This nearly-machismistic talk about "how many triples? HOW MANY? [3]" feels like a distraction. Every bit of recent experience I have building data-backed apps with a lot of data coming in the door has taught me to (a) keep the data just like you got it, (b) build a way to extract enough from it to run your app, and (c) optimize the extracted bits to meet user requirements. The separation of steps from (a) to (c) means you can always swap out (b) and (c) from the original data in (a), keeping the sourced data safe over time. Especially because over time (b) and (c) just get easier and easier (faster machines, more RAM, better web frameworks, etc.). If this is common understanding (is it? it's what I know, now, at least), then what's the point of having a live environment for 2,000,000,000 triples? Is that really the only useful way you can query that kind of pile of data in all its semantic glory?
Instinctively I don't want to believe that that's true. I get that some applications are really about accumulating more and more data, and that that can get pretty big pretty quickly. But the triple model seems inherently optimized for flexibility, and as apps/data get bigger and bigger, you want to optimize for efficiency along a few known paths, and I can only imagine you'd again want to (b) extract enough from the source data to run your app. Which, I presume, would mean something very different from 2,000,000,000 triples.
Does that make any sense? Maybe I'm not stating my concern clearly... or just don't yet get something fundamental about all this (very likely!).
Links:
[1] http://onebiglibrary.net/story/will-i-need-to-understand-the-semantic-web-in-2008#comments
[2] http://www.w3.org/TeamSubmission/n3/
[3] http://esw.w3.org/topic/LargeTripleStores