Quick hack: re-rendering newspaper pages from OCR data
This is a page of want ads, based on an image of a 110-year-old newspaper page currently available here. Visit that link and you can see the "Text", which isn't particularly formatted like it was originally, and you can also grab a pdf or jp2 or zoom in to the jp2 online.
Or you can see my re-rendering of the same data using the ocr data with its text positioning info.
I'm rendering this based on a dumb reading of the ocr'd text coordinate data, implemented using Processing. The code is trivial and dumb, so it's not even worth posting. Even so, what's good about it is that you can spot the right number of columns and pick out some words that stand out from the page image in the right places. What's bad about it is that I'm not re-rendering font sizes correctly and there's all that weird business along the left side and the top. That and i'm certainly not understanding all the info the alto data is offering me, but that's easily remedied.
In other words, with raw data comes raw responsibility!
Stay tuned, lots and lots of this data coming along very soon... can't wait to see what the rest of you do with it.

Post new comment