Quick hack: re-rendering newspaper pages from OCR data

This is a page of want ads, based on an image of a 110-year-old newspaper page currently available here. Visit that link and you can see the "Text", which isn't particularly formatted like it was originally, and you can also grab a pdf or jp2 or zoom in to the jp2 online.

Or you can see my re-rendering of the same data using the ocr data with its text positioning info.

newspaper ocr data hack

I'm rendering this based on a dumb reading of the ocr'd text coordinate data, implemented using Processing. The code is trivial and dumb, so it's not even worth posting. Even so, what's good about it is that you can spot the right number of columns and pick out some words that stand out from the page image in the right places. What's bad about it is that I'm not re-rendering font sizes correctly and there's all that weird business along the left side and the top. That and i'm certainly not understanding all the info the alto data is offering me, but that's easily remedied.

In other words, with raw data comes raw responsibility!

Stay tuned, lots and lots of this data coming along very soon... can't wait to see what the rest of you do with it.

Trackback URL for this post:

http://onebiglibrary.net/trackback/304

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <pre> <code> <img> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <form> <input> <span> <object> <embed> <br>
  • Lines and paragraphs break automatically.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <apache>, <bash>, <css>, <diff>, <dot>, <java>, <javascript>, <mysql>, <perl>, <php>, <python>, <rails>, <ruby>, <sql>, <xml>. Beside the tag style "<foo>" it is also possible to use "[foo]".

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
1 + 3 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.