Category Archives: Digitization

I’m Looking for a Microfilm Digitization Quote

I’m looking for a microfilm digitization quote. If you or someone you know provides microfilm digitization, please have them send me a quote.

I’ve got 113 reels of microfilm I’d like to digitize and I’m looking for a ballpark estimate for the project.

Here’s the info I know, please let me know if you need anything else:

  • There are an average of 600 images per reel (about 67,800 images)
  • I’d like to scan at 300 dpi, 8bit grayscale lossless images (tiff? png?)
  • I have the copyright on the images on these reels
  • The reels are lightly used duplicates of the master reels. The master reels are unfortunately unavailable
  • The images are all scanned newspapers
  • I don’t need any OCR done
  • The only metadata I need is which reel each of the images came from eg. One directory per reel with incremental file names would be just fine.

Project Background: I would like to put 128 years of Iron County Miner newspaper archives online. They would be freely available (no subscription or account required) and there’s no plan to make money from them. Since there’s no revenue expected I’m looking for ways to reduce costs while still putting something out there to benefit genealogists and historians.

The master rolls are held by the Wisconsin Historical Society who wants nearly $10,000 ($0.145/image) for the project or $80 per reel to send us fresh copies of the reels. From their perspective, I think that’s probably fair; they aren’t in the digitization business and they probably aren’t set up to do this sort of project in a streamlined manner. They also can’t amortize their digitization equipment costs across so many clients as a commercial company can.

For me though, $10,000 means that I can’t pursue this project right now.

Most digitization companies I have contacted have been reluctant to provide even a ballpark quote without seeing test reels, and I understand that that is a factor. Right now though, I just need a gauge to determine if this project is viable. If $10,000 is the real cost for this sort of project it will have to wait till I’m rich, but if I can get a cheaper quote I hope to make it happen this summer.

Pre-Announcing NewspaperCMS

I have been working on a CMS (Content Management System) called NewspaperCMS, to host the scanned images with and to make them easily navigable. It is licensed under the GPLv2 so anybody needing to host newspaper archives can use it.

Here’s its page on Google Code: http://code.google.com/p/newspapercms/

I would classify it as in late Alpha or early Beta stages right now. I’ll do an official post on it as it matures and as I get a publicly accessible test site set up. As a teaser, features include:

  • Browse collection by microfilm, newspaper or date
    • Drill down within those categories by newspaper, issue, year or month
  • Access-driven generation of midsized images. No need to generate 60,000 midsized images ahead of time.
  • Valid HTML5/CSS3
  • HTML5/Canvas based client-side image viewer. The user can zoom, rotate, invert, sharpen and change the contrast of the image (uses the http://www.pixastic.com/JavaScript libraries)
    • Falls back to a static image if they don’t have Canvas or JavaScript support
  • Built in search engine
  • Support for the tesseract OCR engine

As I said, it’s still in development, but if you need something like it, you can play with it now. It’s at the point where more development doesn’t make sense until I know I can get the microfilms scanned.

Posted in Computers, Digitization, Programming, Projects, Something Interesting | Leave a comment

PHP Protocol Buffer to MySQL (and back!) bridge

Protocol Buffers are a binary data transfer protocol from Google. Google officially supports C++, Java and Python. There are 3rd party libraries that support other languages. I previously mentioned several that support PHP, including the one that we’re using at work, protoc-gen-php.

One challenge that we faced was storing our data. Should we store our data and convert to Protocol Buffers every time we sent it, or should we just work in Protocol Buffers and store it to the database directly?

We decided to store it in the database in a format compatible with the Protocol Buffer classes so we could easily access it as a Protocol Buffer object again later.

The following classes and scripts were written to help make that bridge between the Protocol Buffer classes generated by protoc-gen-php and MySQL.

They perform two main functions:

  1. Generate MySQL table create statements to build tables to hold the protocol buffer data
  2. Make classes which extend the protoc-gen-php classes with extra functions for database storage and retrieval (and a few bonuses)

 Generating The MySQL

We’ll start by generating some tables for our database. You’ll need php-cli installed, and you’ll need protoParser.php and protoMySQL.php in your PHP include path (or current directory) and makeMysql.php in your $PATH (or current directory).

Edit protoMySQL.php’s preferences (lines 17-36) to suit your configuration and needs.

Now something as simple as:

php ./makeMysql.php *.proto

should generate the MySQL table create statements you will need.

Generate The DB Classes

With protoc-gen-php, each .proto object gets a corresponding class. eg. list.proto.php is created from list.proto

makeClasses.php creates listDB.php which extends the classes generated by protoc-gen-php. Each proto object gets its own protoDB class and file.

php ./makeClasses.php *.proto

Those classes should then be used instead of their original non-database supporting proto classes.

DB Class Functions

__construct($id_or_object = NULL, $limit = PHP_INT_MAX)

$id_or_object
If it’s an object, we assume it’s a non-database variant, and use its members to populate this object.
If it’s numeric, we fetch the object from the database
Otherwise, we pass it up to the parent object.

$limit
Used in the parent constructor

get($id = NULL, $args = NULL)
$id
The database ID of the object to fetch.

$args
Enough MySQL arguments to uniquely identify the proto to retrieve.
If an array, each field is added to the query.
If a string, the string is appended to the query as is, after the WHERE clause.

Returns object if found, NULL if not found (or if multiples found)

unique()
Calls array_unique on any repeating elements.
Note: Arrays of objects are compared using their __toString methods

Returns nothing

load($object)

If you have an equivalent object (eg. a non-database or database version of the object) you can load it into the current object with this method.

Returns nothing

nullOrVal($val)
Determine if we should append NULL or an escaped string. Used in MySQL queries to ensure safe values.

Returns “NULL” or a mysql_real_escape’d string

delete()
Shallow delete from database. Since sub-objects could be shared/referenced by other proto objects this only deletes this object’s entry in the database

Returns nothing

put()
INSERT or UPDATE this object in the database.

Returns the insert ID of the object

toJSON($asArray = FALSE)
Returns a JSON representation of the current object, or an array appropriate for use in json_encode.

fromJSON($json)
Load the variables in the object from a JSON string

purge()
Like delete, but does delete referenced proto objects from database.
Returns number of sub-objects deleted.

License, Warranty and Support

My employers have been kind enough to let me release these scripts under and Open Source license. They are released under the GPL v.2 without any warranty.

We are actually switching away from MySQL on this particular project, and so these scripts are unlikely to receive any further updates.

I will provide such support as I have time for through the comments on this blog post.

Happy programming!

Posted in Computers, Digitization, Projects, Something Interesting | Leave a comment

TakOCR

TakOCR : Easy OCR for Mac

Tako : Japanese for Octopus
OCRopus : Great Open Source OCR project

TakOCR is a project to fill a need I had. I needed a GUI to an OCR engine for my dad. He’s not really the compile-it-and-use-the-command-line type of guy. He is however a Mac using guy, so here are the results for your enjoyment.

Latest downloads

TakOCR.pkg version 1 md5: a7a620e1bbef92c454764c42ce1b4b8e
All packages, sources, uninstaller, etc.

NOTICE:

TakOCR is no longer supported.  If the existing program works for you, great!  If it does not work, I hope you find something else that does.

If someone wants to give me a Mac with the latest version of OSX, I would be happy to update this software. :-)

Usage

Run the installer program, then just drop images onto the program. The OCRed output will be displayed in a window which will pop up.

You will need to quit TakOCR before dropping more images onto it.

What’s Included, Copyrights

TakOCR is really just a bundle of OCRopus, ImageMagick, Ghostscript and a little wrapper application to tie it all together. ImageMagick and Ghostscript let you OCR PDFs, TIFFs, JPEGs, and many more formats.

The wrapper script is just a little Ruby program made into a dropplet application with the help of Platypus.

All of the software included is available under Open Source compatible licenses. You may download the sources at the link above and read individual packages licenses if you wish. Software included is : ImageMagick, uilib, libjpeg, leptonlib, libpng, ocropus, OpenFST, tesseract, libtiff, zlib, ghostscript.

TakOCR itself and the script behind the scenes are both placed in the Public Domain

Posted in Digitization, Programming, Projects | Tagged , , , , | 11 Comments