Tag Archives: OCR

Bookscanned is no more

Thank you for coming to look for Bookscanned. While I still firmly believe in the concept behind bookscanned (scanning genealogy materials and making them available online with OCR software) I am not in the position to dedicate the time needed to make it work well.

I am confident that as OCR software becomes easier, storage cheaper and genealogist more computer savvy something like Bookscanned will come along from someone who can dedicate the resources needed to do it right.

Thanks again, and good luck in your genealogy searches!

Posted in Genealogy | Tagged , , | Leave a comment

TakOCR

TakOCR : Easy OCR for Mac

Tako : Japanese for Octopus
OCRopus : Great Open Source OCR project

TakOCR is a project to fill a need I had. I needed a GUI to an OCR engine for my dad. He’s not really the compile-it-and-use-the-command-line type of guy. He is however a Mac using guy, so here are the results for your enjoyment.

Latest downloads

TakOCR.pkg version 1 md5: a7a620e1bbef92c454764c42ce1b4b8e
All packages, sources, uninstaller, etc.

NOTICE:

TakOCR is no longer supported.  If the existing program works for you, great!  If it does not work, I hope you find something else that does.

If someone wants to give me a Mac with the latest version of OSX, I would be happy to update this software. :-)

Usage

Run the installer program, then just drop images onto the program. The OCRed output will be displayed in a window which will pop up.

You will need to quit TakOCR before dropping more images onto it.

What’s Included, Copyrights

TakOCR is really just a bundle of OCRopus, ImageMagick, Ghostscript and a little wrapper application to tie it all together. ImageMagick and Ghostscript let you OCR PDFs, TIFFs, JPEGs, and many more formats.

The wrapper script is just a little Ruby program made into a dropplet application with the help of Platypus.

All of the software included is available under Open Source compatible licenses. You may download the sources at the link above and read individual packages licenses if you wish. Software included is : ImageMagick, uilib, libjpeg, leptonlib, libpng, ocropus, OpenFST, tesseract, libtiff, zlib, ghostscript.

TakOCR itself and the script behind the scenes are both placed in the Public Domain

Posted in Digitization, Programming, Projects | Tagged , , , , | 11 Comments

OCRShow

OCRShow is a single PHP file you can put in a folder with scanned images and their OCRed text to quickly create a website that is indexable by search engines and easily navigable by humans

Features

  • Display the digitized text and the source image on the same page
  • Search Engine Friendly URLs
  • Next/Previous page links
  • Easy to navigate for human visitors
  • Easily search the book
  • Just a single PHP file
  • Automatically writes .htaccess file if none exists
  • Easy installation

Download and Install OCRShow

Get It Here

To install OCRShow, you will need to create three types of files for each page you have scanned:

  1. An image file (eg. page_0001.png)
  2. A smaller version of the image file (eg. small_page_0001.png)
  3. A plain text copy of what the image says, named .html (eg. page_0001.png.html)

OCRShow will use the ‘small_’ prefix and the .html file extension. If you want something different, you will need to edit the code yourself.

Upload the small_, image, and .html files to their own directory. Rename the downloaded php file to index.php and upload it to the same directory. The first time you visit that directory a .htaccess file will be created which is needed to allow search engine friendly URLs.

  • OCRShow requires PHP, mod_rewrite and .htaccess to work.
  • Search functionality requires grep, sed and sort as well as the ability to run commands with backticks. Most hosting providers will have this

Why Would I Need This?

In many cases there is only one copy of a genealogy book. Perhaps it is a personal history or a book of rememberance. Physically the book can only be in one place at one time. Thanks to the prolification of scanners, you can easily change that book into a series of images. The images can easily be shared with people you know. What about people you don’t know? Say a common descendant who is also researching your great-great-great-great-grandpa? In order to let people liek that find your scanned images, you need to get information about the scans into the search engines. Search engines can’t read images (yet!), so you need to provide text for the images. The easiest way to convert a typed document into text is with Optical Character Recognition software, such as Tesseract.

Search engines follow links when they are building their database, so you need to provide links between the different images and text. OCRShow is a very easy way to create those links and a very easy way to share your scanned documents. You just upload the documents and OCRShow and you’re done!

I am in the process of scanning several hundred pages of genealogy books so that they will be available digitally to the rest of my family. I decided that it would be good if Google could find them too so that other people who may be researching my ancestors will be able to find the information and hopefully we can help each other out.

What’s Your Process?

I am using Ubuntu 8.10 Linux and an Epson Stylus CX3810 all-in-one printer/scanner. The software xsane will automatically keep incrementing a number at the end of a file name, so I start the book at filename_0001.tif and just keep clicking ‘Scan’. I scan in straight black and white at 300 dpi, it keeps the filesize down and makes it easier for the OCR software to figure out where the letters are.

Once the whole book or document is scanned, I run a short shell script:

		#!/bin/bash
		for i in *tif;do tesseract $i $i;done
		for i in *tif;do convert $i $i.png;done
		for i in *png;do convert -resize x1000 $i small_$i;done
		rename 's/\.txt/\.html/' *txt
		rename 's/\.tif\.png/\.png/' *png
		rm *tif

Tesseract is an OCR engine developed over 10 years ago by HP, then donated to the open source world. It coverts a tif file into a text file. It is not formatting aware, so columns, pedigree charts, etc. all disapear. In my case this is ok, because I am really just creating text so Google can find the images. I expect that human users will read the text on the images. The convert commands change the tifs into pngs, because most browsers can’t display tif files. I also create a smaller image so that the user can quickly view the page. Both png and tif are lossless, so we can delete the tif files at the end.

I know that software for this process exists for Windows and Macintosh systems too, but I am not familiar with what your options are. If you have recomendations, especially free recomendations, please let me know and I will post them here.

.htaccess

If the .htaccess file cannot be created by the script, create a file named .htaccess in the same directory as the index.php file with these contents:

RewriteEngine on
RewriteRule !\.(gif|jpg|png|tif|css|php)$ index.php

Posted in Digitization, Genealogy, Programming, Projects | Tagged , , | Leave a comment