OCRShow is a single PHP file you can put in a folder with scanned images and their OCRed text to quickly create a website that is indexable by search engines and easily navigable by humans
- Display the digitized text and the source image on the same page
- Search Engine Friendly URLs
- Next/Previous page links
- Easy to navigate for human visitors
- Easily search the book
- Just a single PHP file
- Automatically writes .htaccess file if none exists
- Easy installation
Download and Install OCRShow
To install OCRShow, you will need to create three types of files for each page you have scanned:
- An image file (eg. page_0001.png)
- A smaller version of the image file (eg. small_page_0001.png)
- A plain text copy of what the image says, named .html (eg. page_0001.png.html)
OCRShow will use the ‘small_’ prefix and the .html file extension. If you want something different, you will need to edit the code yourself.
Upload the small_, image, and .html files to their own directory. Rename the downloaded php file to index.php and upload it to the same directory. The first time you visit that directory a .htaccess file will be created which is needed to allow search engine friendly URLs.
- OCRShow requires PHP, mod_rewrite and .htaccess to work.
- Search functionality requires grep, sed and sort as well as the ability to run commands with backticks. Most hosting providers will have this
Why Would I Need This?
In many cases there is only one copy of a genealogy book. Perhaps it is a personal history or a book of rememberance. Physically the book can only be in one place at one time. Thanks to the prolification of scanners, you can easily change that book into a series of images. The images can easily be shared with people you know. What about people you don’t know? Say a common descendant who is also researching your great-great-great-great-grandpa? In order to let people liek that find your scanned images, you need to get information about the scans into the search engines. Search engines can’t read images (yet!), so you need to provide text for the images. The easiest way to convert a typed document into text is with Optical Character Recognition software, such as Tesseract.
Search engines follow links when they are building their database, so you need to provide links between the different images and text. OCRShow is a very easy way to create those links and a very easy way to share your scanned documents. You just upload the documents and OCRShow and you’re done!
I am in the process of scanning several hundred pages of genealogy books so that they will be available digitally to the rest of my family. I decided that it would be good if Google could find them too so that other people who may be researching my ancestors will be able to find the information and hopefully we can help each other out.
What’s Your Process?
I am using Ubuntu 8.10 Linux and an Epson Stylus CX3810 all-in-one printer/scanner. The software xsane will automatically keep incrementing a number at the end of a file name, so I start the book at filename_0001.tif and just keep clicking ‘Scan’. I scan in straight black and white at 300 dpi, it keeps the filesize down and makes it easier for the OCR software to figure out where the letters are.
Once the whole book or document is scanned, I run a short shell script:
#!/bin/bash for i in *tif;do tesseract $i $i;done for i in *tif;do convert $i $i.png;done for i in *png;do convert -resize x1000 $i small_$i;done rename 's/\.txt/\.html/' *txt rename 's/\.tif\.png/\.png/' *png rm *tif
Tesseract is an OCR engine developed over 10 years ago by HP, then donated to the open source world. It coverts a tif file into a text file. It is not formatting aware, so columns, pedigree charts, etc. all disapear. In my case this is ok, because I am really just creating text so Google can find the images. I expect that human users will read the text on the images. The convert commands change the tifs into pngs, because most browsers can’t display tif files. I also create a smaller image so that the user can quickly view the page. Both png and tif are lossless, so we can delete the tif files at the end.
I know that software for this process exists for Windows and Macintosh systems too, but I am not familiar with what your options are. If you have recomendations, especially free recomendations, please let me know and I will post them here.
If the .htaccess file cannot be created by the script, create a file named .htaccess in the same directory as the index.php file with these contents:
RewriteEngine on RewriteRule !\.(gif|jpg|png|tif|css|php)$ index.php