Digitization: OCR Process for Town Reports
Purpose: To prepare searchable text for Maine town reports being scanned for online access as part of the Kirtas Digitization Project.
Opening ABBYY and Setup
- Open Abbyy FindReader OCR application at one of the designated Kirtas workstations located in the lab, Technical Services or Special Collections.
- From the tool bar select “Details Batch View”
- Note: This view and other options from the setup screen are saved from session to session; if these options have already been selected once, you do not need to select them every time you open Abbyy.
- Set up screen should have the following items checked off. To access setup, right click the gray bar under batch.
- Page number
- Read
- Uncertain characters
- Spelling checked
- Save
- Error warning
- Source image (full path)
General Steps for OCR Processing
Step 1. Open the menu under the “Scan and read” icon and select “Open and read.”
Step 2. Navigate to the folder where the files to be OCRed are located. Highlight all files to be opened. (For the Town Reports project, this will be all of the TIFF files in the “output” folder for a single volume. For a thesis, this will be one PDF.) Select “Open.”
NOTE: Depending on the type of file and number of pages to be opened, this step can take anywhere from 5 minutes to 1 hour to complete. It may be advisable to work on another task while waiting for the files to be opened and read.
Step 3. After all pages have been opened and read, check that the uncertainty level for all pages is below 10%. The easiest way to do this is to click on the “Uncertainty level” heading at the top of the “Batch” section. Clicking on this heading will sort all of the pages from greatest to least character uncertainty. If the first page has an uncertainty level under 10%, then all pages in the batch have an uncertainty level under 10%, and you are finished with this step. If the top page is at 10% or above, then this page and all other pages at that level must be edited and spell checked until their uncertainty level is under 10%. For more detail on this step, see the “Common Problems” section, below.
Step 4. After all pages have an uncertainty level below 10%, return the pages to numerical order by clicking on “Page number.” (Note: If the pages are not returned to numerical order before the file is saved, the pages will stay out of order in the resulting PDF.) Run a complete spell check on the title page(s), by selecting the page, then selecting the “Check spelling” icon and checking all words on that page. When you have finished the spell check, a check mark should appear next to that page number on the “Batch” section of the screen. If there is an index or table of contents, do a complete spell check of these pages, as well.
Step 5: Save the pages as a PDF by highlighting all the pages in one volume, then selecting “Save Wizard” from the drop down menu on the “Save” icon. Select “Save Pages,” then select “OK.” Choose the location where the file will be saved. (Town reports documents should be saved to the desktop.) Select “PDF document” from the dropdown list and name the file according to the town and year, with an underscore between them (e.g., Augusta_1884). If the town consists of two or more words (e.g., “Deer Isle” or “Fort Kent”) remove the space between the words.
- Note: The following settings should be selected under the “Save Wizard,” “Formats Settings,” “PDF” tab. These settings are saved from session to session and do not need to be reset every time a file is saved.
- Keep original image size (checked)
- Save Mode: Text under the page image
- Enable tagged PDF (checked)
- Quality: High (for printing)
- Format: Automatic
- Font: Use standard fonts.
Note: PDFs over 15 MB must be segmented into smaller files either by using the PDF segmenting tool or by exporting smaller ranges of pages from Abbyy. However, the Town Reports PDFs have typically been well under this size, since each year is saved individually. Town reports scanned in grayscale and under 200 pages may be assumed to be small enough not to require segmentation.
Step 5a: Town Reports documents also need to have a separate .txt file exported from Abbyy. To create this file, highlight the pages to be saved, select “Save Wizard,” then “Save Pages,” then “OK.” Save the files to the desktop using the same naming conventions, but select “Text document” from the dropdown menu.
Common Problems that will need to be addressed using Step 3: Spellcheck
- Wrongly rotated image – occasionally Abbyy will interpret the file in such as way as to rotate. [Insert image samples of incorrect vs. correctly rotated pages]
- Unrotate the image. This will automatically clear all markup boxes.
- Draw appropriate boxes manually [Specs on using draw tool?]
- Re-read the document
- Gothic text – Abbyy unable to interpret Gothic font often found on cover page and used for major headings such as titles.
- Use Check Spelling mode to edit manually
- Clamp mark obscuring text – text of document blocked by clamp and unreadable by software
- Use Check Spelling mode to manually correct text as best you can make out from image
- Offset columns of text lines. [insert image example here?]
- Use Check Spelling mode to draw new boxes
Tips for Working with Spell Check Mode
Confirming word by word vs. confirming whole phrases — phrases more efficient use of time but must keep an eye out for goofy characters at beginning and ends of words. Stray marks, quotes, etc. are sometimes interpreted as letters. If removed the software can properly interpret the word.
Contact: um.library.technical.services@maine.edu
Return to Technical Services Table of Contents.