Cataloging: OCR for Theses
Purpose: To prepare searchable text for University of Maine theses and dissertations scanned for online access as part of the Electronic Theses & Dissertations database. After the thesis has been OCRed, the next step is to add it to Digital Commons.
- Open ABBYY. Go to “Open PDF File/Images” and navigate to the location of the PDF file.
- In the Open File dialogue box, make sure that the “Enable image processing” box is checked.
- Click on the PDF to open and begin reading it in ABBYY. Wait for ABBYY to finish opening and reading all of the pages.
- Run a spell check on the title page and abstract pages until those pages have a 0% character uncertainty level. Fix “special characters” (e.g., Greek letters, mathematical symbols, accented letters) in the title page and abstract, if possible. (Note that if the entire thesis is in a foreign language, such as French, the spell check language should be changed to that language.)
- Read the OCRed text of the title page to check for spelling errors that ABBYY may have missed. (This is rare, but it’s important that the title page be free of typos.) If the title or abstract has any “special characters,” read through the OCRed text of the abstract to double check that the special characters were properly interpreted by ABBYY.
- For the rest of the pages in the thesis, reduce the character uncertainty level to 5% or lower by taking the following steps:
- If there is a text or table block drawn around content that is non-OCRable (e.g., maps, graphs, images), delete the entire block. (This also holds true for text blocks that consist entirely of mathematical equations or computer code, since it is often impossible or too time-consuming to enter the correct symbols.)
- If there is a text block which includes both readable text and non-OCRable content, redraw the box to eliminate the non-OCRable section and reread the text box. (With the box selected, Ctrl-Shift-B rereads the single box. Alternately, Ctrl-R will reread the entire page.)
- If all the text blocks are drawn around readable text, run a spell check and make needed changes until the character uncertainty level is at 5% or lower.
- When you have completed the spell check, save the pages as PDF file with the same name as the original, plus a suffix such as “-OCR” to keep the original file as a short term backup. The standard naming convention is last name, first initial, thesis date. So, a thesis written by John Smith in 2012 would be named SmithJ2012.pdf. (In case of a naming conflict, e.g. such another thesis written by Jane Smith in 2012, the second thesis would be named SmithJ2012a.pdf.)
Return to Technical Services TOC