Creating OCR Files
- for information on OCR versions of transcripts, see Transcripts.
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our Quality Control process. Scanning technicians indicate what objects are good candidates for OCR on the Tracking Files sheet for each collection.
The following is for creating OCR during the digitization process. For information and script for creating OCR on the server (Linux), see For_Creating_Derivatives.
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt
- Open Adobe Acrobat 9 Pro.
- Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text in Multiple Files Using OCR.
- On the box that opens, click the Add Files button and navigate to the file you want. When files are selected, click OK.
- The Output Options box will now open. Choose these settings:
- Under Target Folder, choose Specific Folder and navigate to the folder you have prepared for output.
- Under Filenaming, chose Add to Original Filename. Under Insert After, type in ".ocr". Uncheck Overwrite Existing Files.
- Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.
- Click OK. Large files may take quite a bit of time.