Creating OCR Files

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
m
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
*for information on OCR versions of transcripts see [[Transcripts]]
+
*For information and script for creating OCR on the server during upload, see the pages on '''[[For_Creating_Derivatives | the scripts used]]''' and on '''[[OCR List | how to document which files need OCR]]'''
  
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our Quality Control process. Scanning technicians indicate what objects are good candidates for OCR on the Tracking Files sheet for each collection.  
+
*For information on OCR versions of transcripts, see '''[[Transcripts]].'''
  
''The following is for creating OCR during the digitization process. For information and script for creating OCR on the server (Linux), see [[For_Creating_Derivatives]].''
+
-----
 +
 
 +
''The following is for creating OCR '''during the digitization process.''''' This process is no longer used, which is why the text below has been rendered in gray.
 +
 
 +
<font color="grey">
 +
 
 +
'''What is it?'''
 +
 
 +
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.
 +
 
 +
'''Where do these files go?''' [this information is out of date]
  
 
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt
 
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt
  
'''OCR Process'''
+
'''OCR Process (Windows)'''
  
 
# Open Adobe Acrobat 9 Pro.  
 
# Open Adobe Acrobat 9 Pro.  
Line 17: Line 27:
 
#*Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.  
 
#*Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.  
 
# Click OK. Large files may take quite a bit of time.
 
# Click OK. Large files may take quite a bit of time.
 +
 +
'''OCR Process (Mac)'''
 +
 +
# Open Adobe Acrobat 8 Pro.
 +
# Drag file into Adobe Acrobat 8 Pro.
 +
# Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text Using OCR.
 +
# In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
 +
  Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.
 +
</font>

Latest revision as of 08:11, 29 October 2012

  • For information on OCR versions of transcripts, see Transcripts.

The following is for creating OCR during the digitization process. This process is no longer used, which is why the text below has been rendered in gray.

What is it?

OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.

Where do these files go? [this information is out of date]

OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt

OCR Process (Windows)

  1. Open Adobe Acrobat 9 Pro.
  2. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text in Multiple Files Using OCR.
  3. On the box that opens, click the Add Files button and navigate to the file you want. When files are selected, click OK.
  4. The Output Options box will now open. Choose these settings:
    • Under Target Folder, choose Specific Folder and navigate to the folder you have prepared for output.
    • Under Filenaming, chose Add to Original Filename. Under Insert After, type in ".ocr". Uncheck Overwrite Existing Files.
    • Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.
  5. Click OK. Large files may take quite a bit of time.

OCR Process (Mac)

  1. Open Adobe Acrobat 8 Pro.
  2. Drag file into Adobe Acrobat 8 Pro.
  3. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text Using OCR.
  4. In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
  Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.

Personal tools