Creating OCR Files

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(rearranged info, changed links to updated pages, deprecated a section)
m
 
Line 1: Line 1:
''The following is for creating OCR '''during the digitization process.''''' 
+
*For information and script for creating OCR on the server during upload, see the pages on '''[[For_Creating_Derivatives | the scripts used]]''' and on '''[[OCR List | how to document which files need OCR]]'''
*For information and script for creating OCR on the server (Linux), see '''[[For_Creating_Derivatives]]''' and '''[[OCR List]]'''
+
 
 
*For information on OCR versions of transcripts, see '''[[Transcripts]].'''
 
*For information on OCR versions of transcripts, see '''[[Transcripts]].'''
  
 +
-----
 +
 +
''The following is for creating OCR '''during the digitization process.''''' This process is no longer used, which is why the text below has been rendered in gray.
 +
 +
<font color="grey">
  
 
'''What is it?'''  
 
'''What is it?'''  
Line 8: Line 13:
 
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.  
 
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.  
  
<font color="grey">'''Where do these files go?''' [this information is out of date]
+
'''Where do these files go?''' [this information is out of date]
  
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt</font>
+
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt
  
 
'''OCR Process (Windows)'''
 
'''OCR Process (Windows)'''
Line 30: Line 35:
 
# In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
 
# In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
 
   Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.
 
   Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.
 +
</font>

Latest revision as of 09:11, 29 October 2012

  • For information on OCR versions of transcripts, see Transcripts.

The following is for creating OCR during the digitization process. This process is no longer used, which is why the text below has been rendered in gray.

What is it?

OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.

Where do these files go? [this information is out of date]

OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt

OCR Process (Windows)

  1. Open Adobe Acrobat 9 Pro.
  2. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text in Multiple Files Using OCR.
  3. On the box that opens, click the Add Files button and navigate to the file you want. When files are selected, click OK.
  4. The Output Options box will now open. Choose these settings:
    • Under Target Folder, choose Specific Folder and navigate to the folder you have prepared for output.
    • Under Filenaming, chose Add to Original Filename. Under Insert After, type in ".ocr". Uncheck Overwrite Existing Files.
    • Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.
  5. Click OK. Large files may take quite a bit of time.

OCR Process (Mac)

  1. Open Adobe Acrobat 8 Pro.
  2. Drag file into Adobe Acrobat 8 Pro.
  3. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text Using OCR.
  4. In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
  Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.

Personal tools