Creating OCR Files

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
 
(11 intermediate revisions by 3 users not shown)
Line 1: Line 1:
*for information on OCR versions of transcripts, see [[Transcripts]].
+
*For information and script for creating OCR on the server during upload, see the pages on '''[[For_Creating_Derivatives | the scripts used]]''' and on '''[[OCR List | how to document which files need OCR]]''' [This process currently is not used.]
  
 +
*For information on OCR versions of transcripts, see '''[[Transcripts]].'''
  
'''OCR''' stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.
+
-----
  
''The following is for creating OCR during the digitization process.  For information and script for creating OCR on the server (Linux), see [[For_Creating_Derivatives]].''
+
'''What is it?'''  
  
OCR files are saved within the collection directory within a folder called Transcripts. For large collections, coordinating Scans and OCR folders should be created within the Transcripts folder (for example, OCR text files from Scans_3 would be placed in a folder called OCR_3 within the Transcripts folder). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt
+
OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.
 +
 
 +
'''Where do these files go?'''
 +
 
 +
OCR files are saved within the collection directory within a folder called Transcripts -- no sub-folders should be here, all ocr.txt files should be on the same level, including page files). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt.
 +
 
 +
'''OCR Process (Windows) 2017'''
 +
 
 +
''Note: This process can take quite a long time. Though it can run in the background on your computer while you do other work, it is recommended that, if available, you use a computer that is not in use to complete this process.''
 +
 
 +
# Open Adobe Acrobat XI Pro.
 +
# Click on Tools on the navigation bar, then click on Action Wizard from the right pop-out menu (or go to View/Tools/Action Wizard from the top drop down).
 +
# Choose OCR Batch Test.
 +
# Click on Add Folder next to the folder-plus icon (use the drop arrow to the right, then Add Files to process only one file).
 +
# Navigate to the directory where you have prepared images to be processed, then click OK.
 +
# Wait until you see the names of the files you are processing appear in the Files to be processed box above the Add Folder button. If you are processing a large number of files (100+), this may take a few minutes.
 +
# Click on the green Start button. 
 +
#* Again, with a large number of files, this may take quite a bit of time. Check back in a few hours.
 +
#* Images will appear in the viewer as they are processed. The red Stop button will be present while the program is running.
 +
#* When all files have been processed, the Start/Stop button turns into a non-clickable button that says Completed and green checkmarks appear next to the file names in the Files to be processed box.
 +
# Processed files will be in the same folder as your images.
 +
#* The script transcriptsmover.pl will put the ocr.txt files into the correct folder.
 +
#* Go to S:\Digital Projects\Administrative\scripts\transcripts\transcriptsmover.pl
 +
#* Double click the on script and follow the prompts to choose your directory and move files.
 +
#* Your ocr.txt files should now be in a folder called Transcripts in your collection level directory.
 +
 
 +
 
 +
-----
 +
 
 +
''The following is for creating OCR '''during the digitization process.''''' This process is no longer used, which is why the text below has been rendered in gray.
 +
 
 +
<font color="grey">
  
 
'''OCR Process (Windows)'''
 
'''OCR Process (Windows)'''
Line 26: Line 58:
 
# In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
 
# In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
 
   Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.
 
   Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.
 +
</font>

Latest revision as of 11:19, 15 February 2017

  • For information on OCR versions of transcripts, see Transcripts.

What is it?

OCR stands for Optical Character Recognition. This is the process by which typewritten or printed text is electronically translated into machine-editable text. We perform OCR on all images containing typewritten or printed text. OCR is a step in our digitization process.

Where do these files go?

OCR files are saved within the collection directory within a folder called Transcripts -- no sub-folders should be here, all ocr.txt files should be on the same level, including page files). OCR text files are saved in the following format: ex. u0003_0001577_0000233.ocr.txt.

OCR Process (Windows) 2017

Note: This process can take quite a long time. Though it can run in the background on your computer while you do other work, it is recommended that, if available, you use a computer that is not in use to complete this process.

  1. Open Adobe Acrobat XI Pro.
  2. Click on Tools on the navigation bar, then click on Action Wizard from the right pop-out menu (or go to View/Tools/Action Wizard from the top drop down).
  3. Choose OCR Batch Test.
  4. Click on Add Folder next to the folder-plus icon (use the drop arrow to the right, then Add Files to process only one file).
  5. Navigate to the directory where you have prepared images to be processed, then click OK.
  6. Wait until you see the names of the files you are processing appear in the Files to be processed box above the Add Folder button. If you are processing a large number of files (100+), this may take a few minutes.
  7. Click on the green Start button.
    • Again, with a large number of files, this may take quite a bit of time. Check back in a few hours.
    • Images will appear in the viewer as they are processed. The red Stop button will be present while the program is running.
    • When all files have been processed, the Start/Stop button turns into a non-clickable button that says Completed and green checkmarks appear next to the file names in the Files to be processed box.
  8. Processed files will be in the same folder as your images.
    • The script transcriptsmover.pl will put the ocr.txt files into the correct folder.
    • Go to S:\Digital Projects\Administrative\scripts\transcripts\transcriptsmover.pl
    • Double click the on script and follow the prompts to choose your directory and move files.
    • Your ocr.txt files should now be in a folder called Transcripts in your collection level directory.



The following is for creating OCR during the digitization process. This process is no longer used, which is why the text below has been rendered in gray.

OCR Process (Windows)

  1. Open Adobe Acrobat 9 Pro.
  2. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text in Multiple Files Using OCR.
  3. On the box that opens, click the Add Files button and navigate to the file you want. When files are selected, click OK.
  4. The Output Options box will now open. Choose these settings:
    • Under Target Folder, choose Specific Folder and navigate to the folder you have prepared for output.
    • Under Filenaming, chose Add to Original Filename. Under Insert After, type in ".ocr". Uncheck Overwrite Existing Files.
    • Under Output Format, chose Export File(s) to Alternate Format and select Text (Plain) from the drop down menu.
  5. Click OK. Large files may take quite a bit of time.

OCR Process (Mac)

  1. Open Adobe Acrobat 8 Pro.
  2. Drag file into Adobe Acrobat 8 Pro.
  3. Choose the Document tab across the top. From the drop-down, hover over OCR Text Recognition, and then choose Recognize Text Using OCR.
  4. In Mac, the file is not automatically saved. So once OCR is done, go to File>Save As>Text (plain). Save to your local transcripts folder.
  Unlike Windows/Acrobat-9 which is capable of batch OCR, Mac/Acrobat-8 appears to only support OCR for one item at a time.

Personal tools