OCR List

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(New page: '''How to create an OCR List''' # Open the collection's TrackingFiles log file # For the files you are uploading, copy those rows to a new spreadsheet # Delete all columns EXCEPT: ''Filen...)
 
(Added info about script handling of the list)
(3 intermediate revisions by one user not shown)
Line 1: Line 1:
 +
'''When do you need an OCR List?'''
 +
 +
When any of the content in the current upload is typewritten (versus handwritten).
 +
 +
When you can do without creating an OCR List:
 +
*If NONE of the items are typewritten, you can tell the makeJpegs script that there are no items to OCR. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it no.)
 +
*If ALL the of items are typewritten, you can ask the makeJpegs script to OCR everything. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it yes, then choose the option to OCR the entire content of the upload.)
 +
 
'''How to create an OCR List'''
 
'''How to create an OCR List'''
  
 
# Open the collection's TrackingFiles log file
 
# Open the collection's TrackingFiles log file
 
# For the files you are uploading, copy those rows to a new spreadsheet
 
# For the files you are uploading, copy those rows to a new spreadsheet
# Delete all columns EXCEPT: ''Filename'' and ''OCR?''
+
# In the new spreadsheet, delete all columns EXCEPT: ''Filename'' and ''OCR?''
 +
#* This will results in a two-column list, with only one filename per line -- which is what the script is looking for
 
#* ''OCR?'' should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
 
#* ''OCR?'' should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
#* if it isn't, take the time to look over the items in Bridge and fill in that column
+
#* If it isn't, take the time to look over the items in Bridge and fill in that column
# Save as tab delimited file called [collection number].ocrList.txt  
+
# Save the new spreadsheet as tab delimited file called [collection number].ocrList.txt  
 
#* Example: u0003_0001577.ocrList.txt
 
#* Example: u0003_0001577.ocrList.txt
 
# Put in collection's Admin folder (in Digital_Coll_Completed directory, of course!)
 
# Put in collection's Admin folder (in Digital_Coll_Completed directory, of course!)
 +
 +
'''What the makeJpegs script does with an OCR List'''
 +
 +
If any items/files that list are already online, the script will look for the tiffs in the archive, look for existing text files in Acumen - and if it finds the former and not the latter, it will OCR the tiff and place it in Acumen.  For the content in the ocrList which is currently being uploaded, the OCR will be placed in the UploadArea/ocr directory.  For content not found, there will be a list of the unlocated TIFFs in the output file.

Revision as of 09:48, 26 October 2012

When do you need an OCR List?

When any of the content in the current upload is typewritten (versus handwritten).

When you can do without creating an OCR List:

  • If NONE of the items are typewritten, you can tell the makeJpegs script that there are no items to OCR. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it no.)
  • If ALL the of items are typewritten, you can ask the makeJpegs script to OCR everything. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it yes, then choose the option to OCR the entire content of the upload.)

How to create an OCR List

  1. Open the collection's TrackingFiles log file
  2. For the files you are uploading, copy those rows to a new spreadsheet
  3. In the new spreadsheet, delete all columns EXCEPT: Filename and OCR?
    • This will results in a two-column list, with only one filename per line -- which is what the script is looking for
    • OCR? should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
    • If it isn't, take the time to look over the items in Bridge and fill in that column
  4. Save the new spreadsheet as tab delimited file called [collection number].ocrList.txt
    • Example: u0003_0001577.ocrList.txt
  5. Put in collection's Admin folder (in Digital_Coll_Completed directory, of course!)

What the makeJpegs script does with an OCR List

If any items/files that list are already online, the script will look for the tiffs in the archive, look for existing text files in Acumen - and if it finds the former and not the latter, it will OCR the tiff and place it in Acumen. For the content in the ocrList which is currently being uploaded, the OCR will be placed in the UploadArea/ocr directory. For content not found, there will be a list of the unlocated TIFFs in the output file.

Personal tools