OCR List

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(changed information to comply with new tracking data workflow)
 
(One intermediate revision by one user not shown)
Line 1: Line 1:
 +
'''''NOTE: These instructions apply only to legacy collections which use the [[TrackingFiles]] model of log data, rather than integrating log data with the metadata spreadsheet until upload. For most collection, see [[Tracking Data]] for marking items as needing OCR.'''''
 +
 +
 
'''When do you need an OCR List?'''
 
'''When do you need an OCR List?'''
  
When any of the content in the current upload is typewritten (versus handwritten).
+
*When any of the content in the current upload is typewritten (versus handwritten).
 
   
 
   
 
'''When you can do without creating an OCR List?'''
 
'''When you can do without creating an OCR List?'''
Line 11: Line 14:
  
 
# Open the collection's metadata spreadsheet (before we've exported our tracking data)
 
# Open the collection's metadata spreadsheet (before we've exported our tracking data)
# COPY (DO NOT CUT) and paste the columns ''Filename'' and ''OCR?'' to a new sheet
+
# COPY (DO NOT CUT) the columns ''Filename'' and ''OCR?'', and PASTE to a new sheet
 
#* The ''OCR?'' column should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
 
#* The ''OCR?'' column should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
 
#* If it isn't, take the time to look over the items and fill in that column
 
#* If it isn't, take the time to look over the items and fill in that column
Line 23: Line 26:
 
'''OCR List for batches'''
 
'''OCR List for batches'''
  
Follow the steps above, except
+
Follow the steps above, except...
# limit the data to just the batch; here are a couple of different ways to do that
+
 
#* in the original spreadsheet, select to copy only the part of the columns that pertain to the batch
+
2. Limit the data to just the batch; here are a couple of different ways to do that
#* in the new sheet, delete the parts of the columns that aren't part of the batch
+
* in the original spreadsheet, select to copy only the part of the columns that pertain to the batch
# Add the batch number to the filename
+
* in the new sheet, delete the parts of the columns that aren't part of the batch
#* Format: [collnum].[batchnum].ocrList.txt  
+
5. Add the batch number to the filename
#* Example: u0003_0001577.32.ocrList.txt
+
* Format: [collnum].[batchnum].ocrList.txt  
 +
* Example: u0003_0001577.32.ocrList.txt
  
 
'''What the makeJpegs script does with an OCR List'''
 
'''What the makeJpegs script does with an OCR List'''
  
 
If any items/files that list are already online, the script will look for the tiffs in the archive, look for existing text files in Acumen - and if it finds the former and not the latter, it will OCR the tiff and place it in Acumen.  For the content in the ocrList which is currently being uploaded, the OCR will be placed in the UploadArea/ocr directory.  For content not found, there will be a list of the unlocated TIFFs in the output file.
 
If any items/files that list are already online, the script will look for the tiffs in the archive, look for existing text files in Acumen - and if it finds the former and not the latter, it will OCR the tiff and place it in Acumen.  For the content in the ocrList which is currently being uploaded, the OCR will be placed in the UploadArea/ocr directory.  For content not found, there will be a list of the unlocated TIFFs in the output file.

Latest revision as of 08:54, 6 June 2013

NOTE: These instructions apply only to legacy collections which use the TrackingFiles model of log data, rather than integrating log data with the metadata spreadsheet until upload. For most collection, see Tracking Data for marking items as needing OCR.


When do you need an OCR List?

  • When any of the content in the current upload is typewritten (versus handwritten).

When you can do without creating an OCR List?

  • If NONE of the items are typewritten, you can tell the makeJpegs script that there are no items to OCR. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it no.)
  • If ALL the of items are typewritten, you can ask the makeJpegs script to OCR everything. (Upon not finding the ocrList.txt file, it will ask you if you want to OCR anything. You will tell it yes, then choose the option to OCR the entire content of the upload.)

How to create an OCR List

  1. Open the collection's metadata spreadsheet (before we've exported our tracking data)
  2. COPY (DO NOT CUT) the columns Filename and OCR?, and PASTE to a new sheet
    • The OCR? column should be filled in with 1 or 0 for each item (1=yes to OCR, more than half of the item is typewritten, 0=no to OCR)
    • If it isn't, take the time to look over the items and fill in that column
  3. Delete the header row
  4. This will result in a two-column list, with no header row and only one filename per line -- which is what the script requires
  5. Save the new sheet as tab delimited file
    • Format: [collnum].ocrList.txt
    • Example: u0003_0001823.ocrList.txt
  6. Put in collection's Admin folder (in Digital_Coll_Completed directory, of course!)

OCR List for batches

Follow the steps above, except...

2. Limit the data to just the batch; here are a couple of different ways to do that

  • in the original spreadsheet, select to copy only the part of the columns that pertain to the batch
  • in the new sheet, delete the parts of the columns that aren't part of the batch

5. Add the batch number to the filename

  • Format: [collnum].[batchnum].ocrList.txt
  • Example: u0003_0001577.32.ocrList.txt

What the makeJpegs script does with an OCR List

If any items/files that list are already online, the script will look for the tiffs in the archive, look for existing text files in Acumen - and if it finds the former and not the latter, it will OCR the tiff and place it in Acumen. For the content in the ocrList which is currently being uploaded, the OCR will be placed in the UploadArea/ocr directory. For content not found, there will be a list of the unlocated TIFFs in the output file.

Personal tools