Preparing Collections on the S Drive for Online Delivery and Storage

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(Metadata)
(Admin)
(47 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Things to do before marking a folder as "Ready" or "Store"==
+
The following page assumes the content for upload has already been through the [[Quality_Control | Quality Control process]]
  
The collection number u0003_0000001 will be used as an example.
+
==Preparation Procedure==
  
===Check Subfolders of Collection Level Folder===
+
Choose one of the following checklists:
The Collection Level folder contains [[Share_Drive_Protocols#Contents|subfolders]] and their content must adhere to certain specifications prior to the collection being considered "Ready" (to go online via the Metadata Unit) or "Store" (to go directly into storage).
+
  
====Admin====
+
'''One-Shot Collections''' are small enough to be completed in one batch, so they require no batch numbering or batching process.  
* This folder must exist.
+
  
*Must contain:
+
'''Ongoing Collections''' are those which will have multiple batches for upload, so they require "batching."
** [[Collection_Information|u0003_0000001.xml]]
+
  Make sure to refer to the [[Collection_Information|Collection_Information]] page regarding acceptable data values.
+
* [[TrackingFiles#Scope|u0003_0000001.log.txt]]
+
  Include a text version of the log file with every batch; previous versions of the log will be overwritten by the newest version.
+
  Specifically, since this log file's source file is an Excel workbook, it is the "log" sheet within the Excel file that needs to be saved as u0003_0000001.log.txt.
+
  This is the sheet with all the scanning data (technician, dates, # of scans, etc.).
+
* May also contain:
+
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
+
** [[Most_Content#Getting_content_from_the_share_drive_to_live_in_Acumen|OCR list]] - .ocrList.txt extension.
+
<!--** Finding Aid - .xml extension-->
+
** and other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file.
+
  For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension - i.e. "u0008_0000001_0000001.notes.txt".
+
  
* If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows:
+
===One-Shot Collections===
** u0003_0000001.1.xml
+
** u0003_0000001.2.xml
+
  
====Metadata====
+
# Move collection folder from Digital_Coll_in_progress to Digital_Coll_Complete
* This folder must exist.
+
# Remove extra text in folder name so that it is labeled with just the collection number
 +
## Example: u0003_0000633_HarperTimetables --> u0003_0000633
 +
# Finalize collection documentation
 +
## '''Tracking Data'''
 +
### OLD/SEPARATE FILE: save TrackingFiles spreadsheet (see S:\Digital Projects\Organization\TrackingFiles) as tab delimited .txt file in Admin folder (see [[TrackingFiles]] for more)
 +
### NEW/INTEGRATED: COPY Filename column from the metadata spreadsheet to a new sheet, MOVE Tracking Data columns to new sheet, SAVE as tab delimited .txt file in Admin folder (see [[Tracking Data]] for more)
 +
## '''OCR List''': see [[OCR List | these instructions]]
 +
## '''Metadata''':  save spreadsheet (minus the tracking data you should’ve already removed) as tab delimited .txt file in Metadata folder
 +
# [[Making MODS | Create MODS]]
 +
# Carefully check contents of the folders (see list below)
 +
## Make sure to remove any unnecessary files created during the capture process (for example, test scans or supplementary metadata or text file notes about progress)
 +
# Once everything is okay, you’re ready to [[Most Content | Upload Content]]!
  
<!--
+
===Ongoing Collections===
* If going to Metadata Unit, must contain:
+
** [[Descriptive_metadata|u0003_0000001.xlsx]]
+
  The Metadata Unit will convert this to a tab-delimited.txt file after adding subject headings, etc. and uploading the content. Will get converted to UTF-8 later.
+
  
  If metadata for a set of scans being transported to Storage needs to be segmented out, instructions for doing so can be viewed [[Metadata_Movement#Metadata_Transfer_and_Remediation_for_Collections_Requiring_Multiple_Uploads|here]].
+
# Set up collection folder in Digital_Coll_Complete, and create inside it
  The work of parsing out the metadata can be assisted by script; see: [[Parsing_Metadata]].
+
## Admin
-->
+
## Metadata
 +
## Transcripts (if necessary)
 +
# Move Scans folder from ongoing collection folder in Digital_Coll_in_progress to this new collection folder in Digital_Coll_Complete
 +
# Follow the procedures for [[Batches | Creating Batch Documentation]]
 +
# [[Making MODS | Create MODS]]  
 +
# Double-check contents of the folders (see list below)
 +
# Once everything is okay, you’re ready to [[Most Content | Upload Content]]!
  
If going directly to Storage, must contain:
 
* u0003_0000001.txt
 
  This is tab-delimited text export of the original spreadsheet.
 
  The prior .xlsx spreadsheet should be moved to S:\Digital Projects\Administrative\collectionInfo\Storage_Excel. Will get converted to UTF-8 later.
 
 
 
''If this is a large or ongoing collection, the tab-delimited text export should contain ONLY the metadata for the items currently being transported to storage.  The text file itself should have a period and then a number to indicate which portion of the complete metadata this is.  The first tab-delimited export would be named, for example, u0003_0000001.1.txt, and would contain the first 500 entries, for example. The second tab-delimited export, for items 501-1000, would be named u0003_0000001.2.txt, and so forth.  Thus, only by collecting all these tab-delimited exports do we have a complete set of descriptive metadata for the collection items.''
 
  
''If this is an ongoing collection, the Excel version of the metadata for the items currently being transported to storage must go in the Metadata Unit's remediation queue.''
+
==Checking Folders==
 +
The Collection Level folder contains [[Share_Drive_Protocols#Contents|subfolders]] and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.
  
* May also contain (whether going to Metadata Unit or Storage):
+
The collection number u0003_0000001 will be used as an example.
** [http://www.loc.gov/ead/ u0003_0000001.ead.xml]
+
** [http://www.loc.gov/standards/mods/ u0003_0000001.mods.xml]
+
** [http://www.loc.gov/standards/mets/ u0003_0000001.mets.xml]
+
  All must use ANSI or UTF-8 without BOM encoding.
+
  
  Additionally, if you (are comfortable with) XML, please open the EAD file and look for this line (should be the 4th one down):
+
===Admin===
  <eadid countrycode="US" mainagencycode="US-US-ALM"></eadid>
+
* This folder must exist.
  If the collection number is not there (what we name the file:  u0003_0000580) then please enter it, so that line looks like this:
+
 
  <eadid countrycode="US" mainagencycode="US-US-ALM">u0003_0000580</eadid>
+
*Must contain:
  This way the file self-references, and can be found by this number during searches, if we index it properly. Also, if something gets misnamed somewhere, this will help to sort out the problem.
+
** '''[[Collection_Information|u0003_0000001.xml]]'''
  - J Deridder, 082409
+
***If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows: u0003_0000001.1.xml, u0003_0000001.2.xml, etc.
 +
***Make sure to refer to the [[Collection_Information|Collection_Information]] page regarding acceptable data values.
 +
** '''[[Tracking Data|u0003_0000001.log.txt]]'''
 +
***Include a text version of the log file with every batch; previous versions of the log will be overwritten by the newest version.
  
====Scans====
+
* May also contain:
 +
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
 +
** [[OCR List|OCR list]] - .ocrList.txt extension.
 +
*** '''u0001_0000002.ocrList.txt'''
 +
<!--** Finding Aid - .xml extension-->
 +
** [[Skipped Items | Skipped items list]] - .skipped.txt extension. For batched collections: this should be present ONLY during the last upload, as it will contain information about skipped items across the entire collection.
 +
*** '''u0003_0000193.skipped.txt'''
 +
**Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension
 +
*** '''u0008_0000001_0000001.notes.txt'''
 +
 
 +
===Metadata===
 +
* This folder must exist.
 +
* Must contain:
 +
**'''u0003_0000001.m01.txt'''  or '''u0002_0000001.m03.txt'''  or '''u0008_0000001.m02.txt'''
 +
***Note the type of spreadsheet is echoed in the segment before the ".txt" --  if this is a batch file, the batch number precedes the m0x value: 
 +
**** example: '''u0002_0000001.1.m01.txt'''.
 +
**** check this file for: Diacritics, Quotes, UTF-8 encoding
 +
***see [https://intranet.lib.ua.edu/cataloging/metadata/SpreadsheetRegistry] for more information.
 +
**'''A MODS folder'''
 +
*** This folder will contain all the MODS files created via Archivist Utility (see: [[Making MODS]]).
 +
 
 +
===Scans===
 
* This folder must exist.  
 
* This folder must exist.  
   Note: we may break Scans folders into chunks for manageability, for more information [[Scans_folder|click here]].
+
    
 
+
 
*Must only contain:
 
*Must only contain:
** scans (tiffs/wavs) of non-compound objects and compound objects (inside respective subfolders)
+
**'''Scans (tiffs/wavs)''' of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
  All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
+
  
 
====Transcripts====
 
====Transcripts====
* This folder must exist only if [[transcripts]] exist.
+
* This folder can exist only if [[transcripts]] exist.
 
*Must only contain one or more of the following types of files:
 
*Must only contain one or more of the following types of files:
 
** u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders).
+
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders). If cleaned up .txt files exist, remove the corresponding .ocr.txt file.
  If cleaned up .txt files exist, remove the corresponding .ocr.txt file.
+
  
===Perform Quality Control Tasks===
+
===needsRemediation===
see [[Quality_Control]]
+
*This folder always exists at the following location.
 
+
** S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation
===Spot check all .xml, .txt, and .xlsx files===
+
*Must contain:
* Check all such files for proper filenames and extensions.
+
**Excel formatted version of the batch metadata.
* Open all such files and look for anomalies and inconsistencies, misspellings, and missing data, etc.
+
***example: '''u0002_0000001.1.m01.xlsx'''.
** Examples:
+
*** make sure fields such as Funder(s), Funding Information, Repository Collection, Digital Collection, Digital Publisher, etc. are filled out.
+
*** ideally, no additional fields such as "Notes" are in the Metadata file. "Notes" as such should be deleted or moved to the appropriate row in the log.txt file.
+
*** make sure the Format column in the Metadata file has not been altered to the Time format as it sometimes is:
+
  If a tab delimited metadata file is opened via Excel (especially by right clicking the file and choosing to open in Excel), the format column if like: 3 p., 4 p., etc. Will get interpreted as
+
  3:00 PM, 4:00 PM, etc.
+
  If then resaved as .txt, times will have been saved instead of page #s.
+
  The way around this is to have Excel open first, choose Open.
+
  open your text file and while you are being interrogated by Excel about how to import, set the Format column as "Text".
+
 
+
===Check all Folder names===
+
*Make sure folders are named correctly and that there are no superfluous word concatenations to object level folders, etc.
+
 
+
===Match Data across documents and folders===
+
 
+
This table attempts to show how data in one of our documents/folders should match with data in another document/folder.
+
 
+
Corresponding data must properly match/equate prior to marking "Scans" folders - and especially collection level folders - as "Ready" or "Store".
+
 
+
 
+
{| {{table }} border=1
+
| align="center" style="background:#f0f0f0;"|'''TrackingFileNames'''
+
| align="center" style="background:#f0f0f0;"|'''Admin XML'''
+
| align="center" style="background:#f0f0f0;"|'''Metadata'''
+
| align="center" style="background:#f0f0f0;"|'''Archivist Queue (\"Selection.xlsx\")'''
+
| align="center" style="background:#f0f0f0;"|'''TrackingFiles'''
+
| align="center" style="background:#f0f0f0;"|'''Finding Aid'''
+
| align="center" style="background:#f0f0f0;"|'''Scans folder'''
+
|-
+
| Collection Number||||[first 14 characters of Filename]||||Collection Number||||
+
|-
+
| Container||||Container Number||Container||||||
+
|-
+
| physical location||Manuscript_Number <font color=red>[Is this true?]</font>||||Manuscript Number; Physical Location||||MSS #||
+
|-
+
| primary analog format||||Genre [since there can be multiple genres in the metadata, this will correspond, in average, to the TrackingFileNames Primary Analog Format]||Genre/type||||||
+
|-
+
| source collection||Analog_Collection_Name||Repository Collection||Name of Analog Collection||||Title||
+
|-
+
| Description||Digital_Collection_Description||||Blurb||||Abstract; or Scope and Contents [limited to digitized portion]||
+
|-
+
| ||Digital_Collection_Name||Digital Collection||Project Name||||||
+
|-
+
| ||Alphabetized_By||||Alphabetize||||||
+
|-
+
| ||Type_Of_Content||Type(s) [since there can be multiple Types in the Metadata, this will correspond, in average, to the Admin XML Type_of_Content]||Genre/type||||||
+
|-
+
| ||Finding_Aid_Link||||Link to finding aid ||||||
+
|-
+
| ||||||||Total Scans (Scans + Transcript Scans)||||[use PERL script or folder search to retrieve number of Total Scans (tifs or wavs)]
+
|-
+
| ||||[count total objects in Metadata sheet]||||Total Objects||||[use PERL script or folder search to retrieve number of objects (singletons + compound object subfolders)]
+
|-
+
|}
+
 
+
  From a digital preservation/delivery perspective, it's not as important to match information to the Archivist Queue spreadsheet, though it would be ideal if possible.
+
  Also, it isn't always feasible to match the number of actual scans vs. what is notated in the TrackingFiles, although that is also ideal.
+

Revision as of 17:38, 23 April 2013

The following page assumes the content for upload has already been through the Quality Control process

Contents

Preparation Procedure

Choose one of the following checklists:

One-Shot Collections are small enough to be completed in one batch, so they require no batch numbering or batching process.

Ongoing Collections are those which will have multiple batches for upload, so they require "batching."

One-Shot Collections

  1. Move collection folder from Digital_Coll_in_progress to Digital_Coll_Complete
  2. Remove extra text in folder name so that it is labeled with just the collection number
    1. Example: u0003_0000633_HarperTimetables --> u0003_0000633
  3. Finalize collection documentation
    1. Tracking Data
      1. OLD/SEPARATE FILE: save TrackingFiles spreadsheet (see S:\Digital Projects\Organization\TrackingFiles) as tab delimited .txt file in Admin folder (see TrackingFiles for more)
      2. NEW/INTEGRATED: COPY Filename column from the metadata spreadsheet to a new sheet, MOVE Tracking Data columns to new sheet, SAVE as tab delimited .txt file in Admin folder (see Tracking Data for more)
    2. OCR List: see these instructions
    3. Metadata: save spreadsheet (minus the tracking data you should’ve already removed) as tab delimited .txt file in Metadata folder
  4. Create MODS
  5. Carefully check contents of the folders (see list below)
    1. Make sure to remove any unnecessary files created during the capture process (for example, test scans or supplementary metadata or text file notes about progress)
  6. Once everything is okay, you’re ready to Upload Content!

Ongoing Collections

  1. Set up collection folder in Digital_Coll_Complete, and create inside it
    1. Admin
    2. Metadata
    3. Transcripts (if necessary)
  2. Move Scans folder from ongoing collection folder in Digital_Coll_in_progress to this new collection folder in Digital_Coll_Complete
  3. Follow the procedures for Creating Batch Documentation
  4. Create MODS
  5. Double-check contents of the folders (see list below)
  6. Once everything is okay, you’re ready to Upload Content!


Checking Folders

The Collection Level folder contains subfolders and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.

The collection number u0003_0000001 will be used as an example.

Admin

  • This folder must exist.
  • Must contain:
    • u0003_0000001.xml
      • If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows: u0003_0000001.1.xml, u0003_0000001.2.xml, etc.
      • Make sure to refer to the Collection_Information page regarding acceptable data values.
    • u0003_0000001.log.txt
      • Include a text version of the log file with every batch; previous versions of the log will be overwritten by the newest version.
  • May also contain:
    • Thumbs Icon - .png extension.
    • OCR list - .ocrList.txt extension.
      • u0001_0000002.ocrList.txt
    • Skipped items list - .skipped.txt extension. For batched collections: this should be present ONLY during the last upload, as it will contain information about skipped items across the entire collection.
      • u0003_0000193.skipped.txt
    • Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension
      • u0008_0000001_0000001.notes.txt

Metadata

  • This folder must exist.
  • Must contain:
    • u0003_0000001.m01.txt or u0002_0000001.m03.txt or u0008_0000001.m02.txt
      • Note the type of spreadsheet is echoed in the segment before the ".txt" -- if this is a batch file, the batch number precedes the m0x value:
        • example: u0002_0000001.1.m01.txt.
        • check this file for: Diacritics, Quotes, UTF-8 encoding
      • see [1] for more information.
    • A MODS folder
      • This folder will contain all the MODS files created via Archivist Utility (see: Making MODS).

Scans

  • This folder must exist.
  • Must only contain:
    • Scans (tiffs/wavs) of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.

Transcripts

  • This folder can exist only if transcripts exist.
  • Must only contain one or more of the following types of files:
    • u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
    • u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
    • u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders). If cleaned up .txt files exist, remove the corresponding .ocr.txt file.

needsRemediation

  • This folder always exists at the following location.
    • S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation
  • Must contain:
    • Excel formatted version of the batch metadata.
      • example: u0002_0000001.1.m01.xlsx.
Personal tools