Preparing Collections on the S Drive for Online Delivery and Storage

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(Transcripts)
 
(26 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Things to do before beginning an upload==
+
The following page assumes the content for upload has already been through the [[Quality_Control | Quality Control process]]
  
The collection number u0003_0000001 will be used as an example.
+
[[Image:Slide1.PNG]]
  
===Check Subfolders of Collection Level Folder===
+
[[Image:Slide2.PNG]]
The Collection Level folder contains [[Share_Drive_Protocols#Contents|subfolders]] and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.
+
  
====Admin====
+
==Preparation Procedure==
* This folder must exist.
+
  
*Must contain:
+
Choose one of the following checklists:
** '''[[Collection_Information|u0003_0000001.xml]]'''
+
***Make sure to refer to the [[Collection_Information|Collection_Information]] page regarding acceptable data values.
+
** '''[[TrackingFiles#Scope|u0003_0000001.log.txt]]'''
+
***Include a text version of the log file with every batch; previous versions of the log will be overwritten by the newest version.
+
  
* May also contain:
+
'''One-Shot Collections''' are small enough to be completed in one batch, so they require no batch numbering or batching process.  
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
+
** [[Most_Content#Getting_content_from_the_share_drive_to_live_in_Acumen|OCR list]] - .ocrList.txt extension.
+
<!--** Finding Aid - .xml extension-->
+
**Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension - i.e. "u0008_0000001_0000001.notes.txt".
+
  
* If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows:
+
'''Ongoing Collections''' are those which will have multiple batches for upload, so they require "batching."
** u0003_0000001.1.xml
+
** u0003_0000001.2.xml
+
  
 +
===One-Shot Collections===
  
====Metadata====
+
# Move collection folder from Digital_Coll_in_progress to Digital_Coll_Complete
* This folder must exist.
+
# Remove extra text in folder name so that it is labeled with just the collection number
* Must contain:
+
## Example: u0003_0000633_HarperTimetables --> u0003_0000633
**'''u0003_0000001.m01.txt''' or '''u0002_0000001.m03.txt''' or '''u0008_0000001.m02.txt'''
+
# Finalize collection documentation
 +
#* '''Tracking Data''': COPY Filename column from the metadata spreadsheet to a new sheet, MOVE Tracking Data columns to new sheet, SAVE as tab delimited .txt file in Admin folder (see [[Tracking Data]] for more)
 +
#* '''Metadata''':  save spreadsheet (minus the tracking data you should’ve already removed) as tab delimited .txt file in Metadata folder
 +
# Carefully check contents of the folders (see list below)
 +
#* Make sure to remove any unnecessary files created during the capture process (for example, test scans or supplementary metadata or text file notes about progress)
 +
# Once everything is okay, you’re ready to [[Most Content | Upload Content]]!
  
(Note the type of spreadsheet is echoed in the segment before the ".txt" --  if this is a batch file, the batch number follows the m0x value:  u0002_0000001.m01.1.txt.)
+
===Ongoing Collections===
  
(see [https://intranet.lib.ua.edu/cataloging/metadata/SpreadsheetRegistry] for more information.)
+
# Set up collection folder in Digital_Coll_Complete, and create inside it
 +
#* Admin
 +
#* Metadata
 +
#* Transcripts (if necessary)
 +
# Move Scans folder from ongoing collection folder in Digital_Coll_in_progress to this new collection folder in Digital_Coll_Complete
 +
# Follow the procedures for [[Batches | Creating Batch Documentation]]
 +
# Double-check contents of the folders (see list below)
 +
# Once everything is okay, you’re ready to [[Most Content | Upload Content]]!
  
 +
==Checking Folders==
 +
The Collection Level folder contains [[Share_Drive_Protocols#Contents|subfolders]] and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.
  
***This is tab-delimited text export of the original spreadsheet. (If this file was exported out of Excel as a tab-delimited txt file, you must open the file with Notepad ++ and do a Search and Replace to remove the quotation marks that have been inserted at each tab.)
 
***The source .xlsx spreadsheet should be moved to S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation.
 
 
 
<blockquote>''If this is a large or ongoing collection, the tab-delimited text export should contain ONLY the metadata for the items currently being transported to storage. The text file itself should have a period and then a number to indicate which portion, or "batch", of the complete metadata this is. The first tab-delimited export would be named, for example, u0003_0000001.1.txt, and would contain the first 500 entries, for example. The second tab-delimited export, for items 501-1000, would be named u0003_0000001.m01.2.txt, and so forth. Thus, only by collecting all these tab-delimited exports do we have a complete set of descriptive metadata for the collection items. For more on how to parse these "batches" out from the complete set of descriptive metadata, see [[Parsing Metadata]].''</blockquote>
 
  
*'''A MODS folder'''
+
'''The following folders must exist and be capitalized as shown
** This folder will contain all the MODS files created via Archivist Utility (see: [[Making MODS]]).
+
* Admin
 +
* Metadata
 +
* Scans'''
  
====Scans====
+
===Admin===
* This folder must exist.
+
 
+
*Must only contain:
+
**'''Scans (tiffs/wavs)''' of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
+
  
 +
*MUST contain:
 +
** [[Collection_Information| Collection information XML file]]
 +
***'''u0003_0000001.xml'''
 +
***If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows: u0003_0000001.1.xml, u0003_0000001.2.xml, etc.
 +
***Make sure to refer to the [[Collection_Information|Collection_Information]] page regarding acceptable data values.
 +
** [[Tracking Data|Log file]]
 +
***'''u0003_0000001.log.txt'''
 +
***Include a text version of the log file with every batch. First column contains IDs, 2nd column contains pages, or the script will spit out errors.
  
====Transcripts====
+
* MAY also contain:
* This folder must exist only if [[transcripts]] exist.
+
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
*Must only contain one or more of the following types of files:
+
*** '''u0001_2007001.icon.png'''
** u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
+
<!--** Finding Aid - .xml extension-->
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
+
** [[Skipped Items | Skipped items list]] - .skipped.txt extension. For batched collections: this should be present ONLY during the last upload, as it will contain information about skipped items across the entire collection.
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders). If cleaned up .txt files exist, remove the corresponding .ocr.txt file.
+
*** '''u0003_0000193.skipped.txt'''  NOTE:  the archiving script doesn't yet know what to do with this optional file. Move by hand to the archive during that process.
 +
** [[Match file]] - .txt extension  <font color="red">Tell Jody it's there for pickup or else COPY(as root) it to /srv/JodysScriptArea/eads/MATCH</font>
 +
*** '''u0001_2007010.match.txt'''  (This file provides a match between photo IDs and assigned IDs so content can be linked in the right place in the EAD, and found by users)
 +
**Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. Additional notes can be saved as a plain text files with a ".notes.txt" extension
 +
*** '''u0008_0000001_0000001.notes.txt'''
  
 +
===Metadata===
  
===Perform Quality Control Tasks===
+
* MUST contain:
see [[Quality_Control]]
+
** Excel metadata spreadsheet
 +
***'''u0003_0000001.m01.xlsx'''  or '''u0002_0000001.m03.xlsx'''  or '''u0008_0000001.m02.xlsx'''
 +
***Note the type of spreadsheet is echoed in the segment before the ".txt" --  if this is a batch file, the batch number precedes the m0x value -- example: '''u0002_0000001.1.m01.xlsx'''.
 +
** [[Image Technical Metadata | FITS folder]]
 +
*** With FITS files created by script (NOTE: Audio fits2aes puts the FITS and AES files on the server)
  
Note: quality control tasks should have already have been performed by scanning technicians during the QC process, but it's a good idea to run the QC scripts again. It takes very little time and helps to catch any mistakes that might have gotten through.
+
===Scans===
  
 +
*MUST contain ONLY:
 +
**'''Scans (tiffs/wavs)''' of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
  
===Spot check all .xml, .txt, and .xlsx files===
+
====Transcripts====
* Check all such files for proper filenames and extensions.
+
* This folder CAN exist ONLY IF [[transcripts]] exist.
* Open all such files and look for anomalies and inconsistencies, misspellings, and missing data, etc.
+
*Must only contain one or more of the following types of files:
**Ideally, no additional fields such as "Notes" are in the Metadata file. "Notes" as such should be deleted or moved to the appropriate row in the log.txt file.
+
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain text transcripts
*** Make sure the Format column in the Metadata file has not been altered to the Time format. If a tab delimited metadata file is opened via Excel (especially by right clicking the file and choosing to open in Excel), the format column if like: 3 p., 4 p., etc. Will get interpreted as 3:00 PM, 4:00 PM, etc. If then resaved as .txt, times will have been saved instead of page #s. The way around this is to have Excel open first, choose Open. Open your text file and while you are being interrogated by Excel about how to import, set the Format column as "Text".
+
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain text OCR from images
 
+
Obviously, if errors are found *after* text exports and MODS files are made, then the Excel file needs to be corrected and the text and MODS files remade.
+
 
+
 
+
===Check all Folder names===
+
*Make sure folders are named correctly and that there are no superfluous word concatenations to object level folders, etc.
+
 
+
 
+
===Moving the Files to the Server===
+
The folder should now be prepared to run scripts and place the files on the storage server.
+
 
+
see: [http://www.lib.ua.edu/wiki/digcoll/index.php/Most_Content Most_Content]
+

Latest revision as of 16:27, 11 July 2017

The following page assumes the content for upload has already been through the Quality Control process

Slide1.PNG

Slide2.PNG

Contents

[edit] Preparation Procedure

Choose one of the following checklists:

One-Shot Collections are small enough to be completed in one batch, so they require no batch numbering or batching process.

Ongoing Collections are those which will have multiple batches for upload, so they require "batching."

[edit] One-Shot Collections

  1. Move collection folder from Digital_Coll_in_progress to Digital_Coll_Complete
  2. Remove extra text in folder name so that it is labeled with just the collection number
    1. Example: u0003_0000633_HarperTimetables --> u0003_0000633
  3. Finalize collection documentation
    • Tracking Data: COPY Filename column from the metadata spreadsheet to a new sheet, MOVE Tracking Data columns to new sheet, SAVE as tab delimited .txt file in Admin folder (see Tracking Data for more)
    • Metadata: save spreadsheet (minus the tracking data you should’ve already removed) as tab delimited .txt file in Metadata folder
  4. Carefully check contents of the folders (see list below)
    • Make sure to remove any unnecessary files created during the capture process (for example, test scans or supplementary metadata or text file notes about progress)
  5. Once everything is okay, you’re ready to Upload Content!

[edit] Ongoing Collections

  1. Set up collection folder in Digital_Coll_Complete, and create inside it
    • Admin
    • Metadata
    • Transcripts (if necessary)
  2. Move Scans folder from ongoing collection folder in Digital_Coll_in_progress to this new collection folder in Digital_Coll_Complete
  3. Follow the procedures for Creating Batch Documentation
  4. Double-check contents of the folders (see list below)
  5. Once everything is okay, you’re ready to Upload Content!

[edit] Checking Folders

The Collection Level folder contains subfolders and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.


The following folders must exist and be capitalized as shown

  • Admin
  • Metadata
  • Scans

[edit] Admin

  • MUST contain:
    • Collection information XML file
      • u0003_0000001.xml
      • If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows: u0003_0000001.1.xml, u0003_0000001.2.xml, etc.
      • Make sure to refer to the Collection_Information page regarding acceptable data values.
    • Log file
      • u0003_0000001.log.txt
      • Include a text version of the log file with every batch. First column contains IDs, 2nd column contains pages, or the script will spit out errors.
  • MAY also contain:
    • Thumbs Icon - .png extension.
      • u0001_2007001.icon.png
    • Skipped items list - .skipped.txt extension. For batched collections: this should be present ONLY during the last upload, as it will contain information about skipped items across the entire collection.
      • u0003_0000193.skipped.txt NOTE: the archiving script doesn't yet know what to do with this optional file. Move by hand to the archive during that process.
    • Match file - .txt extension Tell Jody it's there for pickup or else COPY(as root) it to /srv/JodysScriptArea/eads/MATCH
      • u0001_2007010.match.txt (This file provides a match between photo IDs and assigned IDs so content can be linked in the right place in the EAD, and found by users)
    • Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. Additional notes can be saved as a plain text files with a ".notes.txt" extension
      • u0008_0000001_0000001.notes.txt

[edit] Metadata

  • MUST contain:
    • Excel metadata spreadsheet
      • u0003_0000001.m01.xlsx or u0002_0000001.m03.xlsx or u0008_0000001.m02.xlsx
      • Note the type of spreadsheet is echoed in the segment before the ".txt" -- if this is a batch file, the batch number precedes the m0x value -- example: u0002_0000001.1.m01.xlsx.
    • FITS folder
      • With FITS files created by script (NOTE: Audio fits2aes puts the FITS and AES files on the server)

[edit] Scans

  • MUST contain ONLY:
    • Scans (tiffs/wavs) of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.

[edit] Transcripts

  • This folder CAN exist ONLY IF transcripts exist.
  • Must only contain one or more of the following types of files:
    • u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain text transcripts
    • u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain text OCR from images
Personal tools