Preparing Collections on the S Drive for Online Delivery and Storage

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(Scans)
m
(9 intermediate revisions by 2 users not shown)
Line 17: Line 17:
 
* May also contain:
 
* May also contain:
 
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
 
** [[Thumbs_icons|Thumbs Icon]] - .png extension.
** [[Most_Content#Getting_content_from_the_share_drive_to_live_in_Acumen|OCR list]] - .ocrList.txt extension.
+
** [[OCR List|OCR list]] - .ocrList.txt extension.
 
<!--** Finding Aid - .xml extension-->
 
<!--** Finding Aid - .xml extension-->
 
**Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension - i.e. "u0008_0000001_0000001.notes.txt".
 
**Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension - i.e. "u0008_0000001_0000001.notes.txt".
Line 24: Line 24:
 
** u0003_0000001.1.xml
 
** u0003_0000001.1.xml
 
** u0003_0000001.2.xml
 
** u0003_0000001.2.xml
 +
  
 
====Metadata====
 
====Metadata====
 
* This folder must exist.
 
* This folder must exist.
 
* Must contain:
 
* Must contain:
**'''u0003_0000001.txt'''
+
**'''u0003_0000001.m01.txt''' or '''u0002_0000001.m03.txt'''  or '''u0008_0000001.m02.txt'''
 +
 
 +
(Note the type of spreadsheet is echoed in the segment before the ".txt" --  if this is a batch file, the batch number follows the m0x value:  u0002_0000001.m01.1.txt.)
 +
 
 +
(see [https://intranet.lib.ua.edu/cataloging/metadata/SpreadsheetRegistry] for more information.)
 +
 
 +
 
 
***This is tab-delimited text export of the original spreadsheet. (If this file was exported out of Excel as a tab-delimited txt file, you must open the file with Notepad ++ and do a Search and Replace to remove the quotation marks that have been inserted at each tab.)
 
***This is tab-delimited text export of the original spreadsheet. (If this file was exported out of Excel as a tab-delimited txt file, you must open the file with Notepad ++ and do a Search and Replace to remove the quotation marks that have been inserted at each tab.)
***The source .xlsx spreadsheet should be moved to S:\Digital Projects\Administrative\collectionInfo\forMDlib\needsRemediation.
+
***The source .xlsx spreadsheet should be moved to S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation.
 
    
 
    
<blockquote>''If this is a large or ongoing collection, the tab-delimited text export should contain ONLY the metadata for the items currently being transported to storage. The text file itself should have a period and then a number to indicate which portion, or "batch", of the complete metadata this is. The first tab-delimited export would be named, for example, u0003_0000001.1.txt, and would contain the first 500 entries, for example. The second tab-delimited export, for items 501-1000, would be named u0003_0000001.2.txt, and so forth. Thus, only by collecting all these tab-delimited exports do we have a complete set of descriptive metadata for the collection items. For more on how to parse these "batches" out from the complete set of descriptive metadata, see [[Parsing Metadata]].''</blockquote>
+
<blockquote>''If this is a large or ongoing collection, the tab-delimited text export should contain ONLY the metadata for the items currently being transported to storage. The text file itself should have a period and then a number to indicate which portion, or "batch", of the complete metadata this is. The first tab-delimited export would be named, for example, u0003_0000001.1.txt, and would contain the first 500 entries, for example. The second tab-delimited export, for items 501-1000, would be named u0003_0000001.m01.2.txt, and so forth. Thus, only by collecting all these tab-delimited exports do we have a complete set of descriptive metadata for the collection items. For more on how to parse these "batches" out from the complete set of descriptive metadata, see [[Parsing Metadata]].''</blockquote>
  
 
*'''A MODS folder'''
 
*'''A MODS folder'''
Line 42: Line 49:
 
*Must only contain:
 
*Must only contain:
 
**'''Scans (tiffs/wavs)''' of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
 
**'''Scans (tiffs/wavs)''' of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.
 +
  
 
====Transcripts====
 
====Transcripts====
Line 48: Line 56:
 
** u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
 
** u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders).
+
** u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders). If cleaned up .txt files exist, remove the corresponding .ocr.txt file.
  If cleaned up .txt files exist, remove the corresponding .ocr.txt file.
+
 
  
 
===Perform Quality Control Tasks===
 
===Perform Quality Control Tasks===
 
see [[Quality_Control]]
 
see [[Quality_Control]]
  Note: quality control tasks will already have been performed by scanning technicians beforehand, but it's a good idea to run the QC scripts again.
+
 
  This takes very little time.
+
Note: quality control tasks should have already have been performed by scanning technicians during the QC process, but it's a good idea to run the QC scripts again. It takes very little time and helps to catch any mistakes that might have gotten through.
 +
 
  
 
===Spot check all .xml, .txt, and .xlsx files===
 
===Spot check all .xml, .txt, and .xlsx files===
 
* Check all such files for proper filenames and extensions.
 
* Check all such files for proper filenames and extensions.
 
* Open all such files and look for anomalies and inconsistencies, misspellings, and missing data, etc.
 
* Open all such files and look for anomalies and inconsistencies, misspellings, and missing data, etc.
** Examples:
+
**Ideally, no additional fields such as "Notes" are in the Metadata file. "Notes" as such should be deleted or moved to the appropriate row in the log.txt file.
*** make sure fields such as Funder(s), Funding Information, Repository Collection, Digital Collection, Digital Publisher, etc. are filled out.
+
*** Make sure the Format column in the Metadata file has not been altered to the Time format. If a tab delimited metadata file is opened via Excel (especially by right clicking the file and choosing to open in Excel), the format column if like: 3 p., 4 p., etc. Will get interpreted as 3:00 PM, 4:00 PM, etc. If then resaved as .txt, times will have been saved instead of page #s. The way around this is to have Excel open first, choose Open. Open your text file and while you are being interrogated by Excel about how to import, set the Format column as "Text".
  The required fields should have a green header cell in the descriptive metadata to aid in verifying the presence of required content.
+
 
*** ideally, no additional fields such as "Notes" are in the Metadata file. "Notes" as such should be deleted or moved to the appropriate row in the log.txt file.
+
Obviously, if errors are found *after* text exports and MODS files are made, then the Excel file needs to be corrected and the text and MODS files remade.
*** make sure the Format column in the Metadata file has not been altered to the Time format as it sometimes is:
+
  If a tab delimited metadata file is opened via Excel (especially by right clicking the file and choosing to open in Excel), the format column if like: 3 p., 4 p., etc. Will get interpreted as  
+
  3:00 PM, 4:00 PM, etc.
+
  If then resaved as .txt, times will have been saved instead of page #s.  
+
  The way around this is to have Excel open first, choose Open.
+
  open your text file and while you are being interrogated by Excel about how to import, set the Format column as "Text".
+
  
  Obviously, if errors are found *after* text exports and MODS files are made, then the Excel file needs to be corrected and the text and MODS files remade.
 
  
 
===Check all Folder names===
 
===Check all Folder names===
 
*Make sure folders are named correctly and that there are no superfluous word concatenations to object level folders, etc.  
 
*Make sure folders are named correctly and that there are no superfluous word concatenations to object level folders, etc.  
  
===Match Data across documents and folders===
 
 
This table attempts to show how data in one of our documents/folders should match with data in another document/folder.
 
 
Corresponding data must properly match/equate prior to marking "Scans" folders as "Store".
 
 
 
{| {{table }} border=1
 
| align="center" style="background:#f0f0f0;"|'''TrackingFileNames'''
 
| align="center" style="background:#f0f0f0;"|'''Admin XML'''
 
| align="center" style="background:#f0f0f0;"|'''Metadata'''
 
| align="center" style="background:#f0f0f0;"|'''Archivist Queue (\"Selection.xlsx\")'''
 
| align="center" style="background:#f0f0f0;"|'''TrackingFiles'''
 
| align="center" style="background:#f0f0f0;"|'''Finding Aid'''
 
| align="center" style="background:#f0f0f0;"|'''Scans folder'''
 
|-
 
| Collection Number||||[first 14 characters of Filename]||||Collection Number||||
 
|-
 
| Container||||Container Number||Container||||||
 
|-
 
| physical location||Manuscript_Number <font color=red>[Is this true?]</font>||||Manuscript Number; Physical Location||||MSS #||
 
|-
 
| primary analog format||||Genre [since there can be multiple genres in the metadata, this will correspond, in average, to the TrackingFileNames Primary Analog Format]||Genre/type||||||
 
|-
 
| source collection||Analog_Collection_Name||Repository Collection||Name of Analog Collection||||Title||
 
|-
 
| Description||Digital_Collection_Description||||Blurb||||Abstract; or Scope and Contents [limited to digitized portion]||
 
|-
 
| ||Digital_Collection_Name||Digital Collection||Project Name||||||
 
|-
 
| ||Alphabetized_By||||Alphabetize||||||
 
|-
 
| ||Type_Of_Content||Type(s) [since there can be multiple Types in the Metadata, this will correspond, in average, to the Admin XML Type_of_Content]||Genre/type||||||
 
|-
 
| ||Finding_Aid_Link||||Link to finding aid ||||||
 
|-
 
| ||||||||Total Scans (Scans + Transcript Scans)||||[use PERL script or folder search to retrieve number of Total Scans (tifs or wavs)]
 
|-
 
| ||||[count total objects in Metadata sheet]||||Total Objects||||[use PERL script or folder search to retrieve number of objects (singletons + compound object subfolders)]
 
|-
 
|}
 
  
  From a digital preservation/delivery perspective, it's not as important to match information to the Archivist Queue spreadsheet, though it would be ideal if possible.
 
  Also, it isn't always feasible to match the number of actual scans vs. what is notated in the TrackingFiles, although that is also ideal.
 
 
===Moving the Files to the Server===
 
===Moving the Files to the Server===
 
The folder should now be prepared to run scripts and place the files on the storage server.
 
The folder should now be prepared to run scripts and place the files on the storage server.
  
 
see: [http://www.lib.ua.edu/wiki/digcoll/index.php/Most_Content Most_Content]
 
see: [http://www.lib.ua.edu/wiki/digcoll/index.php/Most_Content Most_Content]

Revision as of 08:27, 25 October 2012

Contents

Things to do before beginning an upload

The collection number u0003_0000001 will be used as an example.

Check Subfolders of Collection Level Folder

The Collection Level folder contains subfolders and their content must adhere to certain specifications prior to the collection being considered ready to "ship" for online access and long term storage.

Admin

  • This folder must exist.
  • May also contain:
    • Thumbs Icon - .png extension.
    • OCR list - .ocrList.txt extension.
    • Other relevant documents saved as plain .txt (ANSI or UTF-8 without BOM preferred). If possible please incorporate any additional data into the log.txt file. For example, audio collections often have significant item-level notes that we want to retain. These plain text files can be saved with a ".notes.txt" extension - i.e. "u0008_0000001_0000001.notes.txt".
  • If multiple Digital Collections spawn from the same Analog Collection, there can be more than one Collection Information XML file as follows:
    • u0003_0000001.1.xml
    • u0003_0000001.2.xml


Metadata

  • This folder must exist.
  • Must contain:
    • u0003_0000001.m01.txt or u0002_0000001.m03.txt or u0008_0000001.m02.txt
(Note the type of spreadsheet is echoed in the segment before the ".txt" --  if this is a batch file, the batch number follows the m0x value:  u0002_0000001.m01.1.txt.)

(see [1] for more information.)


      • This is tab-delimited text export of the original spreadsheet. (If this file was exported out of Excel as a tab-delimited txt file, you must open the file with Notepad ++ and do a Search and Replace to remove the quotation marks that have been inserted at each tab.)
      • The source .xlsx spreadsheet should be moved to S:\Digital Projects\Administrative\Pipeline\collectionInfo\forMDlib\needsRemediation.
If this is a large or ongoing collection, the tab-delimited text export should contain ONLY the metadata for the items currently being transported to storage. The text file itself should have a period and then a number to indicate which portion, or "batch", of the complete metadata this is. The first tab-delimited export would be named, for example, u0003_0000001.1.txt, and would contain the first 500 entries, for example. The second tab-delimited export, for items 501-1000, would be named u0003_0000001.m01.2.txt, and so forth. Thus, only by collecting all these tab-delimited exports do we have a complete set of descriptive metadata for the collection items. For more on how to parse these "batches" out from the complete set of descriptive metadata, see Parsing Metadata.
  • A MODS folder
    • This folder will contain all the MODS files created via Archivist Utility (see: Making MODS).

Scans

  • This folder must exist.
  • Must only contain:
    • Scans (tiffs/wavs) of non-compound objects and compound objects (inside respective subfolders). All other files types will not be retained. Temporary files and thumbs.db files do not have to be deleted since they will be removed upon transfer to Storage.


Transcripts

  • This folder must exist only if transcripts exist.
  • Must only contain one or more of the following types of files:
    • u0003_0000001_0000001.tif, u0003_0000002_0001.tif, etc. - corresponding to non-compound and compound objects (inside respective subfolders).
    • u0003_0000001_0000001.txt, u0003_0000002_0001.txt, etc. - plain .txt files corresponding to non-compound and compound objects (inside respective subfolders).
    • u0003_0000001_0000001.ocr.txt, u0003_0000002_0001.ocr.txt, etc. - plain .txt files of OCRed tiffs corresponding to non-compound and compound objects (inside respective subfolders). If cleaned up .txt files exist, remove the corresponding .ocr.txt file.


Perform Quality Control Tasks

see Quality_Control

Note: quality control tasks should have already have been performed by scanning technicians during the QC process, but it's a good idea to run the QC scripts again. It takes very little time and helps to catch any mistakes that might have gotten through.


Spot check all .xml, .txt, and .xlsx files

  • Check all such files for proper filenames and extensions.
  • Open all such files and look for anomalies and inconsistencies, misspellings, and missing data, etc.
    • Ideally, no additional fields such as "Notes" are in the Metadata file. "Notes" as such should be deleted or moved to the appropriate row in the log.txt file.
      • Make sure the Format column in the Metadata file has not been altered to the Time format. If a tab delimited metadata file is opened via Excel (especially by right clicking the file and choosing to open in Excel), the format column if like: 3 p., 4 p., etc. Will get interpreted as 3:00 PM, 4:00 PM, etc. If then resaved as .txt, times will have been saved instead of page #s. The way around this is to have Excel open first, choose Open. Open your text file and while you are being interrogated by Excel about how to import, set the Format column as "Text".

Obviously, if errors are found *after* text exports and MODS files are made, then the Excel file needs to be corrected and the text and MODS files remade.


Check all Folder names

  • Make sure folders are named correctly and that there are no superfluous word concatenations to object level folders, etc.


Moving the Files to the Server

The folder should now be prepared to run scripts and place the files on the storage server.

see: Most_Content

Personal tools