Organization of completed content for long-term storage

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
As we expand our holdings to multiple collections, and content from different sources beyond Hoole, we need to organize files and folders in a systematic manner.  
+
As we expand our holdings to multiple collections, and content from different sources beyond Hoole, we need to organize files and folders in a systematic manner. The following solution follows a simple rule:  '''replace the underscore in a file name with a forward slash, to determine the appropriate directory location for the file'''.
  
 
Based on the [[http://intranet.lib.ua.edu/wiki/digcoll/index.php/File_naming_schemes|File Naming Schemes]] we selected,  
 
Based on the [[http://intranet.lib.ua.edu/wiki/digcoll/index.php/File_naming_schemes|File Naming Schemes]] we selected,  
Line 5: Line 5:
 
Within a specified directory on a slow server (to reduce risk of damage and corruption by access):  
 
Within a specified directory on a slow server (to reduce risk of damage and corruption by access):  
  
===1) By holding institution ===
+
===1) By holding institution and subsidiary group ===
The first level of organization is by the letter preceding all numbers in the filename.
+
The first letter in a filename indicates something about the origin. All digital content starting with "u" originated from holdings within the University of Alabama, whereas other letters indicate origins elsewhere.
Hence, all digital content originating from holdings here at the university will be stored within the "u" directory. All digital content originating from holdings elsewhere, will be stored under a directory named according to the selected letter (as yet undetermined).
+
Following the initial letter are a series of 4 numbers indicating a grouping and an institution.  Using these 5 digits to create the first level directory structure ensures that all content is segregated by the holding organization;  that is, all Hoole Rare books content will be in folder "u0004" as that is the number assigned to it. For example, a file labeled u0004_0002061_0000345_0003.tif  will be found in a subdirectory under /u0004/ in the file system.
  
===2) By group ===
+
===2) By collection ===
In the second level of directory structure (within the folders on the first level), the folders will be named for the first set of numbers following the letter, and preceding the first underscore. Since this number indicates a grouping and an institution, this ensures that all content is segregated by the holding organization;  that is, all Hoole Rare books content will be in folder "0002" if that is the number assigned to itFor example, a file labeled u0002_0002061_0000345_0003.tif  will be found in a subdirectory under /u/2/ in the file system.
+
In the third level (within the folders on the 2nd level), folders will be named according to the 2nd set of numbers in the filename: after the first underscore and preceding the 2nd underscore. This is the number of the collection for the institution/groupingIn the file name u0004_0002061_0000345_0003.tif, "0002061" indicates collection number 2061, so the folder "2061" will exist for this collection:  /u0004/0002061/
 +
* note that metadata and images for collections that do not have items digitized will be stored here, and there will not be a 3rd or 4th level.
  
===3) By collection ===
+
===3) By item ===
In the third level (within the folders on the 2nd level), folders will be named according to the 2nd set of numbers in the filename: after the first underscore and preceding the 2nd underscore. This is the number of the collection for the institution/grouping.  In the file name u0001_0002061_0000345_0003.tif, "0002061" indicates collection number 2061, so the folder "2061"  will exist for this collection: /u/2/2061/
+
In the 4th level (within the folders on the 3rd level), folders will be named according to the 3rd set of numbers in the filename: after the 2nd underscore, and preceding the 3rd underscore. This is the number assigned to the item within the specified collection.  In the file name u0004_0002061_0000345_0003.tif, "345" is the number for this item in this collection. So the folder /u0004/0002061/0000345 will contain all files relating to this item.
* note that metadata and images for collections that do not have items digitized will be stored here, and there will not be a 4th level.
+
* note that items that do not have pages will be stored here, and there will not be a 4th or 5th level.
  
===4) By item ===
+
===4) By sequence for delivery ===
In the 4th level (within the folders on the 3rd level), folders will be named according to the 3rd set of numbers in the filename: after the 2nd underscore, and preceding the 3rd underscore.  This is the number assigned to the item within the specified collection.  In the file name u0001_0002061_0000345_0003.tif, "345" is the number for this item in this collection.  So the folder /u/2/2061/345 will contain all files relating to this item.
+
In the 5th level (within the folders on the 4th level), folders will be named according to the 4th set of numbers in the filename: after the 3rd underscore, and preceding the 4th underscore or the period and filename extension.  This is the number assigned to the sequence of delivery for the files within the item.  In the file name u0004_0002061_0000345_0003.tif, "0003" indicates the 3rd image in a sequence, so the directory /u0004/0002061/0000345/0003/ will contain this tiff and all information associated with it.
* note that items that do not have pages will be stored here, and there will not be a 5th level.
+
* note that if there are subpages, such as in a scrapbook, there will be an additional level beneath this one, using the same reasoning.
 
+
===5) By sequence for delivery ===
+
In the 5th level (within the folders on the 4th level), folders will be named according to the 4th set of numbers in the filename: after the 3rd underscore, and preceding the 4th underscore or the period and filename extension.  This is the number assigned to the sequence of delivery for the files within the item.  In the file name u0001_0002061_0000345_0003.tif, "0003" indicates the 3rd image in a sequence, so the directory /u/2/2061/345/3/ will contain this tiff and all information associated with it.  
+
  
 
== OTHER SUBFOLDERS IN EACH OF THE ABOVE ==
 
== OTHER SUBFOLDERS IN EACH OF THE ABOVE ==
Line 28: Line 26:
 
* Note:  All documentation needs to be in unicode or ascii xml or plain text.
 
* Note:  All documentation needs to be in unicode or ascii xml or plain text.
  
===B) image===
+
===B) text===
* Within the image folder, folders should be named by type followed by underscore, then version (replace periods with hyphens), followed by underscore, followed by DPI.  For example, tiff_6.0_600dpi would contain a version 6 TIFF at 600 dpi.
+
Thus, the actual storage directory of  u0001_0002061_0000345_0003.tif would be /u/2/2061/345/3/image/tiff_6-0_600dpi/
+
Rationale:  this enables us to locate by script images of a certain type which may need to be reformatted before becoming obsolete.
+
* Should we begin storing a different preservation image type, version, or dpi, this will clarify which images are which. 
+
 
+
===C) text===
+
 
* Within the text folder, subfolders should be named "ocr" for ocr text in ascii or unicode;  "transcribed" for transcribed text in ascii or unicode.  All text should be stored either as xml or plain text (.txt files).
 
* Within the text folder, subfolders should be named "ocr" for ocr text in ascii or unicode;  "transcribed" for transcribed text in ascii or unicode.  All text should be stored either as xml or plain text (.txt files).
 
This will enable us to identify text which may be poor as opposed to better quality.  If the text is ocr text which has been remediated, it should be stored as "transcribed".
 
This will enable us to identify text which may be poor as opposed to better quality.  If the text is ocr text which has been remediated, it should be stored as "transcribed".
  
===D) metadata===
+
===C) metadata===
* Within the metadata folder, subfolders should be named according to the type of metadata they contain:  type followed by underscore, followed by version (replace periods with hyphens).  The following shorthand is to be used:
+
* If a METS file is available to organize the metadata and tag it with appropriate namespaces, that is ideal.  This METS file should have xlinks to the archival quality bitstreams.  If a METS file is not available:  within the metadata folder, subfolders should be named according to the type of metadata they contain:  type followed by underscore, followed by version.  The following shorthand is to be used:
 
** qdc for qualified Dublin Core
 
** qdc for qualified Dublin Core
 
** udc for unqualified Dublin Core
 
** udc for unqualified Dublin Core
Line 47: Line 39:
 
** ead for EAD  (Encoded Archival Description)
 
** ead for EAD  (Encoded Archival Description)
  
Thus, an example folder udc_1-1 can be expected to contain an unqualified Dublin Core record meeting the specifications of version 1.1.  Likewise, a folder named mods_3-2 would contain a MODS version 3.2 metadata record.
+
Thus, an example folder udc_1.1 can be expected to contain an unqualified Dublin Core record meeting the specifications of version 1.1.  Likewise, a folder named mods_3.2 would contain a MODS version 3.2 metadata record.
  
 
Note that the metadata record is stored at the appropriate level:  an EAD would be stored at the collection level, as it is a collection level record.  If the collection contains only one item, then it should be labeled item 1, and the metadata for the item would be in the item directory, to avoid confusion.  
 
Note that the metadata record is stored at the appropriate level:  an EAD would be stored at the collection level, as it is a collection level record.  If the collection contains only one item, then it should be labeled item 1, and the metadata for the item would be in the item directory, to avoid confusion.  
  
Thus, /u/2/2061/metadata/ may contain mods_3-2 and ead_2002 and udc_1.1 directories, each containing collection-level metadata about collection 2061.  This describes the collection.
+
Thus, /u0004/2061/metadata/ may contain mods_3.2 and ead_2002 and udc_1.1 directories, each containing collection-level metadata about collection 2061.  This describes the collection.
  
If collection 2061 currently includes only one item, the filename for that item should be u0001_0002061_0000345;  the directory  /u/2/2061/1/metadata would contain the metadata about that item.  If there is page-level metadata for that item, then the metadata for the first page would be stored in /u/2/2061/1/1/metadata, the metadata for the 2nd page would be stored in /u/2/2061/1/2/metadata, and so forth.
+
If collection 2061 currently includes only one item, the filename for that item should be u0001_0002061_0000345;  the directory  /u0004/0002061/0000001/metadata would contain the metadata about that item.  If there is page-level metadata for that item, then the metadata for the first page would be stored in /u0002/0002061/0000001/0001/metadata, the metadata for the 2nd page would be stored in /u0004/0002061/0000001/0002/metadata, and so forth.
  
 
* For metadata records of local profiles (for example, where needed fields are taken from different metadata schemas), a schema or dtd  or text/xml data dictionary is expected within a subsidiary "documentation" folder.  The folder containing the metadata record and the "documentation" folder should be named according to the following system:  "profile" followed by underscore, followed by the 8 digit date (year month day sequence), followed by underscore, followed by the initials of the responsible party.  For example:  "profile_20080825_jld" would indicate the profile Jody Lynn DeRidder created on August 25, 2008.  Note that the date is the date of the profile, not the date this record was created or stored.
 
* For metadata records of local profiles (for example, where needed fields are taken from different metadata schemas), a schema or dtd  or text/xml data dictionary is expected within a subsidiary "documentation" folder.  The folder containing the metadata record and the "documentation" folder should be named according to the following system:  "profile" followed by underscore, followed by the 8 digit date (year month day sequence), followed by underscore, followed by the initials of the responsible party.  For example:  "profile_20080825_jld" would indicate the profile Jody Lynn DeRidder created on August 25, 2008.  Note that the date is the date of the profile, not the date this record was created or stored.
  
Thus, within the /u/2/2061/345/metadata/profile_20080825_jld/ folder, you would find an xml metadata record meeting a profile specified by text or xml information in /u/2/2061/345/metadata/profile_20080825_jld/documentation/ -- and the metadata record would be for the item  u0001_0002061_0000345.
+
Thus, within the /u0004/0002061/0000345/metadata/profile_20080825_jld/ folder, you would find an xml metadata record meeting a profile specified by text or xml information in /u0004/0002061/0000345/metadata/profile_20080825_jld/documentation/ -- and the metadata record would be for the item  u0004_0002061_0000345.
  
Notice that additional forms of metadata may be added  (structural, administrative, and technical) within each metadata folder without confusion as to what the metadata is or what it is about.
+
Notice that additional forms of metadata may be added  (structural, administrative, and technical) within each metadata folder without confusion as to what the metadata is or what it is about.  Hopefully, these can then be incorporated into METS files at some point, to simplify all this.

Revision as of 13:26, 29 October 2008

As we expand our holdings to multiple collections, and content from different sources beyond Hoole, we need to organize files and folders in a systematic manner. The following solution follows a simple rule: replace the underscore in a file name with a forward slash, to determine the appropriate directory location for the file.

Based on the [Naming Schemes] we selected,

Contents

I propose the following organization for materials that we have already digitized:

Within a specified directory on a slow server (to reduce risk of damage and corruption by access):

1) By holding institution and subsidiary group

The first letter in a filename indicates something about the origin. All digital content starting with "u" originated from holdings within the University of Alabama, whereas other letters indicate origins elsewhere. Following the initial letter are a series of 4 numbers indicating a grouping and an institution. Using these 5 digits to create the first level directory structure ensures that all content is segregated by the holding organization; that is, all Hoole Rare books content will be in folder "u0004" as that is the number assigned to it. For example, a file labeled u0004_0002061_0000345_0003.tif will be found in a subdirectory under /u0004/ in the file system.

2) By collection

In the third level (within the folders on the 2nd level), folders will be named according to the 2nd set of numbers in the filename: after the first underscore and preceding the 2nd underscore. This is the number of the collection for the institution/grouping. In the file name u0004_0002061_0000345_0003.tif, "0002061" indicates collection number 2061, so the folder "2061" will exist for this collection: /u0004/0002061/

  • note that metadata and images for collections that do not have items digitized will be stored here, and there will not be a 3rd or 4th level.

3) By item

In the 4th level (within the folders on the 3rd level), folders will be named according to the 3rd set of numbers in the filename: after the 2nd underscore, and preceding the 3rd underscore. This is the number assigned to the item within the specified collection. In the file name u0004_0002061_0000345_0003.tif, "345" is the number for this item in this collection. So the folder /u0004/0002061/0000345 will contain all files relating to this item.

  • note that items that do not have pages will be stored here, and there will not be a 4th or 5th level.

4) By sequence for delivery

In the 5th level (within the folders on the 4th level), folders will be named according to the 4th set of numbers in the filename: after the 3rd underscore, and preceding the 4th underscore or the period and filename extension. This is the number assigned to the sequence of delivery for the files within the item. In the file name u0004_0002061_0000345_0003.tif, "0003" indicates the 3rd image in a sequence, so the directory /u0004/0002061/0000345/0003/ will contain this tiff and all information associated with it.

  • note that if there are subpages, such as in a scrapbook, there will be an additional level beneath this one, using the same reasoning.

OTHER SUBFOLDERS IN EACH OF THE ABOVE

Each of the levels above may contain any or all of the following folders:

A) documentation

  • Note: All documentation needs to be in unicode or ascii xml or plain text.

B) text

  • Within the text folder, subfolders should be named "ocr" for ocr text in ascii or unicode; "transcribed" for transcribed text in ascii or unicode. All text should be stored either as xml or plain text (.txt files).

This will enable us to identify text which may be poor as opposed to better quality. If the text is ocr text which has been remediated, it should be stored as "transcribed".

C) metadata

  • If a METS file is available to organize the metadata and tag it with appropriate namespaces, that is ideal. This METS file should have xlinks to the archival quality bitstreams. If a METS file is not available: within the metadata folder, subfolders should be named according to the type of metadata they contain: type followed by underscore, followed by version. The following shorthand is to be used:
    • qdc for qualified Dublin Core
    • udc for unqualified Dublin Core
    • mods for MODS (Metadata Object Description Standard)
    • mets for METS (Metadata Encoding Transmission Standard)
    • tei for TEI (Text Encoding Initiative)
    • ead for EAD (Encoded Archival Description)

Thus, an example folder udc_1.1 can be expected to contain an unqualified Dublin Core record meeting the specifications of version 1.1. Likewise, a folder named mods_3.2 would contain a MODS version 3.2 metadata record.

Note that the metadata record is stored at the appropriate level: an EAD would be stored at the collection level, as it is a collection level record. If the collection contains only one item, then it should be labeled item 1, and the metadata for the item would be in the item directory, to avoid confusion.

Thus, /u0004/2061/metadata/ may contain mods_3.2 and ead_2002 and udc_1.1 directories, each containing collection-level metadata about collection 2061. This describes the collection.

If collection 2061 currently includes only one item, the filename for that item should be u0001_0002061_0000345; the directory /u0004/0002061/0000001/metadata would contain the metadata about that item. If there is page-level metadata for that item, then the metadata for the first page would be stored in /u0002/0002061/0000001/0001/metadata, the metadata for the 2nd page would be stored in /u0004/0002061/0000001/0002/metadata, and so forth.

  • For metadata records of local profiles (for example, where needed fields are taken from different metadata schemas), a schema or dtd or text/xml data dictionary is expected within a subsidiary "documentation" folder. The folder containing the metadata record and the "documentation" folder should be named according to the following system: "profile" followed by underscore, followed by the 8 digit date (year month day sequence), followed by underscore, followed by the initials of the responsible party. For example: "profile_20080825_jld" would indicate the profile Jody Lynn DeRidder created on August 25, 2008. Note that the date is the date of the profile, not the date this record was created or stored.

Thus, within the /u0004/0002061/0000345/metadata/profile_20080825_jld/ folder, you would find an xml metadata record meeting a profile specified by text or xml information in /u0004/0002061/0000345/metadata/profile_20080825_jld/documentation/ -- and the metadata record would be for the item u0004_0002061_0000345.

Notice that additional forms of metadata may be added (structural, administrative, and technical) within each metadata folder without confusion as to what the metadata is or what it is about. Hopefully, these can then be incorporated into METS files at some point, to simplify all this.

Personal tools