Organization of completed content for long-term storage
As we expand our holdings to multiple collections, and content from different sources beyond Hoole, we need to organize files and folders in a systematic manner.
Based on the [Naming Schemes] we selected,
I propose the following organization for materials that we have already digitized:
Within a specified directory on a slow server (to reduce risk of damage and corruption by access):
1) By holding institution
The first level of organization is by the letter preceding all numbers in the filename. Hence, all digital content originating from holdings here at the university will be stored within the "u" directory. All digital content originating from holdings elsewhere, will be stored under a directory named according to the selected letter (as yet undetermined).
2) By group
In the second level of directory structure (within the folders on the first level), the folders will be named for the first set of numbers following the letter, and preceding the first underscore. Since this number indicates a grouping and an institution, this ensures that all content is segregated by the holding organization; that is, all Hoole Rare books content will be in folder "0002" if that is the number assigned to it. For example, a file labeled u0002_0002061_0000345_0003.tif will be found in a subdirectory under /u/2/ in the file system.
3) By collection
In the third level (within the folders on the 2nd level), folders will be named according to the 2nd set of numbers in the filename: after the first underscore and preceding the 2nd underscore. This is the number of the collection for the institution/grouping. In the file name u0001_0002061_0000345_0003.tif, "0002061" indicates collection number 2061, so the folder "2061" will exist for this collection: /u/2/2061/
- note that metadata and images for collections that do not have items digitized will be stored here, and there will not be a 4th level.
4) By item
In the 4th level (within the folders on the 3rd level), folders will be named according to the 3rd set of numbers in the filename: after the 2nd underscore, and preceding the 3rd underscore. This is the number assigned to the item within the specified collection. In the file name u0001_0002061_0000345_0003.tif, "345" is the number for this item in this collection. So the folder /u/2/2061/345 will contain all files relating to this item.
- note that items that do not have pages will be stored here, and there will not be a 5th level.
5) By sequence for delivery
In the 5th level (within the folders on the 4th level), folders will be named according to the 4th set of numbers in the filename: after the 3rd underscore, and preceding the 4th underscore or the period and filename extension. This is the number assigned to the sequence of delivery for the files within the item. In the file name u0001_0002061_0000345_0003.tif, "0003" indicates the 3rd image in a sequence, so the directory /u/2/2061/345/3/ will contain this tiff and all information associated with it.
OTHER SUBFOLDERS IN EACH OF THE ABOVE
Each of the levels above may contain any or all of the following folders:
- Note: All documentation needs to be in unicode or ascii xml or plain text.
- Within the image folder, folders should be named by type followed by underscore, then version (replace periods with hyphens), followed by underscore, followed by DPI. For example, tiff_6.0_600dpi would contain a version 6 TIFF at 600 dpi.
Thus, the actual storage directory of u0001_0002061_0000345_0003.tif would be /u/2/2061/345/3/image/tiff_6-0_600dpi/ Rationale: this enables us to locate by script images of a certain type which may need to be reformatted before becoming obsolete.
- Should we begin storing a different preservation image type, version, or dpi, this will clarify which images are which.
- Within the text folder, subfolders should be named "ocr" for ocr text in ascii or unicode; "transcribed" for transcribed text in ascii or unicode. All text should be stored either as xml or plain text (.txt files).
This will enable us to identify text which may be poor as opposed to better quality. If the text is ocr text which has been remediated, it should be stored as "transcribed".
- Within the metadata folder, subfolders should be named according to the type of metadata they contain: type followed by underscore, followed by version (replace periods with hyphens). The following shorthand is to be used:
- qdc for qualified Dublin Core
- udc for unqualified Dublin Core
- mods for MODS (Metadata Object Description Standard)
- mets for METS (Metadata Encoding Transmission Standard)
- tei for TEI (Text Encoding Initiative)
- ead for EAD (Encoded Archival Description)
Thus, an example folder udc_1-1 can be expected to contain an unqualified Dublin Core record meeting the specifications of version 1.1. Likewise, a folder named mods_3-2 would contain a MODS version 3.2 metadata record.
Note that the metadata record is stored at the appropriate level: an EAD would be stored at the collection level, as it is a collection level record. If the collection contains only one item, then it should be labeled item 1, and the metadata for the item would be in the item directory, to avoid confusion.
Thus, /u/2/2061/metadata/ may contain mods_3-2 and ead_2002 and udc_1.1 directories, each containing collection-level metadata about collection 2061. This describes the collection.
If collection 2061 currently includes only one item, the filename for that item should be u0001_0002061_0000345; the directory /u/2/2061/1/metadata would contain the metadata about that item. If there is page-level metadata for that item, then the metadata for the first page would be stored in /u/2/2061/1/1/metadata, the metadata for the 2nd page would be stored in /u/2/2061/1/2/metadata, and so forth.
- For metadata records of local profiles (for example, where needed fields are taken from different metadata schemas), a schema or dtd or text/xml data dictionary is expected within a subsidiary "documentation" folder. The folder containing the metadata record and the "documentation" folder should be named according to the following system: "profile" followed by underscore, followed by the 8 digit date (year month day sequence), followed by underscore, followed by the initials of the responsible party. For example: "profile_20080825_jld" would indicate the profile Jody Lynn DeRidder created on August 25, 2008. Note that the date is the date of the profile, not the date this record was created or stored.
Thus, within the /u/2/2061/345/metadata/profile_20080825_jld/ folder, you would find an xml metadata record meeting a profile specified by text or xml information in /u/2/2061/345/metadata/profile_20080825_jld/documentation/ -- and the metadata record would be for the item u0001_0002061_0000345.
Notice that additional forms of metadata may be added (structural, administrative, and technical) within each metadata folder without confusion as to what the metadata is or what it is about.