Organization of completed content for long-term storage
As we expand our holdings to multiple collections, and content from different sources beyond Hoole, we need to organize files and folders in a systematic manner. The following solution follows a simple rule: replace the underscore in a file name with a forward slash, to determine the appropriate directory location for the file.
Based on the File Naming Schemes we selected,
Within a specified directory on a slow server (to reduce risk of damage and corruption by access):
1) By holding institution and subsidiary group
The first letter in a filename indicates something about the origin. All digital content starting with "u" originated from holdings within the University of Alabama, whereas other letters indicate origins elsewhere. Following the initial letter are a series of 4 numbers indicating a grouping and an institution. Using these 5 digits to create the first level directory structure ensures that all content is segregated by the holding organization; that is, all Hoole Rare books content will be in folder "u0004" as that is the number assigned to it. For example, a file labeled u0004_0002061_0000345_0003.tif will be found in a subdirectory under /u0004/ in the file system.
2) By collection
In the third level (within the folders on the 2nd level), folders will be named according to the 2nd set of numbers in the filename: after the first underscore and preceding the 2nd underscore. This is the number of the collection for the institution/grouping. In the file name u0004_0002061_0000345_0003.tif, "0002061" indicates collection number 2061, so the folder "2061" will exist for this collection: /u0004/0002061/
- note that metadata and images for collections that do not have items digitized will be stored here, and there will not be a 3rd or 4th level.
3) By item
In the 4th level (within the folders on the 3rd level), folders will be named according to the 3rd set of numbers in the filename: after the 2nd underscore, and preceding the 3rd underscore. This is the number assigned to the item within the specified collection. In the file name u0004_0002061_0000345_0003.tif, "345" is the number for this item in this collection. So the folder /u0004/0002061/0000345 will contain all files relating to this item.
- note that items that do not have pages will be stored here, and there will not be a 4th or 5th level.
4) By sequence for delivery
In the 5th level (within the folders on the 4th level), folders will be named according to the 4th set of numbers in the filename: after the 3rd underscore, and preceding the 4th underscore or the period and filename extension. This is the number assigned to the sequence of delivery for the files within the item. In the file name u0004_0002061_0000345_0003.tif, "0003" indicates the 3rd image in a sequence, so the directory /u0004/0002061/0000345/0003/ will contain this tiff and all information associated with it.
- note that if there are subpages, such as in a scrapbook, there will be an additional level beneath this one, using the same reasoning.
OTHER SUBFOLDERS IN EACH OF THE ABOVE
Each of the levels above may contain any or all of the following folders:
- Note: All documentation needs to be in unicode or ascii xml or plain text.
This directory holds administrative information.
- Within the Transcripts folder, files should be named for the item transcribed and should be in ASCII text or UTF-8 text. If the file is OCR, include .ocr before the .txt extension. For example, u0003_0000580_0000002.ocr.txt would be the OCR for item 2 in the Hoole Manuscripts (u0003) MSS 580 collection (0000580). Transcripts should be stored at the level to which they belong, just like metadata -- that is, if this is the transcript for item 2, it should be in item 2's directory, in a subdirectory named Transcripts.
We have since decided that OCR text is not archival content, since it can be generated by script, and since OCR software keeps improving. If we correct the OCR by hand (changing the extension from ".ocr.txt" to simply ".txt" and retaining it in the Transcripts directory at the applicable level) then the human-corrected transcript is worth archiving and will be kept. The OCR, however, will be only kept in the delivery system.
- Electronic Theses and Dissertations, and Undergraduate Research Papers, are often submitted with supplemental files. These aren't component files, although they are named as if they are "pages" within the item. Since, however, they are supplemental files for the item itself, the Supplemental directory normally resides at the item level, and these "page" numbered files are deposited within that directory. This provides the delivery system with additional information to inform rendering, and it also provides clarity for those reconstructing our content as to what files are supplemental to what level of content.
- If a METS file is available to organize the metadata and tag it with appropriate namespaces, that is ideal. This METS file should have xlinks to the archival quality bitstreams. Within the metadata folder, filenames should be adapted according to the type of metadata they contain: type followed by underscore, followed by version. The following shorthand is to be used:
- qdc for qualified Dublin Core
- udc for unqualified Dublin Core
- mods for MODS (Metadata Object Description Standard)
- mets for METS (Metadata Encoding Transmission Standard)
- tei for TEI (Text Encoding Initiative)
- ead for EAD (Encoded Archival Description)
Versions of metadata will be tracked in the database (saved as flat file regularly in the top set of directories).
Note that the metadata record is stored at the appropriate level: an EAD would be stored at the collection level, as it is a collection level record.
Thus, /u0004/2061/Metadata/ may contain u0004_2061.mods.xml and u0004_2061.ead.xml and u0004_2061.dc.xml each containing collection-level metadata about collection 2061.
If collection 2061 currently includes only one item, the filename for that item should be u0001_0002061_0000345; the directory /u0004/0002061/0000001/metadata would contain the metadata about that item. If there is page-level metadata for that item, then the metadata for the first page would be stored in /u0002/0002061/0000001/0001/Metadata, the metadata for the 2nd page would be stored in /u0004/0002061/0000001/0002/Metadata, and so forth (each would be named appropriately).
- For metadata records of local profiles (for example, where needed fields are taken from different metadata schemas), a schema or dtd or text/xml data dictionary is expected within a subsidiary "Documentation" folder. The folder containing the metadata record and the "documentation" folder should be named according to the following system: "profile" followed by underscore, followed by the 8 digit date (year month day sequence), followed by underscore, followed by the initials of the responsible party. For example: "profile_20080825_jld" would indicate the profile Jody Lynn DeRidder created on August 25, 2008. Note that the date is the date of the profile, not the date this record was created or stored.
Thus, within the /u0004/0002061/0000345/Metadata/profile_20080825_jld/ folder, you would find an xml metadata record meeting a profile specified by text or xml information in /u0004/0002061/0000345/metadata/profile_20080825_jld/Documentation/ -- and the metadata record would be for the item u0004_0002061_0000345.