File naming schemes
We tried to come up with a file naming scheme that would scale, to manage digital objects of various types, from our holdings as well as elsewhere on campus and in other institutions. We needed some way to group the files that is reasonably intuitive, supportable, extensible, and which supports the needs of management and delivery systems.
While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.
While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object. (How many George Smiths are there?) And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.
So we are hoping that whatever system we come up against can be mapped against this one; database tables and spreadsheets must exist to describe or point to the actual analog item, its location, its collection, and its holder, and more. The filename cannot be expected to hold sufficient information to please everyone or every use. Its primary purpose is as a unique identifying value for the file, and to facilitate in file management and online delivery.
Additional considerations are the [] of the filename, characters which cause problems for software (such as \/:*?"<>|); and consistency (both for parsing and handling, and also for visually spotting errors). In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more: []); and we need sufficient characters to provide for identification within the realm of the content provider.
Therefore, we will begin each file name with a "u" for university holdings, and "p" for patron holdings, and reserve the other letters for other types of institutions.
- The first 4 numbers will be used to identify the holding group or area at the university. As such, there will be several for Hoole. For example, u0001 will be for all image collections, and u0003 for all manuscript collections.
- The second set of 7 numbers will identify the particular collection within the holding area. We chose 7 numbers since this will hold the current collection numbers being used for image collections by Hoole (year processed followed by 3 digit number denoting sequence of processing). For other collections whose numbering system will fit within this, we will use the existing numbering system for that holding area. For example, in Manuscripts, MS 2504 will become collection u0003_002504.
- The third set of numbers will identify the item within the collection.
- the 4th set of numbers, if used, will identify the sequence for delivery on the page level.
- The 5th set of numbers, if used, will identify the sequence for delivery on the sub-page level (for example, closeups of photos on a scrapbook page).
Here is a set of sample file numbers, the analysis of which will follow:
"u" indicates that this object resides at the University of Alabama or was created under our university auspices. Some other character (as yet undetermined) will delineate other holdings. We have selected "p" for patron holdings. Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.
After the "u" comes a 4-digit value to indicate a category which is spelled out in the database. At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more. Each of these is a "superset" of potential collections. In this example, we are going to suppose that 0003 was assigned to Hoole Special Collections Manuscripts.
The second set of numbers (in this case, "0002061") is the collection number assigned by the folks indicated in the first set of numbers. Since this is from Manuscripts, this is MS 2061. Any number with 7 or less digits may be assigned, though we recommend auto-incrementing so it's clear which numbers are not yet in use, if indeed the current system does not map directly to this one. ( If the file being labeled is the EAD (finding aid) for collection MS 2061, the file naming stops here: u0003_0002061.xml. )
The third set of numbers is the item number. In this case, item # 345 is the 345th item selected for digitization; it may NOT be the 345th item encountered in the analog collection. Identification depends on information in the database or spreadsheet.
The 4th set of numbers is the sequence number. In the set of numbers above, there are 3 pages for this document, numbered sequentially. These numbers may NOT match the page numbers as appear on the page, but are necessary for sequential delivery.
Also in the set of numbers above is a file that ends in "0000.xml" -- this is the metadata record, of whatever type. If there are multiple kinds of metadata for this item (ex.: MODS, METS, DC, EAD, TEI), then they must be kept in separate directories to avoid overwriting. Said directories should be named appropriately. If page-level metadata is kept in a separate file, it would end with the sequence number of the page to which it applies; that is, metadata for sequence page 1 ends in 0001.xml instead of 0000.xml.
Thus, the set of filenames above are for a 3-page document known as the 345th item selected for digitization from MS 2061 in UA Hoole Special Collections Manuscripts division. There are 3 tiff images, one per page, numbered sequentially, and one metadata record for the item (which consists of 3 pages).
These above are a page from a scrapbook, with closeup shots of each of 2 photos on the 3rd page. The scrapbook is the 5th item in the collection from MS 2324.
This one (above) is Rare books, which has no subcollections --
This (above) is a photograph, only one, so no need for pages --
- n0234_0000022_0000079_0000.xml OR n0234_0000022_0000079.xml
These above represent a 2-image document and metadata from some other institution.
- Remember that the first set of numbers is the holder ID;
- the second set of numbers is the collection ID;
- the third set of numbers is the item ID;
- the 4th set of numbers is the sequence for delivery if there are pages.
So... if the single-item collection number is u0001_2009170 (the 170th image collection processed in 2009)
Then the first and only item in the collection would be u0001_2009170_0000001.
If the item has pages, such as a scrapbook, then the first page would be u0001_2009170_0000001_0001.tif
In this manner, we retain consistency of how we use each segment of information in the filename; this consistency allows us to automate much of our file management.
Click here for further discussion of Other Versions of an image, such as the OCR tiff file.