File naming schemes
We tried to come up with a file naming scheme that would scale, to manage digital objects of various types, from our holdings as well as elsewhere on campus and in other institutions. We needed some way to group the files that is reasonably intuitive, supportable, extensible, and which supports the needs of management and delivery systems.
While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.
While we are certain that this is far from perfect, and while we are still not in total agreement, the current recommendation is the one that follows. Please think about applications and uses to see if there are likely situations where this won't work; and please think about what would be an improvement. This is not yet set in stone. Comments and suggestions are welcomed, preferably within 2 weeks (by August 19, 2008). By then we will need to be training students to scan, so we need this in place.
While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object. (How many George Smiths are there?) And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.
So we are hoping that whatever system we come up against can be mapped against this one; database tables and spreadsheets must exist to describe or point to the actual analog item, its location, its collection, and its holder, and more. The filename cannot be expected to hold sufficient information to please everyone or every use. Its primary purpose is as a unique identifying value for the file, and to facilitate in file management and online delivery.
Additional considerations are the [] of the filename, characters which cause problems for software (such as \/:*?"<>|); and consistency (both for parsing and handling, and also for visually spotting errors). In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more: []); and we need sufficient characters to provide for identification within the realm of the content provider.
This last set the value of "collection" to at least 7 digits, as our Photograph collections require that; and within those collections we already have up to 6 digit numbers in each one.
Here is a set of sample file numbers, the analysis of which will follow:
"u" indicates that this object resides at the University of Alabama or was created under our university auspices. Some other character (as yet undetermined) will delineate other. I recommend "n" below, as the visual opposite of "u", and it can be construed to mean "not ours"!) Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.
After the "u" comes a 4-digit value to indicate a category which is spelled out in the database. At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more. Each of these is a "superset" of potential collections. In this example, we are going to suppose that 0003 was assigned to Hoole Special Collections Manuscripts.
The second set of numbers (in this case, "0002061") is the collection number assigned by the folks indicated in the first set of numbers. Since this is from Manuscripts, this is MS 2061. Any number with 7 or less digits may be assigned, though we recommend auto-incrementing so it's clear which numbers are not yet in use, if indeed the current system does not map directly to this one. ( If the file being labeled is the EAD (finding aid) for collection MS 2061, the file naming stops here: u0003_0002061.xml. )
The third set of numbers is the item number. In this case, item # 345 is the 345th item selected for digitization; it may NOT be the 345th item encountered in the analog collection. Identification depends on information in the database or spreadsheet.
The 4th set of numbers is the sequence number. In the set of numbers above, there are 3 pages for this document, numbered sequentially. These numbers may NOT match the page numbers as appear on the page, but are necessary for sequential delivery.
Also in the set of numbers above is a file that ends in "0000.xml" -- this is the metadata record, of whatever type. If there are multiple kinds of metadata for this item (ex.: MODS, METS, DC, EAD, TEI), then they must be kept in separate directories to avoid overwriting. Said directories should be named appropriately. If page-level metadata is kept in a separate file, it would end with the sequence number of the page to which it applies; that is, metadata for sequence page 1 ends in 0001.xml instead of 0000.xml.
Thus, the set of filenames above are for a 3-page document known as the 345th item selected for digitization from MS 2061 in UA Hoole Special Collections Manuscripts division. There are 3 tiff images, one per page, numbered sequentially, and one metadata record for the item (which consists of 3 pages).
These above are a page from a scrapbook, with closeup shots of each of 2 photos on the 3rd page. The scrapbook is the 5th item in the collection from MS 2324.
This one (above) is Rare books, which has no subcollections --
This (above) is a photograph, only one, so no need for pages --
These above represent a 2-image document and metadata from some other institution.
Click here for further discussion of Other Versions of an image, such as the OCR tiff file.
Content with multiple streams to be delivered simultaneously
Examples may be
- a) audio with music scores,
- b) side-by-side images, and
- c) image tiling.
- d) 3D delivery
For reference in the following examples, the mathematical x, y, and z axes are used. Lowest x numbers are at the left, highest x numbers at the right; lowest y values are at the bottom, highest y values are at the top. Lowest z values are on the 2-dimensional plane, highest are the farthest away from the 2 dimensional plane.
For (a): different formats, delivered simultaneously
use different extensions, same file name.
For (b): multiple images to deliver simultaneously, in row or column format
add x or y values to filename to indicate relative delivery location. For example:
The image tagged "1x" will be the one on the left; "3x" will be on the far right, and "2x" in the middle. If the delivery for the images is to be in a column, instead of a row, use y values instead.
The image tagged "1y" will be the one on the bottom; "3y" will be on the top, and "2y" in the middle.
For (c): tiled images
use both x and y axis values in the filenames.
In this situation, there are 2 rows and 2 columns. Bottom left is the image tagged "1x1y"; Top left is the image tagged "1x2y"; Top right is the image tagged "2x2y", and bottom right is the image tagged "2x1y".
For (d): 3D delivery
Similarly, the z axis may be added for 3-dimensional image delivery. The letters should always be used in alphabetical order, with the number preceding the letter of the axis to which it applies.
please comment and make suggestions. if any of this is unclear, please recommend improvements
thank you!! Jlderidder