File naming schemes

From UA Libraries Digital Services Planning and Documentation
Revision as of 13:46, 11 September 2008 by Jlderidder (talk | contribs)


We tried to come up with a file naming scheme that would scale, to manage digital objects of various types, from our holdings as well as elsewhere on campus and in other institutions. We needed some way to group the files that is reasonably intuitive, supportable, extensible, and which supports the needs of management and delivery systems.

While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.

While we are certain that this is far from perfect, and while we are still not in total agreement, the current recommendation is the one that follows. Please think about applications and uses to see if there are likely situations where this won't work; and please think about what would be an improvement. This is not yet set in stone. Comments and suggestions are welcomed, preferably within 2 weeks (by August 19, 2008). By then we will need to be training students to scan, so we need this in place.

While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object. (How many George Smiths are there?) And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.

So we are hoping that whatever system we come up against can be mapped against this one; database tables and spreadsheets must exist to describe or point to the actual analog item, its location, its collection, and its holder, and more. The filename cannot be expected to hold sufficient information to please everyone or every use. Its primary purpose is as a unique identifying value for the file, and to facilitate in file management and online delivery.

Additional considerations are the [[1]] of the filename, characters which cause problems for software (such as \/:*?"<>|); and consistency (both for parsing and handling, and also for visually spotting errors). In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more: [[2]]); and we need sufficient characters to provide for identification within the realm of the content provider.

This last set the value of "collection" to at least 7 digits, as our Photograph collections require that; and within those collections we already have up to 6 digit numbers in each one.

Here is a set of sample file numbers, the analysis of which will follow:

  • u0001_0002061_0000345_0001.tif
  • u0001_0002061_0000345_0002.tif
  • u0001_0002061_0000345_0003.tif
  • u0001_0002061_0000345_0000.xml

"u" indicates that this object resides at the University of Alabama or was created under our university auspices. Some other character (as yet undetermined) will delineate other. I recommend "n" below, as the visual opposite of "u", and it can be construed to mean "not ours"!) Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.

After the "u" comes a 4-digit value to indicate a category which is spelled out in the database. At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more. Each of these is a "superset" of potential collections. In this example, we are going to suppose that 0001 was assigned to Hoole Special Collections Manuscripts.

The second set of numbers (in this case, "0002061") is the collection number assigned by the folks indicated in the first set of numbers. Since this is from Manuscripts, this is MS 2061. Any number with 7 or less digits may be assigned, though we recommend auto-incrementing so it's clear which numbers are not yet in use, if indeed the current system does not map directly to this one.

The third set of numbers is the item number. In this case, item # 345 is the 345th item selected for digitization; it may NOT be the 345th item encountered in the analog collection. Identification depends on information in the database or spreadsheet.

The 4th set of numbers is the sequence number. In the set of numbers above, there are 3 pages for this document, numbered sequentially. These numbers may NOT match the page numbers as appear on the page, but are necessary for sequential delivery.

Also in the set of numbers above is a file that ends in "0000.xml" -- this is the metadata record, of whatever type. If there are multiple kinds of metadata for this item (ex.: MODS, METS, DC, EAD, TEI), then they must be kept in separate directories to avoid overwriting. Said directories should be named appropriately. If page-level metadata is kept in a separate file, it would end with the sequence number of the page to which it applies; that is, metadata for sequence page 1 ends in 0001.xml instead of 0000.xml.

EXAMPLES Thus, the set of filenames above are for a 3-page document known as the 345th item selected for digitization from MS 2061 in UA Hoole Special Collections Manuscripts division. There are 3 tiff images, one per page, numbered sequentially, and one metadata record for the item (which consists of 3 pages).

  • u0002_0002324_0000005_0001.tif
  • u0002_0002324_0000005_0001_001.tif
  • u0002_0002324_0000005_0001_002.tif

These above are a page from a scrapbook, with closeup shots of each of 2 photos on the page

  • u0003_0000001_0009937_0002.tif

This one is Rare books, which has no subcollections --

  • u0004_2007102_0000342.tif
  • u0004_2007102_0000342.xml

This (above) is a photograph, only one, so no need for pages --

  • n0234_0000022_0000079_0000.xml
  • n0234_0000022_0000079_0001.tif
  • n0234_0000022_0000079_0002.tif

These above represent a 2-image document and metadata from some other institution.

please comment and make suggestions. if any of this is unclear, please recommend improvements

thank you!! Jlderidder