File naming schemes

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(11 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.  
 
While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.  
  
While we are certain that this is far from perfect, and while we are still not in total agreement, the current recommendation is the one that follows.  Please think about applications and uses to see if there are likely situations where this won't work;  and please think about what would be an improvement.  This is not yet set in stone.  Comments and suggestions are welcomed, preferably within 2 weeks (by August 19, 2008).  By then we will need to be training students to scan, so we need this in place.
+
''For more information, see [[TrackingFilenames]]''
 +
 
 +
 
  
 
While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object.  (How many George Smiths are there?)  And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.   
 
While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object.  (How many George Smiths are there?)  And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.   
Line 13: Line 15:
 
Additional considerations are the [[http://www.controlledvocabulary.com/imagedatabases/filename_limits.html]] of the filename, characters which cause problems for software (such as \/:*?"<>|); and  consistency (both for parsing and handling, and also for visually spotting errors).  In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more:  [[http://www.w3.org/TR/REC-xml/#sec-attribute-types]]); and we need sufficient characters to provide for identification within the realm of the content provider.
 
Additional considerations are the [[http://www.controlledvocabulary.com/imagedatabases/filename_limits.html]] of the filename, characters which cause problems for software (such as \/:*?"<>|); and  consistency (both for parsing and handling, and also for visually spotting errors).  In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more:  [[http://www.w3.org/TR/REC-xml/#sec-attribute-types]]); and we need sufficient characters to provide for identification within the realm of the content provider.
  
This last set the value of "collection" to at least 7 digits, as our Photograph collections require that; and within those collections we already have up to 6 digit numbers in each one.
+
'''Therefore, we will begin each file name with a "u" for university holdings, and "p" for patron holdings, and reserve the other letters for other types of institutions.'''
 +
 
 +
#The first 4 numbers will be used to identify the holding group or area at the university.  As such, there will be several for Hoole.  For example, u0001 will be for all image collections, and u0003 for all manuscript collections. 
 +
#The second set of 7 numbers will identify the particular collection within the holding area.  We chose 7 numbers since this will hold the current collection numbers being used for image collections by Hoole (year processed followed by 3 digit number denoting sequence of processing).  For other collections whose numbering system will fit within this, we will use the existing numbering system for that holding area.  For example, in Manuscripts, MS 2504 will become collection u0003_002504.
 +
#The third set of numbers will identify the item within the collection.
 +
#the 4th set of numbers, if used, will identify the sequence for delivery on the page level.
 +
#The 5th set of numbers, if used, will identify the sequence for delivery on the sub-page level (for example, closeups of photos on a scrapbook page).
 +
 
 +
 
  
 
'''Here is a set of sample file numbers, the analysis of which will follow:'''
 
'''Here is a set of sample file numbers, the analysis of which will follow:'''
Line 24: Line 34:
  
  
'''"u" indicates that this object resides at the University of Alabama or was created under our university auspices.'''  Some other character (as yet undetermined) will delineate other.  I recommend "n" below, as the visual opposite of "u", and it can be construed to mean "not ours"!) Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.
+
'''"u" indicates that this object resides at the University of Alabama or was created under our university auspices.'''  Some other character (as yet undetermined) will delineate other holdingsWe have selected "p" for patron holdings. Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.
  
 
'''After the "u" comes a 4-digit value to indicate a category which is spelled out in the database.'''  At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more.
 
'''After the "u" comes a 4-digit value to indicate a category which is spelled out in the database.'''  At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more.
Line 61: Line 71:
  
  
*n0234_0000022_0000079_0000.xml
+
*n0234_0000022_0000079_0000.xml  ''OR''  n0234_0000022_0000079.xml
 
*n0234_0000022_0000079_0001.tif
 
*n0234_0000022_0000079_0001.tif
 
*n0234_0000022_0000079_0002.tif
 
*n0234_0000022_0000079_0002.tif
Line 67: Line 77:
 
These above represent a 2-image document and metadata from some other institution.
 
These above represent a 2-image document and metadata from some other institution.
  
Click here for further discussion of [[OtherVersions|Other Versions of an image]], such as the OCR tiff file.
 
 
=== Content with multiple streams to be delivered simultaneously ===
 
 
Examples may be
 
*a) audio with music scores,
 
*b) side-by-side images, and
 
*c) image tiling.
 
*d) 3D delivery
 
 
For reference in the following examples, the mathematical x, y, and z axes are used. Lowest x numbers are at the left, highest x numbers at the right;  lowest y values are at the bottom, highest y values are at the top.  Lowest z values are on the 2-dimensional plane, highest are the farthest away from the 2 dimensional plane.
 
 
'''For (a): different formats, delivered simultaneously '''
 
 
use different extensions, same file name.
 
 
'''For (b): multiple images to deliver simultaneously, in row or column format'''
 
 
add x or y values to filename to indicate relative delivery location.  For example: 
 
 
#u0055_0000123_1x0000001.tif
 
#u0055_0000123_2x0000001.tif
 
#u0055_0000123_3x0000001.tif
 
 
The image tagged "1x" will be the one on the left; "3x" will be on the far right, and "2x" in the middle.
 
If the delivery for the images is to be in a column, instead of a row, use y values instead.
 
 
#u0055_0000123_1y0000001.tif
 
#u0055_0000123_2y0000001.tif
 
#u0055_0000123_3y0000001.tif
 
 
The image tagged "1y" will be the one on the bottom; "3y" will be on the top, and "2y" in the middle.
 
  
'''For (c): tiled images'''
 
  
use both x and y axis values in the filenames. 
+
'''[[For a single-item collection]]'''
  
#u0055_0000123_1x1y0000001.tif
+
'''[[OtherVersions|Other Versions of an image]]''', such as the OCR tiff file.
#u0055_0000123_2x1y0000001.tif
+
#u0055_0000123_1x2y0000001.tif
+
#u0055_0000123_2x2y0000001.tif
+
  
In this situation, there are 2 rows and 2 columns.  Bottom left is the image tagged "1x1y"; Top left is the image tagged "1x2y"; Top right is the image tagged "2x2y", and bottom right is the image tagged "2x1y".
+
'''[[Item_Numbering_Variations]]'''
  
'''For (d): 3D delivery '''
+
'''[[Multiple Batches In Collection]]
  
Similarly, the z axis may be added for 3-dimensional image delivery.  The letters should always be used in alphabetical order, with the number preceding the letter of the axis to which it applies.
+
'''[[Content with multiple streams to be delivered simultaneously]]'''
  
 +
'''[[Intellectual vs Logical]]''' Numbering issues
  
 +
''In this manner, we retain consistency of how we use each segment of information in the filename;  this consistency allows us to automate much of our file management.''
  
'''please comment and make suggestions.  if any of this is unclear, please recommend improvements'''
+
<!---'''please comment and make suggestions.  if any of this is unclear, please recommend improvements'''
  
''thank you!!''            [[User:Jlderidder|Jlderidder]]
+
''thank you!!''            [[User:Jlderidder|Jlderidder]]--->

Revision as of 08:32, 11 September 2012

Discussion:

We tried to come up with a file naming scheme that would scale, to manage digital objects of various types, from our holdings as well as elsewhere on campus and in other institutions. We needed some way to group the files that is reasonably intuitive, supportable, extensible, and which supports the needs of management and delivery systems.

While our current digitized items are primarily from content housed at Hoole Special Collections, we recognize that soon the Digital Services may be digitizing and/or managing content from across campus and from a variety of other institutions.

For more information, see TrackingFilenames


While the first few uses of character strings is very helpful, after a time we are forced to choose nonsensical character strings to delineate an object. (How many George Smiths are there?) And we realize that not every institution or organization or digital item creator is going to organize things to the level we do, or in the same ways.

So we are hoping that whatever system we come up against can be mapped against this one; database tables and spreadsheets must exist to describe or point to the actual analog item, its location, its collection, and its holder, and more. The filename cannot be expected to hold sufficient information to please everyone or every use. Its primary purpose is as a unique identifying value for the file, and to facilitate in file management and online delivery.

Additional considerations are the [[1]] of the filename, characters which cause problems for software (such as \/:*?"<>|); and consistency (both for parsing and handling, and also for visually spotting errors). In addition, the "id" attribute values that are valid for xml need to be considered (begin with a letter, and more: [[2]]); and we need sufficient characters to provide for identification within the realm of the content provider.

Therefore, we will begin each file name with a "u" for university holdings, and "p" for patron holdings, and reserve the other letters for other types of institutions.

  1. The first 4 numbers will be used to identify the holding group or area at the university. As such, there will be several for Hoole. For example, u0001 will be for all image collections, and u0003 for all manuscript collections.
  2. The second set of 7 numbers will identify the particular collection within the holding area. We chose 7 numbers since this will hold the current collection numbers being used for image collections by Hoole (year processed followed by 3 digit number denoting sequence of processing). For other collections whose numbering system will fit within this, we will use the existing numbering system for that holding area. For example, in Manuscripts, MS 2504 will become collection u0003_002504.
  3. The third set of numbers will identify the item within the collection.
  4. the 4th set of numbers, if used, will identify the sequence for delivery on the page level.
  5. The 5th set of numbers, if used, will identify the sequence for delivery on the sub-page level (for example, closeups of photos on a scrapbook page).


Here is a set of sample file numbers, the analysis of which will follow:

  • u0003_0002061_0000345_0001.tif
  • u0003_0002061_0000345_0002.tif
  • u0003_0002061_0000345_0003.tif
  • u0003_0002061_0000345_0000.xml


"u" indicates that this object resides at the University of Alabama or was created under our university auspices. Some other character (as yet undetermined) will delineate other holdings. We have selected "p" for patron holdings. Longer or more complex systems were discarded as being difficult and problematic after the first few assigned values.

After the "u" comes a 4-digit value to indicate a category which is spelled out in the database. At Hoole, it makes sense to separate Rare Books from Manuscripts from Archives from Photographs -- and more. Each of these is a "superset" of potential collections. In this example, we are going to suppose that 0003 was assigned to Hoole Special Collections Manuscripts.

The second set of numbers (in this case, "0002061") is the collection number assigned by the folks indicated in the first set of numbers. Since this is from Manuscripts, this is MS 2061. Any number with 7 or less digits may be assigned, though we recommend auto-incrementing so it's clear which numbers are not yet in use, if indeed the current system does not map directly to this one. ( If the file being labeled is the EAD (finding aid) for collection MS 2061, the file naming stops here: u0003_0002061.xml. )

The third set of numbers is the item number. In this case, item # 345 is the 345th item selected for digitization; it may NOT be the 345th item encountered in the analog collection. Identification depends on information in the database or spreadsheet.

The 4th set of numbers is the sequence number. In the set of numbers above, there are 3 pages for this document, numbered sequentially. These numbers may NOT match the page numbers as appear on the page, but are necessary for sequential delivery.

Also in the set of numbers above is a file that ends in "0000.xml" -- this is the metadata record, of whatever type. If there are multiple kinds of metadata for this item (ex.: MODS, METS, DC, EAD, TEI), then they must be kept in separate directories to avoid overwriting. Said directories should be named appropriately. If page-level metadata is kept in a separate file, it would end with the sequence number of the page to which it applies; that is, metadata for sequence page 1 ends in 0001.xml instead of 0000.xml.

Thus, the set of filenames above are for a 3-page document known as the 345th item selected for digitization from MS 2061 in UA Hoole Special Collections Manuscripts division. There are 3 tiff images, one per page, numbered sequentially, and one metadata record for the item (which consists of 3 pages).


OTHER EXAMPLES

  • u0003_0002324_0000005_0003.tif
  • u0003_0002324_0000005_0003_001.tif
  • u0003_0002324_0000005_0003_002.tif

These above are a page from a scrapbook, with closeup shots of each of 2 photos on the 3rd page. The scrapbook is the 5th item in the collection from MS 2324.


  • u0002_0000001_0009937_0002.tif

This one (above) is Rare books, which has no subcollections --


  • u0001_2007102_0000342.tif
  • u0001_2007102_0000342.xml

This (above) is a photograph, only one, so no need for pages --


  • n0234_0000022_0000079_0000.xml OR n0234_0000022_0000079.xml
  • n0234_0000022_0000079_0001.tif
  • n0234_0000022_0000079_0002.tif

These above represent a 2-image document and metadata from some other institution.


For a single-item collection

Other Versions of an image, such as the OCR tiff file.

Item_Numbering_Variations

Multiple Batches In Collection

Content with multiple streams to be delivered simultaneously

Intellectual vs Logical Numbering issues

In this manner, we retain consistency of how we use each segment of information in the filename; this consistency allows us to automate much of our file management.


Personal tools