Situations arise in which we may have multiple versions of a particular image or object, for a variety of reasons.
As we cannot foresee all the situations, we will proceed in determining policies in a case by case fashion.
Rule 1. Different format, same file.
Clearly, if another format of the same file is being created, the only change needs to be the extension.
- A jpeg made from the tif file u0003_000608_0000234_0001.tif would be named u0003_000608_0000234_0001.jpg.
- A plain text derivation of the tif file u0003_000608_0000234_0001.tif (created by OCR process or hand-typing into a plain text editor) would be named u0003_000608_0000234_0001.txt.
- Someone reads the letter into a microphone, digitizing the text in audio format. In this case, no pagination is involved, so the filename of the ITEM ends in the extension of the audio format, for example: u0003_000608_0000234.mp3.
Rule 2. Same format, different version.
If, however another tiff is being made of an alternate version of the same object, we need another method. When we have a typed transcription of a handwritten letter, and we scan the typed transcription in order to create OCR text, we need a naming scheme. This image only exists for OCR purposes, so we have decided that for tif file (is it too much to assume that the fidelity of the transcription is enough that we can make the intellectual leap that the OCR comes from the hand written letter with out bulking up the filename with the files geneology? can we not assume equivalency of "a" and "c" if they share "b", and lose the "ocr_" from any ocr text file?) u0003_000608_0000234_0001.tif the tiff to be used for OCR purposes (the one made of the transcription) is to be named u0003_000608_0000234_ocr_0001.tif.
When this file is actually OCR'd, it will create a .txt file, which will then be named u0003_000608_0000234_ocr_0001.txt. The reasoning here is this: the typed transcription normally does NOT correspond with the pagination of the original scanned object. That is, page 1 of the transcription normally has all of page 1 and most of page 2 of the original object transcribed onto it. So page 2 of the transcription does not contain all and only the content of page 2 of the original object. if the pagination of the transcription is out of sync with the object how is the ocr matched up with the objects pages so that it can be used to locate words? and if that is not possible with the use of transcriptions and we only want to use the ocr'd text to get the user to the right letter
(remember, if an OCR text file is made from the original scanned object: u0003_000608_0000234_0001.tif, then that text file would be named u0003_000608_0000234_0001.txt. This follows rule 1 above, in that it is a different format of the original file.)
Hence the choice here is to add something standard into the filename that indicates the purpose of the new object, at the point where it makes the most sense. But does this scale? My first thought is, probably not.
I'm thinking: we create version 6.0 tiffs right now, at 600 dpi. 10 years from now, we may decide we need to have version 8.0 tiffs instead, though we want to keep the originals. So what do we name the new tiffs? Or do I need to worry about that now? (v.8.0 tiffs are the least of our future problems if you take into consideration v.24.0 tiffs that are hivemind accessed by the singularity brain so how will we all avoid neural feedback from redundant latent filename share generated by our old v.6.0 tiffs???) in seriousness could we take the v.6.0 tiffs out of circulation/our access archive?
I welcome your thoughts!! Jlderidder