Difference between revisions of "Transcripts"

From UA Libraries Digital Services Planning and Documentation
Line 118: Line 118:
#* Check your new files and make sure everything is correct; if you saved your previous page ocr.txt files, you can delete them now.
#* Check your new files and make sure everything is correct; if you saved your previous page ocr.txt files, you can delete them now.
==Audio Transcript Matching: Compound Objects==
==Audio Transcript Matching: Compound Objects==

Revision as of 11:04, 15 February 2017



Two extensions are used for plain-text OCR files:

  1. ".ocr.txt" - the "raw" OCR results.
  2. ".txt" - if and when the raw .ocr.txt files have been remediated, the remediated versions use the standard ".txt" extension.
  If these remediated files exist, the ".ocr.txt" files should be deleted. The TIFF files of the transcriptions will be permanently deleted if it is deemed they do not themselves merit preservation.
  That is to say if the image of the transcript has no perceived value other than the text on the page, then the image of the transcription will be deleted.

NOTE: for Transcripts files where we have either corrected the OCR or typed in the transcription (where we save it as .txt instead of .ocr.txt) we must make certain that we are verifying the transcription against the original file. It’s quite possible that the person who typed the transcription made an error, and it is insufficient for us simply to save a correct transcript of someone else’s transcription; we need to be sure that the saved .txt file reflects what is actually in the original.

  If it is deemed that remediation of raw OCR files will be too labor/cost intensive, both the TIFFS and the raw ".ocr.txt" files must be preserved.
  This way, as OCR technology improves, it may be possible to render higher quality OCR versions of the transcripts in the future.

Non-compound Objects

For non-compound objects/items, transcripts simply add a page-level extension to the item number.

For example, item u0008_9999999 might have 10 transcripts which will be named as thus for the transcript .tif files: u0008_9999999_0001.tif through u0008_9999999_0010.tif.
The same applies to ".ocr.txt" and ".txt" files as well, that is to say the respective raw and remediated OCR versions of the transcript TIFFs.

Compound Objects

Often tiff/wave files do not have a one to one match with transcript files or vice-versa.

Below are three scenarios and the naming rules devised on 6/19/09 to allow the file names themselves to denote a correlation between different file types that point to the same information (i.e. an audio interview and a transcript of that interview).

Situation 1: One to Many (One media file to many text files)

ex: 1 .wav file and 3 .txt transcript files





Situation 2: Many to One (Many media files to one text file)

ex: 3 .wav files and 1 .txt transcript file





Situation 3: One to One (One media file to one text file)

ex: 1 .wav file and 1 .txt transcript file



*For general information on filenaming conventions, see File_naming_schemes.

Folder Names

Folders called "Transcripts" will reside at the same level as the Scans folders.

  • All transcripts should be placed inside the Transcript directory at the same level. Do not place transcripts in sub-folders as you would in a Scans folder.
  • The numerical identifier attached to the end of "Scans_" and "Transcripts_" folders will show the correspondence between given Scans and Transcripts folders.
  • In other words, for a given collection: the folder “Transcripts_21” contains any transcript files for the scans in “Scans_21”.


  • What if you have 2 .wav files and 3 pages of transcripts? That is, what happens when a transcript page contains the transcription for part of each of the 2 .wav files?

In that case you might have a scenario like this: _0001.wav goes with _0001_001.txt and _0001_002.txt while _0002.wav goes with _0002_001.txt and _0001_002.txt. In this case, _0001_002.txt and _0002_001.txt are the SAME document, simply existing twice with different file names. This still allows people to know what transcripts correspond to what media items based simply on the file name.

We understand that this calls for more storage space to be used (given that an analog item exists as two distinct files), but the greater concern is the removal of confusion regarding relation of items to one another.

Important: It was discussed on 070209 that in the case of .txt file transcripts (OCR), we can simply edit the .txt files so that there is a 1 to 1 match between the transcript and the media file (wav, tiff, etc.). That is to say that if the transcripts have no value in themselves (i.e. tiffs of historically important original transcripts WOULD have value in and of themselves) we can then use the OCR-ed .txt files and divide them up as needed to get a 1 to 1 match with the media file.

  • What about a scenario in which there are 2 tiffs (of original analog materials) and 1 transcript file (in tiff format) which contains the transcription for both original tiffs? How will someone know what part of the transcription tiff matches with the respective portion of the scans of the analog materials?

In this case we have a workflow and scripts for that!

  1. After your have run OCR on your files and moved them to the correct folder, use the script joinPages_GUI.pl.
    • Go to S:\Digital Projects\Administrative\scripts\transcripts and double click on joinPages_GUI.pl
    • Choose the directory where your ocr.txt are
    • The script will look for items with pages (example: u0003_0001583_0000122_0001.ocr.txt), read in each page-level file, and create a single txt file with all the pages in order and named for the item.
    • Once complete, delete the individual page files or move them to another directory to be deleted later.
  2. Open the new file, compare the transcript text to the original materials, and prepare the txt file to be split correctly.
    • Add the number of the page and a space to the beginning of the text that matches the original materials.
    • You may leave a newline space between pages, but there should be no space lines within the text for each page; delete any line spaces within the page text. (See example below.)
    • Save the edited file.
  3. Use the script makePages_GUI.pl to separate pages you've specified into individual txt files.
    • In S:\Digital Projects\Administrative\scripts\transcripts and double click on makePages_GUI.pl
    • Choose the item-level file you edited.
    • The script will look for the page break numbers you put in on Step 2, grab the text, and create each page as a separate txt file named for the item + page.
    • Check your new files and make sure everything is correct; if you saved your previous page ocr.txt files, you can delete them now.


Audio Transcript Matching: Compound Objects

Audio interviews for compound objects can have multiple transcript files/scans associated with one audio file. It is important to name these transcript files in a way that delineates which transcripts match with particular .wav files.

Here is a link to a small tutorial that demonstrates a method for doing such a thing provided the transcript scans are good candidates for OCR.