Transcripts now come to us two ways:
- via the Acumen interface, in which case they are automatically added to the Acumen database. If OCR is available from digitization, that is stored in the web directory, and provided to the user to correct and improve upon to create a transcript. If, however, a transcript exists in the database, however minimally, that is presented to the user instead. As part of the archiving process, transcripts are extracted monthly from Acumen, checked against the last version archived, and if different, are processed into the archive.
- Transcripts also sometimes happen during or prior to digitization.
What follows is about this option.
Transcripts vs. OCR
At times we receive hard copy transcriptions alongside the originals which have been submitted for digitization. Our process for these is to digitize the transcriptions, place in a "Transcriptions" directory in the collection work area (see Share Drive Protocols, and run OCR (Optical Character Recognition) software over these to extract text for indexing, to enhance searchability. The script for this process is on the libcontent server in the ds home directory scripts area, and is called ocrDir. (OCR processing is a regular feature of the upload process, normally dependent upon the entry of a "1" in the OCR column of a log file, indicating that the item is at least half printed text.)
After OCR extracts are made, they must be edited so there is a single text file for each page of the original (not the transcript) document. This is so that when users search on text contained in the file, they will be directed to the correct original page. Often during this editing process, staff or student workers may "clean up" the OCR text, correcting for machine errors. At this point, these files become transcripts instead of OCR files. The difference is reflected in the file naming process: u0003_0000252_0000001_0001.ocr.txt is an OCR file for page 1 of the first item in the Cabaniss collection (u0003_0000252_0000001_0001.tif) and u0003_0000252_0000001_0001.txt is the transcript for that same page. (See File_naming_schemes) Transcripts must be in ASCII text or UTF-8!!
As mentioned before, OCR files are uploaded to the web server (from the Transcripts directory of the collection work area) and are distributed into the web directories where they belong. For example, u0003_0000252_0000001_0001.ocr.txt would be placed in the Transcripts subdirectory for the page: /srv/www/htdocs/content/u0003/0000252/0000001/0001/Transcripts/ The various makeJpegs scripts (one for mass digitized content, one for audio, one for scrapbooks, one for the rest of our content) all generate OCR as required by the log file entries, and upload transcripts both to the deposits directory and to the Acumen staging database. The code for this can be found in the derivatives library (/srv/scripts/lib/derivatives.pm) .
Transcripts are now uploaded into the Acumen database; they are also placed in the deposits directory for archiving with the TIFF files (see Organization of completed content for long-term storage. OCR files do not go into the archive, as they can be regenerated from the existing tiffs (and hopefully, future OCR software will be better than what we have now. Right now, we use tesseract (see For Creating Derivatives).
In the future:
We hope to obtain enough transcripts for our handwritten documents to generate TEI XML for them. This would be done by extracting metadata from the MODS to generate the header, incorporate one transcript per page, and link in the delivery images for each page above or below the transcripts.