From UA Libraries Digital Services Planning and Documentation
Jump to: navigation, search

For translations or transcriptions to be in a form that we can use:

  1. We need a separate digital text (txt extension, not doc, docx, or rtf ) file for each page, named the same as what the digital page will be named.
  2. It needs to be in UTF-8 (not Windows Unicode, which is UTF-16) and use either Windows character map or Babelmap UTF-8 encodings for the diacritics. We recommend the use of Notepad++ for the creation of each text file.

If the transcripts are in analog form, the process is this:

  1. Digitize the transcripts
  2. OCR the images. OCR only captures about 80% of plain text, and will destroy words with diacritic encodings.
  3. Review the OCR text, compare with the original files that had been transcribed, and divide the content accordingly. You will need to create a separate text file for each original handwritten page, and name it the same as the original handwritten page (except end with the extension ".ocr.txt") -- see above.
  4. If you correct the OCR errors in these text files, then remove the ".ocr" from the file extension, so they simply end in ".txt"
  5. Unless the transcript images are of research value themselves, delete them (as well as the intermediate OCR content).
  6. Place the results in the Transcripts directory for upload.