From UA Libraries Digital Services Planning and Documentation
Revision as of 09:54, 3 October 2012 by Jlderidder
For translations or transcriptions to be in a form that we can use:
- We need a separate digital text (txt extension, not doc, docx, or rtf ) file for each page, named the same as what the digital page will be named.
- It needs to be in UTF-8 (not Windows Unicode, which is UTF-16) and use either Windows character map or Babelmap UTF-8 encodings for the diacritics. We recommend the use of Notepad++ for the creation of each text file.
If the transcripts are in analog form, the process is this:
- Digitize the transcripts
- OCR the images. OCR only captures about 80% of plain text, and will destroy words with diacritic encodings.
- Review the OCR text, compare with the original files that had been transcribed, and divide the content accordingly. You will need to create a separate text file for each original handwritten page, and name it the same as the original handwritten page (except end with the extension ".ocr.txt") -- see above.
- If you correct the OCR errors in these text files, then remove the ".ocr" from the file extension, so they simply end in ".txt"
- Unless the transcript images are of research value themselves, delete them (as well as the intermediate OCR content).
- Place the results in the Transcripts directory for upload.