Transcription stats

From UA Libraries Digital Services Planning and Documentation
Revision as of 12:51, 11 April 2012 by Jlderidder (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

All of the below is on

Update on success of this round of transcriptions

All the TransLive scripts create an output file in the ./output directory in /srv/scripts/transcripts/mediawiki/. These output files are dated tab-delimited text files intended to be opened in Excel or another spreadsheet software.

The columns are:

  1. "CollID" for the collection identifier,
  2. "NumItemsTranscribed" for the number of items transcribed for that collection thus far in this round, and
  3. "NumTranscriptions" for the total number of transcriptions for the collection.

A collection with multi-page items may have many more transcriptions than items (up to one per page). Only the latest transcription is counted, as that's what's retrieved when you run these scripts. Note that if you are selectively running TransLive on one collection after another, these output files will all be named with today's date and will hence overwrite one another (at present) -- so you may want to retrieve the output files between processing each collection for deletion.

Total Transcripts in u0003 and u0002 thus far

In /srv/scripts/transcripts/ there are two scripts of interest: numPagesAndTrans and getTransCount.

Run numPagesAndTrans first; it will traverse the directories of u0003 (Hoole Manuscripts) and u0002 (Hoole Rare Books) and count the number of pages per item, and the number of transcripts (and OCR files) per item and per collection. This information is stored in the InfoTrack database numItemPages table.

getTransCount utilizes this information, so it should be run second. What it does is report the information gathered from the previous script and will also yank the title of the collection out of the allColls table, formatting the results in the ./output directory as a dated tab-delimited text file intended to be opened in Excel or another spreadsheet software.

The columns of information are:

  1. "Collnum" for the collection identifier;
  2. "Title" for the collection title;
  3. "NumItems" for the number of items in the collection;
  4. "NumItemsTranscribed and/Or OCRd" for the number of items transcribed or OCRd in that collection;
  5. "NumPages" for the total page count for that collection;
  6. "NumTranscriptions" and
  7. "NumOCRd" for a breakdown of the "NumItemsTranscribed" column. Some things may have been OCR'd and then transcribed (perhaps the OCR has not been removed), but for the most part, this will give you a sense of where the gaps are in terms of what should still be transcribed. Generally, if we can't OCR it, it should be transcribed.
Personal tools