Harvesting Transcriptions

From UA Libraries Digital Services Planning and Documentation
Jump to: navigation, search

There are two versions of harvesting transcriptions. The first one is to collect content submitted thus far during a rotation in the transcription software (it may be in there for a whole semester, after all; we may want to know what progress has been made!). This first method does NOT document what's been collected in the tracking database, as it's not finalized.

The second version DOES document what's been collected in the tracking database, in preparation for removing data from the transcription software. Both of these have the same preliminary steps:

On libcontent:


1) change directory to /srv/scripts/transcripts/mediawiki

2) move all content from ./transcripts into ./allTranscripts.

3) run: extractText (to pull out everything), or run: extractSelectedText {collection_id} {box_number} ... to only pull out content for a particular collection (and box, if desired).

Both these scripts dump the text into the ./transcripts directory. The second script accesses the InfoTrack database to determine which file names to seek.

4) check results in ./transcripts

Harvesting content during a rotation

5) run: transLiveAcumenOnly

This will copy each file in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed) and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions . It does NOT update the tracking database, since we're still leaving the content in the transcription software at this point.


Harvesting content at the end of a rotation

5) If only pulling out selected content (you ran extractSelectedText in step 3 above!) then run transLiveSelected {collection_id} {box_number} ... otherwise, run transLive (which will do ALL the content in the ./transcripts directory; this assumes you ran extractText in step 3 above).

This will copy files in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed) and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions . The latter overwrites previous transcriptions; the former is versioned for archiving.


TransLiveSelected ONLY processes the files listed in InfoTrack as belonging to the collection (and box, if specified) on the commandline. This script DOES update the tracking database, in preparation for deletion of content from the transcription software.

NOTE: ALL versions are copied to the Special Collections area, whereas only the first and the most recent are archived in LOCKSS. If the most recent is not the best, the archivists will need to request archiving of the preferred version.

Personal tools