Difference between revisions of "Harvesting Transcriptions"

From UA Libraries Digital Services Planning and Documentation
(New page: There are two versions of harvesting transcriptions. The first one is to collect content submitted thus far during a rotation in the transcription software (it may be in there for a whole...)
 
(Harvesting content at the end of a rotation)
Line 28: Line 28:
 
==Harvesting content at the end of a rotation==
 
==Harvesting content at the end of a rotation==
  
3)  If only pulling out selected content (you ran extractSelectedText in step 3 above!) then run transLiveSelected  {collection_id} {box_number} ...
+
5)  If only pulling out selected content (you ran extractSelectedText in step 3 above!) then run transLiveSelected  {collection_id} {box_number} ...
 
otherwise, run transLive (which will do ALL the content in the ./transcripts directory;  this assumes you ran extractText in step 3 above).   
 
otherwise, run transLive (which will do ALL the content in the ./transcripts directory;  this assumes you ran extractText in step 3 above).   
  
 
This will copy files in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed)
 
This will copy files in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed)
 
and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions .  TransLiveSelected ONLY processes the files listed in InfoTrack as belonging to the collection (and box, if specified) on the commandline. These scripts DO update the tracking database, in preparation for deletion of content from the transcription software.
 
and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions .  TransLiveSelected ONLY processes the files listed in InfoTrack as belonging to the collection (and box, if specified) on the commandline. These scripts DO update the tracking database, in preparation for deletion of content from the transcription software.

Revision as of 16:33, 4 April 2012

There are two versions of harvesting transcriptions. The first one is to collect content submitted thus far during a rotation in the transcription software (it may be in there for a whole semester, after all; we may want to know what progress has been made!). This first method does NOT document what's been collected in the tracking database, as it's not finalized.

The second version DOES document what's been collected in the tracking database, in preparation for removing data from the transcription software. Both of these have the same preliminary steps:

On libcontent1:


1) change directory to /srv/scripts/transcripts/mediawiki

2) move all content from ./transcripts into ./allTranscripts.

3) run: extractText (to pull out everything), or run: extractSelectedText {collection_id} {box_number} ... to only pull out content for a particular collection (and box, if desired).

Both these scripts dump the text into the ./transcripts directory. The second script accesses the InfoTrack database to determine which file names to seek.

4) check results in ./transcripts

Harvesting content during a rotation

5) run: transLiveAcumenOnly

This will copy each file in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed) and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions . It does NOT update the tracking database, since we're still leaving the content in the transcription software at this point.


Harvesting content at the end of a rotation

5) If only pulling out selected content (you ran extractSelectedText in step 3 above!) then run transLiveSelected {collection_id} {box_number} ... otherwise, run transLive (which will do ALL the content in the ./transcripts directory; this assumes you ran extractText in step 3 above).

This will copy files in the ./transcripts directory into Acumen in the correct location (creating a Transcripts directory if needed) and will also place a copy in the deposits directory (versioning it if necessary) and in the Special Collections share drive area under Digital_Program_files/Transcriptions . TransLiveSelected ONLY processes the files listed in InfoTrack as belonging to the collection (and box, if specified) on the commandline. These scripts DO update the tracking database, in preparation for deletion of content from the transcription software.