For Creating Derivatives

From UA Libraries Digital Services Planning and Documentation
Jump to: navigation, search


Open source software we are using to support our digital library

  • Tesseract-ocr for OCR creation. The command used is of this form:
 tesseract [OLDFILE]  [NEWFILE]
  • ImageMagick for JPEG creation from TIFF files. The command used with ImageMagick is of this form (this for the 2048 size):
 convert [OLDFILE] -strip -density 96 -resample 96x96  -resize 2048x2048 -filter Cubic -quiet [NEWFILE]
  • SOX for extraction of segments of WAV files according to specifications in the ADL files (start and end times) (see Audio Decision List)
 sox waves/u0008_0000001_0000001.wav temp/u0008_0000001_0000001_0001.wav trim 0 30 

This takes the first 30 seconds of the first wave and puts it in the 2nd wave without changing the original file.

  • LAME for some MP3 creation from WAV files. The command used with LAME is of this form:
 lame [OLDFILE] [NEWFILE] -v --noreplaygain 

Example: lame temp/u0008_0000001_0000001_0001.wav -v --noreplaygain --nohist

This takes the wave file and turns it into an mp3 file, disabling ReplayGain analysis, disabling VBR histogram display, and setting the quality to mid-range (same as -V4)

NOTE: it places the mp3 in the same directory. Will need to move them.

 *These setting were decided upon after consulting the Hydrogen Audio site.

The scripts currently used for creating derivatives

  • Windows Perl script ( that generates ADL (Audio Decision List) files based on tab-delimited spreadsheet output and existing WAV files. This is in preparation to extraction of clips from the WAV files to generate MP3 files that correspond to particular intellectual items, such as a song. The current version supports performances that cross multiple WAV files by use of page/subpage entries, where the metadata lines that do NOT describe resulting MP3 derivatives (they may describe the reel, or the entire performance) do not have an associated WAV file listed. The script will check to ensure end times are not less than begin times, and that end times do not extend beyond the total length of the WAV file.
  • Linux Perl script (audioToAcumen) that pulls content from mounted Windows drive, uses ImageMagick to create JPEGs from transcripts, and places them in a directory on the Linux server: This also picks up MODS files (and adds persistent URLs), FITS files, and makes OCR of the transcript images (using tesseract-ocr)if a corresponding text file doesn't already exist -- all derivatives go onto the Linux server, and are deleted from the Windows drive. The transcripts (if not OCR) are fed into the Acumen database. ALSO, this script uses the ADL files and the WAV files to generate the MP3 files for delivery, using the commands described above.
  • Linux Perl script (moveAudioContent) that pulls the WAV files, TIFF files, log files, exported spreadsheet (and collection xml if available) from mounted Windows drive, and places them in a deposits directory on the Linux server, to await archiving.
  • Linux Perl script (moreTranscriptOCR) that offers to create OCR for all existing transcripts a given collection already online, or of selected transcript numbers across any number of collections.


The below information is deprecated; we have replaced the work flow below with Moving_Content_to_Acumen_and_Archive, which enables us to get content online without putting it into the storage archive first. Benefits include that while LOCKSS partners are harvesting our content, we do not have to twiddle our thumbs till they're done; also, by putting the tools into the hands of Digital Services Staff, we free up the programming needs on the server, and avoid bottlenecks.

After content has been moved to the long term archive, we need online derivatives for web access in a directory structure that mirrors the archive.

For images and audio

The following script runs through the archive, looking for tiff and wave files, and transcripts.

    1. Transcripts are simply copied to the web-accessible directory, placed under a Transcripts directory at the level to which it applies.
    2. ImageMagick ( )is used to create 2 image derivatives:
      1. a thumbnail, where the longest size is 128 pixels (file ends in _128.jpg)
      2. a large image, where the longest side is 2048 pixels (file ends in _2048)
    3. LAME ( )is used to create an mp3 from each wave file

The command used with ImageMagick is of this form (this for the 2048 size):

 convert [OLDFILE] -strip -density 96 -resample 96x96  -resize 2048x2048 -filter Cubic -quiet [NEWFILE]

The command used with LAME is of this form:

 lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S

Here's the perl script: File:Copychange.txt

NOTE: During the process of creating derivatives in this manner, we discovered to our dismay that the software that comes with the Captureback overhead creates two tiffs inside each tiff file. One is a thumbnail, and one is the full-size master image. Unfortunately, Image Magick by default creates a jpeg from both tiffs when the above command is run. It concatenates a "-0" to one of the filenames and a "-1" to the other. Examples of these can be seen here: [[1]]. The files ending in "-1" were created from the thumbnail, so they are blurry. We developed an additional script (called "repair" which hunts through directories, seeking out the files thus named, deleting the ones ending in "-1.jpg" and renaming the ones ending in "-0.jpg" to remove the "-0" addition. Here's the perl script: File:Repair.txt

For OCR text:

We're more selective. We don't want to OCR image files -- our guideline is that there must be at least 50% textual content on a page before we will consider OCRing it. We're using the open source tesseract-ocr ( ) on the command line.

Given a set of collection names, the following Perl script goes through /srv/archive, locates tiff files, checks to see if OCR files already exist online in /srv/www/htdocs/content, and if not, creates directories for them, and uses tesseract-ocr to create OCR derivatives and places them there.

The command used with tesseract-ocr is of this form:

 tesseract [OLDFILE]  [NEWFILE]

Here's the script: File:OcrIt.txt

Personal tools