For Creating Derivatives

From UA Libraries Digital Services Planning and Documentation
Revision as of 09:41, 19 November 2009 by Jlderidder (Talk | contribs)

Jump to: navigation, search

After content has been moved to the long term archive, we need online derivatives for web access in a directory structure that mirrors the archive.

For images and audio

The following script runs through the archive, looking for tiff and wave files, and transcripts.

    1. Transcripts are simply copied to the web-accessible directory, placed under a Transcripts directory at the level to which it applies.
    2. ImageMagick ( )is used to create 3 image derivatives:
      1. a thumbnail, where the longest size is 128 pixels (file ends in _128.jpg)
      2. a mid-sized image (the default for delivery), where the longest side is 512 pixels (file ends in _512.jpg)
      3. a large image, where the longest side is 2048 pixels (file ends in _2048)
    3. LAME ( )is used to create an mp3 from each wave file

The command used with ImageMagick is of this form (this for the 2048 size):

 convert [OLDFILE] -strip -density 96 -resample 96x96  -resize 2048x2048 -filter Cubic -quiet [NEWFILE]

The command used with LAME is of this form:

 lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S

Here's the perl script: File:Copychange.txt

For OCR text:

We're more selective. We don't want to OCR image files -- our guideline is that there must be at least 50% textual content on a page before we will consider OCRing it. We're using the open source tesseract-ocr ( ) on the command line.

Given a set of collection names, the following Perl script goes through /srv/archive, locates tiff files, checks to see if OCR files already exist online in /srv/www/htdocs/content, and if not, creates directories for them, and uses tesseract-ocr to create OCR derivatives and places them there.

Here's the script: File:OcrIt.txt

Personal tools