For Creating Derivatives
After content has been moved to the long term archive, we need online derivatives for web access in a directory structure that mirrors the archive.
For images and audio
The following script runs through the archive, looking for tiff and wave files, and transcripts.
- Transcripts are simply copied to the web-accessible directory, placed under a Transcripts directory at the level to which it applies.
- ImageMagick (http://www.imagemagick.org/ )is used to create 3 image derivatives:
- a thumbnail, where the longest size is 128 pixels (file ends in _128.jpg)
- a mid-sized image (the default for delivery), where the longest side is 512 pixels (file ends in _512.jpg)
- a large image, where the longest side is 2048 pixels (file ends in _2048)
- LAME (http://lame.sourceforge.net/ )is used to create an mp3 from each wave file
The command used with ImageMagick is of this form (this for the 2048 size):
convert [OLDFILE] -strip -density 96 -resample 96x96 -resize 2048x2048 -filter Cubic -quiet [NEWFILE]
The command used with LAME is of this form:
lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S
Here's the perl script: File:Copychange.txt
For OCR text:
We're more selective. We don't want to OCR image files -- our guideline is that there must be at least 50% textual content on a page before we will consider OCRing it. We're using the open source tesseract-ocr (http://sourceforge.net/projects/tesseract-ocr/ ) on the command line.
Given a set of collection names, the following Perl script goes through /srv/archive, locates tiff files, checks to see if OCR files already exist online in /srv/www/htdocs/content, and if not, creates directories for them, and uses tesseract-ocr to create OCR derivatives and places them there.
The command used with tesseract-ocr is of this form:
tesseract [OLDFILE] [NEWFILE]
Here's the script: File:OcrIt.txt