For Creating Derivatives

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(Getting content from the share drive to the storage server to prepare for archving)
m (reflecting server switch from libcontent1 to libcontent)
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Getting content from the share drive to <u>''live''</u> in Acumen ==
+
==Open source software we are using to support our digital library==
  
# Make sure content is in the Completed directory, and that all quality control scripts have been run, and all corrections made.
+
* [http://sourceforge.net/projects/tesseract-ocr/ Tesseract-ocr] for OCR creation.  The command used is of this form:
#  Create the MODS from the exported text spreadsheet ([[Making_MODS]]), and place them in a MODS directory within the Metadata directory.
+
  tesseract [OLDFILE]  [NEWFILE]
# Log onto libcontent1 and change into the UploadArea/scripts directory.  (see [[Command-line_Work_on_Linux_Server]])
+
* [http://www.imagemagick.org/ ImageMagick] for JPEG creation from TIFF files.  The command used with ImageMagick is of this form (this for the 2048 size):
#  Type in 'makeJpegs'  and answer the questions that arise. If errors appear on the commandline or in the output file, repair them. The output file will be located in the UploadArea/output directory, and the script will tell you its name, which is timestamped.   
+
  convert [OLDFILE] -strip -density 96 -resample 96x96 -resize 2048x2048 -filter Cubic -quiet [NEWFILE]
## The makeJpegs script will do some minor QC, including checking to see if there is a MODS for each item-level object and an object for each item-level MODS.  
+
* [http://lame.sourceforge.net/ LAME] for some MP3 creation from WAV filesThe command used with LAME is of this form:
## If there are no problems, it will copy the MODS files to the UploadArea/MODS directory on the server, copy the transcripts to the UploadArea/transcripts directory on the server, and will generate JPEG derivatives (a thumb ending in _128.jpg and a large image ending in _2048.jpg) and place them in the UploadArea/jpegs directory on the server.
+
  lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S
# If there were no errors and the script completed these tasks, check the results on the server in the above-named directoriesIf all is well, proceed to the next step.
+
 
# In the UploadArea/scripts directory, type in 'relocate_all'.  This script will move all the content just uploaded into the correct directories in Acumen. If anything remains in the above-mentioned directories after the script has completed, there's a problem which must be repaired.
+
  *These setting were decided upon after consulting the [http://wiki.hydrogenaudio.org/index.php?title=Lame#Portable:_background_noise_and_low_bitrate_requirement.2C_small_sizes| Hydrogen Audio] site.
# Check the indexing of the uploaded content after about 24 hours to verify.
+
 
+
''makeJpegs will be upgraded to include the creation of OCR files for either entire collection contents or item-level materials as specified in a text file containing a list of one item number per line (no file extensions).''
+
 
+
 
+
 
+
== Getting content from the share drive to the storage server to prepare for archiving ==
+
 
+
The moveContent script will copy content to the deposits directory on the storage server, where it will be prepared for archiving at a later date.
+
 
+
Additionally, it will check the collection-level xml file (in the Admin directory), then add it to the online database which feeds the collection list online. Thus, this script should NOT be run for new collections until the content is indexed;  otherwise, the link from the collection listing will be dead.
+
 
+
After making the copy, it will verify that each file copied without alteration, and then delete the copy on the share drive.
+
 
+
If any files remain in the directories on the share drive, they did NOT copy to the server!!  Run the script again, as there may have been a failure of the copy across the network. If this fails, the file will need to be moved manually, and the problem encountered by the script must be resolved.
+
 
+
== Archiving content ==
+
  
See lines 9-25 and 31-33 on this page:  [[Moving_Content_To_Long-Term_Storage]]
+
==The scripts currently used for creating derivatives==
  
 +
* Linux Perl script that pulls content from mounted Windows drive, uses ImageMagick to create JPEGs from transcripts, and places them in a directory on the Linux server: [[Image: makeAudioJpegs.txt]].  This also picks up MODS files, MP3 files, and makes OCR of the transcript images (using tesseract-ocr)if a corresponding text file doesn't already exist -- all derivatives go onto the Linux server, and are deleted from the Windows drive.
  
 +
* Linux Perl script that pulls content from mounted Windows drive, uses ImageMagick to create JPEGs from manuscript images, and places them in a directory on the Linux server: [[Image: makeJpegs.txt]].  This also picks up MODS files, and makes OCR of the transcript images (using tesseract-ocr)if a corresponding text file doesn't already exist -- all derivatives go onto the Linux server, and are deleted from the Windows drive.  Options include the ability to OCR the entire content of images, selected item numbers, and the ability to apply these options to the entire collection already online.
  
 +
* Linux Perl script that offers to create OCR for all existing transcripts a given collection already online, or of selected transcript numbers across any number of collections:  [[Image:moreTranscriptOCR.txt]]
  
''The following section is deprecated;  we are replacing it with the preceding workflow, which enables us to get content online without putting it into the storage archive first. Benefits include that while LOCKSS partners are harvesting our content, we do not have to twiddle our thumbs till they're done;  also, by putting the tools into the hands of Digital Services Staff, we free up the programming needs on the server, and avoid bottlenecks.''
+
* Linux Perl script to create OCR files for items listed in *ocrList.txt files located in the /srv/deposits/ocrMe directory, and place those OCR files in the correct web location:  [[Image:ocrSelected.txt]]
  
 +
==OLDER METHODS ARE BELOW==
  
 +
''The below information is deprecated;  we have replaced the work flow below with [[Moving_Content_to_Acumen_and_Archive]], which enables us to get content online without putting it into the storage archive first.  Benefits include that while LOCKSS partners are harvesting our content, we do not have to twiddle our thumbs till they're done;  also, by putting the tools into the hands of Digital Services Staff, we free up the programming needs on the server, and avoid bottlenecks.''
  
 
----------------------------------------------------------------------------------------------------------------------------
 
----------------------------------------------------------------------------------------------------------------------------
Line 60: Line 48:
  
 
'''NOTE:'''
 
'''NOTE:'''
''During the process of creating derivatives in this manner, we discovered to our dismay that the software that comes with the Captureback overhead creates two tiffs inside each tiff file.  One is a thumbnail, and one is the full-size master image.  Unfortunately, Image Magick by default creates a jpeg from both tiffs when the above command is run.  It concatenates a "-0" to one of the filenames and a "-1" to the other.  Examples of these can be seen here:  [[http://libcontent1.lib.ua.edu/~jeremiah/images/]].  The files ending in "-1" were created from the thumbnail, so they are blurry.  We developed an additional script (called "repair" which hunts through directories, seeking out the files thus named, deleting the ones ending in "-1.jpg" and renaming the ones ending in "-0.jpg" to remove the "-0" addition.  Here's the perl script:  [[Image:repair.txt]]
+
''During the process of creating derivatives in this manner, we discovered to our dismay that the software that comes with the Captureback overhead creates two tiffs inside each tiff file.  One is a thumbnail, and one is the full-size master image.  Unfortunately, Image Magick by default creates a jpeg from both tiffs when the above command is run.  It concatenates a "-0" to one of the filenames and a "-1" to the other.  Examples of these can be seen here:  [[http://libcontent.lib.ua.edu/~jeremiah/images/]].  The files ending in "-1" were created from the thumbnail, so they are blurry.  We developed an additional script (called "repair" which hunts through directories, seeking out the files thus named, deleting the ones ending in "-1.jpg" and renaming the ones ending in "-0.jpg" to remove the "-0" addition.  Here's the perl script:  [[Image:repair.txt]]
 
''
 
''
  

Latest revision as of 09:11, 6 August 2013

Contents

[edit] Open source software we are using to support our digital library

  • Tesseract-ocr for OCR creation. The command used is of this form:
 tesseract [OLDFILE]  [NEWFILE]
  • ImageMagick for JPEG creation from TIFF files. The command used with ImageMagick is of this form (this for the 2048 size):
 convert [OLDFILE] -strip -density 96 -resample 96x96  -resize 2048x2048 -filter Cubic -quiet [NEWFILE]
  • LAME for some MP3 creation from WAV files. The command used with LAME is of this form:
 lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S
  
 *These setting were decided upon after consulting the Hydrogen Audio site.

[edit] The scripts currently used for creating derivatives

  • Linux Perl script that pulls content from mounted Windows drive, uses ImageMagick to create JPEGs from transcripts, and places them in a directory on the Linux server: File:MakeAudioJpegs.txt. This also picks up MODS files, MP3 files, and makes OCR of the transcript images (using tesseract-ocr)if a corresponding text file doesn't already exist -- all derivatives go onto the Linux server, and are deleted from the Windows drive.
  • Linux Perl script that pulls content from mounted Windows drive, uses ImageMagick to create JPEGs from manuscript images, and places them in a directory on the Linux server: File:MakeJpegs.txt. This also picks up MODS files, and makes OCR of the transcript images (using tesseract-ocr)if a corresponding text file doesn't already exist -- all derivatives go onto the Linux server, and are deleted from the Windows drive. Options include the ability to OCR the entire content of images, selected item numbers, and the ability to apply these options to the entire collection already online.
  • Linux Perl script that offers to create OCR for all existing transcripts a given collection already online, or of selected transcript numbers across any number of collections: File:MoreTranscriptOCR.txt
  • Linux Perl script to create OCR files for items listed in *ocrList.txt files located in the /srv/deposits/ocrMe directory, and place those OCR files in the correct web location: File:OcrSelected.txt

[edit] OLDER METHODS ARE BELOW

The below information is deprecated; we have replaced the work flow below with Moving_Content_to_Acumen_and_Archive, which enables us to get content online without putting it into the storage archive first. Benefits include that while LOCKSS partners are harvesting our content, we do not have to twiddle our thumbs till they're done; also, by putting the tools into the hands of Digital Services Staff, we free up the programming needs on the server, and avoid bottlenecks.



After content has been moved to the long term archive, we need online derivatives for web access in a directory structure that mirrors the archive.

[edit] For images and audio

The following script runs through the archive, looking for tiff and wave files, and transcripts.

    1. Transcripts are simply copied to the web-accessible directory, placed under a Transcripts directory at the level to which it applies.
    2. ImageMagick (http://www.imagemagick.org/ )is used to create 2 image derivatives:
      1. a thumbnail, where the longest size is 128 pixels (file ends in _128.jpg)
      2. a large image, where the longest side is 2048 pixels (file ends in _2048)
    3. LAME (http://lame.sourceforge.net/ )is used to create an mp3 from each wave file

The command used with ImageMagick is of this form (this for the 2048 size):

 convert [OLDFILE] -strip -density 96 -resample 96x96  -resize 2048x2048 -filter Cubic -quiet [NEWFILE]

The command used with LAME is of this form:

 lame [OLDFILE] [NEWFILE] -V4 --noreplaygain -S

Here's the perl script: File:Copychange.txt


NOTE: During the process of creating derivatives in this manner, we discovered to our dismay that the software that comes with the Captureback overhead creates two tiffs inside each tiff file. One is a thumbnail, and one is the full-size master image. Unfortunately, Image Magick by default creates a jpeg from both tiffs when the above command is run. It concatenates a "-0" to one of the filenames and a "-1" to the other. Examples of these can be seen here: [[1]]. The files ending in "-1" were created from the thumbnail, so they are blurry. We developed an additional script (called "repair" which hunts through directories, seeking out the files thus named, deleting the ones ending in "-1.jpg" and renaming the ones ending in "-0.jpg" to remove the "-0" addition. Here's the perl script: File:Repair.txt


[edit] For OCR text:

We're more selective. We don't want to OCR image files -- our guideline is that there must be at least 50% textual content on a page before we will consider OCRing it. We're using the open source tesseract-ocr (http://sourceforge.net/projects/tesseract-ocr/ ) on the command line.

Given a set of collection names, the following Perl script goes through /srv/archive, locates tiff files, checks to see if OCR files already exist online in /srv/www/htdocs/content, and if not, creates directories for them, and uses tesseract-ocr to create OCR derivatives and places them there.

The command used with tesseract-ocr is of this form:

 tesseract [OLDFILE]  [NEWFILE]

Here's the script: File:OcrIt.txt

Personal tools