Matching

From UA Libraries Digital Services Planning and Documentation
Revision as of 14:31, 26 July 2010 by Scpstu32 (Talk | contribs)

Jump to: navigation, search

From time to time, a physical book will need to be scanned but the existing metadata supplied by the archivists is not of the book as an item, but rather there is metadata per intellectual item in the book.

It's easier to explain this by example:

  • Say we have a 3 page book (excluding covers).
  • This book is a letter book.
  • There is one letter per page.
  • The book is called An Example Book and its identifier is z9999_9999999_9999999.
  • We don't have metadata for An Example Book.
  • We *do* have metadata for each of the three letters (Letter 1 ... Letter 3); they're identifiers are z9999_9999999_0000001, z9999_9999999_0000002, and z9999_9999999_0000003.
  • And we want to present this as a book *and* as the three letter items.

So what do we do? It's probably going to be easier to scant the book as a book, so let's do that. Then we need to create a match file in tab-delimited form. For our example above it would look like this:

BOOK     ITEM
z9999_9999999_9999999_0001   z9999_9999999_0000001
z9999_9999999_9999999_0002   z9999_9999999_0000002
z9999_9999999_9999999_0003   z9999_9999999_0000003

Then we'll use a Perl script called bookToItems and ImageMagick to make new image files based on the match file information.

In other words, with our example above, we would end up with three files (z9999_9999999_0000001.tif ... z9999_9999999_0000003.tif) in addition to our already-existing files (z9999_9999999_9999999_0001.tif ... z9999_9999999_9999999_0003.tif).


Of course, a page in a book may contain part of one letter and part of another.

So for a less-simple example, see http://www.lib.ua.edu/wiki/digcoll/images/1/18/20100720_match.txt which is a real match file that we actually used in production in 2010 July.

As for what the script expects, please read the excerpt from an email from 2010 July by Jody DeRidder:

 The script is in the scripts directory at the top level, and is named bookToItems.  It will only look in the in_progress folder at present.
 It writes to an output file in the output directory within the scripts directory.
 If it runs across a match file for which there is already a corresponding scans directory, it will ask you if you want to reprocess.
 It expects the match files to be of the form Date (yyyymmdd) underscore “match.txt”  -- so, for example, 20100720_match.txt would be a good one for today.
 It will name the output directory Scans_match_yyyymmdd  to match the date of the match.txt file.
 The Match file needs to have, on each line, old identifier followed by a tab followed by a new identifier.
 The script expects a Windows newline to end each line.
 The script expects the input tiffs to be in a Scans directory (or several Scans directories).
 And – it will process multiple match files if they are present.
Personal tools