Matching

From UA Libraries Digital Services Planning and Documentation
Revision as of 15:13, 26 July 2010 by Scpstu32 (Talk | contribs)

Jump to: navigation, search

This page is a work in progress as of 7-26-10


From time to time, a physical book will need to be scanned but the existing metadata supplied by the archivists is not of the book as an item, but rather there is metadata per intellectual item in the book.

It's easier to explain this by example:

  • Say we have a 3 page book (excluding covers).
  • This book is a letter book.
  • There is one letter per page.
  • The book is called An Example Book and its identifier is z9999_9999999_9999999.
  • We don't have metadata for An Example Book.
  • We *do* have metadata for each of the three letters (Letter 1 ... Letter 3); they're identifiers are z9999_9999999_0000001, z9999_9999999_0000002, and z9999_9999999_0000003.
  • And we want to present this as a book *and* as the three letter items.

So what do we do? It's probably going to be easier to scant the book as a book, so let's do that. Then we need to create a match file in tab-delimited form. For our example above it would look like this:

BOOK     ITEM
z9999_9999999_9999999_0001     z9999_9999999_0000001
z9999_9999999_9999999 _0002  z9999_9999999_0000002
z9999_9999999_9999999_0003   z9999_9999999_0000003

Of course, a page in a book may contain part of one letter and part of another.

So for a less-simple example, see this file which is a real match file that we actually used in production in 2010 July.


 The script is in the scripts directory at the top level, and is named bookToItems.  It will only look in the in_progress folder at present.
 It writes to an output file in the output directory within the scripts directory.
 If it runs across a match file for which there is already a corresponding scans directory, it will ask you if you want to reprocess.
 It expects the match files to be of the form Date (yyyymmdd) underscore “match.txt”  -- so, for example, 20100720_match.txt would be a good one for today.
 It will name the output directory Scans_match_yyyymmdd  to match the date of the match.txt file.
 The Match file needs to have, on each line, old identifier followed by a tab followed by a new identifier.
 The script expects a Windows newline to end each line.
 The script expects the input tiffs to be in a Scans directory (or several Scans directories).
 And – it will process multiple match files if they are present.
Personal tools