Matching

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
 
(2 intermediate revisions by one user not shown)
Line 15: Line 15:
 
*It's probably going to be easier to scan the book as a book, so let's do that first.
 
*It's probably going to be easier to scan the book as a book, so let's do that first.
 
*Then we need to create a match file in tab-delimited form.
 
*Then we need to create a match file in tab-delimited form.
 +
**For our example above the text file would look like this:
  
For our example above the text file would look like this:
+
{| {{table}} border="1"
 
+
| align="center" style="background:#f0f0f0;"|'''BOOK'''
'''BOOK     ITEM'''
+
| align="center" style="background:#f0f0f0;"|'''ITEM'''
z9999_9999999_9999999_0001   z9999_9999999_0000001
+
|-
z9999_9999999_9999999_0002   z9999_9999999_0000002
+
| z9999_9999999_9999999_0001||z9999_9999999_0000001
z9999_9999999_9999999_0003   z9999_9999999_0000003
+
|-
 +
| z9999_9999999_9999999_0002||z9999_9999999_0000002
 +
|-
 +
| z9999_9999999_9999999_0003||z9999_9999999_0000003
 +
|-
 +
|}
  
 
*Then we'll use a Perl script called bookToItems and ImageMagick to make new image files based on the match file information.
 
*Then we'll use a Perl script called bookToItems and ImageMagick to make new image files based on the match file information.

Latest revision as of 14:59, 29 October 2010

From time to time, a physical book will need to be scanned but the existing metadata supplied by the archivists is not of the book as an item, but rather there is metadata per intellectual item in the book.

It's easier to explain this by example:

  • Say we have a 3 page book (excluding covers).
  • This book is a letter book.
  • There is one letter per page.
  • The book is called An Example Book and its identifier is z9999_9999999_9999999.
  • We don't have metadata for An Example Book.
  • We *do* have metadata for each of the three letters (Letter 1 ... Letter 3); they're identifiers are z9999_9999999_0000001, z9999_9999999_0000002, and z9999_9999999_0000003.
  • And we want to present this as a book *and* as the three letter items.


So what do we do?

  • It's probably going to be easier to scan the book as a book, so let's do that first.
  • Then we need to create a match file in tab-delimited form.
    • For our example above the text file would look like this:
BOOK ITEM
z9999_9999999_9999999_0001 z9999_9999999_0000001
z9999_9999999_9999999_0002 z9999_9999999_0000002
z9999_9999999_9999999_0003 z9999_9999999_0000003
  • Then we'll use a Perl script called bookToItems and ImageMagick to make new image files based on the match file information.
    • In other words, with our example above, we would end up with three files (z9999_9999999_0000001.tif ... z9999_9999999_0000003.tif) in addition to our already-existing files (z9999_9999999_9999999_0001.tif ... z9999_9999999_9999999_0003.tif).


Of course, a page in a book may contain part of one letter and part of another.

So for a less-simple example, see http://www.lib.ua.edu/wiki/digcoll/images/1/18/20100720_match.txt which is a real match file that we actually used in production in 2010 July.

As for what the script expects, please read the excerpt from an email from 2010 July 20 by Jody DeRidder:

 The script is in the scripts directory at the top level, and is named bookToItems.  It will only look in the in_progress folder at present.
 It writes to an output file in the output directory within the scripts directory.
 If it runs across a match file for which there is already a corresponding scans directory, it will ask you if you want to reprocess.
 It expects the match files to be of the form Date (yyyymmdd) underscore “match.txt”  -- so, for example, 20100720_match.txt would be a good one for today.
 It will name the output directory Scans_match_yyyymmdd  to match the date of the match.txt file.
 The Match file needs to have, on each line, old identifier followed by a tab followed by a new identifier.
 The script expects a Windows newline to end each line.
 The script expects the input tiffs to be in a Scans directory (or several Scans directories).
 And – it will process multiple match files if they are present.
Personal tools