Archiving MODS

From UA Libraries Digital Services Planning and Documentation
Revision as of 08:42, 30 September 2011 by Jlderidder (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Beginning 9/30/11, and on a quarterly basis, we are capturing snapshots of our MODS records for our archive.

The reason we do not capture more snapshots is because of the overhead costs for long-term storage (LOCKSS) for each new version of the file, and also for each new version of the Manifest from which the files are linked. Each collection has a Manifest which enables the LOCKSS partners to know what files to copy. Every time a Manifest changes, it is considered a new version of the Manifest. When the collections are small, this is a small file. When the collections are large, this can be a large file. The more versions we have of the Manifests, the more space we have to pay for in LOCKSS on an annual basis (above and beyond storage and local backup costs).

And metadata is constantly changing. While we provide a minimal MODS for each item on upload, the Metadata Unit is remediating these as time permits, overwriting them with improved versions which contain, for example, LCSH (Library of Congress Subject Headings).

The snapshot-capturing software is a Perl script currently called "getMODS" and resides in /srv/scripts/storing/MODS/ on libcontent1. When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for MODS files. When it finds one, it checks to see:

  1. Is there a MODS file for this thing in the archive? (/srv/archive/ area)
    1. If not:
      1. this one is copied over,
      2. versioned to version 1, and
      3. linked into the Manifest
    2. If so, does this MODS match that one?
      1. If so: no action is taken. This version is already in the archive
      2. If not, is there a version 2 already in the archive?
        1. If not:
          1. This one is copied over,
          2. versioned to version 2, and
          3. linked into the Manifest
        2. If so:
          1. The existing MODS is backed up
          2. This MODS is copied over
          3. This MODS is versioned to version 2, and
          4. NOT linked in (version 2 is already linked, because it exists)

Backups are made of all changed manifests. Backups of manifests and of MODS have the suffix "_LOCKSS_yyyy-mm-dd" if the collection has been harvested into LOCKSS, and have the suffix "_yyyy-mm-dd" if not.

Those with the "_LOCKSS_" in the suffix are counted when we estimate our LOCKSS storage for yearly fees. Those without that suffix may in the future be discarded if we determine we don't need them.

Since the getMODS manages a huge quantity of files, one must be prepared for potential blips in the process. After running getMODS, one should always (and after all storage procedures) run "checkArchive" in /srv/scripts/storing/ (precede with "nohup" as it may take several hours to run). This verifies that everything that *should* be in the Manifests is linked somewhere, and everything that is in the Manifests exists in the directories in the place linked. The output is /srv/scripts/storing/ArchiveERRORS, and must be checked after running "checkArchive".

If MODS files are listed in ArchiveERRORS, return to the /MODS subdirectory and run "getMissingMODS". It reads ArchiveERRORS and pulls out the MODS listings, and adds them to the correct manifests. I do not yet know why there's blips in this process; it appears to only happen with the really huge collections, so it may be a memory error. This should be addressed in the next quarterly backup if it continues to be a problem. It may be that it was only a problem the first time, as our capture the first time was so huge (many thousands of files).

Currently, the schedule for MODS backups is September, December, March and June of each year.


Personal tools