Archiving MODS

From UA Libraries Digital Services Planning and Documentation
Revision as of 09:16, 6 August 2013 by Kgmatheny (talk | contribs) (reflecting server switch from libcontent1 to libcontent)

Beginning 9/30/11, and on a quarterly basis, we are capturing snapshots of our MODS records for our archive.

The reason we do not capture more snapshots is because of the overhead costs for long-term storage (LOCKSS) for each new version of the file, and also for each new version of the Manifest from which the files are linked. Each collection has a Manifest which enables the LOCKSS partners to know what files to copy. Every time a Manifest changes, it is considered a new version of the Manifest. When the collections are small, this is a small file. When the collections are large, this can be a large file. The more versions we have of the Manifests, the more space we have to pay for in LOCKSS on an annual basis (above and beyond storage and local backup costs).

And metadata is constantly changing. While we provide a minimal MODS for each item on upload, the Metadata Unit is remediating these as time permits, overwriting them with improved versions which contain, for example, LCSH (Library of Congress Subject Headings).

The snapshot-capturing software is a Perl script currently called "getMODS" and resides in /srv/scripts/storing/MODS/ on libcontent. When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for MODS files. When it finds one, it checks to see:

  1. Is there a MODS file for this thing in the archive? (/srv/archive/ area)
    1. If not:
      1. this one is copied over,
      2. versioned to version 1, and
      3. linked into the Manifest
    2. If so, does this MODS match that one?
      1. If so: no action is taken. This version is already in the archive
      2. If not, is there a version 2 already in the archive?
        1. If not:
          1. This one is copied over,
          2. versioned to version 2, and
          3. linked into the Manifest
        2. If so:
          1. The existing MODS is backed up
          2. This MODS is copied over
          3. This MODS is versioned to version 2, and
          4. NOT linked in (version 2 is already linked, because it exists)

Backups are made of all changed manifests. Backups of manifests and of MODS have the suffix "_LOCKSS_yyyy-mm-dd" if the collection has been harvested into LOCKSS, and have the suffix "_yyyy-mm-dd" if not.

Those with the "_LOCKSS_" in the suffix are counted when we estimate our LOCKSS storage for yearly fees. Those without that suffix may in the future be discarded if we determine we don't need them.

Since the getMODS manages a huge quantity of files, one must be prepared for potential blips in the process. After running getMODS, one should always (and after all storage procedures) run "checkArchive" in /srv/scripts/storing/ (precede with "nohup" as it may take several hours to run). This verifies that everything that *should* be in the Manifests is linked somewhere, and everything that is in the Manifests exists in the directories in the place linked. The output is /srv/scripts/storing/ArchiveERRORS, and must be checked after running "checkArchive".

If MODS files are listed in ArchiveERRORS, return to the /MODS subdirectory and run "getMissingMODS". It reads ArchiveERRORS and pulls out the MODS listings, and adds them to the correct manifests. The first time I ran this, some of the MODS were not added to the manifests, and I think it was a memory error, as the first capture was huge (many thousands of files). Captures since then have not had this problem, but we should always check to verify that this ran correctly, so I am retaining this procedure.

Currently, the schedule for MODS backups is September, December, March and June of each year.

Jlderidder 16:38, 1 May 2012 (CDT)