Archiving MODS

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
('''Beginning 9/30/11, and on a quarterly basis, we are capturing snapshots of our MODS records for our archive.''')
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== '''Beginning 9/30/11, and on a quarterly basis, we are capturing snapshots of our MODS records for our archive.''' ==
+
New and updated MODS are now captured by the upload scripts, as copies of modified MODS are deposited in the faceting directory for processing, and from there are copied to the deposits directory.  New MODS are copied to both places (the faceting script will overwrite the ones in deposits if it runs before archiving).
  
 +
The archiving of MODS has been incorporated into the regular archiving script;  see [[for Archiving]] for more information.
 +
The logic used for what is archived is below.
  
''The reason we do not capture more snapshots'' is because of the overhead costs for long-term storage (LOCKSS) for each new version of the
+
Up to two copies of each metadata file are archived, with .v1 or .v2 added prior to the .xml extension, to indicate if the file is the initial metadata file (version 1 == v1) or a newer, modified version (v2). There will also be an unversioned copy kept in the Metadata directory in the archive, which is a copy of the most recent version (for easy access)Thus, if the metadata file is modified more than once, the v2 file will be overwritten with each change. And every time the metadata is modified, the unversioned copy is updated as well.
file, and also for each new version of the Manifest from which the files are linked. Each collection has a Manifest which
+
enables the LOCKSS partners to know what files to copy. Every time a Manifest changes, it is considered a new version of the Manifest.
+
When the collections are small, this is a small fileWhen the collections are large, this can be a large file. The more versions we
+
have of the Manifests, the more space we have to pay for in LOCKSS on an annual basis (above and beyond storage and local backup costs).
+
  
And metadata is constantly changing. While we provide a minimal MODS for each item on upload, the Metadata Unit is remediating these as time permits, overwriting them with improved versions which contain, for example, LCSH (Library of Congress Subject Headings).
+
The v1 and v2 files are linked into the manifest for LOCKSS;  the unversioned copy is not.
 +
 
 +
 
 +
Should we need to capture MODS from the entire archive again, the script to use would be:
 +
"getMODS" and resides in /srv/scripts/storing/MODS/ on libcontent.
  
'''The snapshot-capturing software is a Perl script currently called "getMODS"''' and resides in /srv/scripts/storing/MODS/ on libcontent1. 
 
 
When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for
 
When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for
 
MODS files. When it finds one, it checks to see:
 
MODS files. When it finds one, it checks to see:
 +
 
# Is there a MODS file for this thing in the archive?  (/srv/archive/ area)
 
# Is there a MODS file for this thing in the archive?  (/srv/archive/ area)
 
##  If not:
 
##  If not:
Line 32: Line 34:
  
  
'''Backups are made of all changed manifests.'''  Backups of manifests and of MODS have the suffix "_LOCKSS_yyyy-mm-dd" if
+
[[User:Jlderidder|Jlderidder]] ([[User talk:Jlderidder|talk]]) 13:43, 24 March 2015 (CDT)
the collection has been harvested into LOCKSS, and have the suffix "_yyyy-mm-dd" if not.
+
 
+
Those with the "_LOCKSS_" in the suffix are counted when we estimate our LOCKSS storage for yearly fees.
+
Those without that suffix may in the future be discarded if we determine we don't need them.
+
 
+
'''Since the getMODS manages a huge quantity of files''', one must be prepared for potential blips in the process.  After running
+
getMODS, one should always (and after all storage procedures) run "checkArchive" in /srv/scripts/storing/  (precede with "nohup" as it
+
may take several hours to run).  This verifies that everything that *should* be in the Manifests is linked somewhere, and everything
+
that is in the Manifests exists in the directories in the place linked.    The output is /srv/scripts/storing/ArchiveERRORS, and
+
must be checked after running "checkArchive".
+
 
+
'''If MODS files are listed in ArchiveERRORS''', return to the /MODS subdirectory and run "getMissingMODS".  It reads ArchiveERRORS and pulls out the MODS listings, and adds them to the correct manifests.  The first time I ran this, some of the MODS were not added to the manifests, and I think it was a memory error, as the first capture was huge (many thousands of files).  Captures since then have not had this problem, but we should always check to verify that this ran correctly, so I am retaining this procedure.
+
 
+
 
+
'''Currently, the schedule for MODS backups is September, December, March and June of each year.'''
+
 
+
[[User:Jlderidder|Jlderidder]] 16:38, 1 May 2012 (CDT)
+

Latest revision as of 13:44, 24 March 2015

New and updated MODS are now captured by the upload scripts, as copies of modified MODS are deposited in the faceting directory for processing, and from there are copied to the deposits directory. New MODS are copied to both places (the faceting script will overwrite the ones in deposits if it runs before archiving).

The archiving of MODS has been incorporated into the regular archiving script; see for Archiving for more information. The logic used for what is archived is below.

Up to two copies of each metadata file are archived, with .v1 or .v2 added prior to the .xml extension, to indicate if the file is the initial metadata file (version 1 == v1) or a newer, modified version (v2). There will also be an unversioned copy kept in the Metadata directory in the archive, which is a copy of the most recent version (for easy access). Thus, if the metadata file is modified more than once, the v2 file will be overwritten with each change. And every time the metadata is modified, the unversioned copy is updated as well.

The v1 and v2 files are linked into the manifest for LOCKSS; the unversioned copy is not.


Should we need to capture MODS from the entire archive again, the script to use would be: "getMODS" and resides in /srv/scripts/storing/MODS/ on libcontent.

When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for MODS files. When it finds one, it checks to see:

  1. Is there a MODS file for this thing in the archive? (/srv/archive/ area)
    1. If not:
      1. this one is copied over,
      2. versioned to version 1, and
      3. linked into the Manifest
    2. If so, does this MODS match that one?
      1. If so: no action is taken. This version is already in the archive
      2. If not, is there a version 2 already in the archive?
        1. If not:
          1. This one is copied over,
          2. versioned to version 2, and
          3. linked into the Manifest
        2. If so:
          1. The existing MODS is backed up
          2. This MODS is copied over
          3. This MODS is versioned to version 2, and
          4. NOT linked in (version 2 is already linked, because it exists)


Jlderidder (talk) 13:43, 24 March 2015 (CDT)

Personal tools