For Archiving

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Currently, content being uploaded for archival storage is in a specific organization (specified here: [[Share_Drive_Protocols]]).
 
Currently, content being uploaded for archival storage is in a specific organization (specified here: [[Share_Drive_Protocols]]).
  
Once this content is placed into the /srv/deposits/content/ directory on libcontent1 (a Linux server), we :
+
Once this content is placed into the /srv/deposits/content/ directory on libcontent (a Linux server), we :
 
# verify that it copied correctly across the network,  
 
# verify that it copied correctly across the network,  
 
# check the content with quality control verification scripts (such as [[Image:testIncoming.txt]])  
 
# check the content with quality control verification scripts (such as [[Image:testIncoming.txt]])  
Line 9: Line 9:
 
Archiving it means that we weed out extraneous files, re-order content (via copy) according to our storage organization (specified here: [[Organization_of_completed_content_for_long-term_storage]]), version the metadata, xml, or text files (linking into the manifest only the version;  the updated one overwrites the unversioned copy in the directory) and either create a LOCKSS manifest for this content or alter existing ones to include this content.
 
Archiving it means that we weed out extraneous files, re-order content (via copy) according to our storage organization (specified here: [[Organization_of_completed_content_for_long-term_storage]]), version the metadata, xml, or text files (linking into the manifest only the version;  the updated one overwrites the unversioned copy in the directory) and either create a LOCKSS manifest for this content or alter existing ones to include this content.
  
This script (still being modified and updated to handle new problems) is here: [[Image:Relocating.txt]]
+
We have different versions of the archiving script (Relocating) for Electronic Theses and Dissertations (in the bornDigital subdirectory), and then for all the rest of our content ("relocating" in /srv/scripts/storing/).  They will soon be combined.
By uncommenting out the $test = 1; line, you can run this as a test, which will not change any existing manifests or copy content.  Instead, it will write all the manifest changes and creations into one huge file called RelocateManfests, and it will still write a list of what files it will copy where to the "moveme" file.
+
  
After running this script for real, run "checkem" which goes through the moveme file, does md5 comparison on the old file and the new one -- if they're the same, it will delete the old on in the deposits directory.  If they're not the same, it will output an error and leave the original untouched.
+
**Before archiving, run findMissing. /home/ds/scripts
  
Here's the checkem script:  [[Image:Checkem.txt]]
+
The Arching Scripts are located in /srv/scripts/storing/
  
Another handy script is archiveCheck [[Image:CheckArchive.txt]] which verifies that everything in each manifest is in the archive, and everything I intended to link into the manifest is indeed linked there properly.
+
findMissing locates which items and pages have NO metadata and locates which metadata has no derivatives and watches for item-level content that also has pages, and outputs error messages for these.
  
When we digitize multiple tiny collections, we may combine the spreadsheets, for simplicity.  Then, however, they must be split out by collection for archiving:  [[Image:splitExcel.txt]]
+
1) '''First, look through the deposits directory''' for any odd file organization or oddly named or missing content. Correct if found. (Check all admin folders for logs and xmls, check each collection number for the creect directories etc.)
 +
 
 +
2) Run 'testDeposits' to look for things you might have missed.  If errors are found, correct them before proceeding.
 +
 
 +
3)  Next, run '''"waitCheck'''" which checks for:
 +
*  EADs in /srv/deposits/EADs/new/, and compares these to see if the EAD has changed from the last version cached.  If not, it is deleted.  If so, it creates a collection directory (if necessary) in /srv/deposits/content/ and moves the new EAD there for archiving (into a Metadata directory).
 +
*  tags in /srv/deposits/crowdsourcing/tags -- and does the same with these
 +
*  transcripts in /srv/deposits/crowdsourcing/transcriptions/  -- and does the same with these.
 +
*  ETDs in /srv/deposits/bornDigital
 +
 
 +
4) Remove RelocateManifests. Uncomment $'''test''' = 1 in "relocating" and run it.
 +
 
 +
5) '''Check the moveThese''' directory to make sure the files will be moved to the right location for each collection, and
 +
 
 +
6) '''Check the RelocateManifests''' to verify that the manifests will be written correctly.  Be sure to look at end of relocateManifests for Manifests that need to be created by hand (for new holding areas), as well as to check what is being added to existing holder Manifests.
 +
 
 +
7) Comment back in $test = 1; and re-'''run "relocating"'''.  This script will:
 +
* crawl through the /srv/deposits/content/ directories looking for content,
 +
* locate or create the necessary manifests (including creating new ones if the latest one is approaching a terabyte of content),
 +
* back up the manifests, and specify the ones that are in LOCKSS with a file naming convention (so we can collect the sizes of the manifests)
 +
* identify where in the manifests things need to be linked, and add them
 +
* move or copy files to the correct location in the archive, and 
 +
* document in a file named for the collection number in the moveThese subdirectory, what file was copied where, for the archival files (wav and tiff) that need checksum verification prior to deletion
 +
 +
8)  '''Check parentMans''' output for any manifests that need to be created for new holding areas.  If any are identified, create them to match the format of all other holding area Manifests, and insert the links captured for you in parentMans. 
 +
 
 +
9) Run '''checkAll''' to verify that the archival files have been copied over correctly. It goes through the moveme file, does md5 comparison on the old file and the new one -- if they're the same, it will delete the old on in the deposits directory.  If they're not the same, it will output an error and leave the original untouched.
 +
 
 +
10) Use '''deleteEmpties'''' to empty directories there.
 +
 
 +
11) '''Check the directories''' in /srv/deposits/content/ directories to make sure nothing is left. If anything is left in this location, there is a problem and you need to figure out what it is. 
 +
 
 +
12) Using '''nohup''' (to keep this from dying if you get disconnected), run '''checkArchive''', which will write to ArchiveERRORS. This takes hours, and will verify that everything in each manifest is in the archive, and everything that should be linked the manifests is indeed linked there properly.
 +
 
 +
13) '''Review ArchiveERRORS''' for problems that need to be corrected. 
 +
 
 +
 
 +
NOTE:  When we digitize multiple tiny collections, we may combine the spreadsheets, for simplicity.  Then, however, they must be split out by collection for archiving:  [[Image:splitExcel.txt]]

Latest revision as of 09:45, 25 June 2015

Currently, content being uploaded for archival storage is in a specific organization (specified here: Share_Drive_Protocols).

Once this content is placed into the /srv/deposits/content/ directory on libcontent (a Linux server), we :

  1. verify that it copied correctly across the network,
  2. check the content with quality control verification scripts (such as File:TestIncoming.txt)
  3. upload the collection information file content into the database to provide access to the online collection via a web-side php script, and
  4. then we archive it.

Archiving it means that we weed out extraneous files, re-order content (via copy) according to our storage organization (specified here: Organization_of_completed_content_for_long-term_storage), version the metadata, xml, or text files (linking into the manifest only the version; the updated one overwrites the unversioned copy in the directory) and either create a LOCKSS manifest for this content or alter existing ones to include this content.

We have different versions of the archiving script (Relocating) for Electronic Theses and Dissertations (in the bornDigital subdirectory), and then for all the rest of our content ("relocating" in /srv/scripts/storing/). They will soon be combined.

    • Before archiving, run findMissing. /home/ds/scripts

The Arching Scripts are located in /srv/scripts/storing/

findMissing locates which items and pages have NO metadata and locates which metadata has no derivatives and watches for item-level content that also has pages, and outputs error messages for these.

1) First, look through the deposits directory for any odd file organization or oddly named or missing content. Correct if found. (Check all admin folders for logs and xmls, check each collection number for the creect directories etc.)

2) Run 'testDeposits' to look for things you might have missed. If errors are found, correct them before proceeding.

3) Next, run "waitCheck" which checks for:

  • EADs in /srv/deposits/EADs/new/, and compares these to see if the EAD has changed from the last version cached. If not, it is deleted. If so, it creates a collection directory (if necessary) in /srv/deposits/content/ and moves the new EAD there for archiving (into a Metadata directory).
  • tags in /srv/deposits/crowdsourcing/tags -- and does the same with these
  • transcripts in /srv/deposits/crowdsourcing/transcriptions/ -- and does the same with these.
  • ETDs in /srv/deposits/bornDigital

4) Remove RelocateManifests. Uncomment $test = 1 in "relocating" and run it.

5) Check the moveThese directory to make sure the files will be moved to the right location for each collection, and

6) Check the RelocateManifests to verify that the manifests will be written correctly. Be sure to look at end of relocateManifests for Manifests that need to be created by hand (for new holding areas), as well as to check what is being added to existing holder Manifests.

7) Comment back in $test = 1; and re-run "relocating". This script will:

  • crawl through the /srv/deposits/content/ directories looking for content,
  • locate or create the necessary manifests (including creating new ones if the latest one is approaching a terabyte of content),
  • back up the manifests, and specify the ones that are in LOCKSS with a file naming convention (so we can collect the sizes of the manifests)
  • identify where in the manifests things need to be linked, and add them
  • move or copy files to the correct location in the archive, and
  • document in a file named for the collection number in the moveThese subdirectory, what file was copied where, for the archival files (wav and tiff) that need checksum verification prior to deletion

8) Check parentMans output for any manifests that need to be created for new holding areas. If any are identified, create them to match the format of all other holding area Manifests, and insert the links captured for you in parentMans.

9) Run checkAll to verify that the archival files have been copied over correctly. It goes through the moveme file, does md5 comparison on the old file and the new one -- if they're the same, it will delete the old on in the deposits directory. If they're not the same, it will output an error and leave the original untouched.

10) Use deleteEmpties' to empty directories there.

11) Check the directories in /srv/deposits/content/ directories to make sure nothing is left. If anything is left in this location, there is a problem and you need to figure out what it is.

12) Using nohup (to keep this from dying if you get disconnected), run checkArchive, which will write to ArchiveERRORS. This takes hours, and will verify that everything in each manifest is in the archive, and everything that should be linked the manifests is indeed linked there properly.

13) Review ArchiveERRORS for problems that need to be corrected.


NOTE: When we digitize multiple tiny collections, we may combine the spreadsheets, for simplicity. Then, however, they must be split out by collection for archiving: File:SplitExcel.txt

Personal tools