Currently, content being uploaded for archival storage is in a specific organization (specified here: Share_Drive_Protocols).
Once this content is placed into the /srv/deposits/content/ directory on libcontent (a Linux server), we :
- verify that it copied correctly across the network,
- check the content with quality control verification scripts (such as File:TestIncoming.txt)
- upload the collection information file content into the database to provide access to the online collection via a web-side php script, and
- then we archive it.
Archiving it means that we weed out extraneous files, re-order content (via copy) according to our storage organization (specified here: Organization_of_completed_content_for_long-term_storage), version the metadata, xml, or text files (linking into the manifest only the version; the updated one overwrites the unversioned copy in the directory) and either create a LOCKSS manifest for this content or alter existing ones to include this content.
We have different versions of the archiving script (Relocating) for Electronic Theses and Dissertations (in the bornDigital subdirectory), and then for all the rest of our content ("relocating" in /srv/scripts/storing/). They will soon be combined.
- Before archiving, run findMissing. /home/ds/scripts
The Archiving Scripts are located in /srv/scripts/storing/
findMissing locates which items and pages have NO metadata and locates which metadata has no derivatives and watches for item-level content that also has pages, and outputs error messages for these.
1) First, look through the deposits directory for any odd file organization or oddly named or missing content. Correct if found. (Check all admin folders for logs and xmls, check each collection number for the correct directories etc.)
2) Run 'testDeposits' to look for things you might have missed. If errors are found, correct them before proceeding.
3) Next, run "waitCheck" which checks for:
- EADs in /srv/deposits/EADs/new/, and compares these to see if the EAD has changed from the last version cached. If not, it is deleted. If so, it creates a collection directory (if necessary) in /srv/deposits/content/ and moves the new EAD there for archiving (into a Metadata directory).
- tags in /srv/deposits/crowdsourcing/tags -- and does the same with these
- transcripts in /srv/deposits/crowdsourcing/transcriptions/ -- and does the same with these.
- ETDs in /srv/deposits/bornDigital
If you get an error message about ETD content already in the archive, do a diff between the incoming in /srv/deposits/bornDigital and the specified files in the archive. If they are for the same person, same degree, same title, they're okay (just an updated record). In this case, compare the versions in deposits/content with those in deposits/bornDigital -- if the deposits/content are the same plus faceted entries, delete the copy in deposits/bornDigital. Otherwise, overwrite the ones in deposits/content with those in deposits/bornDigital.
Otherwise (the new records are for different people or degrees/papers) -- contact the metadata librarians; they may have reused an identifier for a different ETD, and now what we have online has overwritten older content. We'll have to clean it up. Pull any related records out of /srv/deposits/content into a hold directory, before continuing with the archiving process.
4) Remove RelocateManifests. Uncomment $test = 1 in "relocating" and run it.
5) Check the moveThese directory to make sure the files will be moved to the right location for each collection, and
6) Check the RelocateManifests to verify that the manifests will be written correctly. Be sure to look at end of relocateManifests for Manifests that need to be created by hand (for new holding areas), as well as to check what is being added to existing holder Manifests.
7) Comment back in $test = 1; and re-run "relocating". This script will:
- crawl through the /srv/deposits/content/ directories looking for content,
- locate or create the necessary manifests (including creating new ones if the latest one is approaching a terabyte of content),
- back up the manifests, and specify the ones that are in LOCKSS with a file naming convention (so we can collect the sizes of the manifests)
- identify where in the manifests things need to be linked, and add them
- move or copy files to the correct location in the archive, and
- document in a file named for the collection number in the moveThese subdirectory, what file was copied where, for the archival files (wav and tiff) that need checksum verification prior to deletion
8) Check parentMans output for any manifests that need to be created for new holding areas. If any are identified, create them to match the format of all other holding area Manifests, and insert the links captured for you in parentMans.
9) Run checkAll to verify that the archival files have been copied over correctly. It goes through the moveme file, does md5 comparison on the old file and the new one -- if they're the same, it will delete the old on in the deposits directory. If they're not the same, it will output an error and leave the original untouched.
10) Use deleteEmpties' to empty directories there.
11) Check the directories in /srv/deposits/content/ directories to make sure nothing is left. If anything is left in this location, there is a problem and you need to figure out what it is.
12) Using nohup (to keep this from dying if you get disconnected), run checkArchive, which will write to ArchiveERRORS. This takes hours, and will verify that everything in each manifest is in the archive, and everything that should be linked the manifests is indeed linked there properly.
13) Review ArchiveERRORS for problems that need to be corrected.
NOTE: When we digitize multiple tiny collections, we may combine the spreadsheets, for simplicity. Then, however, they must be split out by collection for archiving: File:SplitExcel.txt