Checking Acumen and the Archive
In the UploadArea/scripts directory (/home/ds/UploadArea/scripts/) there is a script called "findMissing". This script crawls Acumen looking for:
- items without metadata
- metadata without items
- item-level content that also has pages (for example: u0003_0002345_0000005.tif and u0003_0002345_0000005_0001.tif -- this is an error)
findMissing should be run at least once a month, to ensure we have not created problems in Acumen with the most recent uploads.
Checking the Archive
In the /srv/scripts/storing directory, there's a script called "checkArchive". This script crawls the archive, looking for:
- content listed in the manifests that does not exist (or has bad links)
- content in the archive that should be in the manifest but isn't (metadata, transcripts and archival files)
- manifests listed in the upper levels of the hierarchy that are missing links to manifests the next level down
- manifests listed in the upper levels of the hierarchy that contain bad links (links to manifests that do not exist)
checkArchive should be run at least once a month, after archiving.
Checking Acumen Vs. the Archive
In general, we expect everything in Acumen to be in the archive, and vice versa (however, there are some exceptions, noted below) -- so it's a sanity check to regularly compare the contents of the two. We have had situations where tiffs were lost before or during transition to the server, so we had content in Acumen without content in the archive. We have also had situations where content made it all the way to the archive, but was never put online -- or the derivative-generating script was cut short due to networking or other issues, so derivatives for some content was not in Acumen.
On libcontent, in the /srv/scripts/stats directory, there are two scripts for this purpose. One is acumenToArchiveDiff and one is archiveToAcumenDiff. The names reflect what they do. acumenToArchiveDiff looks at what's in Acumen, and then checks those against what's in the archive; and archiveToAcumenDiff looks at what's in the archive, and then checks those against what's in Acumen. The result files are written to the output directory as datestamped "NotInAcumen" or "NotInArchive" files.
What is compared are TIFFs and WAVEs in the archive and large (2048 pixel) JPEGs and MP3s in Acumen. Since we're looking for a one-to-one comparison of ids, this process does NOT yet recognize MP3s when multiple ones are generated from a single WAV file. Those are listed in the results file under "OKAY TO DIFFER" -- and new collections that meet this description need to be noted in the scripts, in order to be included.
Other content that notably should not be in one or the other area include the following:
Not in the archive
- u0015_0000002 Undergraduate research projects
- t0003 Tuskegee finding aids
- u0007_0000001 UA Video collection "Realizing the Dream" is copyright protected, and though we were required to put it online, we do not have the rights to preserve it or make copies, so it's not in the archive.
- Content that is in deposits awaiting archiving
Not in Acumen
- u0011_0000011 Publisher's Bindings Online (this is hosted elsewhere)
- u0005_0000002 County maps that were digitized by the cartographic lab (hosted elsewhere)
- Content that has been taken down for a reason, which should be noted in a problems.txt file in the collection Documentation directory in the archive.