Tracking for the long term

From UA Libraries Digital Services Planning and Documentation
Revision as of 08:41, 26 April 2016 by Jlderidder (talk | contribs) (The InfoTrack database)

We're developing tracking databases for management of content over time. Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

InfoTrack and md5sums are located on The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described in Tracking_automated_scripts.

The InfoTrack database


This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online). Most of this information comes from the Collection_Information file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of Moving_Content_To_Long-Term_Storage or the moveContent script described in the 3rd section of Most Content. These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet. More information about this table is available here: allColls


This table documents when we've been asked to take down either an EAD or a digital collection, when and by whom. The process for doing this is spelled out in TakeDowns.


Once a collection has been released for harvesting into LOCKSS, we must monitor the size and additions to that content, and avoid any changes. As content is communicated over the network, we log here the identifier, manifest number, and date (and for collections which are subcollections of others, such as rare books, the subcollection title and parent identifier).


Referenced by allColls entries for donor, funders, metadata and digitization leads, and collection submitter, this table contains the following fields: ID (an auto-incrementing number), name, addressOrDepartment, contactInfo, and notes.


This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support is assigned a number. Retrieval of that item will be by using a URL of the form: where the number following the last forward slash is the number of the item. The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history). This table is accessed by the script which lives in the cgi-bin of (/srv/www/cgi-bin/ ) A URL rewrite and a virtual host configuration, along with the DNS registration of, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.


This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported. Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, the date the content should be made available (dateAvailable, in form yyyy-mm-dd), and exceptions.


Information extracted using Google API is stored here for particular item locations. This enables us to apply the same lat/long for other items with the same location without calling the API again and again and again (there are limits on calls, as well as server overhead.


This works with the geocode table, indicating the item location for each of the items in the database. The locationID here corresponds to the locationID in the geocode table.


Name authority database (MADS), described in For_Subjects_and_Names to support controlled vocabulary in the use of names in descriptive metadata. The database numbers are referenced in the spreadsheets, then used in the makeMods script to ensure the MODS entry matches the controlled vocabulary in the database.


Subject authority database containing tagged versions of each controlled vocabulary entry; the tags indicate the applicable fields to use when constructing the subject in MODS. This is used in preparing spreadsheets for transformation into MODS.


This is a legacy table from when we were migrating content through Scripto to obtain transcripts. We used this table to document how many pages of particular items were transcribed, and how many OCR'd, so we could get a sense of how complete our coverage was on items. The script (that had been run periodically) that fills this table can be found in /srv/scripts/transcripts/olderVersions/previous/: numPagesAndTrans . This is retained in case we need to collect this information again.


itemSums This table contains the identifier, file name, file path, current MD5 checksum, byte size, number of times modified, date of first entry, and a notes field.

modified If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the byte size of the original file, the reason modified, and by whom are recorded here.

imageTechMed As images are stored, technical metadata is generated for them, and some administrative information from that process is stored here. This information includes the mime type, format, format version, test date, format registry, format registry key, whether or not the TIFF contains a thumb, and whether conflicts were found during the testing process. You can find more information about this in Image_Technical_Metadata.

audioTechMed As audio files are stored, technical metadata is generated for them, and some administrative information from that process is stored here. This information includes the mime type, format, format version, test date, format registry, format registry key, number of channels, and whether conflicts were found during the testing process. You can find more information about this in Audio_Technical_Metadata.

typeContent This is the beginning of a method of tracking types of content, where the type may impact the migration target format. We have yet to find a way to automate capture of whether an image is of primarily text, but when we do, we expect to capture that information here. Thus far, the only types stored are scrapbooks, as identified at the item level.


This database was started to track what has been selected out for inclusion in the Digital Preservation Network, since contributions should be of a certain size. Our first set of content was to be 5 TB, so we selected from content that is NOT yet in ADPNet LOCKSS, but which, combined, amounts to approximately 5 TB. To do so, we had to select only a portion of an ongoing (long-term) collection. Hence, inclusion of information down to the file level was necessary, to avoid duplication and to verify checksums. The fields in the "submitted" table include timestamp, identifier, item identifier, filename, file path, byte size, md5 checksum, collection number and manifest number (the latter refers to the LOCKSS manifest into which the file is linked for ADPNet pickup; multi-TB collections have multiple manifests, as we limit the size of each AU -- "Archival Unit" -- to one terabyte or less).