Tracking for the long term

From UA Libraries Digital Services Planning and Documentation
Revision as of 10:31, 5 July 2011 by Jlderidder (Talk | contribs)

Jump to: navigation, search

We're developing tracking databases for management of content over time. Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

InfoTrack and md5sums are located on The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described in Tracking_automated_scripts.

The InfoTrack database


This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online). Most of this information comes from the Collection_Information file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of Moving_Content_To_Long-Term_Storage or the moveContent script described in the 3rd section of Most Content. These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet. More information about this table is available here: allColls


Once a collection has been released for harvesting into LOCKSS, we must monitor the size and additions to that content, and avoid any changes. As content is communicated over the network, we log here the identifier, manifest number, and date (and for collections which are subcollections of others, such as rare books, the subcollection title and parent identifier).


This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support is assigned a number. Retrieval of that item will be by using a URL of the form: where the number following the last forward slash is the number of the item. The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history). This table is accessed by the script which lives in the cgi-bin of A URL rewrite and a virtual host configuration, along with the DNS registration of, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.


This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported. Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, and the date the content should be made available (dateAvailable, in form yyyy-mm-dd).


This table is designed to capture the type of format an file is in, a URL to information on that format, the quality of the capture, and the version of the format. For example, a TIFF version 6.0 file captured at 600 dpi would have a URL entered from the Unified Digital Format Registry ([[1]] similar to this one now available via the PRONOM Digital Format Registry: [[2]].

By maintaining a record of the current format, version, and quality of each file, in a database, we can easily identify and locate all files in need of migration or emulation in the face of approaching obsolescence.


This table is not yet in use, and will likely expand. The purpose is to track which donors have provided support for digitization, processing, or donated content, and be able to link to their website and use a logo for the display of content.


itemSums This table contains the identifier, file name, file path, current MD5 checksum, number of times modified, date of first entry, and a notes field.

modified If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the reason modified, and by whom are recorded here.

Dates that the files are verified are stored in the checkscripts database described in Tracking_automated_scripts

Personal tools