Tracking for the long term

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
 
(18 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
from the database and store these as flat files in the top levels of the storage system, as a form of manifest.
 
from the database and store these as flat files in the top levels of the storage system, as a form of manifest.
  
InfoTrack and md5sums are located on libcontent1.lib.ua.edu.  The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described  in [[Tracking_automated_scripts]].
+
InfoTrack and md5sums are located on libcontent.lib.ua.edu.  The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described  in [[Tracking_automated_scripts]].
  
 
== The InfoTrack database ==
 
== The InfoTrack database ==
Line 10: Line 10:
  
 
This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online).  Most of this information comes from the [[Collection_Information]] file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists.  This information is uploaded by the collToDbase script described in step 10 of [[Moving_Content_To_Long-Term_Storage]] or the moveContent script described in the 3rd section of [[Most Content]].  These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet.  More information about this table is available here:  [[allColls]]
 
This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online).  Most of this information comes from the [[Collection_Information]] file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists.  This information is uploaded by the collToDbase script described in step 10 of [[Moving_Content_To_Long-Term_Storage]] or the moveContent script described in the 3rd section of [[Most Content]].  These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet.  More information about this table is available here:  [[allColls]]
 +
 +
'''removedColls'''
 +
 +
This table documents when we've been asked to take down either an EAD or a digital collection, when and by whom.
 +
The process for doing this is spelled out in [[TakeDowns]].
  
 
'''inLOCKSS'''
 
'''inLOCKSS'''
Line 17: Line 22:
 
'''lookup'''
 
'''lookup'''
  
This table was created to support persistent identifiers.  It's a lookup table to provide redirects.  Each item for which we would like to provide this support is assigned a number.  Retrieval of that item will be by using a URL of the form:  http://purl.lib.ua.edu/3234  where the number following the last forward slash is the number of the item.  The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history).  This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent1.lib.ua.edu. A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work.  Whenever a file must be moved, the database is updated, and the persistent URL continues to work.  In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.
+
This table was created to support persistent identifiers.  It's a lookup table to provide redirects.  Each item for which we would like to provide this support is assigned a number.  Retrieval of that item will be by using a URL of the form:  http://purl.lib.ua.edu/3234  where the number following the last forward slash is the number of the item.  The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history).  This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent.lib.ua.edu (/srv/www/cgi-bin/ ) A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work.  Whenever a file must be moved, the database is updated, and the persistent URL continues to work.  In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.
 
+
  
  
 
'''bornDigital'''
 
'''bornDigital'''
  
This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported.  Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, and the date the content should be made available (dateAvailable, in form yyyy-mm-dd).
+
This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported.  Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, the date the content should be made available (dateAvailable, in form yyyy-mm-dd), and exceptions.
  
  
 +
'''geocode'''
  
'''archival_formats'''
+
Information extracted using Google API is stored here for particular item locations. This enables us to apply the same lat/long for other items with the same location without calling the API again and again and again (there are limits on calls, as well as server overhead.
  
This table is designed to capture the type of format an file is in, a URL to information on that format, the quality of the capture, and the version of the format.  For example, a TIFF version 6.0 file captured at 600 dpi would have a URL entered from the Unified Digital Format Registry ([[http://www.udfr.org/]] similar to this one now available via the PRONOM Digital Format Registry:  [[http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=612]].
+
'''itemLocation'''
  
By maintaining a record of the current format, version, and quality of each file, in a database, we can easily identify and locate all files in need of migration or emulation in the face of approaching obsolescence.
+
This works with the geocode table, indicating the item location for each of the items in the database.  The locationID here corresponds to the locationID in the geocode table.
  
 +
'''names'''
  
 +
Name authority database (MADS), described in [[For_Subjects_and_Names]] to support controlled vocabulary in the use of names in descriptive metadata. The database numbers are referenced in the spreadsheets, then used in the makeMods script to ensure the MODS entry matches the controlled vocabulary in the database.
  
'''Donors'''
+
'''subjects'''
  
This table is not yet in use, and will likely expandThe purpose is to track which donors have provided support for digitization, processing, or donated content, and be able to link to their website and use a logo for the display of content.
+
Subject authority database containing tagged versions of each controlled vocabulary entry;  the tags indicate the applicable fields to use when constructing the subject in MODSThis is used in preparing spreadsheets for transformation into MODS.
  
 +
'''numItemPages'''
 +
 +
This is a legacy table from when we were migrating content through Scripto to obtain transcripts.  We used this table to document how many pages of particular items were transcribed, and how many OCR'd, so we could get a sense of how complete our coverage was on items.  The script (that had been run periodically) that fills this table can be found in /srv/scripts/transcripts/olderVersions/previous/: numPagesAndTrans .  This is retained in case we need to collect this information again.
  
 
== md5sums ==
 
== md5sums ==
  
 
'''itemSums'''
 
'''itemSums'''
This table contains the identifier, file name, file path, current MD5 checksum, number of times modified, date of first entry, and a notes field.
+
This table contains the identifier, file name, file path, current MD5 checksum, byte size, number of times modified, date of first entry, and a notes field.
  
 
'''modified'''
 
'''modified'''
If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the reason modified, and by whom are recorded here.
+
If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the byte size of the original file, the reason modified, and by whom are recorded here.
 +
 
 +
*Dates that the files are verified are stored in the checkscripts database described in [[Tracking_automated_scripts]]
 +
 
 +
'''imageTechMed'''
 +
As images are stored, technical metadata is generated for them, and some administrative information from that process is stored here.
 +
This information includes the mime type, format, format version, test date, format registry, format registry key, whether or not the TIFF contains a thumb, and whether conflicts were found during the testing process.  You can find more information about this in [[Image_Technical_Metadata]].
 +
 
 +
'''audioTechMed'''
 +
As audio files are stored, technical metadata is generated for them, and some administrative information from that process is stored here.
 +
This information includes the mime type, format, format version, test date, format registry, format registry key, number of channels, and whether conflicts were found during the testing process.  You can find more information about this in [[Audio_Technical_Metadata]].
 +
 
 +
'''typeContent'''
 +
This is the beginning of a method of tracking types of content, where the type may impact the migration target format.  We have yet to find a way to automate capture of whether an image is of primarily text, but when we do, we expect to capture that information here.  Thus far, the only types stored are scrapbooks, as identified at the item level.
 +
 
 +
==dpn==
  
Dates that the files are verified are stored in the checkscripts database described in [[Tracking_automated_scripts]]
+
This database was started to track what has been selected out for inclusion in the Digital Preservation Network, since contributions should be of a certain size.  Our first set of content was to be 5 TB, so we selected from content that is NOT yet in ADPNet LOCKSS, but which, combined, amounts to approximately 5 TB.  To do so, we had to select only a portion of an ongoing (long-term) collection.  Hence, inclusion of information down to the file level was necessary, to avoid duplication and to verify checksums.  The fields in the "submitted" table include timestamp, identifier, item identifier, filename, file path, byte size, md5 checksum, collection number and manifest number (the latter refers to the LOCKSS manifest into which the file is linked for ADPNet pickup;  multi-TB collections have multiple manifests, as we limit the size of each AU -- "Archival Unit"  -- to one terabyte or less).

Latest revision as of 10:06, 17 December 2015

We're developing tracking databases for management of content over time. Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

InfoTrack and md5sums are located on libcontent.lib.ua.edu. The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described in Tracking_automated_scripts.

[edit] The InfoTrack database

allColls

This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online). Most of this information comes from the Collection_Information file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of Moving_Content_To_Long-Term_Storage or the moveContent script described in the 3rd section of Most Content. These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet. More information about this table is available here: allColls

removedColls

This table documents when we've been asked to take down either an EAD or a digital collection, when and by whom. The process for doing this is spelled out in TakeDowns.

inLOCKSS

Once a collection has been released for harvesting into LOCKSS, we must monitor the size and additions to that content, and avoid any changes. As content is communicated over the network, we log here the identifier, manifest number, and date (and for collections which are subcollections of others, such as rare books, the subcollection title and parent identifier).

lookup

This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support is assigned a number. Retrieval of that item will be by using a URL of the form: http://purl.lib.ua.edu/3234 where the number following the last forward slash is the number of the item. The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history). This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent.lib.ua.edu (/srv/www/cgi-bin/ ) A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.


bornDigital

This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported. Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, the date the content should be made available (dateAvailable, in form yyyy-mm-dd), and exceptions.


geocode

Information extracted using Google API is stored here for particular item locations. This enables us to apply the same lat/long for other items with the same location without calling the API again and again and again (there are limits on calls, as well as server overhead.

itemLocation

This works with the geocode table, indicating the item location for each of the items in the database. The locationID here corresponds to the locationID in the geocode table.

names

Name authority database (MADS), described in For_Subjects_and_Names to support controlled vocabulary in the use of names in descriptive metadata. The database numbers are referenced in the spreadsheets, then used in the makeMods script to ensure the MODS entry matches the controlled vocabulary in the database.

subjects

Subject authority database containing tagged versions of each controlled vocabulary entry; the tags indicate the applicable fields to use when constructing the subject in MODS. This is used in preparing spreadsheets for transformation into MODS.

numItemPages

This is a legacy table from when we were migrating content through Scripto to obtain transcripts. We used this table to document how many pages of particular items were transcribed, and how many OCR'd, so we could get a sense of how complete our coverage was on items. The script (that had been run periodically) that fills this table can be found in /srv/scripts/transcripts/olderVersions/previous/: numPagesAndTrans . This is retained in case we need to collect this information again.

[edit] md5sums

itemSums This table contains the identifier, file name, file path, current MD5 checksum, byte size, number of times modified, date of first entry, and a notes field.

modified If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the byte size of the original file, the reason modified, and by whom are recorded here.

imageTechMed As images are stored, technical metadata is generated for them, and some administrative information from that process is stored here. This information includes the mime type, format, format version, test date, format registry, format registry key, whether or not the TIFF contains a thumb, and whether conflicts were found during the testing process. You can find more information about this in Image_Technical_Metadata.

audioTechMed As audio files are stored, technical metadata is generated for them, and some administrative information from that process is stored here. This information includes the mime type, format, format version, test date, format registry, format registry key, number of channels, and whether conflicts were found during the testing process. You can find more information about this in Audio_Technical_Metadata.

typeContent This is the beginning of a method of tracking types of content, where the type may impact the migration target format. We have yet to find a way to automate capture of whether an image is of primarily text, but when we do, we expect to capture that information here. Thus far, the only types stored are scrapbooks, as identified at the item level.

[edit] dpn

This database was started to track what has been selected out for inclusion in the Digital Preservation Network, since contributions should be of a certain size. Our first set of content was to be 5 TB, so we selected from content that is NOT yet in ADPNet LOCKSS, but which, combined, amounts to approximately 5 TB. To do so, we had to select only a portion of an ongoing (long-term) collection. Hence, inclusion of information down to the file level was necessary, to avoid duplication and to verify checksums. The fields in the "submitted" table include timestamp, identifier, item identifier, filename, file path, byte size, md5 checksum, collection number and manifest number (the latter refers to the LOCKSS manifest into which the file is linked for ADPNet pickup; multi-TB collections have multiple manifests, as we limit the size of each AU -- "Archival Unit" -- to one terabyte or less).

Personal tools