TrackingFiles

From UA Libraries Digital Services Planning and Documentation
Revision as of 15:18, 5 May 2010 by Scpstu32 (Talk | contribs)

Jump to: navigation, search

Contents

TrackingFiles

Scope

Since approximately the middle of 2008, an Excel spreadsheet named Trackingfiles.xlsx has been used as a central data source for digitization and scanning activities - i.e. data not found in the Descriptive Metadata sheets or other metadata sources. Typical data recorded includes, but is not limited to, the object name, date of scan, and scanning technician.

When a collection is completely scanned and ready for storage and online presentation, a tab delimited text export (unicode) of this spreadsheet will be placed into the collection Admin folder as:

collection_number.log.txt

ex: u0008_0000001.log.txt

  Use Excel to make this text file. Potential problems exist if using OpenOffice in that some of the formula values may not get converted correctly.

Problems

With an increase in the number of student workers, the limitations of using a singular file, consisting of multiple and non-standardized worksheets (one for each collection) began to manifest themselves.

Namely:

1. "traffic jams" : students scanning/working simultaneously needed to coordinate with each other their collective real-time use of the spreadsheet given that when one user had the spreadsheet open on their desktop, no one else could edit the spreadsheet.

2. The spreadsheet provided a source for report-worthy data - scanning activity, total scans by date, etc. However, its non-standardized design precluded batch information harvesting via scripts and/or Excel formulas.


Solutions

In the summer of 2009, it was decided to consider two changes that could eliminate the aforementioned problems:

Namely:

1. Separating the worksheets into separate workbooks (files) and contain them all in one centralized folder within S:\Digital Projects\Organization\Digital Program Logs. This would eliminate the "traffic jams" as users would no longer vie for access to the same file. The exception would be the instance when two or more students were scanning/working from the same box/collection. Still, this would be marked decrease from the previous situation.

2. Standardizing the data fields. Standardizing the spreadsheet could allow for import into a master Excel workbook and/or database application in which reoccurring formulas and/or queries could be created and saved, allowing for "one stop shopping" in regard to the collection of report-worthy data (for monthly and yearly reports, etc.).


Standardization of TrackingFiles

Worksheets per Workbook

Each trackingfiles spreadsheet file will contain two worksheets:

1. "Log" : this is where all the fields mentioned in the next section get recorded.

  DO NOT sort this sheet unless you know how to sort ONLY the scanning log data. That is to say, DO NOT sort column A nor any of the columns with formulas that tally scan totals, etc.

2. "QC" : this is the place where Quality Control Notes can go as the collections is being checked for quality.

  • As none of the QC information will get exported to a tab delimited text file when the collection is ready for storage, any important information from this worksheet needs to be copied into the "Collection Notes" cloumn of the "Log" sheet, mentioned below.

3. "scanner_types" : this sheet exists only to provide the list if scanner types so that the drop down list for the field "Scanner Type" in the Log sheet can function. If scanning devises are every discarded or new ones purchased, this sheet needs to be updated prior to the creation of a tracking files file for a new digital collection.

Current scanner types:

  • audio (Fostex)
  • batch scanner (HP)
  • bookdrive (Atiz)
  • film strip (Nikon)
  • flatbed (Epson)
  • overhead (Phase One)
  • scanback (Better Light)
  • other
  avoid using "other" at all costs, only use this for old collections for which discarded equipment was likely used.
  with the sheet "unprotected" it's possible to choose "other" and search/replace it with a customized scanner type should the need to do so arise.

Standardized Fields

The template for the TrackingFiles Excel files is listed below in the format:

Field Title: brief description of the Field.

Fields marked with + are mandatory.

Fields marked ++ are only mandatory if a Transcript Scan/s and/or Transcript Text file/s exists/exist.

Fields marked +++ are only mandatory for audio collections.

Fields marked ++++ are derived from formulas and require no human entry.


Collection Notes: general notes field regarding any pertinent information regarding the whole collection, or parts thereof. May also include any QC data that should be retained. Does not correlate with other columns, so that General Notes in cell A2 do not necessarily pertain only to B2, C2, etc. In certain circumstances such as information regarding what items were not scanned for a given collection, this information should *also* be included in the "Notes" fields in the Descriptive Metadata, though at the item level for the latter (Descriptive Metadata). This is per Jody 7/2009.

+Object Filename: name of object (does not include file extension).

+Number of Scans: number of scans per item. i.e. 5 for a five page letter.

+Scanning Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one object, separate initials by SEMICOLON+SPACE. ex: NA; AAP

  as the data validation will force you to enter only 2-3 letters, you may enter "ABC" and then search/replace that with "NA; AAP". The sheet must not be protected in order to do this.

+Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

+Scanner Type: drop down list of scanner that scans were made on. use this Excel formula "=scanner_types!A2:A9".

+Binding Status: used to indicate if item is bound or not for ARL stats. 0=unbound, 1=bound. Leaving the cell blank will also equate to being unbound, but entering "0" is the preferred entry.

Scan Notes: any textual information necessary regarding the scanning of the item or item itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

++Number of Transcript Scans: number of scans per item. i.e. 5 for a five page transcript.

++Transcript Scan Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP

++Transcript Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

Transcript Scan Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

+OCR?: used to indicate if item of part of item (50% or more) is an OCR candidate. 0=don't OCR, 1=OCR. Leaving the cell blank equates to a "0", but entering "0" is the preferred entry. This DOES NOT relate to transcripts found in the box next to the item itself.

  If you find an OCRable transcript in the box alongside the item: you must OCR the transcript in real-time and clean up the files to be ".txt" files.
  Then make sure everything matches along with the content in the primary item (same words on same pages) and not the transcript since we won’t be keeping scans of transcripts in most cases. 
  If we suspect that it isn’t doing all this work for a set of transcripts we should consult with an Archivist and ask them if they think the item is super-special
  in which case we would go ahead and do all the OCR/clean up work if they say the item is special.
 "Note there are 3 possibilities regarding OCRing transcripts: either A) we only scan them and make the transcript scans align in filename to the primary item scans and then raw OCR (ocr.txt)
  these transcript scans. We save the scans too for future re-OCR when the OCR technology improves or B) we scan them and make remediated .txt files for the transcripts and delete the 
  transcript scans, or C) we scan them and make remediated .txt files for the transcripts and preserve the transcript scans if they are themselves ""important"". 
  We do whatever we need to with the scans (duplicate and rename, etc) to make them match with the layout of the primary item."
  This last option above is VERY RARELY going to occur. Generally it will be the second option, though the first may be quite common with audio transcripts. Of course, if we have good OCR
  of the item itself, we aren't interested in OCRing the transcripts themselves at all (unless they are really super-special.) Is this confusing enough? :-)
  
  For information on how to OCR existing transcripts and creating cleaned up .txt files see:
  http://www.lib.ua.edu/wiki/digcoll/index.php/Transcripts


++ Number of Transcript Text Files: number of scans per item. i.e. 5 for a five page transcript.

++Transcript Text File Technician: 2 or 3 letter initials of employee responsible for OCR work, etc. If more than one person has assisted in the completion of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP

++Transcript Text File Dates: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

Transcript Text File Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

+++Audio QC Technician: 2 or 3 letter initials of employee responsible for audio quality control. If more than one person is responsible, separate initials by SEMICOLON+SPACE. ex: NA; AAP

+++Audio QC Date: date of audio QC entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

+++Audio Decision List: free text area for 'ADL' notes. ex: [Working Lives ADL]

  the ADL column is deprecated. Do not use this column. It has been black-ed out. The column must still exist for easy import into Access.

++++Collection Number: = the first 13 characters of the object filename. function: =LEFT(B:B,13)

++++Total Scans (Transcripts, too): = SUM of (Number of Scans + Number of Transcript Scans).

++++Total Objects: = total cells in "Number of Scans" column that have at least 1 listed for the number of scans. So if there's a row for a yet-unscanned object, it won't not count toward the Total Objects until an actual number of scans is recorded for the item. function: COUNTIF(C:C,">=1")

  Formerly this formula was based on the total non-blank rows in spreadsheet minus 1 (the header row). function: =COUNTIF(B:B,"*")-1. 
  The was replaced by the newer formula when we discovered people were making rows for objects and writing in the "Scans Notes" column phrases like "Not scanned, needs to be scanned on overhead instead.".

++++Total Bound Items: = sum of total items marked as bound.


  • it should be mentioned that the date formats are formatted to DATE data type in Excel and if 06/06/09 is entered, only 6/6/09 will be displayed.
  • should there be a need to add columns in terms of helping technicians track their work more easily, these non-standard columns MUST be add after the last standard field (i.e. "Total Objects"). This ensures that the database queries will work properly when all the trackingfiles spreadsheets are merged.
  • if it isn't known who the scanning technician was, please use "??" in the appropriate column to indicate that this is unknown information.

Filenaming Convention

Individual TrackingFiles spreadsheets will be named as follows: collection_number.log.xlsx

ex: u0001_0000001.log.xlsx


From individual TrackingFiles files to Database

Background

With numerous trackingfiles spreadsheets, the "traffic jam" problem is virtually eliminated. And the standardization of the spreadsheets makes it possible to query the data for reporting purposes, but there is no simply way to query all the data unless it exists as one table.

For this reason, a search began for a free, pre-exisitng Visual Basic Macro that would allow one of the following methods:

1. A Microsoft Excel macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular, temporary master spreadsheet (master). This master spreadsheet could then be linked by a desktop database application (Microsoft Access of OpenOffice Base) for purposing of running common queries. Note that the master spreadsheet would be temporary in the sense that it needs to be made every time there is a need for current queries/reports. While it's possible to have the master spreadsheet have a "live" connection, continually being fed updated from the children spreadsheets, this did not seem like a good idea given that we may decided to move older, languishing children spreadsheets out of the main folder, thereby potentially breaking the live connection.

2. A Microsoft Access macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular table within Access.

We were not able to find a macro that satisfied method #2, but we found a script that installs as an Excel Add-In called RDBMerge, written by Ron de Bruin. The corresponding source code is here, with explanations here.

As an Add-in RDBMerge provides a GUI interface, is simple to use, and allows us to employ method #1. The sequence for using this tool and working with the data in a database is as follows:

  an OpenOffice equivalent seems to be available from http://www.sobolsoft.com/calcjoinfile/, but we have not tested it.


Workflow

Ensure all trackingfiles spreadsheets are in a single folder (S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\). No spreadsheet of any other type may exist in this folder.

1. At the end of every month do the following:

2. Open Excel.

3. Open a new, blank spreadsheet.

4. Run RDBMerge to combine all the spreadsheets within (\\Libfs1\share\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files).

Use the following RDBMerge settings:


Rdbmerge.PNG


  • Notice the setting for "Which Range". By placing A2 in the box the program will merge all cells in all spreadsheets without the header rows. We'll have to paste the header row into the merged spreadsheet in the latter steps.

5. RDBMerge will have created a new workbook with 2 worksheets: "Combine Sheet" and "Log Sheet".

6. Go ahead and check "Log Sheet" to make sure there were no errors.

  • It is OK if it reports an errors such as below because the "~" indicates a temporary file and not one of our real spreadsheets.

\tracking_files\~$tracking_files_template1.xlsx Error

7. Now open any one of our standard TrackingFiles workbooks and copy/paste the header row into "Combine Sheet".

8. Save "Combine Sheet" as trackingfiles_merge.xlsx to the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

  • A database file called "trackingfiles.accdb" will exist in the same folder that will know to automatically connect to a file called "trackingfiles_merge.xlsx". It is therefore imperative to name the file correctly in step #7.

9. Close Excel.

10. Open Access and create queries or run pre-existing ones for reporting purposes.


Setting up the Access Database : shortcut

1. Open "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

2. Delete any pre-existing tables.

3. Choose External Data>Saved Imports

4. Run the saved import called "Import-trackingfiles_merge".

  This will automatically import "trackingfiles_merge.xlsx" from the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.
  This will automatically format the numerically-based columns (i.e. Total Scans, etc.) as integers so that the SUM functions will work correctly.


Setting up the Access Database : from scratch

The database file "trackingfiles.accdb" should need little to no alternation once set up (and the following instructions are cursory), but in the event that the file is lost simply do the following to make a new one:

1. Open Access and create a new database called "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

2. Delete any pre-existing tables.

3. Choose External Data>Excel

4. Select "trackingfiles_merge.xlsx" and choose "Import Source Data into New Table" as the import option.

5. Click "Next" until Access asks what to name the table. Call it "data".

  • as you import the Excel file make sure that the box is checked which specifies that the first row is a header row
  • also you must make sure Number of Scans, Binding Status, Number of Transcript Scans, OCR?, Number of Transcript Text Files, Total Scans, Total Objects, Total Bound Items, etc. are imported as Integers. Otherwise you cannot tally the SUMs of these columns. Let Access determine the type for the other columns.
  • let Access add a key (i.e. an autonumber column)

Setting up the Access Database : useful queries

Here are some useful queries in SQL view. They can be pasted into Access to be re-created (use the SQL view).


Total Scans by User Defined Date Range:

SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status] FROM data GROUP BY data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status], data.[Collection Notes], data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects] HAVING (((data.[Scan Date]) Between [Enter Start Date as MM/DD/YYYY, but do not precede single digits with Zero] And [Enter End Date as MM/DD/YYYY, but do not precede single digits with Zero]));


Item Level Info by User Defined Collection Number:

SELECT data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects], data.[Total Bound Items] FROM data WHERE (((data.[Collection Number])=[enter a collection number]));


Total Scans by User Defined Month and Year:

SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scan Notes], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Total Bound Items] FROM data WHERE (((data.[Scan Date]) Like [enter month as # for Jan-Sep or ## for Oct-Dec]+"*"+[enter year as ####]));


If you are interested in learning SQL, the W3C Schools SQL tutorial is a good place to start.

The Tizag site has a small introduction to Access that might be helpful, too.

VBA : batch updates to Tracking Files

All tracking files can be altered in batch via Microsoft's Visual Basic for Applications.

By using the template script here, one can simply embed the resultant coding from a Macro record session. Some tweaking of the template code is required in order to point the script at our Tracking Files folder and to search for our "*.log.xls*" files.

Note the snippet below shows a line break that isn’t present in the blogger’s post, perhaps due to their blogging software’s formatting.

   ‘will start LOOP until all files in folder sPath have been looped through
   Set oWbk = Workbooks.Open(sPath & "\" & sFil)

It appears that VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be ‘unprotected’ for the code to work, but it’s quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you’ve created.

  • We are using a suffix to identify shared files … "filename.share.xlsx". Thus, sharing can be turned off manually or a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Trying to turn sharing off for a non-shared Excel file will break the script.


Here's an example of a script that edits the "QC" sheet in trackingFiles to add some columns.

Note, this script must not be run as the changes have already been implemented. This is just an example.

' Disclaimer:
' DO NOT use this macro. It has already been used. It is just saved as an example.
' Notice how the only subroutine is "Sub Open_All_Files()" and the last line, "End Sub", applies to this subroutine
' DO NOT use subroutines elsehwere. Just combine all the recorded macro components under the umbrella of the one subroutine.
'
' Description:
' this VBA script will open the "QC" sheet in the given trackingFiles folder (see "sPath" variable below)
' then it will add columns to the QC sheet and close the file for each Excel file
'
'
Option Explicit
Sub Open_All_Files()
Dim oWbk As Workbook
Dim sFil As String
Dim sPath As String
'
sPath = "C:\Documents and Settings\digitalav\Desktop\QC_update\" 'location of files
ChDir sPath
sFil = Dir("*.xls*") 'change or add formats
Do While sFil <> "" 'will start LOOP until all files in folder sPath have been looped through
Set oWbk = Workbooks.Open(sPath & "\" & sFil) 'opens the file
'
' paste recorded MACRO below
'
' Opens the "QC" sheet inside the trackingFiles spreadsheet
   Sheets("QC").Select
   Sheets("QC").Name = "QC"
   Sheets("QC").Select
' Inserts the Columns and adds Data Validation to some columns
   Columns("B:B").Select
   Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove
   Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove
   Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove
   Range("B1").Select
   ActiveCell.FormulaR1C1 = "Ready to go?"
   Columns("B:B").Select
   With Selection.Validation
       .Delete
       .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _
       Operator:=xlBetween, Formula1:="1", Formula2:="1"
       .IgnoreBlank = True
       .InCellDropdown = True
       .InputTitle = "Ready to go?"
       .ErrorTitle = ""
       .InputMessage = "Enter ""y"" or ""n""."
       .ErrorMessage = ""
       .ShowInput = True
       .ShowError = True
   End With
   Range("C1").Select
   ActiveCell.FormulaR1C1 = "Intials (last QC Technician)"
   Columns("C:C").Select
   With Selection.Validation
       .Delete
       .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _
       Operator:=xlBetween, Formula1:="2", Formula2:="3"
       .IgnoreBlank = True
       .InCellDropdown = True
       .InputTitle = ""
       .ErrorTitle = ""
       .InputMessage = ""
       .ErrorMessage = ""
       .ShowInput = True
       .ShowError = True
   End With
   With Selection.Validation
       .Delete
       .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _
       Operator:=xlBetween, Formula1:="2", Formula2:="3"
       .IgnoreBlank = True
       .InCellDropdown = True
       .InputTitle = "Initials"
       .ErrorTitle = ""
       .InputMessage = "Enter your 2 or 3 letter initials."
       .ErrorMessage = ""
       .ShowInput = True
       .ShowError = True
   End With
   Range("D1").Select
   ActiveCell.FormulaR1C1 = "Date of last QC activity"
   Columns("D:D").Select
   With Selection.Validation
       .Delete
       .Add Type:=xlValidateDate, AlertStyle:=xlValidAlertStop, Operator:= _
       xlBetween, Formula1:="1/1/2009", Formula2:="1/1/2121"
       .IgnoreBlank = True
       .InCellDropdown = True
       .InputTitle = "Date"
       .ErrorTitle = ""
       .InputMessage = "Enter date in MM/DD/YYYY format."
       .ErrorMessage = ""
       .ShowInput = True
       .ShowError = True
   End With
   Range("B1:D1").Select
   With Selection.Validation
       .Delete
       .Add Type:=xlValidateInputOnly, AlertStyle:=xlValidAlertStop, Operator _
       :=xlBetween
       .IgnoreBlank = True
       .InCellDropdown = True
       .InputTitle = ""
       .ErrorTitle = ""
       .InputMessage = ""
       .ErrorMessage = ""
       .ShowInput = True
       .ShowError = True
   End With
   Range("A2").Select
   Columns("C:C").EntireColumn.AutoFit
   Range("A2").Select
'
' DO NOT paste recorded macro elements below	
'	
 oWbk.Close True 'close the workbook, saving changes
 sFil = Dir
 Loop ' End of LOOP
 End Sub

Troubleshooting

We have occasionally run into problems with Excel files with the "share" feature turned on. That is to say, sometime access issues arise in terms of being able to open these files from certain computers under certain user profiles. A fix to this problem seems to be creating a new "tracking files" spreadsheet and pasting the data from the corrupt one into the new file, which replaces the old/corrupt one.

Personal tools