TrackingFiles

From UA Libraries Digital Services Planning and Documentation
Revision as of 17:25, 17 December 2009 by Scpstu32 (Talk | contribs)

Jump to: navigation, search

Contents

TrackingFiles

Scope

Since approximately the middle of 2008, an Excel spreadsheet named Trackingfiles.xlsx has been used as a central data source for digitization and scanning activities - i.e. data not found in the Descriptive Metadata sheets or other metadata sources. Typical data recorded includes, but is not limited to, the object name, date of scan, and scanning technician.

When a collection is completely scanned and ready for storage and online presentation, a tab delimited text export (unicode) of this spreadsheet will be placed into the collection Admin folder as:

collection_number.log.txt

ex: u0008_0000001.log.txt

  Use Excel to make this text file. Potential problems exist if using OpenOffice in that some of the formula values may not get converted correctly.

Problems

With an increase in the number of student workers, the limitations of using a singular file, consisting of multiple and non-standardized worksheets (one for each collection) began to manifest themselves.

Namely:

1. "traffic jams" : students scanning/working simultaneously needed to coordinate with each other their collective real-time use of the spreadsheet given that when one user had the spreadsheet open on their desktop, no one else could edit the spreadsheet.

2. The spreadsheet provided a source for report-worthy data - scanning activity, total scans by date, etc. However, its non-standardized design precluded batch information harvesting via scripts and/or Excel formulas.


Solutions

In the summer of 2009, it was decided to consider two changes that could eliminate the aforementioned problems:

Namely:

1. Separating the worksheets into separate workbooks (files) and contain them all in one centralized folder within S:\Digital Projects\Organization\Digital Program Logs. This would eliminate the "traffic jams" as users would no longer vie for access to the same file. The exception would be the instance when two or more students were scanning/working from the same box/collection. Still, this would be marked decrease from the previous situation.

2. Standardizing the data fields. Standardizing the spreadsheet could allow for import into a master Excel workbook and/or database application in which reoccurring formulas and/or queries could be created and saved, allowing for "one stop shopping" in regard to the collection of report-worthy data (for monthly and yearly reports, etc.).


Standardization of TrackingFiles

Worksheets per Workbook

Each trackingfiles spreadsheet file will contain two worksheets:

1. "Log" : this is where all the fields mentioned in the next section get recorded.

2. "QC" : this is the place where Quality Control Notes can go as the collections is being checked for quality.

  • As none of the QC information will get exported to a tab delimited text file when the collection is ready for storage, any important information from this worksheet needs to be copied into the "Collection Notes" cloumn of the "Log" sheet, mentioned below.

3. "scanner_types" : this sheet exists only to provide the list if scanner types so that the drop down list for the field "Scanner Type" in the Log sheet can function. If scanning devises are every discarded or new ones purchased, this sheet needs to be updated prior to the creation of a tracking files file for a new digital collection.

Current scanner types:

  • audio (Fostex)
  • batch scanner (HP)
  • bookdrive (Atiz)
  • film strip (Nikon)
  • flatbed (Epson)
  • overhead (Phase One)
  • scanback (Better Light)
  • other
  avoid using "other" at all costs, only use this for old collections for which discarded equipment was likely used.
  with the sheet "unprotected" it's possible to choose "other" and search/replace it with a customized scanner type should the need to do so arise.


Standardized Fields

The template for the TrackingFiles Excel files is listed below in the format:

Field Title: brief description of the Field.

Fields marked with + are mandatory.

Fields marked ++ are only mandatory if a Transcript Scan/s and/or Transcript Text file/s exists/exist.

Fields marked +++ are only mandatory for audio collections.

Fields marked ++++ are derived from formulas and require no human entry.


Collection Notes: general notes field regarding any pertinent information regarding the whole collection, or parts thereof. May also include any QC data that should be retained. Does not correlate with other columns, so that General Notes in cell A2 do not necessarily pertain only to B2, C2, etc. In certain circumstances such as information regarding what items were not scanned for a given collection, this information should *also* be included in the "Notes" fields in the Descriptive Metadata, though at the item level for the latter (Descriptive Metadata). This is per Jody 7/2009.

+Object Filename: name of object (does not include file extension).

+Number of Scans: number of scans per item. i.e. 5 for a five page letter.

+Scanning Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one object, separate initials by SEMICOLON+SPACE. ex: NA; AAP

  as the data validation will force you to enter only 2-3 letters, you may enter "ABC" and then search/replace that with "NA; AAP". The sheet must not be protected in order to do this.

+Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

+Scanner Type: drop down list of scanner that scans were made on. use this Excel formula "=scanner_types!A2:A9".

+Binding Status: used to indicate if item is bound or not for ARL stats. 0=unbound, 1=bound. Leaving the cell blank will also equate to being unbound, but entering "0" is the preferred entry.

Scan Notes: any textual information necessary regarding the scanning of the item or item itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

++Number of Transcript Scans: number of scans per item. i.e. 5 for a five page transcript.

++Transcript Scan Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP

++Transcript Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

Transcript Scan Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

+OCR?: used to indicate if item of part of item is an OCR candidate. 0=don't OCR, 1=OCR. Leaving the cell blank equates to a "0", but entering "0" is the preferred entry.

++ Number of Transcript Text Files: number of scans per item. i.e. 5 for a five page transcript.

++Transcript Text File Technician: 2 or 3 letter initials of employee responsible for OCR work, etc. If more than one person has assisted in the completion of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP

++Transcript Text File Dates: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

Transcript Text File Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.

+++Audio QC Technician: 2 or 3 letter initials of employee responsible for audio quality control. If more than one person is responsible, separate initials by SEMICOLON+SPACE. ex: NA; AAP

+++Audio QC Date: date of audio QC entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.

+++Audio Decision List: free text area for 'ADL' notes. ex: [Working Lives ADL]

++++Collection Number: = the first 13 characters of the object filename. function: =LEFT(B:B,13)

++++Total Scans (Transcripts, too): = SUM of (Number of Scans + Number of Transcript Scans).

++++Total Objects: = total cells in "Number of Scans" column that have at least 1 listed for the number of scans. So if there's a row for a yet-unscanned object, it won't not count toward the Total Objects until an actual number of scans is recorded for the item. function: COUNTIF(C:C,">=1")

  Formerly this formula was based on the total non-blank rows in spreadsheet minus 1 (the header row). function: =COUNTIF(B:B,"*")-1. 
  The was replaced by the newer formula when we discovered people were making rows for objects and writing in the "Scans Notes" column phrases like "Not scanned, needs to be scanned on overhead instead.".

++++Total Bound Items: = sum of total items marked as bound.


  • it should be mentioned that the date formats are formatted to DATE data type in Excel and if 06/06/09 is entered, only 6/6/09 will be displayed.
  • should there be a need to add columns in terms of helping technicians track their work more easily, these non-standard columns MUST be add after the last standard field (i.e. "Total Objects"). This ensures that the database queries will work properly when all the trackingfiles spreadsheets are merged.
  • if it isn't known who the scanning technician was, please use "??" in the appropriate column to indicate that this is unknown information.


Filenaming Convention

Individual TrackingFiles spreadsheets will be named as follows: collection_number.log.xlsx

ex: u0001_0000001.log.xlsx


From individual TrackingFiles files to Database

Background

With numerous trackingfiles spreadsheets, the "traffic jam" problem is virtually eliminated. And the standardization of the spreadsheets makes it possible to query the data for reporting purposes, but there is no simply way to query all the data unless it exists as one table.

For this reason, a search began for a free, pre-exisitng Visual Basic Macro that would allow one of the following methods:

1. A Microsoft Excel macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular, temporary master spreadsheet (master). This master spreadsheet could then be linked by a desktop database application (Microsoft Access of OpenOffice Base) for purposing of running common queries. Note that the master spreadsheet would be temporary in the sense that it needs to be made every time there is a need for current queries/reports. While it's possible to have the master spreadsheet have a "live" connection, continually being fed updated from the children spreadsheets, this did not seem like a good idea given that we may decided to move older, languishing children spreadsheets out of the main folder, thereby potentially breaking the live connection.

2. A Microsoft Access macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular table within Access.

We were not able to find a macro that satisfied method #2, but we found a script that installs as an Excel Add-In called RDBMerge, written by Ron de Bruin. The corresponding source code is here, with explanations here.

As an Add-in RDBMerge provides a GUI interface, is simple to use, and allows us to employ method #1. The sequence for using this tool and working with the data in a database is as follows:

  an OpenOffice equivalent seems to be available from http://www.sobolsoft.com/calcjoinfile/, but we have not tested it.


Workflow

Ensure all trackingfiles spreadsheets are in a single folder (S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\). No spreadsheet of any other type may exist in this folder.

1. At the end of every month do the following:

2. Open Excel.

3. Open a new, blank spreadsheet.

4. Run RDBMerge to combine all the spreadsheets within (\\Libfs1\share\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files).

Use the following RDBMerge settings:


Rdbmerge.PNG


  • Notice the setting for "Which Range". By placing A2 in the box the program will merge all cells in all spreadsheets without the header rows. We'll have to paste the header row into the merged spreadsheet in the latter steps.

5. RDBMerge will have created a new workbook with 2 worksheets: "Combine Sheet" and "Log Sheet".

6. Go ahead and check "Log Sheet" to make sure there were no errors.

  • It is OK if it reports an errors such as below because the "~" indicates a temporary file and not one of our real spreadsheets.

\tracking_files\~$tracking_files_template1.xlsx Error

7. Now open any one of our standard TrackingFiles workbooks and copy/paste the header row into "Combine Sheet".

8. Save "Combine Sheet" as trackingfiles_merge.xlsx to the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

  • A database file called "trackingfiles.accdb" will exist in the same folder that will know to automatically connect to a file called "trackingfiles_merge.xlsx". It is therefore imperative to name the file correctly in step #7.

9. Close Excel.

10. Open Access and create queries or run pre-existing ones for reporting purposes.


Setting up the Access Database : shortcut

1. Open "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

2. Delete any pre-existing tables.

3. Choose External Data>Saved Imports

4. Run the saved import called "Import-trackingfiles_merge".

  This will automatically import "trackingfiles_merge.xlsx" from the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.
  This will automatically format the numerically-based columns (i.e. Total Scans, etc.) as integers so that the SUM functions will work correctly.


Setting up the Access Database : from scratch

The database file "trackingfiles.accdb" should need little to no alternation once set up (and the following instructions are cursory), but in the event that the file is lost simply do the following to make a new one:

1. Open Access and create a new database called "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.

2. Delete any pre-existing tables.

3. Choose External Data>Excel

4. Select "trackingfiles_merge.xlsx" and choose "Import Source Data into New Table" as the import option.

5. Click "Next" until Access asks what to name the table. Call it "data".

  • as you import the Excel file make sure that the box is checked which specifies that the first row is a header row
  • also you must make sure Number of Scans, Binding Status, Number of Transcript Scans, OCR?, Number of Transcript Text Files, Total Scans, Total Objects, Total Bound Items, etc. are imported as Integers. Otherwise you cannot tally the SUMs of these columns. Let Access determine the type for the other columns.
  • let Access add a key (i.e. an autonumber column)

Setting up the Access Database : useful queries

Here are some useful queries in SQL view. They can be pasted into Access to be re-created.


Total Scans by User Defined Date Range:

SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status] FROM data GROUP BY data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status], data.[Collection Notes], data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects] HAVING (((data.[Scan Date]) Between [Enter Start Date as MM/DD/YYYY, but do not precede single digits with Zero] And [Enter End Date as MM/DD/YYYY, but do not precede single digits with Zero]));


Item Level Info by User Defined Collection Number:

SELECT data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects], data.[Total Bound Items] FROM data WHERE (((data.[Collection Number])=[enter a collection number]));


Total Scans by User Defined Month and Year:

SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scan Notes], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Total Bound Items] FROM data WHERE (((data.[Scan Date]) Like [enter month as # for Jan-Sep or ## for Oct-Dec]+"*"+[enter year as ####]));

VBA : batch updates to Tracking Files

All tracking files can be altered in batch via Microsoft's Visual Basic for Applications.

By using the template script here, one can simply embed the resultant coding from a Macro record session. Some tweaking of the template code is required in order to point the script at our Tracking Files folder and to search for our "*.log.xls*" files.

Note the snippet below shows a line break that isn’t present in the blogger’s post, perhaps due to their blogging software’s formatting.

   ‘will start LOOP until all files in folder sPath have been looped through
   Set oWbk = Workbooks.Open(sPath & "\" & sFil)

It appears that VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be ‘unprotected’ for the code to work, but it’s quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you’ve created.

  • We are using a suffix to identify shared files … "filename.share.xlsx". Thus, sharing can be turned off manually or a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Trying to turn sharing off for a non-shared Excel file will break the script.
Personal tools