Since approximately the middle of 2008, an Excel spreadsheet named Trackingfiles.xlsx has been used as a central data source for digitization and scanning activities - i.e. data not found in the Descriptive Metadata sheets or other metadata sources. Typical data recorded includes, but is not limited to, the object name, date of scan, and scanning technician.
When a collection (or part-of) is ready for storage and online presentation, a tab delimited text export of this spreadsheet will be placed into the collection Admin folder as:
Use Excel to make this text file. Potential problems exist if using OpenOffice in that some of the formula values may not get converted correctly. If diacritics exist, see Diacritics prior to export.
With an increase in the number of student workers, the limitations of using a singular file, consisting of multiple and non-standardized worksheets (one for each collection) began to manifest themselves.
1. "traffic jams" : students scanning/working simultaneously needed to coordinate with each other their collective real-time use of the spreadsheet given that when one user had the spreadsheet open on their desktop, no one else could edit the spreadsheet.
2. The spreadsheet provided a source for report-worthy data - scanning activity, total scans by date, etc. However, its non-standardized design precluded batch information harvesting via scripts and/or Excel formulas.
In the summer of 2009, it was decided to consider two changes that could eliminate the aforementioned problems:
1. Separating the worksheets into separate workbooks (files) and contain them all in one centralized folder within S:\Digital Projects\Organization\Digital Program Logs. This would eliminate the "traffic jams" as users would no longer vie for access to the same file. The exception would be the instance when two or more students were scanning/working from the same box/collection. Still, this would be marked decrease from the previous situation.
2. Standardizing the data fields. Standardizing the spreadsheet could allow for import into a master Excel workbook and/or database application in which reoccurring formulas and/or queries could be created and saved, allowing for "one stop shopping" in regard to the collection of report-worthy data (for monthly and yearly reports, etc.).
Standardization of TrackingFiles
Worksheets per Workbook
Each trackingfiles spreadsheet file will contain two worksheets:
1. "Log" : this is where all the fields mentioned in the next section get recorded.
DO NOT sort this sheet unless you know how to sort ONLY the scanning log data. That is to say, DO NOT sort column A nor any of the columns with formulas that tally scan totals, etc.
2. "QC" : this is the place where Quality Control Notes can go as the collections is being checked for quality.
- As none of the QC information will get exported to a tab delimited text file when the collection is ready for storage, any important information from this worksheet needs to be copied into the "Collection Notes" cloumn of the "Log" sheet, mentioned below.
3. "scanner_types" : this sheet exists only to provide the list if scanner types so that the drop down list for the field "Scanner Type" in the Log sheet can function. If scanning devises are every discarded or new ones purchased, this sheet needs to be updated prior to the creation of a tracking files file for a new digital collection.
Current scanner types:
- audio (Fostex)
- batch scanner (HP)
- bookdrive (Atiz)
- film strip (Nikon)
- flatbed (Epson)
- overhead (Phase One)
- scanback (Better Light)
avoid using "other" at all costs, only use this for old collections for which discarded equipment was likely used. with the sheet "unprotected" it's possible to choose "other" and search/replace it with a customized scanner type should the need to do so arise.
The template for the TrackingFiles Excel files is listed below in the format:
Field Title: brief description of the Field.
Fields marked with + are mandatory.
Fields marked ++ are only mandatory if a Transcript Scan/s and/or Transcript Text file/s exists/exist.
Fields marked +++ are only mandatory for audio collections.
Fields marked ++++ are derived from formulas and require no human entry.
Collection Notes: general notes field regarding any pertinent information regarding the whole collection, or parts thereof. May also include any QC data that should be retained. Does not correlate with other columns, so that General Notes in cell A2 do not necessarily pertain only to B2, C2, etc. In certain circumstances such as information regarding what items were not scanned for a given collection, this information should *also* be included in the "Notes" fields in the Descriptive Metadata, though at the item level for the latter (Descriptive Metadata). This is per Jody 7/2009.
+Object Filename: name of object (does not include file extension).
+Number of Scans: number of scans per item. i.e. 5 for a five page letter.
+Scanning Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one object, separate initials by SEMICOLON+SPACE. ex: NA; AAP
as the data validation will force you to enter only 2-3 letters, you may enter "ABC" and then search/replace that with "NA; AAP". The sheet must not be protected in order to do this.
+Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.
+Scanner Type: drop down list of scanner that scans were made on. use this Excel formula "=scanner_types!A2:A9".
+Binding Status: used to indicate if item is bound or not for ARL stats. 0=unbound, 1=bound. Leaving the cell blank will also equate to being unbound, but entering "0" is the preferred entry.
Scan Notes: any textual information necessary regarding the scanning of the item or item itself as it relates to digitization. i.e. This is not the place for descriptive metadata.
++Number of Transcript Scans: number of scans per item. i.e. 5 for a five page transcript.
++Transcript Scan Technician: 2 or 3 letter initials of employee responsible for scanning. If more than one person has assisted in the complete scanning of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP
++Transcript Scan Date: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.
Transcript Scan Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.
+OCR?: used to indicate if item of part of item (50% or more) is an OCR candidate. 0=don't OCR, 1=OCR. Leaving the cell blank equates to a "0", but entering "0" is the preferred entry. This DOES NOT relate to transcripts found in the box next to the item itself. THIS IS FOR COLLECTIONS WHICH CONTAIN SOME (but not all) TYPEWRITTEN DOCUMENTS. When batching out (or completing the collection), hide the columns between the item identifier and the OCR column. Copy and paste these 2 columns (representing what you digitized) into a separate spreadsheet. Export as tab-delimited text, named collnum.ocrList.txt (example: u0003_0001577.ocrList.txt) and place in the Admin folder for upload. Jody will OCR the items that you've marked a "1" for after upload to the server.
IF HOWEVER THE ENTIRE COLLECTION IS TYPEWRITTEN: During makeJpegs, the script will ask if you want to OCR the entire collection (if it finds no ocrList.txt). Say "yes", and the script will OCR all the TIFFS in the collection. Other options offered are to OCR everything ELSE in the collection (for example, you're on batch 3 before you realize batches 1 and 2 should have been OCR'd, and they're already online; or to OCR the entire collection (what you're uploading as well as anything already online). The ocrList simply allows a more granular control over what is OCR'd.
If you find an OCRable transcript in the box alongside the item: Capture the tiffs, place in a Transcript directory, and let Jody know; she'll run a script over them and create OCR before the upload. Then make sure everything matches along with the content in the primary item (same words on same pages) and not the transcript since we won’t be keeping scans of transcripts in most cases. If we suspect that it isn’t doing all this work for a set of transcripts we should consult with an Archivist and ask them if they think the item is super-special in which case we would go ahead and do all the OCR/clean up work if they say the item is special.
Note there are 3 possibilities regarding OCRing transcripts: either A) we only scan them and then raw OCR (ocr.txt) these transcript scans or B) we scan them and make remediated .txt files for the transcripts, or C) we scan them and make remediated .txt files for the transcripts and preserve the transcript scans if they are themselves "important". We do whatever we need to with the scans (duplicate and rename, etc) to make them match with the layout of the primary item. This last option above is VERY RARELY going to occur. Generally it will be the second option, though the first may be quite common with audio transcripts. Of course, if we have good OCR of the item itself, we aren't interested in OCRing the transcripts themselves at all (unless they are really super-special.) For information on how to OCR existing transcripts and creating cleaned up .txt files see: http://www.lib.ua.edu/wiki/digcoll/index.php/Transcripts
++ Number of Transcript Text Files: number of scans per item. i.e. 5 for a five page transcript.
++Transcript Text File Technician: 2 or 3 letter initials of employee responsible for OCR work, etc. If more than one person has assisted in the completion of one transcription, separate initials by SEMICOLON+SPACE. ex: NA; AAP
++Transcript Text File Dates: date of scan entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.
Transcript Text File Notes: any textual information necessary regarding the scanning of the transcript or transcript itself as it relates to digitization. i.e. This is not the place for descriptive metadata.
+++Audio QC Technician: 2 or 3 letter initials of employee responsible for audio quality control. If more than one person is responsible, separate initials by SEMICOLON+SPACE. ex: NA; AAP
+++Audio QC Date: date of audio QC entered in MM/DD/YYYY format. If more than one person has assisted in the complete scanning of one object, use the most recent date.
+++Number of Tracks: the number of timecoded tracks (i.e. defined intellectual items) within the archival audio files. If a single MP3 derivative contains 5 intellectual items, then the number of tracks is 5 even though there is only one MP3 file.
++++Collection Number: = the first 13 characters of the object filename. function: =LEFT(B:B,13)
++++Total Scans (Transcripts, too): = SUM of (Number of Scans + Number of Transcript Scans).
++++Total Objects: = total cells in "Number of Scans" column that have at least 1 listed for the number of scans. So if there's a row for a yet-unscanned object, it won't not count toward the Total Objects until an actual number of scans is recorded for the item. function: COUNTIF(C:C,">=1")
Formerly this formula was based on the total non-blank rows in spreadsheet minus 1 (the header row). function: =COUNTIF(B:B,"*")-1. The was replaced by the newer formula when we discovered people were making rows for objects and writing in the "Scans Notes" column phrases like "Not scanned, needs to be scanned on overhead instead.".
++++Total Bound Items: = sum of total items marked as bound.
- it should be mentioned that the date formats are formatted to DATE data type in Excel and if 06/06/09 is entered, only 6/6/09 will be displayed.
- should there be a need to add columns in terms of helping technicians track their work more easily, these non-standard columns MUST be add after the last standard field (i.e. "Total Objects"). This ensures that the database queries will work properly when all the trackingfiles spreadsheets are merged.
- if it isn't known who the scanning technician was, please use "??" in the appropriate column to indicate that this is unknown information.
Individual TrackingFiles spreadsheets will be named as follows: collection_number.log.xlsx
From individual TrackingFiles files to Database
With numerous trackingfiles spreadsheets, the "traffic jam" problem is virtually eliminated. And the standardization of the spreadsheets makes it possible to query the data for reporting purposes, but there is no simply way to query all the data unless it exists as one table.
For this reason, a search began for a free, pre-exisitng Visual Basic Macro that would allow one of the following methods:
1. A Microsoft Excel macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular, temporary master spreadsheet (master). This master spreadsheet could then be linked by a desktop database application (Microsoft Access of OpenOffice Base) for purposing of running common queries. Note that the master spreadsheet would be temporary in the sense that it needs to be made every time there is a need for current queries/reports. While it's possible to have the master spreadsheet have a "live" connection, continually being fed updated from the children spreadsheets, this did not seem like a good idea given that we may decided to move older, languishing children spreadsheets out of the main folder, thereby potentially breaking the live connection.
2. A Microsoft Access macro that combined all the trackingfiles spreadsheets (children) within the one trackingfiles directory into a singular table within Access.
We were not able to find a macro that satisfied method #2, but we found a script that installs as an Excel Add-In called RDBMerge, written by Ron de Bruin. The corresponding source code is here, with explanations here.
As an Add-in RDBMerge provides a GUI interface, is simple to use, and allows us to employ method #1. The sequence for using this tool and working with the data in a database is as follows:
an OpenOffice equivalent seems to be available from http://www.sobolsoft.com/calcjoinfile/, but we have not tested it.
Ensure all trackingfiles spreadsheets are in a single folder (S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\). No spreadsheet of any other type may exist in this folder.
1. At the end of every month do the following:
2. Open Excel.
3. Open a new, blank spreadsheet.
4. Instruct all employees to close and open instances of a trackingFiles Excel file.
5. Run RDBMerge to combine all the spreadsheets within (\\Libfs1\share\Digital Projects\Organization\Digital Program Logs\TrackingFiles\).
Use the following RDBMerge settings:
- Notice the setting for "Which Range". By placing A2 in the box the program will merge all cells in all spreadsheets without the header rows. We'll have to paste the header row into the merged spreadsheet in the latter steps.
6. RDBMerge will have created a new workbook with 2 worksheets: "Combine Sheet" and "Log Sheet".
7. Go ahead and check "Log Sheet" to make sure there were no errors.
- It is OK if it reports an errors such as below because the "~" indicates a temporary file and not one of our real spreadsheets.
8. Now open any one of our standard TrackingFiles workbooks and copy/paste the header row into "Combine Sheet".
9. Save "Combine Sheet" as trackingfiles_merge.xlsx to the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.
- A database file called "trackingfiles.accdb" will exist in the same folder that will know to automatically connect to a file called "trackingfiles_merge.xlsx". It is therefore imperative to name the file correctly in step #7.
10. Close Excel.
11. Open Access and create queries or run pre-existing ones for reporting purposes.
Setting up the Access Database : shortcut
1. Open "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.
2. Delete any pre-existing tables.
3. Choose External Data>Saved Imports
4. Run the saved import called "Import-trackingfiles_merge".
This will automatically import "trackingfiles_merge.xlsx" from the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory. This will automatically format the numerically-based columns (i.e. Total Scans, etc.) as integers so that the SUM functions will work correctly.
Setting up the Access Database : from scratch
The database file "trackingfiles.accdb" should need little to no alternation once set up (and the following instructions are cursory), but in the event that the file is lost simply do the following to make a new one:
1. Open Access and create a new database called "trackingfiles.accdb" in the S:\Digital Projects\Organization\Digital Program Logs\TrackingFiles\TrackingFiles_database_files directory.
2. Delete any pre-existing tables.
3. Choose External Data>Excel
4. Select "trackingfiles_merge.xlsx" and choose "Import Source Data into New Table" as the import option.
5. Click "Next" until Access asks what to name the table. Call it "data".
- as you import the Excel file make sure that the box is checked which specifies that the first row is a header row
- also you must make sure Number of Scans, Binding Status, Number of Transcript Scans, OCR?, Number of Transcript Text Files, Total Scans, Total Objects, Total Bound Items, etc. are imported as Integers. Otherwise you cannot tally the SUMs of these columns. Let Access determine the type for the other columns.
- let Access add a key (i.e. an autonumber column)
Setting up the Access Database : useful queries
Here are some useful queries in SQL view. They can be pasted into Access to be re-created (use the SQL view).
Total Scans by User Defined Date Range:
SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status] FROM data GROUP BY data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Binding Status], data.[Collection Notes], data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects] HAVING (((data.[Scan Date]) Between [Enter Start Date as MM/DD/YYYY, but do not precede single digits with Zero] And [Enter End Date as MM/DD/YYYY, but do not precede single digits with Zero]));
Item Level Info by User Defined Collection Number:
SELECT data.[Collection Number], data.[Total Scans (Scans + Transcript Scans)], data.[Total Objects], data.[Total Bound Items] FROM data WHERE (((data.[Collection Number])=[enter a collection number]));
Total Scans by User Defined Month and Year:
SELECT data.[Object Filename], data.[Number of Scans], data.[Scanning Technician], data.[Scan Date], data.[Scan Notes], data.[Scanner Type], data.[Number of Transcript Scans], data.[Transcript Scan Technician], data.[Transcript Scan Date], data.[Transcript Scan Notes], data.[Total Bound Items] FROM data WHERE (((data.[Scan Date]) Like [enter month as # for Jan-Sep or ## for Oct-Dec]+"*"+[enter year as ####]));
If you are interested in learning SQL, the W3C Schools SQL tutorial is a good place to start.
The Tizag site has a small introduction to Access that might be helpful, too.
VBA : batch updates to Tracking Files
All tracking files can be altered in batch via Microsoft's Visual Basic for Applications.
By using the template script here, one can simply embed the resultant coding from a Macro record session. Some tweaking of the template code is required in order to point the script at our Tracking Files folder and to search for our "*.log.xls*" files.
Note the snippet below shows a line break that isn’t present in the blogger’s post, perhaps due to their blogging software’s formatting.
‘will start LOOP until all files in folder sPath have been looped through
Set oWbk = Workbooks.Open(sPath & "\" & sFil)
It appears that VBA scripts break on Excel files for which sharing is turned on (simultaneous multi-user editing). Those files need to be identified* and temporarily isolated into another folder or the sharing needs to temporarily be turned off. Also, sheets need to be ‘unprotected’ for the code to work, but it’s quite simple to add a line in your macro that unprotects the sheet prior to performing whatever macro you’ve created.
- We are using a suffix to identify shared files … "filename.share.xlsx". Thus, sharing can be turned off manually or a VBA script could be run to turn off sharing for all shared files (*.share.xls*). Trying to turn sharing off for a non-shared Excel file will break the script.
Here's an example of a script that edits the "QC" sheet in trackingFiles to add some columns.
Note, this script must not be run as the changes have already been implemented. This is just an example.
' Disclaimer: ' DO NOT use this macro. It has already been used. It is just saved as an example. ' Notice how the only subroutine is "Sub Open_All_Files()" and the last line, "End Sub", applies to this subroutine ' DO NOT use subroutines elsehwere. Just combine all the recorded macro components under the umbrella of the one subroutine. ' ' Description: ' this VBA script will open the "QC" sheet in the given trackingFiles folder (see "sPath" variable below) ' then it will add columns to the QC sheet and close the file for each Excel file ' ' Sub Open_All_Files() Dim oWbk As Workbook Dim sFil As String Dim sPath As String ' sPath = "C:\Documents and Settings\digitalav\Desktop\QC_update\" 'location of files ChDir sPath sFil = Dir("*.xls*") 'change or add formats Do While sFil <> "" 'will start LOOP until all files in folder sPath have been looped through Set oWbk = Workbooks.Open(sPath & "\" & sFil) 'opens the file ' ' paste recorded MACRO below ' ' Opens the "QC" sheet inside the trackingFiles spreadsheet Sheets("QC").Select Sheets("QC").Name = "QC" Sheets("QC").Select ' Inserts the Columns and adds Data Validation to some columns Columns("B:B").Select Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove Selection.Insert Shift:=xlToRight, CopyOrigin:=xlFormatFromLeftOrAbove Range("B1").Select ActiveCell.FormulaR1C1 = "Ready to go?" Columns("B:B").Select With Selection.Validation .Delete .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _ Operator:=xlBetween, Formula1:="1", Formula2:="1" .IgnoreBlank = True .InCellDropdown = True .InputTitle = "Ready to go?" .ErrorTitle = "" .InputMessage = "Enter ""y"" or ""n""." .ErrorMessage = "" .ShowInput = True .ShowError = True End With Range("C1").Select ActiveCell.FormulaR1C1 = "Intials (last QC Technician)" Columns("C:C").Select With Selection.Validation .Delete .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _ Operator:=xlBetween, Formula1:="2", Formula2:="3" .IgnoreBlank = True .InCellDropdown = True .InputTitle = "" .ErrorTitle = "" .InputMessage = "" .ErrorMessage = "" .ShowInput = True .ShowError = True End With With Selection.Validation .Delete .Add Type:=xlValidateTextLength, AlertStyle:=xlValidAlertStop, _ Operator:=xlBetween, Formula1:="2", Formula2:="3" .IgnoreBlank = True .InCellDropdown = True .InputTitle = "Initials" .ErrorTitle = "" .InputMessage = "Enter your 2 or 3 letter initials." .ErrorMessage = "" .ShowInput = True .ShowError = True End With Range("D1").Select ActiveCell.FormulaR1C1 = "Date of last QC activity" Columns("D:D").Select With Selection.Validation .Delete .Add Type:=xlValidateDate, AlertStyle:=xlValidAlertStop, Operator:= _ xlBetween, Formula1:="1/1/2009", Formula2:="1/1/2121" .IgnoreBlank = True .InCellDropdown = True .InputTitle = "Date" .ErrorTitle = "" .InputMessage = "Enter date in MM/DD/YYYY format." .ErrorMessage = "" .ShowInput = True .ShowError = True End With Range("B1:D1").Select With Selection.Validation .Delete .Add Type:=xlValidateInputOnly, AlertStyle:=xlValidAlertStop, Operator _ :=xlBetween .IgnoreBlank = True .InCellDropdown = True .InputTitle = "" .ErrorTitle = "" .InputMessage = "" .ErrorMessage = "" .ShowInput = True .ShowError = True End With Range("A2").Select Columns("C:C").EntireColumn.AutoFit Range("A2").Select ' ' DO NOT paste recorded macro elements below ' oWbk.Close True 'close the workbook, saving changes sFil = Dir Loop ' End of LOOP End Sub
Once you have the VBA code in hand, do the following to run it:
- BACKUP the tracking files (just in case something goes wrong ...)
- create a new Excel file
- record a macro and immediately stop recording it
- from the macros list, EDIT the macro you just made
- paste in your VBA code (overwrite all pre-existing code in the macro you just recorded)
- save the Excel file as a macro-enabled file
- run the macro
- delete the macro-enabled Excel file after you've verified everything worked OK
- We have occasionally run into problems with Excel files with the "share" feature turned on. That is to say, sometime access issues arise in terms of being able to open these files from certain computers under certain user profiles. A fix to this problem seems to be creating a new "tracking files" spreadsheet and pasting the data from the corrupt one into the new file, which replaces the old/corrupt one.
- An alternative to this is to first attempt to unshare and reshare the Excel file on a computer or profile with administrative privileges. If unsuccessful, re-creating the file as mentioned above may be the best solution.
- There also might be times when the data validation (that which forces integers to be entered in certain columns, etc.) seems to "go bad". To remedy this, one can copy a cell with the desired and functional validation and then select the column into which they wish to insert this validation. By choosing "Paste>Paste Special>Paste validation" one can paste in only the validation across the entire column without overwriting the cells' data values.
- Excel keeps an open temp file on a shared document even after it has closed the document. this temp file is open untill excel is quit. this can cause problem when several people have had a shared file open during a day. if the shared permissions get confused and the sheet can not be opened or only opened in read only mode. ALL of the instances of excel must be closed so that there is no instance of excel on any computer in the lab that has had the problem sheet open still running. then the sheet can be reopened on a windows machine first, then on any number of mac or windows machines that is required, opening the shared sheet on a mac first will not allow the sharing process to work.
- If the shared sheet becomes corrupted and can not be opened by even a single user in any mode other then read only, after doing the previous step. making a copy of the file by opening it as read only and doing a "save as" and then working from the copy until the windows server allows you to delete the corrupted file has been the only solution we have found for that problem