Harvesting Tags

From UA Libraries Digital Services Planning and Documentation
Revision as of 15:20, 3 October 2012 by Jlderidder (talk | contribs) (During the rotation in the tagging software:)

Jump to: navigation, search

During the rotation in the tagging software:

Use the `updateAcumenOnly` script in /srv/scripts/tagging/. The script expects one or more collection numbers on the commandline, to specify which collection(s) to extract.

updateAcumenOnly will extract tags and image names from the steve_museum software, dedupe, and check the XML file for each item in Acumen. If no XML file, it creates one, inserting tags (and the number of times each tag was used this round). If there is an XML file, the script checks to see if it contains any of the new tags -- if so, it updates the count for those, and adds any tags not yet entered. This version does NOT log retrieval in the InfoTrack database.

Copies of the tag files are written to /srv/deposits/crowdsourcing/tags/ for archiving and to the Digital_Program_files/Tags/ directory on the Share drive in Special Collections, for access by the archivists. In the latter, older versions of tag files are overwritten by new ones. In the former, the files are versioned for archiving.

The software outputs a datestamped tab-delimited text file in the ./output directory detailing:

  1. Collnum: the collection number
  2. Total Items: total items in collection
  3. ItemsTagged: how many obtained tags
  4. NumTags: number of distinct tags obtained
  5. Date Captured: today's date

When the rotation is complete:

Use the `getCollTags` script in /srv/scripts/tagging/. The script expects a collection name on the commandline, to specify which collection to extract.

getCollTags does the same as updateAcumenOnly, but ALSO logs retrieval in the InfoTrack database. Both of these scripts add a recordModificationDate in a comment within the XML file for the tags.


Tags are stored in separate XML files utilizing this simple schema. In Acumen, these files are located in the Metadata folder at the item level, and the files are named according to this format: itemID.tags.xml where itemID is the item identifier for the item being described. The "confidenceLevel" is the number of times a specific tag has been applied to this item by a user.