Metadata reports

From UA Libraries Digital Services Planning and Documentation
Revision as of 10:10, 11 March 2015 by Jlderidder (talk | contribs) (Subjects)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

From time to time, it's helpful to generate reports to get a sense of the status of content in Acumen. This is also necessary to update any visualization of the content within Acumen, as it's from these reports that such visualization displays are drawn.

  • The automated reports are about new collections and persistent URLs assigned; those are delivered via email (email addresses must be updated in the scripts).
  • The other reports are performed on demand, or as needed for various purposes. Those must be copied/moved to a location where they can be widely accessed, such as to a directory in S:\Public\DigitalServices\ContentAnalysis via /cifs-mount12/ .
  • Monthly count reports are semi-automated. For more information on those, see Monthly Count.

Automated (Metadata-related) Reports

  • acumenMonthlyReport in /srv/scripts/stats/acumenMonthlyReport creates a tab-delimited spreadsheet of collections in Acumen, including collection number, title, description, number of items, number of files, and whether there's an EAD (and totals the columns)
  • newCollPurls in /srv/scripts/purls/newCollPurls provides a list of new collections (since the previous month) with persistent urls (PURLs), title, description, identifier, and real URL -- in text and in XML, according to who asked for what (archivists wanted text, metadata librarians wanted xml)
  • allCollPurls in /srv/scripts/purls/allCollPurls -- provides the same information, but for all the collections in Acumen
  • checkEmbargo in /srv/scripts/etds/checkEmbargo -- sends (a week in advance) both text and xml versions of information about ETDs to be released to the web on the first of the next month.
  • eadModsTester in /srv/scripts/eads/eadModsTester -- provides a summary of the results of testing EADs (new, remediated, or which have newly digitized content) for the ability to link in digitized content
  • cleanUpLinkingOutput in /srv/scripts/eads/cleanUpLinkingOutput -- provides warnings about changes in EAD linkability compared to previous month(s)
  • eadsLostLinks4 in /srv/scripts/eadslostLinks4 -- provides details about items that are no longer linked in the EADs, including the item numbers and the boxes and folders to which they had been assigned, and the reference numbers
  • acumenFacets in /srv/scripts/metadata/faceting/acumenFacets -- provides error messages about the files for which it could not generate facets due to problems in the metadata entries

Genres and more

Within /srv/scripts/metadata/genreStuff are a number of scripts which generate different versions of reports about genre values in the MODS in Acumen.

For Genre reports:

  • genreReportCollCt reports on genre values regardless of authority, listing collection numbers and number of items in that collection and includes ALL genre fields
  • genreSpacesCollCt -- reports on genre values that have spaces before them, listing collection numbers and number of items in that collection -- includes ALL genre fields
  • multGenres -- this script hunts for all the items that have more than one <genre> tag, lists how many genres in each, the item number, and then the genres.

For displays:

  • genreAndDate collects genre value and key date of all items in Acumen that have the top set of genre values as identified in genreLabelMap (which categorizes genres in highest use); outputs genreDates (label, date, number of items in that year) and also nokeydates (which lists the # of items for each label that have no key date, and the item numbers)
  • genreAndYear -- an improvement on genreAndDate, to output by year instead of by date
  • genreAndDecade -- the above, modified to output genreDecade (count of items per label per decade) instead of by date or year
  • topicAndDecade -- a rewrite of the above to work with grouped topics by decade (outputs topicDecade). Depends on topicLabelMap which lists the topic in the first column, and the grouping, or category, in the second. topicLabelMap was created by hand in Excel and exported as tab delimited.

For Building A Browse of Correspondence:

  • getCorrespondence -- collects all MODS from Acumen that have a genre value of "correspondence" and collects them in a local directory for analysis
  • getMore -- goes through the collected correspondence MODS and pulls out information about correspondent, recipient, locations of each, date created and topics -- creating a spreadsheet called correspondence.xls
  • getMoreSplit -- does the same, but splits these into multiple spreadsheets; those with sender and recipient locations plus date; those with only sender location and date; those with only recipient location and date; those with both locations but no date; those with both names and a date; and all others (for further analysis)


These are in /srv/scripts/metadata/subjectStuff/ :

  • countAuths -- looks though Acumen for item level MODS needing subjects; outputs collection number, item number, whether it has no subjects at all or no LCSH subjects
  • getSubjects - pulls all existing subjects out of item level MODS, dedupes, counts how many times each is used, and orders alphabetically: generates alphaSubjects report
  • getSubjectsSplit -- does the same as above, but splits up output into lcshSubs, nonLCSHsubs, an analysis called breakdown (like what countAuths gives, but only a count by collection), and a listToFix, which is exactly what countAuths gives -- contains a list of items)
  • getSubjectsSplitTypes -- provides a breakout of how many geographic, temporal, title, genre, and occupation subjects, what they are, how often used
  • getSubjectsSplitTypesColls -- does the same as above and collects collection identifiers as well
  • splitSubjects -- this one splits compound LCSH subjects into separate subjects, combines with all other subjects, dedupes, orders alphabetically, counts the number of times each is used, and outputs alphaSeparateSubjects report.