Database Relations in Acumen

From UA Libraries Digital Services Planning and Documentation
(Difference between revisions)
Jump to: navigation, search
(asset_type table)
Line 79: Line 79:
  
 
As you can see, fields in some tables refer to fields in others.  Notably, the different types of IDs are used to pull information together.  That's why this is called a relational database.
 
As you can see, fields in some tables refer to fields in others.  Notably, the different types of IDs are used to pull information together.  That's why this is called a relational database.
 +
 +
 +
===authority===
 +
The authority table brings together ID values from the file table, the asset table, the authority_type table, and matches them to a value.
 +
So if you wanted a list of all titles, you could query: "select value from authority where authority_type_id = 1;"  since in authority_type, title is assigned ID number 1.  (Normally, though, we want to find a value for a particular file; for that, you'll need to look up the file_id in the file table.)
 +
 +
| Field            | Type      | Null | Key | Default | Extra          |
 +
+-------------------+------------+------+-----+---------+----------------+
 +
| id                | bigint(20) | NO  | PRI | NULL    | auto_increment | this is the authority_id used elsewhere
 +
| file_id          | bigint(20) | NO  | MUL | NULL    |                | this is the ID from the file table to specify which file exactly
 +
| asset_id          | bigint(20) | YES  | MUL | NULL    |                | this is the ID from the asset table to specify which asset exactly
 +
| authority_type_id | int(11)    | NO  | MUL | NULL    |                | this is the ID from the authority_type table below, for which field it is
 +
| value            | text      | NO  |    | NULL    |                | this is the actual text value of the field in question
 +
 +
So if I've already looked in the file table and found that the file_id for u0003_0000581_0000001.mods.xml is 150791, and I want to look at the title for that record (titles are authority_type_id of 1 from the authority_type table above, I can query:
 +
 +
MariaDB [acumen]> select * from authority where file_id=150791 and authority_type_id=1;
 +
+----------+---------+----------+-------------------+-------------------------------------------------------------+
 +
| id      | file_id | asset_id | authority_type_id | value                                                      |
 +
+----------+---------+----------+-------------------+-------------------------------------------------------------+
 +
| 23391915 |  150791 |      -1 |                1 | Letter from Henry L. Abbot to William C. Gorgas, 1904-02-11 |
 +
+----------+---------+----------+-------------------+-------------------------------------------------------------+
 +
 +
As you can see, the asset_id value is -1 because this value does not apply to an asset in this case, but instead to a file.
 +
 +
 +
===authority_type===
 +
In authority_type, each field (or type) is assigned an id number:
 +
# title
 +
# creator
 +
# archivist
 +
# subject
 +
# date 
 +
... and so on.  The ID number assigned here is used to refer to the value stored in the authority table.
 +
 +
===file===
 +
 +
The file table and the asset table are the heart of how content is managed in Acumen.  The entries in each one have to relate to the other entries (including the ones in the other table) properly. If they do not, Acumen gets very confused, and delivers garbage online.  Check first to see if files are out of order in the web directories.  If they are not, then the database has become corrupt.  You'll have to find the cause, and fix it.  When I found it corrupted in mid 2016, many assets had the wrong file_id entry, and many file_name entries had the wrong parent_id.
 +
 +
| Field              | Type                | Null | Key | Default | Extra          |
 +
+--------------------+----------------------+------+-----+---------+----------------+
 +
| id                | bigint(11)          | NO  | PRI | NULL    | auto_increment |  this is the file_id used elsewhere
 +
| parent_id          | bigint(11)          | YES  | MUL | NULL    |                |  the file_id of the file that hierarchically is above this one
 +
| file_type_id      | int(11)              | NO  |    | NULL    |                |  the file_type id from the file_type table
 +
| title              | varchar(255)        | NO  | MUL | NULL    |                |  the title again; a duplication from the authority table (why??)
 +
| file_name          | varchar(255)        | NO  | MUL | NULL    |                |  the file name WITH EXTENSION
 +
| file_path          | varchar(255)        | NO  |    | NULL    |                |  the SERVER full path to the file, including file name
 +
| file_size          | int(11)              | NO  |    | NULL    |                |  size in bytes
 +
| file_last_modified | bigint(11)          | NO  |    | NULL    |                |  in seconds from the epoch
 +
| status_type_id    | int(11)              | NO  | MUL | NULL    |                |  see status_type for ID value; 1 is good!  others are not
 +
| found              | smallint(5) unsigned | NO  | MUL | NULL    |                |  Whether Acumen found this file the last time it indexed
 +
 +
Parent_Id for a page level MODS should correspond to the item-level MODS.  Parent_id for the item_level MODS should correspond to the EAD (or if there isn't one, the collection xml file).  Parent_id for the EAD (or collection xml file) should be the holder level, such as u0003 or w0001.
 +
 +
 +
'''One way to force a reindex of metadata when things are badly screwed up, is to set all the file_last_modified fields = 0;  then reindex.'''  That way, the indexer will check each and every metadata record, and not assume anything.  Will Jones taught me this one.
 +
 +
 +
 +
===file_type===
 +
 +
Here's the file_type IDs.  "Release" means it's good, it's online.
 +
 +
| id | type          |
 +
+----+----------------+
 +
|  1 | release        |
 +
|  2 | development    |
 +
|  3 | obsolete      |
 +
|  4 | missing        |
 +
|  5 | child_metadata |
 +
|  6 | broken        |
 +
 +
 +
===status_type===
 +
===tags===
 +
===tags_assets===
 +
===transcripts===

Revision as of 12:42, 5 July 2017

Acumen has numerous tables, some of which I'm not even sure are in use. First I'll list the tables, and then I'll talk about the ones that I've worked with to sort out problems. Here's the tables (remember, this is from all 3 acumen databases -- one for live, one for staging, and one for dev). The ones in bold I will describe further:

| aauth_group_to_group | (empty)
| aauth_groups         | (3 entries; clearly Will or Tonio was hoping to set up multiple levels of access)
| aauth_login_attempts | (empty)
| aauth_perm_to_group  | (empty)
| aauth_perm_to_user   | (empty)
| aauth_perms          | (empty)
| aauth_pms            | (empty)
| aauth_user_to_group  | (2 entries)
| aauth_user_variables | (empty)
| aauth_users          | (1 entry)
| asset                |
| asset_orphan         | (empty)
| asset_type           |
| authority            |
| authority_type       |
| crowd_status         | (to be used if anyone bothered to review tags/transcripts and approve them)
| featured             |  (to be used with what Will was setting up in dev, to showcase collections)
| file                 |
| file_type            |
| note                 | (empty)
| note_type            | (empty)
| privilege            | only one entry
| privilege_type       | only one entry
| scan_task            | (empty)
| scratch              |  (not used often, and not sure how/why it's used)
| search_category      | (empty)
| status_type          |
| tags                 |
| tags_assets          |
| transcripts          |
| user                 | (two test entries)
| user_action          | (empty)
| user_action_type     | (empty)

Contents

Tables of Interest

  • asset
  • asset_type
  • authority
  • authority_type
  • file
  • file_type
  • status_type
  • tags
  • tags_assets
  • transcripts

asset_type table

This table lists the types of digital content (apart from metadata) that we currently support in Acumen, with the extensions to be expected for the delivery file and the extensions to be expected for the thumbnails. Each type of asset has an assigned ID:

| id | type     | asset_tails               | thumb_tails                     | parent_folder |
+----+----------+---------------------------+---------------------------------+---------------+
|  1 | image    | .tif,.tiff,_2048.jpg      | _512.jpg,_128.jpg,.txt,.ocr.txt | NULL          |
|  3 | video    | _512kb.mp4,.mp4           | _128v.jpg,_800s.jpg             | NULL          |
|  4 | audio    | _64kb.mp3,_256kb.mp3,.mp3 | .txt,.ocr.txt                   | NULL          |
|  2 | document | .pdf                      | _128c.jpg                       | NULL          |
|  6 | ignore   | .icon.jpg                 |                                 | NULL          |

If you need Acumen to support new types of files, add them here. If they will exist in a subdirectory (such as SupplementalFiles), I think that would go into the parent_folder field.

asset table

Here's the columns in the asset table. This table tracks every single image, audio, video and text file, each of which is considered an "asset" to a metadata record somewhere, in some capacity:

| Field              | Type                 | Null | Key | Default | Extra          |
+--------------------+----------------------+------+-----+---------+----------------+
| id                 | bigint(20)           | NO   | PRI | NULL    | auto_increment | This is referred to in other tables as file_id
| asset_type_id      | int(11)              | NO   |     | NULL    |                | See asset_type ID values in asset_type table above
| name               | varchar(255)         | NO   | UNI | NULL    |                | This is of the form: u0003_0000581_0000001
| orig_path          | varchar(255)         | NO   |     | NULL    |                | This is the acumen path for web delivery, including file name
| thumb_path         | varchar(255)         | NO   |     | NULL    |                | This is the acumen path for web delivery of thumb (no file name)
| file_id            | bigint(20)           | YES  | MUL | NULL    |                | See file_type ID values in file_type table below
| file_size          | bigint(20)           | NO   |     | NULL    |                | size in bytes
| file_last_modified | bigint(20)           | NO   |     | NULL    |                | this is in seconds since the Epoch
| status_type_id     | int(11)              | NO   | MUL | NULL    |                | See status_type ID values in status_type table below
| found              | smallint(5) unsigned | NO   | MUL | NULL    |                | If "1" the system found the file; if "0" there's a problem.

As you can see, fields in some tables refer to fields in others. Notably, the different types of IDs are used to pull information together. That's why this is called a relational database.


authority

The authority table brings together ID values from the file table, the asset table, the authority_type table, and matches them to a value. So if you wanted a list of all titles, you could query: "select value from authority where authority_type_id = 1;" since in authority_type, title is assigned ID number 1. (Normally, though, we want to find a value for a particular file; for that, you'll need to look up the file_id in the file table.)

| Field             | Type       | Null | Key | Default | Extra          |
+-------------------+------------+------+-----+---------+----------------+
| id                | bigint(20) | NO   | PRI | NULL    | auto_increment | this is the authority_id used elsewhere
| file_id           | bigint(20) | NO   | MUL | NULL    |                | this is the ID from the file table to specify which file exactly
| asset_id          | bigint(20) | YES  | MUL | NULL    |                | this is the ID from the asset table to specify which asset exactly
| authority_type_id | int(11)    | NO   | MUL | NULL    |                | this is the ID from the authority_type table below, for which field it is
| value             | text       | NO   |     | NULL    |                | this is the actual text value of the field in question

So if I've already looked in the file table and found that the file_id for u0003_0000581_0000001.mods.xml is 150791, and I want to look at the title for that record (titles are authority_type_id of 1 from the authority_type table above, I can query:

MariaDB [acumen]> select * from authority where file_id=150791 and authority_type_id=1;
+----------+---------+----------+-------------------+-------------------------------------------------------------+
| id       | file_id | asset_id | authority_type_id | value                                                       |
+----------+---------+----------+-------------------+-------------------------------------------------------------+
| 23391915 |  150791 |       -1 |                 1 | Letter from Henry L. Abbot to William C. Gorgas, 1904-02-11 |
+----------+---------+----------+-------------------+-------------------------------------------------------------+

As you can see, the asset_id value is -1 because this value does not apply to an asset in this case, but instead to a file.


authority_type

In authority_type, each field (or type) is assigned an id number:

  1. title
  2. creator
  3. archivist
  4. subject
  5. date

... and so on. The ID number assigned here is used to refer to the value stored in the authority table.

file

The file table and the asset table are the heart of how content is managed in Acumen. The entries in each one have to relate to the other entries (including the ones in the other table) properly. If they do not, Acumen gets very confused, and delivers garbage online. Check first to see if files are out of order in the web directories. If they are not, then the database has become corrupt. You'll have to find the cause, and fix it. When I found it corrupted in mid 2016, many assets had the wrong file_id entry, and many file_name entries had the wrong parent_id.

| Field              | Type                 | Null | Key | Default | Extra          |
+--------------------+----------------------+------+-----+---------+----------------+
| id                 | bigint(11)           | NO   | PRI | NULL    | auto_increment |  this is the file_id used elsewhere
| parent_id          | bigint(11)           | YES  | MUL | NULL    |                |  the file_id of the file that hierarchically is above this one
| file_type_id       | int(11)              | NO   |     | NULL    |                |  the file_type id from the file_type table
| title              | varchar(255)         | NO   | MUL | NULL    |                |  the title again; a duplication from the authority table (why??)
| file_name          | varchar(255)         | NO   | MUL | NULL    |                |  the file name WITH EXTENSION
| file_path          | varchar(255)         | NO   |     | NULL    |                |  the SERVER full path to the file, including file name
| file_size          | int(11)              | NO   |     | NULL    |                |  size in bytes
| file_last_modified | bigint(11)           | NO   |     | NULL    |                |  in seconds from the epoch
| status_type_id     | int(11)              | NO   | MUL | NULL    |                |  see status_type for ID value; 1 is good!  others are not
| found              | smallint(5) unsigned | NO   | MUL | NULL    |                |  Whether Acumen found this file the last time it indexed

Parent_Id for a page level MODS should correspond to the item-level MODS. Parent_id for the item_level MODS should correspond to the EAD (or if there isn't one, the collection xml file). Parent_id for the EAD (or collection xml file) should be the holder level, such as u0003 or w0001.


One way to force a reindex of metadata when things are badly screwed up, is to set all the file_last_modified fields = 0; then reindex. That way, the indexer will check each and every metadata record, and not assume anything. Will Jones taught me this one.


file_type

Here's the file_type IDs. "Release" means it's good, it's online.

| id | type           |
+----+----------------+
|  1 | release        |
|  2 | development    |
|  3 | obsolete       |
|  4 | missing        |
|  5 | child_metadata |
|  6 | broken         |


status_type

tags

tags_assets

transcripts

Personal tools