How to build a Database

If the following are true, you may want to consider moving to a database

  • Your spreadsheets have more than 1000 entries
  • Your excel files have multiple tabs that relate to each other
  • You are concerned about overwriting your data
  • You need multiple users to access the data

Below are several options to help you get started with relational databases. They are provided as a combination of the database software and a graphical user interface (GUI).

As always, don’t hesitate to contact us with any questions as you implement and modify this starting database schema.

DB Software and GUISQLite and DB BrowserMySQL Server and MySQL WorkbenchAWS hosted database server and MySQL Workbench
Need Server?No, file is hosted on your computerYes. Your computer can act as the server.AWS hosts the server
MultiuserNoYesYes
Data TypesLimitedWide rangeWide range
SetupPoint-and-clickPoint-and-clickPoint-and-click
QuerySQLSQLSQL
CostFreeServer maintenance fees (your computer can be the server)storage fees + data transfer fees (12 month grace period)
MaterialsSQLite
DB browser
Install MySQL
About Workbench
Get an AWS account through UC Davis
Create a database on AWS
About Workbench
Getting-started DifficultyEasiestModerateEasy

This tutorial will help you set up a simple database that will help you keep track of cohort vs individual data. This structure will be useful if you are working with cohorts plants or animals that cannot be tagged, but are still measured, and then continue to track the individuals from the cohort when they reach a size that can be tagged. This database allows you to connect measurements of individuals to measurements of their cohort.

First, you want to set up your cohort table. Both MySQL Workbench and DB Browser will allow you to create new tables in the GUI. They differ in the data types they allow, so the table creation steps are outlined below. The featured schema illustration was created in MySQL Workbench.

 

One of the most important steps in setting up and using a database is documenting it. Database documentation is called “Data dictionary” and contains the following information:

  • Table name
  • Field name
  • Data type or allowed values for the field name
  • Any other information regarding the field: is it a key, does it have to be unique, does it auto-increment, does it have a default value
  • A couple of sentences explaining the field. Do not assume your variable name is sufficient to convey the information.

The data_dictionary for the database structure in MySQL can be downloaded and re-purposed for other databases.

In MySQL Workbench you will connect to MySQL Server first. In the Navigator window, find your database. You will see “Tables” nested under it. Right-click and select “Create Table”. First, name the table “Cohort”. Add comments about the content of the table. Click on the first row to add a field name. We are going to follow a naming convention in which the field name is going to contain the table name (all fields in this table will start with cohort_). We are doing this to prevent duplication of field names. The first field is always going to be the unique identifier for the record. Here we call it cohort_id. In this case it is of data type VARCHAR(6)–select it from the drop-down menu and modify the number in the parenthesis to 6. This field now can contain up to six alpha-numeric characters. It is going to be a “Primary key” and “Not NULL”, which means it will unique and always present in each record you add. After this add fields cohort_cross_date, with data type “DATE”, cohort_parent_1 and cohort_parent_2, both as VARCHAR(45). Next, we will add a field with validation for text. This means that the field is going to accept a limited number of options. Ideally, all of your text fields will be designed this way. The way to constrict text options is to select data type ENUM from the drop-down menu and type in the allowed values in single quotation marks, separated by commas in the parentheses like this: ENUM(‘Location_1’, ‘Location_2’). Click “Apply” and MySQL Workbench will generate the SQL code to create the table.

The individual table is created in a similar manner, but here we will introduce a foreign key to the cohort table. This means, we will link it to the cohort table, so that only the id’s of cohorts that have been created can be entered in this table. This keeps the data clean, and improves the speed of any table joins you may choose to do in the future. Once you create the fields “individual_id”, “ind_cohort_id” and “location” with the data types as shown in the schema, click apply to create the table. Find the “Foreign Keys” tab. Name your foreign key “ind_to_cohort” and under Referenced Table select “cohort”. In the adjacent window in Column select “ind_to_cohort” and in the adjacent Referenced Column select cohort_id. For foreign key options, for “On Update” set to “CASCADE”. Let’s say you discovered you misspelled your cohort_id. Once you fix it in the cohort table, it will update with the appropriate individual records. “On Delete” I usually “SET NULL”, but you can also choose “NO ACTION” if you want to keep the reference to the cohort in the individual table even if the cohort record itself is deleted.

Finally, set up the “physical” table. You will notice new data types. First of all, the physical_id is an integer (INT) and is also auto-incremental (AI). This ensures that every measurement you make will be treated as unique. Whenever you are working with data that comes as whole numbers, such as counts, make sure to validate the data to ensure they are integers. The field “physical_cohort_or_ind” ascertains whether the measured organism is being tracked as an individual or only as part of the cohort (because it is too small to be tagged and distinguished from the other members of the cohort). This data type is ENUM(‘cohort’, ‘ind’). Finally, the measurements of weight and length are of DOUBLE data type. Note that the measurement unit is encoded in the field name. You can also have a separate field for units if you expect to be working with different instruments and on different scales. The important part is to keep units separate from the numeric measurement, so that your data are tidy and ready for transformations and analysis.

In DB Browser you will create a database first (see training materials). In the Database Structure tab, select “Create Table”. First, name the table “Cohort”. Further down in Fields, click on “Add Field”. We are going to follow a naming convention in which the field name is going to contain the table name (all fields in this table will start with cohort_). We are doing this to prevent duplication of field names. The first field is always going to be the unique identifier for the record. Here we call it cohort_id. You will see in the comparison table above that SQLite can handle a limited number of data types. In this case our choice is text–select it from the drop-down menu. It is going to be a “Primary key” and “Not NULL”, which means it will unique and always present in each record you add. After this add fields cohort_cross_date, cohort_parent_1 and cohort_parent_2, all three as text.┬áNext, we will add a field called “cohort_location” with validation for text. This means that the field is going to accept a limited number of options. Ideally, all of your text fields will be designed this way. The way to do this in DB Brower is to program a check to see if the input data is one of the options you have given it beforehand. Scroll to the right and locate the “Check” column. Enter cohort_location==”tank_1″ OR cohort_location==”tank_2″. Essentially you are stringing together the allowed values for cohort_location with the Boolean OR. Also note that you need to use the double equal sign. The syntax for this is field_name == “option” OR field_name==”some_other_option”, and you can keep adding OR’s.

The individual table is created in a similar manner, but here we will introduce a foreign key to the cohort table. This means, we will link it to the cohort table, so that only the id’s of cohorts that have been created can be entered in this table. This keeps the data clean, and improves the speed of any table joins you may choose to do in the future. First, create the field “individual_id” as a TEXT data type and as Primary Key, Not NULL and Unique. Next, when you create “ind_cohort_id” as a TEXT data type, scroll to column Foreign Key. When you click the appropriate field, a drop-down menus will appear. In the first drop-down menu choose the table (in this case, cohort) and in the next–the field (“cohort_id”). In the txt box type ON DELETE SET NULL ON UPDATE CASCADE. You can also choose “NO ACTION” if you want to keep the reference to the cohort in the individual table even if the cohort record itself is deleted. Watch the SQL script window below. It should show FOREIGN KEY(`ind_cohort_id`) REFERENCES `cohort`(`cohort_id`). If this line doesn’t appear, unclick and click back the PK, U and NN fields. You also have to set up the foreign key when you create the table. This is a quirk of DB Browser (or a bug). Finally, finish with “location” as a TEXT data type.

Finally, set up the “physical” table. You will notice new data types. First of all, the physical_id is an integer (INT) and is also auto-incremental (AI). This ensures that every measurement you make will be treated as unique. Whenever you are working with data that comes as whole numbers, such as counts, make sure to validate the data to ensure they are integers. The field “physical_cohort_or_ind” ascertains whether the measured organism is being tracked as an individual or only as part of the cohort (because it is too small to be tagged and distinguished from the other members of the cohort). This data type is TEXT, with Check (physical_cohort_or_ind==’cohort’ OR physical_cohort_or_ind==’ind’). Finally, the measurements of weight and length are of NUMERIC data type. Note that the measurement unit is encoded in the field name. You can also have a separate field for units if you expect to be working with different instruments and on different scales. The important part is to keep units separate from the numeric measurement, so that your data are tidy and ready for transformations and analysis.