The R class R programming for biologists

03 -- `data.frame` continued

Getting back where we started

Let’s start with a clean working directory. Create the folder data/ within your working directory, and download the 2 datasets:

download.file("http://r-bio.github.io/data/surveys.csv",
              "data/surveys.csv")
download.file("http://r-bio.github.io/data/species.csv",
              "data/species.csv")

and then check that you can load the surveys dataset (for now) into R:

##   X record_id month day year plot species_id sex wgt
## 1 1         1     7  16 1977    2         NL   M  NA
## 2 2         2     7  16 1977    3         NL   M  NA
## 3 3         3     7  16 1977    2         DM   F  NA
## 4 4         4     7  16 1977    7         DM   M  NA
## 5 5         5     7  16 1977    3         DM   M  NA
## 6 6         6     7  16 1977    1         PF   M  NA
## 'data.frame':	35549 obs. of  9 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ record_id : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ day       : int  16 16 16 16 16 16 16 16 16 16 ...
##  $ year      : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
##  $ plot      : int  2 3 2 7 3 1 2 1 1 6 ...
##  $ species_id: chr  "NL" "NL" "DM" "DM" ...
##  $ sex       : chr  "M" "M" "F" "M" ...
##  $ wgt       : int  NA NA NA NA NA NA NA NA NA NA ...

About the data.frame class

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.

A data.frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different class (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

The most common way you are going to create data.frame objects is when you will use the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

You can also create data.frame manually with the function data.frame(). This function can also take the argument stringsAsFactors. Compare the output of these examples:

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8))
str(example_data)
## 'data.frame':	4 obs. of  3 variables:
##  $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
##  $ feel  : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
##  $ weight: num  45 8 1.1 0.8

Here you can observe the default behavior of the data.frame function. The columns animal and feel are of class factor. By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into a vector of class factor. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE:

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
## 'data.frame':	4 obs. of  3 variables:
##  $ animal: chr  "dog" "cat" "sea cucumber" "sea urchin"
##  $ feel  : chr  "furry" "furry" "squishy" "spiny"
##  $ weight: num  45 8 1.1 0.8

If you want to manually change the class of one of the column, you can use the function as.factor() (below we’ll cover in more detail how to work with columns):

example_data$feel <- as.factor(example_data$feel)
str(example_data)
## 'data.frame':	4 obs. of  3 variables:
##  $ animal: chr  "dog" "cat" "sea cucumber" "sea urchin"
##  $ feel  : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
##  $ weight: num  45 8 1.1 0.8

Challenge

  1. There are a few mistakes in this hand crafted data.frame, can you spot and fix them? Don’t hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
                          author_last=c(Darwin, Mayr, Dobzhansky),
                          year=c(1942, 1970))
  1. Can you predict the class for each of the columns in the following example? Check your guesses using str(country_climate). Are they what you expected? Why? why not?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
                              climate=c("cold", "hot", "temperate", "hot/temperate"),
                              temperature=c(10, 30, 18, "15"),
                              north_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
                              has_kangaroo=c(FALSE, FALSE, FALSE, 1))

Check your guesses using str(country_climate). Are they what you expected? Why? why not?

R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to. You can review the notes from the second lecture to review the coercion rules R uses.

Inspecting data.frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the __dim__ensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
  • Names:
    • names() - returns the column names (synonym of colnames() for data.frame objects)
    • rownames() - returns the row names
  • Summary:
    • str() - structure of the object and information about the class, length and content of each column
    • summary() - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Use these functions on the surveys data set loaded in R.

Slicing data

Our survey data frame has rows and columns (it’s a 2-dimensional object), if we want to extract some specific data from it (a slice of it), we need to specify the “coordinates” we want the data to come from. To do this, we use the square bracket notation (just like with vectors), except that we need to add a comma to indicate the rows and columns we want. Row numbers come first, followed by column numbers. Here are some examples:

surveys[1, 1]   # first element in the first column of the data frame
surveys[1, 6]   # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ]    # the 3rd element for all columns
surveys[, 8]    # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)

Challenge

  1. The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)

Subsetting data

In particular for larger datasets, it can be tricky to remember the column number that corresponds to a particular variable (Are species names in column 6 or 8? oh, right… they are in column 7), and using the column number to extract the data (i.e., surveys[, 7]) may not be practical. In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the weights from our datasets, we can use: surveys$wgt. You can use names(surveys) or colnames(surveys) to remind yourself of the column names.

In some cases, you may way to select more than one column. You can do this using the square brackets: surveys[, c("wgt", "sex")].

When analyzing data, though, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per plot.

One way to do this is to select the data we want, and create a new temporary array, using the subset() function. For instance, if we just want to look at the animals of the species “DO”:

surveys_DO <- subset(surveys, species == "DO")

Challenge

  1. What does the following do (Try to guess without executing it)? surveys_DO$month[2] <- 8

  2. Use the function subset to create a data.frame that contains all individuals of the species “DM” that were collected in 2002. How many individuals of the species “DM” were collected in 2002?

Adding a column to our dataset

Sometimes, you may have to add a new column to your dataset that represents a new variable. You can add columns to a data.frame using the function cbind() (c__olumn __bind). Beware, the additional column must have the same number of elements as there are rows in the data.frame.

In our survey dataset, the species are represented by a 2-letter code (e.g., “AB”), however, we would like to include the species name. The correspondance between the 2 letter code and the names are in the file species.csv. In this file, one column includes the genus and another includes the species. First, we are going to import this file in memory:

species <- read.csv("data/species.csv", stringsAsFactors=FALSE)

We are then going to use the function match() to create a vector that contains the genus names for all our observations. The function match() takes at least 2 arguments: the values to be matched (in our case the 2 letter code from the surveys data frame held in the column species), and the table that contains the values to be matched against (in our case the 2 letter code in the species data frame held in the column species_id). The function returns the position of the matches in the table, and this can be used to retrieve the genus names:

surveys_spid_index <- match(surveys$species_id, species$species_id)
surveys_genera <- species$genus[surveys_spid_index]

Now that we have our vector of genus names, we can add it as a new column to our surveys object:

surveys <- cbind(surveys, genus=surveys_genera)

Challenge

  • Use the same approach to also include the species names in the surveys data frame.

  • Use the help in R to understand what the function paste() does. Use it to add a new column called genus_species into the species data.frame.
  • Use the help to understand what the function merge() does. Use it to create a new data.frame that combines the content of surveys and the modified version of species.
  • Use this data set to answer the following:
    • How many birds have been captured?
    • How many individuals of the genus Dipodomys have been captured?

Adding rows

Let’s create a data.frame that contains the information only for the species “DO” and “DM”. We know how to create the data set for each species with the function subset():

surveys_DO <- subset(surveys, species == "DO")
surveys_DM <- subset(surveys, species == "DM")

Similarly to cbind() for columns, there is a function rbind() (r__ow __bind) that puts together two data.frame. With rbind() the number of columns and their names must be identical between the two objects:

surveys_DO_DM <- rbind(surveys_DO, surveys_DM)

Challenge

  • Using a similar approach, construct a new data.frame that only includes data for the years 2000 and 2001.

  • How does it differ from subset(surveys, species == "DO" | species == "DM")?

Removing columns

Just like you can select columns by their positions in the data.frame or by their names, you can remove them similarly.

To remove it by column number:

surveys_noDate <- surveys[, -c(3:5)]
colnames(surveys)
##  [1] "X"            "record_id"    "month"        "day"         
##  [5] "year"         "plot"         "species_id"   "sex"         
##  [9] "wgt"          "genus"        "species_name"
colnames(surveys_noDate)
## [1] "X"            "record_id"    "plot"         "species_id"  
## [5] "sex"          "wgt"          "genus"        "species_name"

The easiest way to remove by name is to use the subset() function. This time we need to specify explicitly the argument select as the default is to subset on rows (as above). The minus sign indicates the names of the columns to remove (note that the column names should not be quoted):

surveys_noDate2 <- subset(surveys, select=-c(month, day, year))
colnames(surveys_noDate2)
## [1] "X"            "record_id"    "plot"         "species_id"  
## [5] "sex"          "wgt"          "genus"        "species_name"