We are going to use the package stringr
to learn some common basic string
manipulation. In biology these are often needed to clean messy datasets, to work
with sequence data or extra data from unformatted data.
In base R, there are equivalent functions to the ones we will cover today, they are a little more challenging to use but may provide additional flexibility if you have very specific needs.
To learn the functions we’ll cover today, we will work again with the list of sea cucumber specimens downloaded from iDigBio.
Let’s get set up
## if you need to download the data
## download.file("http://r-bio.github.io/data/holothuriidae-specimens.csv",
## "data/holothuriidae-specimens.csv")
hol <- read.csv(file="data/holothuriidae-specimens.csv", stringsAsFactors=FALSE)
library(stringr)
The function str_length()
gives the number of characters in a string:
str_length(c("cat", "dog", "giraffe", "cute dog", "very cute kitten"))
## [1] 3 3 7 8 16
Challenge
Which country where sea cucumbers have been collected, has the most letters in its name?
The function str_sub()
can take 3 arguments: a vector of class “character”, a
beginning and an end. Negative numbers indicates characters from the end (the
last one being -1).
str_sub("a very cute kitten")
## [1] "a very cute kitten"
str_sub("a very cute kitten", start=1L, end=-1L)
## [1] "a very cute kitten"
str_sub("a very cute kitten", start=8)
## [1] "cute kitten"
str_sub("a very cute kitten", start=-6)
## [1] "kitten"
str_sub("a very cute kitten", start=3, end=6)
## [1] "very"
Challenge
A common mistake in taxonomic data is that the wrong suffixes are used in order or class names. Check that the last 4 letters are the same for all this dataset.
str_sub()
can also be used to replace parts of a string:
cutest <- "The cutest animals are puppies"
str_sub(cutest, -7) <- "kittens"
str_c()
equivalent to paste()
but by default uses the empty strings as
separator:str_c("the cutest are ", c("cats", "dogs"), collapse=" and not ")
## [1] "the cutest are cats and not the cutest are dogs"
str_dup()
replicates a string as many times as specifiedstr_dup(c("wow ", "amazing "), 3)
## [1] "wow wow wow " "amazing amazing amazing "
str_trim()
removes leading and trailing spaces. It’s very common when
importing data (or cleaning up data entered in a form) that there are spaces
that you don’t want to have to deal with. This function removes
them. str_pad()
add whitespace (or other characters) on the right, left or
both to make a string a given length (width
)str_trim(str_dup(str_pad(c("wow ", "amazing "), width = str_length("amazing"), side="right", pad = "!"), 3))
## [1] "wow !!!wow !!!wow !!!" "amazing amazing amazing"
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. Jamie Zawinski
Also see: this
sum(str_detect(hol$dwc.scientificName, ignore.case("holothuria")))
## Please use (fixed|coll|regexp)(x, ignore_case = TRUE) instead of ignore.case(x)
## [1] 2899
library(dplyr)
authors <- hol$dwc.scientificNameAuthorship %>%
str_replace(pattern = "å", replacement = "aa") %>%
str_replace(pattern = "ä", replacement = "ae") %>%
str_replace(pattern = "é", "e") %>%
str_replace("^HL", "Hubert Lyman") %>%
str_replace("^Krauss in", "") %>%
str_split("&") %>% unlist() %>%
str_split(", ") %>% unlist() %>%
str_extract("[[:alpha:]]+(.+[[:alpha:]])?") %>%
unique() %>% sort()
authors
## [1] "Augustin" "Bell" "Brandt"
## [4] "Caso" "Caycedo" "Cherbonnier"
## [7] "Chiaje" "Clark" "Deichmann"
## [10] "Delle Chiaje" "Domantay" "Erwe"
## [13] "Feral" "Fisher" "Forskaal"
## [16] "Gaimard" "Gmelin" "Hubert Lyman Clark"
## [19] "Jaeger" "Kerr" "Laguarda-Figueras"
## [22] "Lampert" "Lesson" "Linnaeus"
## [25] "Ludwig" "Marenzeller" "Massin"
## [28] "Miller" "Mitsukuri" "Pawson"
## [31] "Pourtales" "Pourtalès" "Purcell"
## [34] "Quoy" "Rowe" "Samyn"
## [37] "Selenka" "Semper" "Sluiter"
## [40] "Solís-Marín" "Tan Tiu" "Theel"
## [43] "Tomascik" "Uthicke" "von Marenzeller"
stringr
has a good
vignette
(on which this lecture is based)rex
provides an easier way of writing regular expressions.