2 May 2011

Import dbf to R, Manipulate Strings with grep & sub Function

Here's a set of historical species presence records of a certain geographical region (data-link). I wanted to manipulate / simplify strings (species names) and get an overview of the data. 
...The tasks were to split genera and epitheta, to exclude species with specific strings included and to get rid of unwanted text (author names). For graphical presentation of the species record history I did a plot with segments indicating the first and last year of a species record:



 



# for dbf import you'll need:
require(foreign)

# change to your directory: TLMFB_IL <- read.dbf("D:\\Downloads\\TLMFB_IL.dbf")

str(TLMFB_IL)

# get rid of hybrids:
dat_sub1 <- TLMFB_IL[-grep(" x ", TLMFB_IL$NAME),]

# check how many were dismissed:
length(TLMFB_IL$NAME) - length(dat_sub1$NAME)

# get genera and epitheta:
Gen <- sub(" .*", "", dat_sub1$NAME)
Epi <- sub(" .*", "", substring(dat_sub1$NAME, nchar(Gen)+2))

# get rid of species with unsure determination:
dat_sub2 <- dat_sub1[Epi != "spec." &
Epi != "sp." &
Epi != "cf." &
Epi != "" &
Epi != "?", ]

length(TLMFB_IL$NAME) - length(dat_sub2$NAME)

# check:
dat_sub2$NAME
TLMFB_IL$NAME

table(Epi != "spec." &
Epi != "sp." &
Epi != "cf." &
Epi != "" &
Epi != "?")

length(grep(" x ", TLMFB_IL$NAME))
length(dat_sub2$NAME)

# get genera and epitheta:
Gen <- sub(" .*", "", dat_sub2$NAME)
Epi <- sub(" .*", "", substring(dat_sub2$NAME, nchar(Gen)+2))

# get rid of authors:
sp <- paste(Gen, Epi)
length(sp)

# check arbitrary sample of 100 rows, window should be large

# enough to show columns next to each other:
id <- sample(1:3175, 100)
data.frame(sp = sp, orig = dat_sub2$NAME)[id,]

# add species names without authors to dataframe
dat_sub2$Sp <- sp

# there are some erronous values that should be discarded:
dat_sub3 <- dat_sub2[dat_sub2$date_long < 2010,]
str(dat_sub3)

# get max and min year at which species were recorded
y_min <- aggregate(. ~ Sp, min, data = dat_sub3[,c("Sp","date_long")])
y_max <- aggregate(. ~ Sp, max, data = dat_sub3[,c("Sp","date_long")])

# plot each species first and last record in line plot, data:
head(pldat <- data.frame(Sp = y_min[,1], y_min = y_min[,2],
y_max = y_max[,2], span = y_max[,2] - y_min[,2]))

# plot:
# example("segments")

# new ordering for plot:
pldat <- pldat[order(pldat$y_min, pldat$span),]
plot(x = c(min(pldat$y_min), max(pldat$y_max)), y = c(1, nrow(pldat)),
type = 'n', xlab = "Year", ylab ="", axes = F, frame.plot=FALSE)

segments(x0 = pldat$y_min, x1 = pldat$y_max, y0 = 1:nrow(pldat),
y1 = 1:nrow(pldat), col = "gray60")

axis(1, pretty(1800:2000), cex.axis = 0.75)
mtext(paste("Species Records Innsbruck\n", " (Sp - N = ", nrow(pldat), ")",
sep = ""), side = 2, line = -2)


# especially inbetween the 60's and 80's many species were
# re-recorded and newly added by only a few authors:

sixt_eight <- dat_sub3[dat_sub3$date_long > 1960 & dat_sub3$date_long < 1980, c("AUTOR", "Sp")]

data.frame(table(as.character(sixt_eight$AUTOR)))

To cite package ‘foreign’ in publications use:
R-core members, Saikat DebRoy , Roger Bivand and
others: see COPYRIGHTS file in the sources. (2011). foreign: Read Data Stored by Minitab, S,
SAS, SPSS, Stata, Systat, dBase, .... R package version 0.8-44.
http://CRAN.R-project.org/package=foreign

1 comment :

  1. Very nice. I was overwhelmed when 'grepping' and your example helped. Thank you!

    ReplyDelete