29 Aug 2011

Comparing Two Distributions

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods.

download data*

*Data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.



## comparing flowering time in indigenous and alien species by
## usage of a quantile-quantile plot (qqplot), the ks-test, by
## testing shifts in medians (wilcox test) and by testing
## difference of means (t-test, as we deal with integer and
## not a continous variable it is not the most appropiate choice)
## as well as by a chi-square test:

dat <- read.csv("E:\\R\\Data\\flowering_alien_vs_indigen.csv",
                sep = ";")

library(lattice)

histogram(~ Flowering|Status, data = dat, col = "gray60", layout = c(1, 2),
          xlab = list("Months of flowering"),
          ylab = list("Percentage of total"),
          scales = list(y = list(alternating = F)),
          strip = strip.custom(factor.levels = c("alien", "indigenous")))

qqplot(dat$Flowering[dat$Status == "indigen"],
       dat$Flowering[dat$Status == "Neophyt"])
abline(a = 0, b = 1, lty = 3)

ks.test(dat$Flowering[dat$Status == "indigen"],
        dat$Flowering[dat$Status == "Neophyt"])

wilcox.test(Flowering ~ Status, data = dat)

t.test(Flowering ~ Status, data = dat)

## Note that in the two-sample case the estimator for the
## difference in location parameters does not estimate
## the difference in medians (a common misconception) but
## rather the median of the difference between a sample
## from x and a sample from y.

## as we deal with a limited number of classes (1-12 months)
## and sample size is big enough my favourite would be a
## chi-square test:

m <- table(dat$Status, dat$Flowering)

(Xsq <- chisq.test(m))  # Prints test summary
Xsq$observed   # observed counts (same as M)
Xsq$expected   # expected counts under the null
Xsq$residuals  # Pearson residuals
Xsq$stdres     # standardized residuals

5 comments :

  1. Thank you for posting this! I have a similar type of data set and these comments/codes were vastly helpful.
    -JW, California, USA

    ReplyDelete
  2. Hi,

    I am unable to access the data you uploaded. Can you check your link please?

    ReplyDelete
  3. Hello, and thank you for your post, it's been very helpful.

    I'm also trying to compare two frequency distributions (monkey subgroup-size during dry "S_12" vs. rainy "LL_13" season):

    n <- table(sg$sgs, sg$temp) #Contingency table for subgroup size (sgs) per season (temp)

    S_12 LL_13
    1 182 185
    2 230 262
    3 190 229
    4 190 162
    5 94 130
    6 79 73
    7 56 113
    8 30 46
    9 34 37
    10 7 24
    11 7 24
    12 4 18
    13 2 9
    14 1 4

    I basically want to see if subgroup size frequencies were different between seasons. After running a chi squared test usign your approach, I got a significant difference:

    Xsq <- chisq.test(n)
    Xsq

    Pearson's Chi-squared test
    data: n
    X-squared = 52.7027, df = 13, p-value = 1.018e-06

    However, I'm unclear on something and I would apprecieate any insight you share with me. From what I understand, when we use the chi-squared test to compare two observed distributions, one of them serves as the "expected" one, and is then compared to the other "observed" one. But if this is the case, why do we get a separate set of expected values when looking at the summary of the test?

    This is what I get for expected values:

    Xsq$expected # expected counts under the null

    S_12 LL_13
    1 167.589595 199.410405
    2 224.670520 267.329480
    3 191.335260 227.664740
    4 160.739884 191.260116
    5 102.289017 121.710983
    6 69.410405 82.589595
    7 77.173410 91.826590
    8 34.705202 41.294798
    9 32.421965 38.578035
    10 14.156069 16.843931
    11 14.156069 16.843931
    12 10.046243 11.953757
    13 5.023121 5.976879
    14 2.283237 2.716763

    I'm worried that whart R is actually doing is comparing both sets of distributions with a model based on the default p value for the function ( p = rep(1/length(x)). Forgive me if this is a dumb question, but I've been going through the "?chisq.test" help file and have gone through other references and still can't clear that up in my head, so any help or guidance will be appreciated.

    ReplyDelete
    Replies
    1. sgs S_12 LL_13
      1 182 185
      2 230 262
      3 190 229
      4 190 162
      5 94 130
      6 79 73
      7 56 113
      8 30 46
      9 34 37
      10 7 24
      11 7 24
      12 4 18
      13 2 9
      14 1 4

      # I'll reconstructed your dataset here, (best for sharing data is dput() which saves us much headache;)

      a <- readClipboard()
      b <- do.call(cbind, sapply(a[2:15], FUN = function(x) strsplit(x, " ") ))

      S_12_sgs <- rep(1:14, as.numeric(b[2,]))
      LL_13_sgs <- rep(1:14, as.numeric(b[3,]))
      temp <- c(rep("LL_13", length(LL_13_sgs)), rep("S_12", length(S_12_sgs)))

      sg <- data.frame(temp=temp, sgs=c(LL_13_sgs, S_12_sgs))

      (n <- table(sg$sgs, sg$temp))

      Xsq <- chisq.test(n)

      # Here's how you get the expected frequencies under the null-hypothesis
      # which assumes that each set of frequencies were collected from data
      # with the same distribution!
      # For this you simply calculate row- and column sums and calculate the
      # probabilities and frequencies

      sgs_margins <- margin.table(n, 1)
      LL_13_sum <- margin.table(n, 2)[1]
      S_12_sum <- margin.table(n, 2)[2]
      total_sum <- margin.table(n)

      (sgs_margins / total_sum * LL_13_sum / total_sum) * total_sum
      (sgs_margins / total_sum * S_12_sum / total_sum) * total_sum

      # the same can be grabbed from our Xsq-object
      Xsq$expected

      # the "observed" frequencies are merely the ones collected
      Xsq$observed

      # for detail on Chi-squared and p-values just look up any basic statistics-literature

      Delete