Martinez RforBiologistv1

for Biologists Version 1.1 – December 2009 Marco Martinez Supported by the University of Tennessee, Knoxville and the N...

1 downloads 79 Views 2MB Size
for Biologists Version 1.1 – December 2009

Marco Martinez Supported by the University of Tennessee, Knoxville and the National Science Foundation through Cooperative Agreement EF-0832858 to the University of Tennessee.

Preface This material is intended as an introductory guide to data analysis with the R system, to assist in statistical computing training for life science researchers. It was produced as companion material for a seminar (R Tutorial for Life Sciences) given at The University of Tennessee in the spring of 2009, sponsored by The National Institute for Mathematical and Biological Synthesis (NIMBioS), for graduate students in different biological areas. This material was produced to serve as a basic introduction to R for researchers and students visiting NIMBioS. Many more advanced guides are available both through the R web site and various books. The principal aim is to provide a step-by-step guide on the use of R to carry out statistical analysis and techniques widely used in the life sciences. In each section, we give a detailed explanation of a command in R, followed by a biological example with all the instructions (in red) needed to run the test and with the corresponding output in R (in blue). In several sections we left some questions or additional analysis as an exercise. Also at the end of some sections we give a list of other commands in R related to the topics explained in the corresponding section. We assume some previous knowledge in statistics and experimental design, essentially corresponding to a basic undergraduate introductory statistics course. These notes were written to take advantage of R version 2.8.1 or later, under a Windows operating system. This is version 1.1 of these notes, generated in May 2009 and edited for style and content by NIMBioS Director Louis Gross in December 2009. This document is available for download from the NIMBioS.org site and is provided free-ofcharge with no warrantee for its use. It is not to be modified from this form without explicit authorization from the author. Marco Martinez Graduate Student - Mathematical Ecology Department of Mathematics The University of Tennessee - Knoxville [email protected] December 2009

2

Contents

1. AN INTRODUCTION TO R.................................................................................................................. 4
 1.1 WHAT IS R? ......................................................................................................................................... 4
 1.2 HOW TO INSTALL R?............................................................................................................................ 4
 1.2.1 The base system .......................................................................................................................... 4
 1.2.2 Packages ..................................................................................................................................... 6
 1.3 A SAMPLE SESSION .............................................................................................................................. 7
 1.4 HOW TO GET HELP ............................................................................................................................... 9
 1.5 DOCUMENTATION ................................................................................................................................ 9
 1.5.1 Free documentation .................................................................................................................. 10
 1.5.2 Books......................................................................................................................................... 10
 1.6 DATA IMPORT .................................................................................................................................... 10
 2. DESCRIPTIVE STATISTICS ............................................................................................................. 11
 3. ONE AND TWO SAMPLE TESTS..................................................................................................... 14
 3.1 ONE SAMPLE TEST ............................................................................................................................. 14
 3.2 TWO SAMPLE TEST ............................................................................................................................. 14
 3.2.1 t test unequal variance .............................................................................................................. 15
 3.2.2 Paired t test ............................................................................................................................... 16
 4. SINGLE FACTOR ANALYSIS OF VARIANCE.............................................................................. 17
 4.1 PARAMETRIC ANALYSIS OF VARIANCE (ANOVA) ............................................................................ 17
 4.2 ASSUMPTIONS .................................................................................................................................... 18
 4.1.1 Normality .................................................................................................................................. 18
 4.1.2 Homogeneity of variances (Homoscedasticity) ........................................................................ 19
 4.3 NONPARAMETRIC ANALYSIS (KRUSKAL- WALLIS) ........................................................................... 20
 5. MULTIPLE COMPARISON TESTS.................................................................................................. 22
 5.1 TUKEY TEST ....................................................................................................................................... 22
 5.2 LEAST SIGNIFICANT DIFFERENCE (LSD) TEST ................................................................................... 24
 6. OTHER ANALYSIS OF VARIANCE ................................................................................................ 26
 6.1 RANDOMIZED BLOCK DESIGN ............................................................................................................ 26
 6.1.1 Parametric Analysis of variance............................................................................................... 26
 6.1.2 Nonparametric analysis of variance (Friedman) ..................................................................... 27
 6.2 FACTORIAL STRUCTURE ..................................................................................................................... 28
 7. REGRESSION AND CORRELATION .............................................................................................. 32
 7.1 REGRESSION ...................................................................................................................................... 32
 7.2 CORRELATION ................................................................................................................................... 33
 BIBLIOGRAPHY...................................................................................................................................... 35


3

1. An introduction to R 1.1 What is R? R is a statistical computer program made available through the Internet under the General Public License (GPL). That is, it is supplied with a license that allows you to use it freely, distribute it, or even sell it, as long as the receiver has the same rights and the source code is freely available. It is available for Microsoft Windows XP or later, for a variety of Unix and Linux platforms, and for Apple Macintosh OS X (Dalgaard, 2002). R is an integrated suite of software facilities for data manipulation, calculation and graphical display (Venables et al. 2009). There is a difference in philosophy between R and some other statistical software, since in R a statistical analysis is normally done as a series of steps, with intermediate results being stored as objects. Thus whereas SAS and SPSS will give copious output from a regression, R will give minimal output and store the results in an object (a statistical “fit”) for subsequent interrogation by further R functions (Venables et al. 2009). 1.2 How to install R? These instructions are given for R version 2.8.1 1.2.1 The base system Here we give detailed instructions to download R. Please be aware that new versions can be released with some differences – the main R web page has details on new versions. 1. Go to the web page of R: http://www.r-project.org/ 2. In the left part, find CRAN and click there.

4

3. Now you can choose a mirror site to download the program. You can choose any mirror - here we illustrate using the first one in the USA, University of California, Berkeley.

4. Then we choose our operating system (e.g. Windows in our example). Then click on “base”.

5

5. Finally we download the file R-2.8.1-win32.exe which executes R.

1.2.2 Packages An R installation contains one or more libraries of packages. Some of these packages are part of the base installation. Others can be downloaded from CRAN (see Appendix A), which currently hosts over 1000 packages for various purposes (Dalgaard, 2002). A package can contain functions written in the R language, and data sets. Most packages implement functionality that users will probably not need to have loaded all the time (Dalgaard, 2002). Once a base installation is finished, you can install packages in R using: Open R and from the R window, go to the menu Packages.

6

Then select Install package(s), choose a mirror and then select the package(s) that you need to install. This process needs to be done once for each package. To use a package, you need to load the package. To do that, go to the menu Packages, then select Load package and choose the package(s) that you need. This process has to be done each time that you open a new session and wish to use a specific function in a package that is not in the base system. 1.3 A sample session Before starting with this section be sure that you have a working installation of R. When you open R you should see the console window:

R works fundamentally using a question-and-answer model: You enter a line with a command and press Enter. Then the program does something, prints the result if relevant, and asks for more input. When R is ready for input, it prints out its prompt, a “>” symbol (Dalgaard, 2002). One of the simplest possible tasks in R is to enter an arithmetic expression and receive a result (Dalgaard, 2002). The first line in red (font courier new) are inputs or instructions that we type, the second line is in blue (font courier new) are the outputs or answers from R. > 3+2 [1] 5

“Instruction or inputs” “Answers or Outputs”

We also can perform other standard arithmetic calculations: > 4^2 [1] 16

7

> sqrt(36) [1] 6 > pi [1] 3.141593 > exp(1) [1] 2.718282

The number in brackets is the index of the first number on that line (Dalgaard, 2002). Consider the case of generating the sequence of integers from 50 to 100 > 50:100 [1] 50 [15] 64 [29] 78 [43] 92

51 65 79 93

52 66 80 94

53 67 81 95

54 68 82 96

55 69 83 97

56 70 84 98

57 58 71 72 85 86 99 100

59 73 87

60 74 88

61 75 89

62 76 90

63 77 91

Here [15] indicates that 64 is the fifteen element in the vector of output from this command. One of the most common procedures in R is to store numbers or results. R, like other computer languages, has symbolic variables that are names that can be used to represent values (Dalgaard, 2002). To assign the value 10 to the variable a: > a a/5 [1] 2 > a+2 [1] 12

R allows overwriting variables, without providing any warning that you are redefining a variable that had previously been assigned a value. > a a [1] 234 > a a [1] 456.43

Technically R is an expression language with a very simple syntax. It is case sensitive, so A and a are different symbols and would refer to different variables. The set of

8

symbols which can be used in R names depends on the operating system. Normally all alphanumeric symbols are allowed plus ‘.’ and ‘_’ (Venables et al. 2009). There is, however, the limitation that the name must not start with a digit or a period followed by a digit (Dalgaard, 2002). > A Error: object "A" not found > 2 plot(exit_pigs$fitted.values, exit_pigs$residuals, main = " + Residuals vs Fitted", xlab="Fitted", ylab="Residuals")

Constancy of the residual variance is shown in these plots by the plots having about the same extension of scatter of the residual around zero for each factor level or treatment (Kutner et al. 2005). Then in this particular case the graph indicates that we have homogeneity of variances. Other commands in R: levene.test (Package car) 4.3 Nonparametric Analysis (Kruskal- Wallis) If a set of data is collected according to a completely randomized design, it is possible to test nonparametrically for difference among groups. This may be done by the Kruskal-Wallis test, often called an analysis of variance by ranks. This test may be used in situations where the parametric single-factor ANOVA is not applicable (Zar, 1999). Command in R: kruskal.test(formula, data) formula: Formula of the form lhs ~ rhs where lhs gives the data values and rhs the corresponding groups. data: Data frame containing the variables in the formula.

20

Example (Zar 1999, p. 197) An entomologist is studying the vertical distribution of a fly species in a deciduous forest and obtains five collections of the flies in three different vegetation layers: herb, shrub and tree. The entomologist want to test the H0: The abundance of the flies is the same in all three vegetation layers. Herbs 14 12.1 9.6 8.2 10.2 Shrubs 8.4 5.1 5.5 6.6 6.3 Trees 6.9 7.3 5.8 4.1 5.4 The data have to be in a txt file in the following way: abundance 14 12.1 … 4.1 5.4

layers Herbs Herbs … Trees Trees

We create this file in our working directory with the name flies (see section 1.6): > flies bartlett.test(abundance~layers, data=flies) Bartlett test of homogeneity of variances data: abundance by layers Bartlett's K-squared = 1.7057, df = 2, p-value = 0.4262

Since we don’t have homogeneity of variance, we should perform a Kruskal-Wallis test. > kruskal.test(abundance~layers, data=flies) Kruskal-Wallis rank sum test data: abundance by layers Kruskal-Wallis chi-squared = 8.72, df = 2, p-value = 0.01278

We can conclude that the abundance of the flies is different in the three layers of vegetation.

21

5. Multiple Comparison Tests Suppose that in conducting an analysis of variance the null hypothesis is rejected. Thus, there are differences between the treatment means, but exactly which means differ is not specified. Sometimes in this situation, further comparison and analysis among groups of treatment means may be useful. The procedures for making these comparisons are usually called multiple comparison methods (Montgomery, 2001). In general these methods consider the null hypothesis H0: µA = µB versus the alternative hypothesis, where the subscripts denote any possible pair of groups (Zar, 1999) 5.1 Tukey test A much-used multiple comparison procedure is the Tukey test, also know as the honestly significant difference test (Zar, 1999). Command in R: HSD.test(y, trt, DFerror, MSerror, alpha = 0.05, group=TRUE) y: Variable response trt: Treatments DFerror: Degrees of freedom of the residuals. Take from the ANOVA table MSerror: Mean Square Error of the residuals. Take from the ANOVA table alpha: Significant level. group: TRUE or FALSE. Use always TRUE. For this command we need the package agricolae (see section 1.2.2). Example (Quinn & Keough 2002, p. 174) Medley & Clements (1998) sampled a number of stations (between four and seven) on six streams known to be polluted by heavy metals in the Rocky Mountain region of Colorado, USA. They recorded zinc concentration, and species richness and species diversity of the diatom community and proportion of diatom cells that were the earlysuccessional species, Achanthes minutissima. The first analysis compares mean diatom species diversity (response variable) across the four zinc-level groups (categorical predictor variable), zinc level treated as a fixed factor. The H0 was no difference in mean diatom species diversity between zinc-level groups. ZINC BACK HIGH HIGH MED BACK HIGH BACK

DIV 2.27 1.25 1.15 1.62 1.7 0.63 2.05

ZINC MED MED BACK MED HIGH HIGH HIGH

DIV 2.19 2.1 2.2 2.06 1.9 1.88 0.85

ZINC LOW LOW MED MED LOW LOW HIGH

DIV 1.83 1.88 2.02 1.94 2.1 2.38 1.43

ZINC MED LOW BACK BACK MED LOW MED

DIV 1.75 2.83 1.53 0.76 0.8 1.66 0.98

ZINC HIGH LOW BACK HIGH LOW BACK

DIV 1.04 2.18 1.89 1.37 1.4 1.98

The data have to be in a txt file in the following way: ZINC DIV BACK 2.27 HIGH 1.25 … …

22

LOW MED

1.66 0.98

We create this file in our working directory with the name streams (see section 1.6). > streams exit_streams summary(exit_zinc) Df Sum Sq Mean Sq F value Pr(>F) ZINC 3 2.5666 0.8555 3.9387 0.01756 * Residuals 30 6.5164 0.2172 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can reject H0. Now to see the difference between each level of zinc we perform a Tukey test. > HSD.test(streams$DIV, streams$ZINC, 30, 0.2172, group=T) HSD Test for streams$DIV ...... Alpha 0.050000 Error Degrees of Freedom 30.000000 Error Mean Square 0.217200 Critical Value of Studentized Range 3.845401 Treatment Means streams.ZINC streams.DIV 1 BACK 1.797500 2 HIGH 1.277778 3 LOW 2.032500 4 MED 1.717778

std.err replication 0.1715658 8 0.1422906 9 0.1573298 8 0.1676701 9

Honestly Significant Difference 0.6157647 Harmonic Mean of Cell Sizes 8.470588 Different HSD for each comparison Means with the same letter are not significantly different. Groups, Treatments and means a LOW 2.0325 ab BACK 1.7975 ab MED 1.717778 b HIGH 1.277778 trt means M N std.err 1 LOW 2.032500 a 8.470588 0.1573298 2 BACK 1.797500 ab 8.470588 0.1715658 3 MED 1.717778 ab 8.470588 0.1676701 4 HIGH 1.277778 b 8.470588 0.1422906

The only H0 to be rejected is that of no difference in diatom diversity between sites with low zinc and sites with high zinc (Samuels & Witmer, 2003). We leave as exercise the following: check the assumptions (normality and homoscedasticity) of the model and perform the Tukey test for the example in section 4.1.

23

5.2 Least significant difference (LSD) test Command in R: LSD.test(y, trt, DFerror, MSerror, alpha = 0.05, group=TRUE) y: Variable response trt: Treatments DFerror: Degrees of freedom of the residuals. Take from the ANOVA table MSerror: Mean Square Error of the residuals. Take from the ANOVA table alpha: Significant level. group: TRUE or FALSE. Use always TRUE. For this command we need the package agricolae (see section 1.2.2). Example (Zar 1999, p. 210) Researchers want to perform an ANOVA table and a multiple test comparison. For the following data: Grayson's Pond Beaver Lake Angler's Cove Appletree Lake Rock River

28.2 39.6 46.3 41 56.3

33.2 40.8 42.1 44.1 54.1

36.4 37.9 43.5 46.4 59.4

34.6 37.1 48.8 40.2 62.7

29.1 43.6 43.7 38.6 60

31 42.4 40.1 36.3 57.3

The data are strontium concentrations (mg/ml) in five different bodies of water. The data have to be in a txt file in the following way: strontium bodies 28.2 Grayson 33.2 Grayson … … 60 Rock 57.3 Rock We create this file in our working directory with the name water (see section 1.6). > water exit_water summary(exit_water) Df Sum Sq Mean Sq F value Pr(>F) bodies 4 2193.44 548.36 56.155 3.948e-12 *** Residuals 25 244.13 9.77 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Because a significant p value resulted from the analysis of variance, the LSD test is now applied on the means. > LSD.test(water$strontium, water$bodies, 25, 9.77, group=T) LSD t Test for water$strontium ......

24

Alpha 0.050000 Error Degrees of Freedom 25.000000 Error Mean Square 9.770000 Critical Value of t 2.059539 Treatment Means water.bodies water.strontium 1 Angler 44.08333 2 Apple 41.10000 3 Beaver 40.23333 4 Grayson 32.08333 5 Rock 58.30000

std.err replication 1.257621 6 1.496663 6 1.033011 6 1.308540 6 1.239624 6

Least Significant Difference 3.716692 Means with the same letter are not significantly different. Groups, Treatments and means a Rock 58.3 b Angler 44.08333 bc Apple 41.1 c Beaver 40.23333 d Grayson 32.08333 trt means M N std.err 1 Rock 58.30000 a 6 1.239624 2 Angler 44.08333 b 6 1.257621 3 Apple 41.10000 bc 6 1.496663 4 Beaver 40.23333 c 6 1.033011 5 Grayson 32.08333 d 6 1.308540

We leave as exercise the following: give the appropriate conclusions from this analysis of last example, check the assumptions (normality and homoscedasticity) of the model and perform the Tukey test for the example in this section. Other commands in R: TukeyHSD, waller.test (package agricolae)

25

6. Other Analysis of variance 6.1 Randomized block design In any experiment, variability arising from a nuisance factor can affect the results. We define a nuisance factor as a design factor that probably has an effect on the response, but we are not interested in that effect. When the nuisance source of variability is known and controllable, a design technique called blocking can be used to systematically eliminate its effect on the statistical comparison among treatments (Montgomery, 2001). The statistical term “block” is conceptually an extension of the term “pair” introduced in section 3.2.2 (Zar, 1999). 6.1.1 Parametric Analysis of variance The randomized block ANOVA is an extension of the one – way ANOVA presented in section 4.1. Command in R: aov(formula, data = NULL) formula: A formula specifying the model. Formula of the form a ~ b + c, where a, b and c give the data values and corresponding groups and blocks, respectively. data: A data frame in which the variables specified in the formula will be found. Example (from Samuels & Witmer 2003, p. 487) Researchers were interested in the effect that acid has on the growth rate of alfalfa plants. They created three treatment groups in an experiment: low acid, high acid and control. The response variable in their experiment was the average height of the alfalfa plants in a Styrofoam cup after five days of growth. The observational unit was a cup, rather than individual plants. They had 5 cups for each of the 3 treatments, for a total of 15 observations. However, the cups were arranged near a window and they wanted to account for the effect of differing amounts of sunlight. Thus they created 5 blocks and randomly assigned the 3 treatments within each block. The data are given in the following table: treatments block low high control 1 1.58 1.1 2.47 2 1.15 1.05 2.15 3 1.27 0.5 1.46 4 1.25 1 2.36 5 1 1.5 1 The data have to be in a txt file in the following way: height trt block 2.47 control block1 2.15 control block2 1.46 control block3 2.36 control block4 1 control block5 … … … 1.1 high block1

26

1.05 0.5 1 1.5

high high high high

block2 block3 block4 block5

We create this file in our working directory with the name alfalfa (see section 1.6). > alfalfa exit_alfalfa summary(exit_alfalfa) Df Sum Sq Mean Sq F value Pr(>F) trt 2 1.98601 0.99301 5.4709 0.03182 * block 4 0.83963 0.20991 1.1565 0.39740 Residuals 8 1.45205 0.18151 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p value for the treatment (trt) is small, indicating that the differences between the three sample means are greater than would be expected by chance alone (Samuels & Witmer, 2003). 6.1.2 Nonparametric analysis of variance (Friedman) Friedman’s test is a nonparametric analysis that may be performed on a randomized block experimental design, and it is especially useful with data which do not meet the parametric analysis of variances assumptions of normality and homoscedasticity (Zar, 1999). Command in R: friedman.test(formula, data) formula: Formula of the form a ~ b | c, where a, b and c give the data values and corresponding groups and blocks, respectively. data: Data frame in which the variables specified in the formula will be found. Example (from Zar 1999, p. 264) We want to investigate the H0 that the mean weight gain of guinea pigs is the same on each of four specified diets. Each guinea pig is housed in a separate cage. A block consists of a group of four animals that we can be reasonably assured will experience identical environmental conditions (light, temperature, draft, noise, etc). Each block has each of its four animals assigned at random to one of the four experimental diets, so that each animal in a given block is to receive a different diet. The data (weight gains, in grams) are summarized in the following table. diets Blocks 1 2 3 4 5

1 7 9.9 8.5 5.1 10.3

2 5.3 5.7 4.7 3.5 7.7

3 4.9 7.6 5.5 2.8 8.4

4 8.8 8.9 8.1 3.3 9.1

27

The data have to be in a txt file in the following way: weight diets blocks 7 diet1 block1 9.9 diet1 block2 … … …. 3.3 diet4 block4 9.1 diet4 block5 We create this file in our working directory with the name guinea (see section 1.6). > guinea friedman.test(weight~ diets | blocks, data=guinea) Friedman rank sum test data: weight and diets and blocks Friedman chi-squared = 10.68, df = 3, p-value = 0.01359

Therefore, we reject H0. 6.2 Factorial structure Many experiments involve the study of the effects of two or more factors. In general, factorial designs are most efficient for this type of experiment. By a factorial design, we mean that in each complete trial or replication of the experiment all possible combinations of the levels of the factors are investigated (Montgomery, 2001). Command in R: aov(formula, data = NULL) formula: A formula specifying the model. Formula of the form a ~ b * c, where a, b and c give the data values and corresponding factor1 and factor2, respectively. data: A data frame in which the variables specified in the formula will be found. Example (from Samuels & Witmer 2003, p. 6) Before new drugs are given to humans subjects, it is common practice to test them first in dogs or other animals. In part of one study, a new drug under investigation was given to 4 male and 4 female dogs, at doses 8mg/kg and 25mg/kg. Alkaline phosphatase level (measured in U/Li) was measured from blood samples in order to screen for toxicity problems in dogs before starting with humans. The design of this experiment allows for the investigation of the interaction of two factors: sex of the dog and dose. Data are shown in the following table: Dose 8

25

Male 191 154 194 183 80 49 78 71

Female 150 127 152 105 141 153 171 197

28

The data have to be in a txt file in the following way: Sex Dose Level M dose8 191 M dose8 154 M dose8 194 M dose8 183 … … … F dose25 141 F dose25 153 F dose25 171 F dose25 197 We create this file in our working directory with the name dogs (see section 1.6). > dogs exit_dogs summary(exit_dogs) Df Sum Sq Mean Sq F value Pr(>F) Dose 1 6241.0 6241.0 15.4289 0.002006 Sex 1 2401.0 2401.0 5.9357 0.031367 Dose:Sex 1 20449.0 20449.0 50.5538 1.230e-05 Residuals 12 4854.0 404.5 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05

** * *** ‘.’ 0.1 ‘ ’ 1

From the last analysis we can reject all H0. The factors sex and dose interacted in the following sense: For females the effect of increasing the dose from 8 to 25 was positive (the average increases), but for males the effect of increasing the dose from 8 to 25 was negative (the average decreases). We can see this in the interaction plot. > interaction.plot(dogs$Dose, dogs$Sex, dogs$Level, + xlab="Dose", ylab="Level", trace.label="Sex")

29

Example (from Samuels & Witmer 2003, p. 490) A plant physiologist investigated the effect of mechanical stress on the growth of soybean plants. Individually potted seedlings were randomly allocated to four treatment groups of 13 seedlings each. Seedlings in two groups were stressed by shaking for 20 minutes twice daily, while two control groups were not stressed. Thus, the first factor in the experiment was presence or absence of stress. Also, plants were growth in either low or moderate light. Thus the second factor was amount of light. This experiment is an example of a 2*2 factorial experiment. Low Control stress 264 235 200 188 225 195 268 205 215 212 241 214 232 182 256 215 229 272 288 163 253 230 288 255 230 202

moderate control stress 314 283 320 312 310 291 340 259 299 216 268 201 345 267 271 326 285 241 309 291 337 269 282 282 273 257

The data have to be in a txt file in the following way: area 264 200 225 268 … 291 269 282 257

shaking light control low control low control low control low … … stress moderate stress moderate stress moderate stress moderate

We create this file in our working directory with the name soybean (see section 1.6). > soybean exit_soybean summary(exit_soybean) Df Sum Sq Mean Sq F value Pr(>F) shaking 1 14858 14858 16.5954 0.0001725 *** light 1 42752 42752 47.7490 1.010e-08 *** shaking:light 1 26 26 0.0294 0.8645695 Residuals 48 42976 895

30

--Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> interaction.plot(soybean$shaking, soybean$light, soybean$area, + xlab="shaking", ylab="area", trace.label="light")

Conclusions from this example are left as an exercise.

31

7. Regression and correlation 7.1 Regression In many problems there are two or more variables that are related, and it is of interest to model and explore this relationship. In general, suppose that there is a single dependent variable or response y that depends on k independent variables, for example, x1, x2, …, xk. The relationship between these variables is characterized by a mathematical model called a regression model (Montgomery, 2001). Command in R: lm(formula, data = NULL) formula: A formula specifying the model. data: A data frame in which the variables specified in the formula will be found. Example (from Quinn & Keough 2002, p.79) Christensen et al. (1996) studied the relationships between coarse woody debris (CWD) and shoreline vegetation and lake development in a sample of 16 lakes in North America. The main variables of interest are the density of cabins (no. km-1), density of riparian trees (trees km-1), the basal area of riparian trees (m2 km-1), density of coarse woody debris (no. km-1) and basal area of coarse woody debris (m2 km-1). The researchers are interested in fitting a linear regression model to CWD basal area against riparian tree density (RTD). LAKE Bay Bergner Crampton Long Roach Tenderfoot

CWD 121 41 183 130 127 134

RTD 1270 1210 1800 1875 1300 2150

LAKE Palmer Street Laura Annabelle Joyce

CWD 65 52 12 46 54

RTD 1330 964 961 1400 1280

LAKE Lake_hills Towanda Black oak Johnson Arrowhead

CWD 97 1 4 1 4

RTD 976 771 833 883 956

> CWD RTD exit_lakes summary(exit_lakes) Call: lm(formula = CWD ~ RTD) Residuals: Min 1Q Median -38.62 -22.41 -13.33

3Q 26.15

Max 61.36

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -77.09908 30.60801 -2.519 0.024552 * RTD 0.11552 0.02343 4.930 0.000222 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 36.32 on 14 degrees of freedom Multiple R-squared: 0.6345, Adjusted R-squared: 0.6084

32

F-statistic:

24.3 on 1 and 14 DF,

p-value: 0.0002216

The t test and the ANOVA F test cause us to reject the H0 that β1 equals zero. We would also reject the H0 that β0 equals zero, although this test is of little biological interest. The r2 value (0.634) indicates that we can explain about 63% of the total variation in CWD basal area by the linear regression with riparian tree density. We can predict CWD basal area for a new lake with 1500 trees km-1 in the riparian zone. Plugging 1500 into our fitted regression model: CWD basal area = -77.099 + 0.116*1500 The predicted basal area of CWD is 96.901 m2 km-1 (Quinn & Keough 2002). 7.2 Correlation In many kinds of biological data, the relationship between two (or more variables) is not of clearly of strict dependence between the variables. In such cases, the magnitude of one of the variables changes as the magnitude of the second variable changes, but it is not reasonable to consider there to be an independent variable and a dependent variable whch is causally related to it. In such situations, correlation, rather than regression, analysis is called for (Zar, 1999). Command in R: cor.test(x, y, method = c("pearson", "kendall", "spearman"), conf.level = 0.95) x, y: Numeric vectors of data values. x and y must have the same length. method: Character string indicating which correlation coefficient is to be used for the test. One of "pearson", "kendall", or "spearman", can be abbreviated. Default Pearson. conf.level: confidence level for the returned confidence interval. Default 0.95. Example (from Quinn & Keough 2002, p.73) Green (1997) studied the ecology of red land crabs on Christmas Island and examined the relationship between the total biomass of red land crabs and the density of their burrows within 25 m2 quadrats (sampling units) at five forested sites on the island. We will look at two of these sites: there were ten quadrats at Lower Site (LS) and eight quadrats at Drumsite (DS). Pearson’s correlation coefficient was considered appropriate for these data although more robust correlations were calculated for comparison. DS TOTMASS BURROWS LS TOTMASS BURROWS

2.15 39 4.36 38

2.27 38 4.01 37

4.31 61 3.33 27

2.58 79 2.63 18

3.23 35 4.46 41

1.83 39 3.96 33

1.54 45 4.18 40

2 28 4.21 2.54 4.29 29 25 38

> BURROWS TOTMASS cor.test(BURROWS, TOTMASS) Pearson's product-moment correlation data: BURROWS and TOTMASS t = 1.0428, df = 6, p-value = 0.3372 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:

33

-0.4322803 0.8592175 sample estimates: cor 0.3917155 > cor.test(BURROWS, TOTMASS, method=c("spearman")) Spearman's rank correlation rho data: BURROWS and TOTMASS S = 69.9159, p-value = 0.6915 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.1676677 > cor.test(BURROWS, TOTMASS, method=c("kendall")) Kendall's rank correlation tau data: BURROWS and TOTMASS z = 0.1247, p-value = 0.9008 alternative hypothesis: true tau is not equal to 0 sample estimates: tau 0.03636965

The H0 of no linear relationship between total crab biomass and number of burrows at DS could not be rejected. The same conclusion applies for monotonic relationships measured by Spearman and Kendall’s coefficients. So there was no evidence for any linear or more general monotonic relationship between burrow density and total crab biomass at site DS (Quinn & Keough 2002). We leave the analysis of site LS as exercise. Other commands in R: cor.

34

Bibliography Christensen, D.L., Herwig, B.R., Schindler, D.E. and Carpenter, S.R. 1996. Impacts of lakeshore residential development on coarse woody debris in north temperate lakes. Ecological Applications 64: 1143–1149. Dalgaard, P. 2002. Introductory Statistics with R. Springer. USA. 267p. Elgar, M.A., Allan, R.A. and Evans, T.A. 1996. Foraging strategies in orb-spinning spiders: ambient light and silk decorations in Argiope aetherea Walckenaer (Araneae: Araneoidea). Australian Journal of Ecology 21: 464–467. Furness, R.W. and Bryant, D.M. 1996. Effect of wind on field metabolic rates of breeding Northern Fulmars. Ecology 77: 1181–1188. Green, P.T. 1997. Red crabs in rain forest on Christmas Island, Indian Ocean: activity patterns, density and biomass. Journal of Tropical Ecology 13: 17–38. Kutner, M., Nachtsheim, C., Neter, J. and Li, W. 2005. Applied linear statistical models. Fifth edition. McGraw-Hill/Irwin. USA. 1396p. Montgomery, D. 2001. Design and analysis of experiments. Fifth edition. John Wiley & sons, INC. USA. 684p. Quinn, G. and Keough, M. 2002. Experimental Design and Data Analysis for Biologists. Cambridge University Press. USA. 557p. Samuels, M. and Witmer, J. 2003. Statistics for The Life Sciences. Third Edition. Pearson Education. USA. 724p. Selvin, S. 2004. Biostatistics How it works. Pearson Education. USA. 394p. Venables W. N., Smith D. M. and R Development Core Team. 2009. An Introduction to R. Notes on R: A Programming Environment for Data Analysis and Graphics Version 2.9.0. URL http://cran.cnr.berkeley.edu/doc/manuals/R-intro.pdf Verzani, J. 2002. Simple R – Using R for introductory Statistics. URL http://cran.rproject.org/doc/contrib/Verzani-SimpleR.pdf Zar, J. 1999. Biostatistical Analysis. Fourth Edition. Prentice Hall. USA. 663p.

35