slides rnaseq

Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA) en collaboration avec Ignacio Gonzàles et An...

3 downloads 115 Views 4MB Size
Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA) en collaboration avec Ignacio Gonzàles et Annick Moisan

[email protected] http://www.nathalievialaneix.eu

Toulouse, 11/12 février 2019

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

1 / 63

Outline 1

Exploratory analysis Introduction Experimental design Data exploration and quality assessment

2

Normalization Raw data filtering Interpreting read counts

3

Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

2 / 63

A typical transcriptomic experiment

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

3 / 63

A typical transcriptomic experiment

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

3 / 63

Outline 1

Exploratory analysis Introduction Experimental design Data exploration and quality assessment

2

Normalization Raw data filtering Interpreting read counts

3

Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

4 / 63

A typical RNA-seq experiment

NV2 (INRA)

1

collect samples

2

generate libraries of cDNA fragments

3

affect material to one lane in one flow cell

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

5 / 63

Steps in RNAseq data analysis

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

6 / 63

Part I: Experimental design

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

7 / 63

Confounded effects: a simple example Basic experiment: find differences between control/treated plants

control group plant

NV2 (INRA)

treated group plant

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

8 / 63

Confounded effects: a simple example Basic experiment: find differences between control/treated plants

control group plant

treated group plant

A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field Field 1

Field 2

because you can not differentiate between differences due to the field and differences due to the treatment ⇒ confounded effects NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

8 / 63

Confounded effects: a simple example Basic experiment: find differences between control/treated plants

control group plant

treated group plant

A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field) Field 1

Field 2

differences due to the field and differences due to the treatment can be estimated separately NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

8 / 63

Confounded effects: a simple example Basic experiment: find differences between control/treated plants

control group plant

treated group plant

In summary, what is a good experimental design? Experimental design are usually not as simple as this example: they can include multiple experimental factors (day of experiment, flow cell, . . . ) and multiple covariates (sex, parents, . . . ).

⇒ The experimental design must be carefully thought before starting the experiment and confounded effects must be searched for in a systematic manner.

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

8 / 63

Effect & Variation 2 conditions, 2 genes whose expression distribution is:

first gene: different median levels between the two groups but large variance: differences may be non significant second gene: different median levels between the two groups but very small variance: differences may be significant NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

9 / 63

Source of variation in RNA-seq experiments

NV2 (INRA)

1

at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors)

2

at the middle layer: technical variations (library preparation effect)

3

at the bottom layer: technical variations (lane and cell flow effects)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

10 / 63

Source of variation in RNA-seq experiments 1

at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors)

2

at the middle layer: technical variations (library preparation effect)

3

at the bottom layer: technical variations (lane and cell flow effects)

lane effect < cell flow effect < library preparation effect  biological effect NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

10 / 63

Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: ≥ 3 per condition)

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

11 / 63

Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: ≥ 3 per condition)

technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)

6 biological replicates are better than 3 biological replicates × 2 technical replicates! See [Liu et al., 2014] for further discussions. NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

11 / 63

In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) Lane 1 RNA extraction

Library preparation Lane 2

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

12 / 63

In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) Lane 1 Library preparation

RNA extraction

Lane 2

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate: Flow cell 1

Flow cell 2

Flow cell 3

Flow cell 4

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

C11

C21

T11

T21

C12

C22

T12

T22

C13

C23

T13

T23

C14

C24

T14

T24

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

12 / 63

In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters

RNA extraction

Library preparation and barcoding

Lane 1 Pooling Lane 2

Barcoded adapters

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

12 / 63

In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters

Lane 1

Library preparation and barcoding

RNA extraction

Pooling Lane 2

Barcoded adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells: Flow cell 1

NV2 (INRA)

Flow cell 2

Flow cell 4

Flow cell 3

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

C111

C121

C131

C141

C112

C122

C132

C142

C113

C123

C133

C143

C114

C124

C134

C144

C211

C221

C231

C241

C212

C222

C232

C242

C213

C223

C233

C243

C214

C224

C234

C244

T111

T121

T131

T141

T112

T122

T132

T142

T113

T123

T133

T143

T114

T124

T134

T144

T211

T221

T231

T241

T212

T222

T232

T242

T213

T223

T233

T243

T214

T224

T234

T244

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

12 / 63

Part II: Exploratory analysis

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

13 / 63

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads)

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

14 / 63

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

14 / 63

Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data overdispersion (variance  mean)

black line is “variance = mean” NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

14 / 63

Dataset used in the examples Dataset provided by courtesy of the transcriptomic platform of IPS2 Three files:

D1-counts.txt contains the raw counts of the experiment (13 columns: the first one contains the gene names, the others correspond to 12 different samples; gene names have been shuffled);

D1-genesLength.txt contains information about gene lengths; D1-targets.txt contains information about the sample and the experimental design.

NV2 (INRA)

Biostatistique RNA-seq

Toulouse, 11/12 février 2019

15 / 63

Dataset used in the examples

These text files are loaded raw_ counts