Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA) en collaboration avec Ignacio Gonzàles et Annick Moisan
[email protected] http://www.nathalievialaneix.eu
Toulouse, 11/12 février 2019
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
1 / 63
Outline 1
Exploratory analysis Introduction Experimental design Data exploration and quality assessment
2
Normalization Raw data filtering Interpreting read counts
3
Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
2 / 63
A typical transcriptomic experiment
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
3 / 63
A typical transcriptomic experiment
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
3 / 63
Outline 1
Exploratory analysis Introduction Experimental design Data exploration and quality assessment
2
Normalization Raw data filtering Interpreting read counts
3
Differential Expression analysis Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
4 / 63
A typical RNA-seq experiment
NV2 (INRA)
1
collect samples
2
generate libraries of cDNA fragments
3
affect material to one lane in one flow cell
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
5 / 63
Steps in RNAseq data analysis
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
6 / 63
Part I: Experimental design
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
7 / 63
Confounded effects: a simple example Basic experiment: find differences between control/treated plants
control group plant
NV2 (INRA)
treated group plant
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
8 / 63
Confounded effects: a simple example Basic experiment: find differences between control/treated plants
control group plant
treated group plant
A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field Field 1
Field 2
because you can not differentiate between differences due to the field and differences due to the treatment ⇒ confounded effects NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
8 / 63
Confounded effects: a simple example Basic experiment: find differences between control/treated plants
control group plant
treated group plant
A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field) Field 1
Field 2
differences due to the field and differences due to the treatment can be estimated separately NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
8 / 63
Confounded effects: a simple example Basic experiment: find differences between control/treated plants
control group plant
treated group plant
In summary, what is a good experimental design? Experimental design are usually not as simple as this example: they can include multiple experimental factors (day of experiment, flow cell, . . . ) and multiple covariates (sex, parents, . . . ).
⇒ The experimental design must be carefully thought before starting the experiment and confounded effects must be searched for in a systematic manner.
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
8 / 63
Effect & Variation 2 conditions, 2 genes whose expression distribution is:
first gene: different median levels between the two groups but large variance: differences may be non significant second gene: different median levels between the two groups but very small variance: differences may be significant NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
9 / 63
Source of variation in RNA-seq experiments
NV2 (INRA)
1
at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors)
2
at the middle layer: technical variations (library preparation effect)
3
at the bottom layer: technical variations (lane and cell flow effects)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
10 / 63
Source of variation in RNA-seq experiments 1
at the top layer: biological variations (i.e., individual differences due to e.g., environmental or genetic factors)
2
at the middle layer: technical variations (library preparation effect)
3
at the bottom layer: technical variations (lane and cell flow effects)
lane effect < cell flow effect < library preparation effect biological effect NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
10 / 63
Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: ≥ 3 per condition)
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
11 / 63
Advised policy for biological/technical replicates biological replicates: different biological samples, processed separately (advised: ≥ 3 per condition)
technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)
6 biological replicates are better than 3 biological replicates × 2 technical replicates! See [Liu et al., 2014] for further discussions. NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
11 / 63
In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) Lane 1 RNA extraction
Library preparation Lane 2
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
12 / 63
In summary: a simple two-condition experimental design for RNA-seq Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved) Lane 1 Library preparation
RNA extraction
Lane 2
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate: Flow cell 1
Flow cell 2
Flow cell 3
Flow cell 4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
C11
C21
T11
T21
C12
C22
T12
T22
C13
C23
T13
T23
C14
C24
T14
T24
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
12 / 63
In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters
RNA extraction
Library preparation and barcoding
Lane 1 Pooling Lane 2
Barcoded adapters
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
12 / 63
In summary: a simple two-condition experimental design for RNA-seq Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane Barcode adapters
Lane 1
Library preparation and barcoding
RNA extraction
Pooling Lane 2
Barcoded adapters
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells: Flow cell 1
NV2 (INRA)
Flow cell 2
Flow cell 4
Flow cell 3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
C111
C121
C131
C141
C112
C122
C132
C142
C113
C123
C133
C143
C114
C124
C134
C144
C211
C221
C231
C241
C212
C222
C232
C242
C213
C223
C233
C243
C214
C224
C234
C244
T111
T121
T131
T141
T112
T122
T132
T142
T113
T123
T133
T143
T114
T124
T134
T144
T211
T221
T231
T241
T212
T222
T232
T242
T213
T223
T233
T243
T214
T224
T234
T244
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
12 / 63
Part II: Exploratory analysis
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
13 / 63
Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads)
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
14 / 63
Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
14 / 63
Some features of RNAseq data What must be taken into account? discrete, non-negative data (total number of aligned reads) skewed data overdispersion (variance mean)
black line is “variance = mean” NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
14 / 63
Dataset used in the examples Dataset provided by courtesy of the transcriptomic platform of IPS2 Three files:
D1-counts.txt contains the raw counts of the experiment (13 columns: the first one contains the gene names, the others correspond to 12 different samples; gene names have been shuffled);
D1-genesLength.txt contains information about gene lengths; D1-targets.txt contains information about the sample and the experimental design.
NV2 (INRA)
Biostatistique RNA-seq
Toulouse, 11/12 février 2019
15 / 63
Dataset used in the examples
These text files are loaded raw_ counts