<- here::here("data", "refdb")
refdb_folder refdb_folder
[1] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/refdb"
Fabrice Armougom, MIO
Marc Garel, MIO
Nicolas Henry, ABiMS
August 1, 2022
To begin with, download the course repository on your computer or virtual machine.
To do so, visit the repository on github (ANF-metabarcoding) and download it as a zip file.
Once on your machine, unzip the file and place the resulting folder in the most covenient location (~/Documents/
for example).
Save as a variable the path to the folder where you will place the references databases.
[1] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/refdb"
The reason why we use here::here() is that when you render a Rmarkdown file, the working directory is where the Rmarkdown file is:
Whereas here::here() point to the root of the R project
[1] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic"
Now, let’s create the folder directly from R:
You can also create the folder from RStudio in the Files
window
Tip: You can access the documentation of any R function using ?
in the console. If you to know everything about the function dir.create()
, simply run ?dir.create()
The Silva reference database, commonly used to assign 16S metabarcoding data, will be used in practical.
In case you are working with 18S sequences, you will have better assignements using PR2 (https://pr2-database.org/) or EukRibo (https://doi.org/10.5281/zenodo.6327891) especially if you are interested in protists.
The following code downloads dada2 formated silva reference databases. If you are not confortable with it, you can simply download the reference database files from your web browser here and here.
[1] 60
# so we change timeout to be 20 minutes
options(timeout = 1200)
# we save in variable the path to the refdb
# in the working space
silva_train_set <- file.path(refdb_folder,
"silva_nr99_v138.1_train_set.fa.gz")
silva_species_assignment <- file.path(refdb_folder,
"silva_species_assignment_v138.1.fa.gz")
# then we download the files if they don't already exist
if (!file.exists(silva_train_set)) {
download.file(
"https://zenodo.org/record/4587955/files/silva_nr99_v138.1_train_set.fa.gz",
silva_train_set,
quiet = TRUE
)
}
if (!file.exists(silva_species_assignment)) {
download.file(
"https://zenodo.org/record/4587955/files/silva_species_assignment_v138.1.fa.gz",
silva_species_assignment,
quiet = TRUE
)
}
We will use in this practical R functions especially written for this course. The “classic” way to import functions is to use source()
with the name of the R script to source.
Save the path to the directory containing your raw data (paired-end sequencing fastq files) in a variable named path_to_fastqs
The gzipped (compressed) FASTQ formated “forward” (R1) and “reverse” (R2) files are named as follow:
${SAMPLENAME}_R1.fastq.gz
for the forward files${SAMPLENAME}_R2.fastq.gz
for the reverse files.We list the forward files using the function list.files()
. The argument pattern
gives you the possibility to only select file names matching a regular expression. In our case, we select all file names finising by _R1.fastq.gz
.
We do the same for reverse samples.
To understand: What fnFs & fnRs variables contain?
[1] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S11B_R1.fastq.gz"
[2] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S1B_R1.fastq.gz"
[3] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S2B_R1.fastq.gz"
[4] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S2S_R1.fastq.gz"
[5] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S3B_R1.fastq.gz"
[6] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/data/raw/S3S_R1.fastq.gz"
To understand:
basename()
: remove path to only keep file name.
|>
: R “pipe”. It allows you to chain functions, avoiding intermediate variables and nested parenthesis. It basically transfers the output of the left expression to the input of the right expression. You need R > 4.1 to use this pipe, otherwise use %>%
from magrittr
.
strsplit()
: split character chain according to a defined pattern. ?strsplit
for documentation.
sapply()
: apply a function to each element of a list or vector. The output is simplified to be vector.
Let’s go step by step. First list the R1 file names.
[1] "S11B_R1.fastq.gz" "S1B_R1.fastq.gz" "S2B_R1.fastq.gz" "S2S_R1.fastq.gz"
[5] "S3B_R1.fastq.gz" "S3S_R1.fastq.gz"
We can see that the sample name is before the first _
. With strsplit()
, we can split each file name into a 2 elements vector. The result is a list of 2 elements vectors.
[[1]]
[1] "S11B" "R1.fastq.gz"
[[2]]
[1] "S1B" "R1.fastq.gz"
[[3]]
[1] "S2B" "R1.fastq.gz"
[[4]]
[1] "S2S" "R1.fastq.gz"
[[5]]
[1] "S3B" "R1.fastq.gz"
[[6]]
[1] "S3S" "R1.fastq.gz"
Now, We just have to extract the first element for each file.
[1] "S11B" "S1B" "S2B" "S2S" "S3B" "S3S"
Regular expressions are extremly useful. If you are keen to learn how to use them, have a look here
We use a custom function, qualityprofile()
, implemented in R/preprocessing.R
to check the quality of the raw sequences.
Run ?qualityprofile
to know more about this function.
Open the the pdf file generated by qualityprofile()
We first create a folder where to save the reads once they are trimmed:
Then we prepare a list of paths where to save the results (e.g. sequences without primers)
nopFw <- file.path(path_to_trimmed_reads, basename(fnFs))
nopRv <- file.path(path_to_trimmed_reads, basename(fnRs))
head(nopFw)
[1] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S11B_R1.fastq.gz"
[2] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S1B_R1.fastq.gz"
[3] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S2B_R1.fastq.gz"
[4] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S2S_R1.fastq.gz"
[5] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S3B_R1.fastq.gz"
[6] "/Users/marcgarel/OSU/MIO/2024/Formation_bioinfo_20224/formation_metabarcoding_omic/outputs/dada2/trimmed/S3S_R1.fastq.gz"
The data you are working with correspond to the V3-V4 region using the primers Pro341F (CCTACGGGNBGCASCAG) and Pro805R (GACTACNVGGGTATCTAAT) Save into variables the forward and reverse primers.
Same as before, create a folder
and list paths:
To make the link between files and sample names, simply name the file names vector using the sample names
dada2::filterAndTrim()
Let’s have a look at what the function is doing. To do so, type ?dada2::filterAndTrim()
in the console.
Let’s run the function.
(out <- dada2::filterAndTrim(
fwd = nopFw,
filt = filtFs,
rev = nopRv,
filt.rev = filtRs,
minLen = 150,
matchIDs = TRUE,
maxN = 0,
maxEE = c(3, 3),
truncQ = 2
))
reads.in reads.out
S11B_R1.fastq.gz 1985 994
S1B_R1.fastq.gz 1973 1051
S2B_R1.fastq.gz 1985 1058
S2S_R1.fastq.gz 1981 1046
S3B_R1.fastq.gz 1986 1044
S3S_R1.fastq.gz 1987 1132
S4B_R1.fastq.gz 1980 1058
S4S_R1.fastq.gz 1983 1135
S5B_R1.fastq.gz 1984 1049
S5S_R1.fastq.gz 1988 1055
S6B_R1.fastq.gz 1979 1035
S6S_R1.fastq.gz 1979 1025
S7B_R1.fastq.gz 1985 997
S7S_R1.fastq.gz 1985 1008
S8B_R1.fastq.gz 1992 976
S8S_R1.fastq.gz 1980 1051
S9B_R1.fastq.gz 1979 956
S9S_R1.fastq.gz 1979 1045
What happened?
Details about the function arguments: * nopFw : input, where the forward reads without primers are (path) * filtFs : output, where forward filtered reads are written (path) * nopRv and filRs : same as above, but wiht reverse reads * TruncLen : truncate reads after truncLen bases. Reads shorter than that are discarded. Exple : TruncLen=c(200,150), means forward and reverse reads are cut at 200 bp and 150 bp respectively. * TrimLeft : number of nucleotides to remove from the start * Trimright : number of nucleotides to remove from the end * maxN : max number of ambiguous bases accepted * maxEE : read expected errors (EE) threshold. The EE of a read is the sum of the error probability of each base composing it. Increase that value to accept more low quality reads. The first value refers to the forward reads and the second to the reverse reads. * TruncQ=2: truncate reads at the first instance of a quality score less than or equal to truncQ.
To be able to denoise your data, you need an error model. The error model will tell you at which rate a nucleotide is replace by another for a given quality score. For example, for a quality score Q of 30, what is the probability of an A being wrongly read as a T.
This error model can be learnt directly from the data with the function dada2::learnErrors()
. You can come back to the course for more details about the maths behind.
5155290 total bases in 18715 reads from 18 samples will be used for learning the error rates.
5277647 total bases in 18715 reads from 18 samples will be used for learning the error rates.
You can visualise the resulting error model using the function dada2::plotErrors()
Before denoising, we need to dereplicate the sequences. It means, for each unique sequence, count the number of reads.
The dereplication is achieved using the function dada2::derepFastq()
Now we are ready to run the denoising algorithm with dada2::dada()
. As input, we need the error model and the dereplicated sequences.
Sample 1 - 994 reads in 633 unique sequences.
Sample 2 - 1051 reads in 683 unique sequences.
Sample 3 - 1058 reads in 671 unique sequences.
Sample 4 - 1046 reads in 661 unique sequences.
Sample 5 - 1044 reads in 670 unique sequences.
Sample 6 - 1132 reads in 665 unique sequences.
Sample 7 - 1058 reads in 649 unique sequences.
Sample 8 - 1135 reads in 563 unique sequences.
Sample 9 - 1049 reads in 668 unique sequences.
Sample 10 - 1055 reads in 578 unique sequences.
Sample 11 - 1035 reads in 597 unique sequences.
Sample 12 - 1025 reads in 563 unique sequences.
Sample 13 - 997 reads in 586 unique sequences.
Sample 14 - 1008 reads in 578 unique sequences.
Sample 15 - 976 reads in 595 unique sequences.
Sample 16 - 1051 reads in 604 unique sequences.
Sample 17 - 956 reads in 566 unique sequences.
Sample 18 - 1045 reads in 582 unique sequences.
Sample 1 - 994 reads in 730 unique sequences.
Sample 2 - 1051 reads in 763 unique sequences.
Sample 3 - 1058 reads in 780 unique sequences.
Sample 4 - 1046 reads in 747 unique sequences.
Sample 5 - 1044 reads in 762 unique sequences.
Sample 6 - 1132 reads in 796 unique sequences.
Sample 7 - 1058 reads in 763 unique sequences.
Sample 8 - 1135 reads in 733 unique sequences.
Sample 9 - 1049 reads in 777 unique sequences.
Sample 10 - 1055 reads in 718 unique sequences.
Sample 11 - 1035 reads in 727 unique sequences.
Sample 12 - 1025 reads in 672 unique sequences.
Sample 13 - 997 reads in 700 unique sequences.
Sample 14 - 1008 reads in 709 unique sequences.
Sample 15 - 976 reads in 699 unique sequences.
Sample 16 - 1051 reads in 762 unique sequences.
Sample 17 - 956 reads in 674 unique sequences.
Sample 18 - 1045 reads in 724 unique sequences.
Once forward and reverse reads have been denoised, we can merge them with dada2::mergePairs()
.
At this point we have ASVs and we know their number of reads in each sample. Whe have enough information to build an ASV table.
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. We find and remove bimeras (two-parent chimeras) using the function dada2::removeBimeraDenovo()
Statistics about each preprocessing can also be exported. First this table need to be assembled:
getN <- function(x) sum(dada2::getUniques(x))
summarytab <- data.frame(Samples=sample_names,
imput=out[,1],
filtered=out[,2],
denoised=sapply(dadaFs,getN),
merged=sapply(mergers,getN),
nonchimeric=rowSums(seqtab_nochim),
Final_retained=rowSums(seqtab_nochim)/out[,1]*100)
summarytab
Samples imput filtered denoised merged nonchimeric
S11B_R1.fastq.gz S11B 1985 994 884 749 749
S1B_R1.fastq.gz S1B 1973 1051 823 708 708
S2B_R1.fastq.gz S2B 1985 1058 882 672 672
S2S_R1.fastq.gz S2S 1981 1046 917 783 783
S3B_R1.fastq.gz S3B 1986 1044 857 672 672
S3S_R1.fastq.gz S3S 1987 1132 940 776 776
S4B_R1.fastq.gz S4B 1980 1058 887 687 687
S4S_R1.fastq.gz S4S 1983 1135 967 870 870
S5B_R1.fastq.gz S5B 1984 1049 933 770 770
S5S_R1.fastq.gz S5S 1988 1055 892 767 766
S6B_R1.fastq.gz S6B 1979 1035 948 817 817
S6S_R1.fastq.gz S6S 1979 1025 870 738 738
S7B_R1.fastq.gz S7B 1985 997 868 684 684
S7S_R1.fastq.gz S7S 1985 1008 874 754 754
S8B_R1.fastq.gz S8B 1992 976 811 694 694
S8S_R1.fastq.gz S8S 1980 1051 903 742 742
S9B_R1.fastq.gz S9B 1979 956 840 650 650
S9S_R1.fastq.gz S9S 1979 1045 904 746 746
Final_retained
S11B_R1.fastq.gz 37.73300
S1B_R1.fastq.gz 35.88444
S2B_R1.fastq.gz 33.85390
S2S_R1.fastq.gz 39.52549
S3B_R1.fastq.gz 33.83686
S3S_R1.fastq.gz 39.05385
S4B_R1.fastq.gz 34.69697
S4S_R1.fastq.gz 43.87292
S5B_R1.fastq.gz 38.81048
S5S_R1.fastq.gz 38.53119
S6B_R1.fastq.gz 41.28348
S6S_R1.fastq.gz 37.29156
S7B_R1.fastq.gz 34.45844
S7S_R1.fastq.gz 37.98489
S8B_R1.fastq.gz 34.83936
S8S_R1.fastq.gz 37.47475
S9B_R1.fastq.gz 32.84487
S9S_R1.fastq.gz 37.69581
The ASV table is ready. But without a clue about the ASVs taxonomic identity, we won’t go far in ecological interpretations. We can have an idea of ASV taxonomic identity comparing their sequences to references databases such as SILVA.
The taxonomic assignment is done in two steps.
First, each ASV is assigned to a taxonomy using the RDP Naive Bayesian Classifier algorithm described in Wang et al. 2007 called by the function dada2::assignTaxonomy()
.
The method is robust, however, it often fails to assign at the species level.
If you consider that in case an ASV is 100% similar to a reference sequence, it belongs to the same species, then you can use dada2::addSpecies()
This function assign to the species level ASVs which are identical to a reference sequence.
We will also need some information about the samples, the file is located is another folder.
context <- read.table(here::here("data",
"mapfileFA.txt"),
header = TRUE,row.names = 1)
#See
context
SampName Geo Description groupe Pres PicoEuk Synec Prochloro NanoEuk
S1B S1B North North1B NBF 52 660 32195 10675 955
S2B S2B North North2B NBF 59 890 25480 16595 670
S2S S2S North North2S NBS 0 890 25480 16595 670
S3B S3B North North3B NBF 74 835 13340 25115 1115
S3S S3S North North3S NBS 0 715 26725 16860 890
S4B S4B North North4B NBF 78 2220 3130 29835 2120
S4S S4S North North4S NBS 78 2220 3130 29835 2120
S5B S5B North North5B NBF 42 1620 55780 23795 2555
S5S S5S North North5S NBS 0 1620 56555 22835 2560
S6B S6B South South1B SGF 13 2520 39050 705 3630
S6S S6S South South1S SGS 0 2435 35890 915 3735
S7B S7B South South2B SGF 26 0 0 0 4005
S7S S7S South South2S SGS 0 4535 26545 1340 6585
S8B S8B South South3B SGF 33 0 0 0 5910
S8S S8S South South3S SGS 0 4260 36745 985 5470
S9B S9B South South4B SGF 25 4000 31730 485 4395
S9S S9S South South4S SGS 0 5465 32860 820 5045
S11B S11B South South5B SGF 35 5370 46830 580 6010
Crypto SiOH4 NO2 NO3 NH4 PO4 NT PT Chla T S
S1B 115 1.813 0.256 0.889 0.324 0.132 9.946 3.565 0.0000 22.7338 37.6204
S2B 395 2.592 0.105 1.125 0.328 0.067 9.378 3.391 0.0000 22.6824 37.6627
S2S 395 3.381 0.231 0.706 0.450 0.109 8.817 3.345 0.0000 22.6854 37.6176
S3B 165 1.438 0.057 1.159 0.369 0.174 8.989 2.568 0.0000 21.5296 37.5549
S3S 200 1.656 0.098 0.794 0.367 0.095 7.847 2.520 0.0000 22.5610 37.5960
S4B 235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S4S 235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S5B 1355 2.028 0.103 1.135 0.216 0.128 8.623 3.137 0.0102 24.1905 38.3192
S5S 945 2.669 0.136 0.785 0.267 0.114 9.146 3.062 0.0000 24.1789 38.3213
S6B 1295 2.206 0.249 0.768 0.629 0.236 9.013 3.455 0.0000 22.0197 39.0877
S6S 1300 3.004 0.251 0.927 0.653 0.266 8.776 3.230 0.0134 22.0515 39.0884
S7B 1600 3.016 0.157 0.895 0.491 0.176 8.968 4.116 0.0000 23.6669 38.9699
S7S 1355 1.198 0.165 1.099 0.432 0.180 8.256 3.182 0.0000 23.6814 38.9708
S8B 1590 3.868 0.253 0.567 0.533 0.169 8.395 3.126 0.0000 23.1236 39.0054
S8S 2265 3.639 0.255 0.658 0.665 0.247 8.991 3.843 0.0132 23.3147 38.9885
S9B 1180 3.910 0.107 0.672 0.490 0.134 8.954 4.042 0.0172 22.6306 38.9094
S9S 1545 3.607 0.139 0.644 0.373 0.167 9.817 3.689 0.0062 22.9545 38.7777
S11B 1690 3.324 0.083 0.756 0.467 0.115 9.539 4.138 0.0182 23.0308 38.9967
Sigma_t
S1B 26.0046
S2B 26.0521
S2S 26.0137
S3B 26.2987
S3S 26.0332
S4B 26.9415
S4S 26.9415
S5B 26.1037
S5S 26.1065
S6B 27.3241
S6S 27.3151
S7B 26.7536
S7S 26.7488
S8B 26.9423
S8S 26.8713
S9B 27.0131
S9S 26.8172
S11B 26.9631
physeq <- phyloseq::phyloseq(
phyloseq::otu_table(seqtab_nochim,taxa_are_rows = FALSE),
phyloseq::tax_table(taxonomy),
phyloseq::sample_data(context)
)
#See
physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 155 taxa and 18 samples ]
sample_data() Sample Data: [ 18 samples by 21 sample variables ]
tax_table() Taxonomy Table: [ 155 taxa by 7 taxonomic ranks ]
The function will add sequences AND change labels automatically with ASV1,ASV2 etc
phyloseq-class experiment-level object
otu_table() OTU Table: [ 155 taxa and 18 samples ]
sample_data() Sample Data: [ 18 samples by 21 sample variables ]
tax_table() Taxonomy Table: [ 155 taxa by 7 taxonomic ranks ]
refseq() DNAStringSet: [ 155 reference sequences ]
Knowing the ASVs phylogenetic relatedness will help you to have a better understanding of the communities your studying.
To quote the evolutionary biologist Theodosius Dobzhansky:
Nothing in Biology Makes Sense Except in the Light of Evolution.
A phylogenetic tree reconstructed from the ASV sequences will be used to measure their relatedness.
Before reconstructing a phylogenetic tree we need to align the ASVs sequences.
Once it is done, you can visualise the alignment using the function DECIPHER::BrowseSeqs()
We will infer a phylogenetic from our alignement using the library phangorn
.
First, let’s convert our DNAStringSet
alignment to the phangorn
phyDat
format.
Then, we compute pairwise distances of our aligned sequences using equal base frequencies (JC69 model used by default).
Finally, we reconstruct a neighbour joining tree.
Other approaches to reconstruct a phylogenetic tree exist. If you want to try them with phangorn, have a look here
We need the tree to be rooted for future analysis. We can do that using the function phangorn::midpoint()
Change label (leaves) of the tree (important)
Once we have a rooted tree, we can add it to the phyloseq object.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 155 taxa and 18 samples ]
sample_data() Sample Data: [ 18 samples by 21 sample variables ]
tax_table() Taxonomy Table: [ 155 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 155 tips and 154 internal nodes ]
refseq() DNAStringSet: [ 155 reference sequences ]