PARTII_alpha_Diversity_ANF

Authors

Fabrice Armougom, MIO

Marc Garel, MIO

Prepare workspace

Load libraries

library(phyloseq)
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.2

library(patchwork)

Load custom functions

source(here::here("scripts", "R", "alpha_diversity.R"))
source(here::here("scripts", "R", "utils.R"))

Define output folder

output_alpha <- here::here("outputs", "alpha_diversity")
if (!dir.exists(output_alpha)) dir.create(output_alpha, recursive = TRUE)

Load the data and inspect the phyloseq object

physeq <- readRDS(here::here("data",
                             "phyloseq_object_alpha_beta_div.rds"))

Data Structure

Phyloseq object

physeq

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 213 taxa and 18 samples ]
sample_data() Sample Data:       [ 18 samples by 21 sample variables ]
tax_table()   Taxonomy Table:    [ 213 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 213 tips and 212 internal nodes ]
refseq()      DNAStringSet:      [ 213 reference sequences ]

Composition of our phyloseq object `physeq`

An ASV table with the absolute counts

Be careful: Rows are samples, columns are ASVs

physeq@otu_table[1:10,1:10]

OTU Table:          [10 taxa and 10 samples]
                     taxa are columns
     ASV1 ASV2 ASV3 ASV4 ASV5 ASV6 ASV7 ASV8 ASV9 ASV10
S11B  117   25   85   70   87   40   57   34   41     0
S1B    67    0   23    0   51   48    0    0   27    58
S2B    43    0   35   15   42   52    0    0    0    43
S2S   103   87   76   12   99   43   36   72   46     0
S3B    59    0   32    0   49   73    0    0    0    57
S3S    81   10    0   20   36    0    0    0    0    50
S4B    11    6   38   33   43   46    0    8    0    37
S4S    68    6   38    0   62    0    0   11   30    46
S5B   176   18   62  109    0   35   56   13   33    82
S5S   182    0   36  101   51    0   33    0   29    42

A metadata table with information (e.g. physicochemical, categorical variables) about samples

physeq@sam_data

     SampName   Geo Description groupe Pres PicoEuk Synec Prochloro NanoEuk
S11B     S11B South     South5B    SGF   35    5370 46830       580    6010
S1B       S1B North     North1B    NBF   52     660 32195     10675     955
S2B       S2B North     North2B    NBF   59     890 25480     16595     670
S2S       S2S North     North2S    NBS    0     890 25480     16595     670
S3B       S3B North     North3B    NBF   74     835 13340     25115    1115
S3S       S3S North     North3S    NBS    0     715 26725     16860     890
S4B       S4B North     North4B    NBF   78    2220  3130     29835    2120
S4S       S4S North     North4S    NBS   78    2220  3130     29835    2120
S5B       S5B North     North5B    NBF   42    1620 55780     23795    2555
S5S       S5S North     North5S    NBS    0    1620 56555     22835    2560
S6B       S6B South     South1B    SGF   13    2520 39050       705    3630
S6S       S6S South     South1S    SGS    0    2435 35890       915    3735
S7B       S7B South     South2B    SGF   26       0     0         0    4005
S7S       S7S South     South2S    SGS    0    4535 26545      1340    6585
S8B       S8B South     South3B    SGF   33       0     0         0    5910
S8S       S8S South     South3S    SGS    0    4260 36745       985    5470
S9B       S9B South     South4B    SGF   25    4000 31730       485    4395
S9S       S9S South     South4S    SGS    0    5465 32860       820    5045
     Crypto SiOH4   NO2   NO3   NH4   PO4    NT    PT   Chla       T       S
S11B   1690 3.324 0.083 0.756 0.467 0.115 9.539 4.138 0.0182 23.0308 38.9967
S1B     115 1.813 0.256 0.889 0.324 0.132 9.946 3.565 0.0000 22.7338 37.6204
S2B     395 2.592 0.105 1.125 0.328 0.067 9.378 3.391 0.0000 22.6824 37.6627
S2S     395 3.381 0.231 0.706 0.450 0.109 8.817 3.345 0.0000 22.6854 37.6176
S3B     165 1.438 0.057 1.159 0.369 0.174 8.989 2.568 0.0000 21.5296 37.5549
S3S     200 1.656 0.098 0.794 0.367 0.095 7.847 2.520 0.0000 22.5610 37.5960
S4B     235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S4S     235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S5B    1355 2.028 0.103 1.135 0.216 0.128 8.623 3.137 0.0102 24.1905 38.3192
S5S     945 2.669 0.136 0.785 0.267 0.114 9.146 3.062 0.0000 24.1789 38.3213
S6B    1295 2.206 0.249 0.768 0.629 0.236 9.013 3.455 0.0000 22.0197 39.0877
S6S    1300 3.004 0.251 0.927 0.653 0.266 8.776 3.230 0.0134 22.0515 39.0884
S7B    1600 3.016 0.157 0.895 0.491 0.176 8.968 4.116 0.0000 23.6669 38.9699
S7S    1355 1.198 0.165 1.099 0.432 0.180 8.256 3.182 0.0000 23.6814 38.9708
S8B    1590 3.868 0.253 0.567 0.533 0.169 8.395 3.126 0.0000 23.1236 39.0054
S8S    2265 3.639 0.255 0.658 0.665 0.247 8.991 3.843 0.0132 23.3147 38.9885
S9B    1180 3.910 0.107 0.672 0.490 0.134 8.954 4.042 0.0172 22.6306 38.9094
S9S    1545 3.607 0.139 0.644 0.373 0.167 9.817 3.689 0.0062 22.9545 38.7777
     Sigma_t
S11B 26.9631
S1B  26.0046
S2B  26.0521
S2S  26.0137
S3B  26.2987
S3S  26.0332
S4B  26.9415
S4S  26.9415
S5B  26.1037
S5S  26.1065
S6B  27.3241
S6S  27.3151
S7B  26.7536
S7S  26.7488
S8B  26.9423
S8S  26.8713
S9B  27.0131
S9S  26.8172

A table of taxonomic classification level of each ASV

physeq@tax_table[1:10,]

Taxonomy Table:     [10 taxa by 7 taxonomic ranks]:
      Kingdom    Phylum             Class                 Order             
ASV1  "Bacteria" "Cyanobacteria"    "Cyanobacteriia"      "Synechococcales" 
ASV2  "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Enterobacterales"
ASV3  "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"     
ASV4  "Archaea"  "Thermoplasmatota" "Thermoplasmata"      "Marine Group II" 
ASV5  "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"     
ASV6  "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"     
ASV7  "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "Rhodospirillales"
ASV8  "Bacteria" "Proteobacteria"   "Gammaproteobacteria" "Enterobacterales"
ASV9  "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"     
ASV10 "Bacteria" "Proteobacteria"   "Alphaproteobacteria" "SAR11 clade"     
      Family                    Genus                  Species
ASV1  "Cyanobiaceae"            "Synechococcus CC9902" NA     
ASV2  "Pseudoalteromonadaceae"  "Pseudoalteromonas"    NA     
ASV3  "Clade I"                 "Clade Ia"             NA     
ASV4  NA                        NA                     NA     
ASV5  "Clade I"                 "Clade Ia"             NA     
ASV6  "Clade II"                NA                     NA     
ASV7  "AEGEAN-169 marine group" NA                     NA     
ASV8  "Pseudoalteromonadaceae"  "Pseudoalteromonas"    NA     
ASV9  "Clade I"                 "Clade Ia"             NA     
ASV10 "Clade I"                 "Clade Ia"             NA

A Phylogenetic tree

physeq@phy_tree


Phylogenetic tree with 213 tips and 212 internal nodes.

Tip labels:
  ASV1, ASV2, ASV3, ASV4, ASV5, ASV6, ...

Rooted; includes branch lengths.

A table with the ASV sequences

physeq@refseq

DNAStringSet object of length 213:
      width seq                                             names               
  [1]   402 GGAATTTTCCGCAATGGGCGAA...CGAAAGCCAGGGGAGCGAAAGG ASV1
  [2]   425 GGAATATTGCACAATGGGCGCA...CGAAAGCGTGGGGAGCAAACAG ASV2
  [3]   400 GGAATCTTGCACAATGGAGGAA...CGAAAGCATGGGTAGCGAAGAG ASV3
  [4]   383 CGAAAACTTGACAATGCGAGCA...CGAAGCCTAGGGGCACGAACCG ASV4
  [5]   400 GGAATCTTGCACAATGGAGGAA...CGAAAGCATGGGTAGCGAAGAG ASV5
  ...   ... ...
[209]   426 GGAATTTTGCGCAATGGACGAA...CGAAAGCGTGGGGAGCGAACAG ASV209
[210]   403 GGAATATTGCACAATGGGCGCA...GGTCAACACTGACGCTCATGTA ASV210
[211]   360 CGAAAACTTCACACTGCAGGAA...GAACGGATCCGACGGTCAGGGA ASV211
[212]   400 GGAATATTGGACAATGGGCGAA...CGAAAGCGTGGGTAGCAAACAG ASV212
[213]   404 GGAATATTGCACAATGGGCGCA...GTCAACACTGACGCTCATGTAC ASV213

Subsampling normalization

Rarefaction Curves

Before normalization by sub-sampling, let’s have a look at rarefaction curves, evaluate your sequencing effort and make decisions

Identify your minimum sample size

phyloseq::sample_sums(physeq)

S11B  S1B  S2B  S2S  S3B  S3S  S4B  S4S  S5B  S5S  S6B  S6S  S7B  S7S  S8B  S8S 
 975  837  893  983  878  889  917 1077 1018 1006 1076  937  878  936  846  958 
 S9B  S9S 
 888  991

What is the minimum sample size?

Run rarefaction curves using our custom function `ggrare()` (defined in `R/alpha_diversity.R`)

#Make rarefaction curves & Add min sample size line
ggrare(physeq, step = 10, color = "Description", se = FALSE) +
  geom_vline(xintercept = min(sample_sums(physeq)), color = "gray60")

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 4

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 3

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 4

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 4

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

Do you think is a good idea to normalize your data using this minimal sample size?

Normalization process for alpha diversity: sub-sampling

physeq_rar <- phyloseq::rarefy_even_depth(physeq, rngseed = TRUE)

Check the number of sequences for each sample using sample_sums function

Did you lost a lot of ASVs?

Run rarefaction curves on normalized data

p0 <- ggrare(physeq_rar, step = 10, color = "Description", se = TRUE)

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 2

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 3

Warning in vegan::rarefy(x[i, , drop = FALSE], n, se = se): most observed count
data have counts 1, but smallest count is 4

Group separation

p0 + facet_wrap(~Geo, ncol = 2)

Rank Abundance Curves

tableASV <- data.frame(phyloseq::otu_table(physeq_rar))
metadonnees <- data.frame(physeq_rar@sam_data)
metadonnees$Geo <- factor(metadonnees$Geo)
ab_ranktab <- rankabuncomp(tableASV, y=metadonnees, factor='Geo',scale='logabun', legend="topright")

IV-Alpha Diversity

Indices

Get taxonomy-based diversity indices

#Get indices with alpha function (NB: index="all" if you want all the indices)
alpha_indices <- microbiome::alpha(
  physeq_rar,
  index = c("observed", "diversity_gini_simpson",
          "diversity_shannon", "evenness_pielou",
          "dominance_relative")
)

#save
write.table(alpha_indices,
            file = file.path(output_alpha, "indices_alpha_resultat.txt"),
            sep = "\t")

#which type?
class(alpha_indices)

[1] "data.frame"

#see
alpha_indices

     observed diversity_gini_simpson diversity_shannon evenness_pielou
S11B       36             0.94478631         3.1464802      0.87804201
S1B        35             0.94779244         3.1777661      0.89379888
S2B        43             0.95777575         3.4080222      0.90609969
S2S        36             0.94145759         3.1120657      0.86843848
S3B        41             0.95039889         3.2759058      0.88214412
S3S        38             0.94913135         3.2468023      0.89257055
S4B        40             0.95702493         3.3506939      0.90832296
S4S        40             0.93634460         3.0974738      0.83967878
S5B        32             0.92715778         2.9785407      0.85942517
S5S        33             0.92532499         2.9933406      0.85609441
S6B        41             0.94395556         3.2135081      0.86534151
S6S        28             0.82478100         2.4409713      0.73253948
S7B        36             0.94359014         3.1577021      0.88117355
S7S        44             0.94887441         3.2782965      0.86631400
S8B        33             0.94521168         3.1084743      0.88902263
S8S        39             0.95175779         3.3076444      0.90284939
S9B        36             0.94430384         3.1059084      0.86672024
S9S        33             0.94875451         3.1803927      0.90959124
     dominance_relative
S11B        0.102747909
S1B         0.102747909
S2B         0.096774194
S2S         0.109916368
S3B         0.109916368
S3S         0.109916368
S4B         0.087216249
S4S         0.143369176
S5B         0.166069295
S5S         0.188769415
S6B         0.136200717
S6S         0.383512545
S7B         0.130227001
S7S         0.106332139
S8B         0.094384707
S8S         0.096774194
S9B         0.094384707
S9S         0.103942652

What can you notice for one sample?

How to show this graphically?

Add the alpha indices result to your metadata (sample_data) phyloseq object

Important because many times you will probably want to add new variables in the phyloseq class object!!!

#Turn into sample_data object : sample_data function
alpha_indices <- phyloseq::sample_data(alpha_indices)
#See
class(alpha_indices)

[1] "sample_data"
attr(,"package")
[1] "phyloseq"

#Add alpha_indices to phyloseq sample_data object: merge_phyloseq function!
physeq_rar <- phyloseq::merge_phyloseq(physeq_rar, alpha_indices)
#See the result
sample_data(physeq_rar)

Get phylogeny based diversity indices: get_NRI_NTI function

#CalculateNRI,NTI,PD...: get_NRI_NTI function
ind_comp <- MicrobiotaProcess::get_NRI_NTI(physeq_rar,
                                           abundance.weighted = FALSE,
                                           metric = "all",
                                           seed = 123)

#Retrieve only those of interest :select function, results are in ind_comp@alpha
indi_comp <- as.data.frame(ind_comp@alpha)
NRI_NTI_PB <- dplyr::select(indi_comp, NRI:PD)
#see
NRI_NTI_PB

              NRI            NTI        PD
S11B -1.700640519 -1.73488080003 3.5323774
S1B   1.176239678 -0.76340361000 3.3155461
S2B  -0.261257404 -0.66474586424 3.6970066
S2S  -2.455150637 -0.74180248862 3.4222234
S3B   0.620451099 -1.29874536798 3.6834814
S3S  -0.141085299 -0.73453694988 3.4403182
S4B  -0.485562292  0.00455541779 3.3171476
S4S   1.773462103 -0.91361917079 3.4365770
S5B   0.156984003  0.13152798473 2.9592082
S5S  -0.134251689  1.89245043841 2.4738611
S6B   0.218507592  1.08807022489 3.1227876
S6S   2.631228189  0.00037617274 2.4229824
S7B  -1.470358724 -0.63713654961 3.1692736
S7S  -2.518357417  0.70828292578 3.4530890
S8B  -0.091630593  0.74473762595 2.9358002
S8S  -1.205969382  0.44642821613 3.3017637
S9B  -0.073949264  0.31007463902 2.8807192
S9S   0.443035146  2.45614696757 2.4373593

Again!!! Add the phylogenetic indices to your metadata (sample_data) phyloseq object

#Turn into sample_data object : sample_data function
NRI_NTI_PB <- phyloseq::sample_data(NRI_NTI_PB)
#Add alpha_indices to phyloseq sample_data object: merge_phyloseq function!
physeq_rar <- phyloseq::merge_phyloseq(physeq_rar, NRI_NTI_PB)
#See the result with all the indices included
sample_data(physeq_rar)

Can you give me one of the most diversified sample based on Simpson/Shannon/Richness/Pielou/PD values observed?

Alpha diversity representations

This section will show you how to plot by different ways the alpha diversity and its customization. Understand how it works!

Alpha representations using phyloseq::plot_richness()

You are limited to the indices calculated by the phyloseq::estimate_richness function (i.e.”Observed”, “Chao1”, “ACE”, “Shannon”, “Simpson”, “InvSimpson”, “Fisher”).

Selected indices + SampName

x allow you to choose the column from sample_data(physeq_rar) for applying the label

phyloseq::plot_richness(physeq_rar, x = "SampName",
                        measures = c("Observed", "Shannon", "Simpson"))

Color by group: `color = Geo` & change sample name

For color option pass the column of sample_data(physeq_rar) that you want. Here different colors is applied depending on Geo (which is North and South, so 2 different colors)

phyloseq::plot_richness(physeq_rar,
                        x = "Description",
                        color="Geo",
                        measures=c("Observed", "Shannon", "Simpson"))

Make box_plot by adding geom_boxplot function

phyloseq::plot_richness(physeq_rar,
                        x="Geo",
                        color="Geo",
                        measures=c("Observed", "Shannon", "Simpson")) +
  ggplot2::geom_boxplot()

Make box_plot : geom_boxplot + fill color of boxplot (fill) + transparency (with alpha)

phyloseq::plot_richness(physeq_rar,
                        x = "Geo",
                        measures = c("Observed", "Shannon", "Simpson")) +
  ggplot2::geom_boxplot(aes(fill = Geo), alpha = 0.4)

Alpha representations using `Microbiome::boxplot_alpha` (not shown)

Again, you are limited to the indices calculated by the Microbiome::alpha function

Alpha representations using ggplot2

Interest: Freedom!! you can use ANY indices that you have calculated from different packages & included in sample_data

#Before : Change your phyloseq class oject sample_data as a dataframe
metadata <- data.frame(sample_data(physeq_rar))

basic: points & color

#You use the columns of the metadata (Geo, observed, groupe etc)
ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_point(aes(color = groupe, fill = groupe))

Deals with superposed points: `geom_dotplot()`

ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_dotplot(binaxis = "y", stackdir = "center", stackgroups = TRUE,
               binwidth = 0.5, aes(color = groupe, fill = groupe)) +
  xlab("Geographic position") +
  ylab("Number of Observed ASVs")

Boxplot & color control : `scale_fill` & `scale_color`

ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_boxplot(alpha = 0.7, aes(color = Geo, fill = Geo)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800"))

ggplot(metadata, aes(x = Geo, y = observed)) + 
  geom_boxplot(alpha = 0.7, aes(color = Geo, fill = Geo)) +
  scale_fill_manual(values = c("#00AFBB", "#E7B800")) +
  scale_color_manual(values = c("#00AFBB", "#E7B800"))

Boxplot, color control & points: `geom_jitter()`

ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_boxplot(alpha = 0.6,
               fill = c("#00AFBB", "#E7B800"),
               color = c("#00AFBB", "#E7B800"))+
  geom_jitter(aes(colour = groupe), position = position_jitter(0.07), cex = 2.2)

Boxplot, color control, points and Mean SD: `stat_summary()`

ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_boxplot(alpha = 0.6,
               fill = c("#00AFBB", "#E7B800"),
               color=c("#00AFBB", "#E7B800")) +
  geom_jitter(aes(colour = groupe), position = position_jitter(0.07), cex = 2.2) +
  stat_summary(fun = mean, geom = "point", shape = 17, size = 3, color = "white") +
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1, color = "white")

Combine graphs on same figure: `patchwork`

#Put your  graphs in different variables P1,P2,P3
p1 <- ggplot(metadata, aes(x = Geo, y = observed)) +
  geom_boxplot(alpha = 0.6,
               fill = c("#00AFBB","#E7B800"),
               color=c("#00AFBB","#E7B800")) +
  geom_jitter(aes(colour = groupe), position = position_jitter(0.07), cex = 2.2) +
  theme(axis.title.x = element_blank())

p2 <- ggplot(metadata, aes(x = Geo, y = evenness_pielou)) +
  geom_boxplot(alpha = 0.6,
               fill = c("#00AFBB", "#E7B800"),
               color = c("#00AFBB", "#E7B800")) +
  geom_jitter(aes(colour = groupe), position = position_jitter(0.07), cex = 2.2) +
  theme(axis.title.x = element_blank())

p3 <- ggplot(metadata, aes(x = Geo, y = diversity_gini_simpson)) +
  geom_boxplot(alpha = 0.6,
               fill = c("#00AFBB", "#E7B800"),
               color = c("#00AFBB", "#E7B800")) +
  geom_jitter(aes(colour = groupe), position = position_jitter(0.07), cex = 2.2) +
  theme(axis.title.x = element_blank())

#Put the graph of p1, p2 and p3 on same Figure
p1 + p2 + p3 +
  patchwork::plot_annotation(tag_levels = "A") +
  patchwork::plot_layout(guides = "collect")

Statistical hypothesis for alpha diversity

Normality test: Check the Normal or not normal distribution of your data to choose the right test!

Shapiro test: H0 Null Hypothesis: follows Normal distribution!

Means if p<0.05 -> reject the H0 (so does not follow a normal distribution)

Q-Qplots: Compare your distribution with a theoretical normal distribution

If your data follow a normal distribution, you’re expecting a linear relationship theoritical vs. experimental

Our custom function indices_normality() (defined in R/alpha_diversity.R) plots the results of Shapiro test as well as Q-Qplots.

Select indices to test & run normality check

metadata |>
  dplyr::select(observed,
                diversity_gini_simpson,
                diversity_shannon,
                evenness_pielou,
                PD) |>
  indices_normality(nrow = 3, ncol = 2)

What are your conclusions?

ANOVA: parametric (follows normal distribution) AND at least 3 groups

Anova for Observed ASV and 4 groups

# How many groups used? See the column "groupe" of metadata:
factor(metadata$groupe)

 [1] SGF NBF NBF NBS NBF NBS NBF NBS NBF NBS SGF SGS SGF SGS SGF SGS SGF SGS
Levels: NBF NBS SGF SGS

Variance

# Check homogeneity of variance between groups
# (avoid bias in ANOVA result & keep the power of the test)
# H0= equality of variances in the different populations
stats::bartlett.test(observed ~ groupe, metadata)


    Bartlett test of homogeneity of variances

data:  observed by groupe
Bartlett's K-squared = 3.17979, df = 3, p-value = 0.36473

Conclusion?

Alternative to Bartlett : Levene test (package `car`), less sensitive to normality deviation

Global Test: Anova tell you if that some of the group means are different, but you don’t know which pairs of groups are different!

aov_observed <- stats::aov(observed ~ groupe, metadata)
summary(aov_observed)

            Df  Sum Sq Mean Sq F value Pr(>F)
groupe       3  13.028  4.3426  0.2105 0.8874
Residuals   14 288.750 20.6250

Which pairs of groups are different? -> Post-hoc test: Tukey multiple pairwise-comparisons

signif_pairgroups <- stats::TukeyHSD(aov_observed, method = "bh")
signif_pairgroups

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: stats::aov(formula = observed ~ groupe, data = metadata)

$groupe
         diff        lwr       upr      p adj
NBS-NBF -1.45 -10.304898 7.4048980 0.96316788
SGF-NBF -1.80 -10.148478 6.5484779 0.92176565
SGS-NBF -2.20 -11.054898 6.6548980 0.88664240
SGF-NBS -0.35  -9.204898 8.5048980 0.99943022
SGS-NBS -0.75 -10.083882 8.5838820 0.99530194
SGS-SGF -0.40  -9.254898 8.4548980 0.99915104

Kruskal-Wallis: non-parametric & at least three groups

Kruskal for diversity_shannon and 4 groups

Global test

stats::kruskal.test(diversity_gini_simpson ~ groupe, data = metadata)


    Kruskal-Wallis rank sum test

data:  diversity_gini_simpson by groupe
Kruskal-Wallis chi-squared = 3.92807, df = 3, p-value = 0.26934

Post hoc test: Dunn test (pairwise group test)

signifgroup <- FSA::dunnTest(diversity_gini_simpson ~ groupe,
                           data = metadata,
                           method = "bh")

Warning: groupe was coerced to a factor.

#See
signifgroup

  Comparison           Z     P.unadj      P.adj
1  NBF - NBS  1.82899540 0.067400297 0.40440178
2  NBF - SGF  1.42163731 0.155131569 0.46539471
3  NBS - SGF -0.48866289 0.625080379 0.62508038
4  NBF - SGS  0.64224266 0.520715639 0.62485877
5  NBS - SGS -1.12585250 0.260227956 0.52045591
6  SGF - SGS -0.69808985 0.485121007 0.72768151

T-test: parametric, 2 groups (i.e North Vs. Sud)

stats::bartlett.test(observed ~ Geo, metadata)


    Bartlett test of homogeneity of variances

data:  observed by Geo
Bartlett's K-squared = 0.38191, df = 1, p-value = 0.53658

observed_ttest <- stats::t.test(observed ~ Geo, data = metadata)
#see
observed_ttest


    Welch Two Sample t-test

data:  observed by Geo
t = 0.660078, df = 15.246, p-value = 0.51905
alternative hypothesis: true difference in means between group North and group South is not equal to 0
95 percent confidence interval:
 -2.9660719  5.6327386
sample estimates:
mean in group North mean in group South 
          37.555556           36.222222

Wilcoxon rank sum: non-parametric & 2 Groups

pairwise_test <- ggpubr::compare_means(diversity_gini_simpson ~ Geo,
                                       metadata,
                                       method = "wilcox.test")
#See
pairwise_test

# A tibble: 1 × 8
  .y.                    group1 group2     p p.adj p.format p.signif method  
  <chr>                  <chr>  <chr>  <dbl> <dbl> <chr>    <chr>    <chr>   
1 diversity_gini_simpson South  North  0.796   0.8 0.8      ns       Wilcoxon

Boxplot representation with p-value information

#Boxplot as previously seen
graph_shan <- ggplot(metadata, aes(x = Geo, y = diversity_gini_simpson)) + 
  geom_boxplot(alpha=0.6,
               fill = c("#00AFBB", "#E7B800"),
               color = c("#00AFBB", "#E7B800")) +
  geom_jitter(aes(colour = groupe),
              position = position_jitter(0.02) ,
              cex=2.2)+
  stat_summary(fun = mean, geom = "point",
               shape = 17, size = 3,
               color = "white")

#Add p-value on graph
graph_shan + ggpubr::stat_pvalue_manual(
  pairwise_test,
  y.position = 1,
  label = "p.adj = {p.adj}",
  color = "blue",
  linetype = 1,
  tip.length = 0.01
)

Correlation & linear Regression

Correlation analysis

Methods available are spearman, kendall and pearson. Correlation coefficient r is independent of change of origin and scale (So no data transformation!!). Correlation analysis describes the nature (strength (0->1) and direction +/-) of the relationship between two variables (r), whatever the range and the measurement units of them.

Considerations for statistical tests (test of the value being zero): * Pearson’s test is parametric (normal distribution required) * Spearman’s and Kendall’s tests are non-parametric

Select variables

#Select variables for bivariate correlation
myvariables <- dplyr::select(metadata, SiOH4:PO4,diversity_shannon)
#see
myvariables

     SiOH4   NO2   NO3   NH4   PO4 diversity_shannon
S11B 3.324 0.083 0.756 0.467 0.115         3.1464802
S1B  1.813 0.256 0.889 0.324 0.132         3.1777661
S2B  2.592 0.105 1.125 0.328 0.067         3.4080222
S2S  3.381 0.231 0.706 0.450 0.109         3.1120657
S3B  1.438 0.057 1.159 0.369 0.174         3.2759058
S3S  1.656 0.098 0.794 0.367 0.095         3.2468023
S4B  2.457 0.099 1.087 0.349 0.137         3.3506939
S4S  2.457 0.099 1.087 0.349 0.137         3.0974738
S5B  2.028 0.103 1.135 0.216 0.128         2.9785407
S5S  2.669 0.136 0.785 0.267 0.114         2.9933406
S6B  2.206 0.249 0.768 0.629 0.236         3.2135081
S6S  3.004 0.251 0.927 0.653 0.266         2.4409713
S7B  3.016 0.157 0.895 0.491 0.176         3.1577021
S7S  1.198 0.165 1.099 0.432 0.180         3.2782965
S8B  3.868 0.253 0.567 0.533 0.169         3.1084743
S8S  3.639 0.255 0.658 0.665 0.247         3.3076444
S9B  3.910 0.107 0.672 0.490 0.134         3.1059084
S9S  3.607 0.139 0.644 0.373 0.167         3.1803927

Apply the method

#Apply method pearson
matrixCor <- stats::cor(myvariables, method = "pearson")
#see
matrixCor

                        SiOH4         NO2          NO3         NH4         PO4
SiOH4              1.00000000  0.26802252 -0.727703077  0.44354653  0.11949381
NO2                0.26802252  1.00000000 -0.484402421  0.62864562  0.57757348
NO3               -0.72770308 -0.48440242  1.000000000 -0.49147592 -0.16912716
NH4                0.44354653  0.62864562 -0.491475923  1.00000000  0.77443570
PO4                0.11949381  0.57757348 -0.169127163  0.77443570  1.00000000
diversity_shannon -0.20728823 -0.29316236  0.094763958 -0.24014733 -0.37791579
                  diversity_shannon
SiOH4                  -0.207288231
NO2                    -0.293162363
NO3                     0.094763958
NH4                    -0.240147333
PO4                    -0.377915793
diversity_shannon       1.000000000

# we use a function defined in R/utils.R
# to move the row names content to a new column

df_export(matrixCor, new_rn = "variable")

           variable       SiOH4         NO2          NO3         NH4
1             SiOH4  1.00000000  0.26802252 -0.727703077  0.44354653
2               NO2  0.26802252  1.00000000 -0.484402421  0.62864562
3               NO3 -0.72770308 -0.48440242  1.000000000 -0.49147592
4               NH4  0.44354653  0.62864562 -0.491475923  1.00000000
5               PO4  0.11949381  0.57757348 -0.169127163  0.77443570
6 diversity_shannon -0.20728823 -0.29316236  0.094763958 -0.24014733
          PO4 diversity_shannon
1  0.11949381      -0.207288231
2  0.57757348      -0.293162363
3 -0.16912716       0.094763958
4  0.77443570      -0.240147333
5  1.00000000      -0.377915793
6 -0.37791579       1.000000000

# we can now export

write.table(df_export(matrixCor, new_rn = "variable"),
            file.path(output_alpha, "correlation_matrix.tsv"),
            row.names = FALSE,
            sep = "\t",
            quote = FALSE)

Plot results: corrplot function

corrplot::corrplot(
  matrixCor,
  method="circle",
  type="lower",
  order='hclust',
  tl.col = "black",
  tl.srt = 45,
  tl.cex=0.9,
  diag = FALSE
)

Is the correlation is due to chance? Significance test!

The idea: Test the correlation at the population scale (=Rho) and compare to r (your samples). HO is : there is not a significant linear correlation between x and y in the population. For instance t-test allows to use sample data to generalize an assumption to an entire population.

#Test stats
ptest <- corrplot::cor.mtest(matrixCor, conf.level = .95)
#The p-value are stored in ptest$p
#see
ptest$p

                         SiOH4         NO2          NO3         NH4         PO4
SiOH4             0.0000000000 0.188313687 0.0045596985 0.110136746 0.425648355
NO2               0.1883136871 0.000000000 0.0721195375 0.012780256 0.035320670
NO3               0.0045596985 0.072119538 0.0000000000 0.051193376 0.300785372
NH4               0.1101367463 0.012780256 0.0511933761 0.000000000 0.013913210
PO4               0.4256483554 0.035320670 0.3007853722 0.013913210 0.000000000
diversity_shannon 0.3493471499 0.114960693 0.4193117807 0.112842904 0.052189815
                  diversity_shannon
SiOH4                   0.349347150
NO2                     0.114960693
NO3                     0.419311781
NH4                     0.112842904
PO4                     0.052189815
diversity_shannon       0.000000000

Show only correlations with significant p-values

corrplot::corrplot(
  matrixCor,
  p.mat = ptest$p,
  sig.level = .05,
  method = "circle",
  type = "lower",
  order = 'hclust',
  tl.col = "black",
  tl.srt = 45,
  tl.cex = 0.7,
  diag = FALSE
)

Linear regression

Determination coefficient R² provides percentage variation in y which is explained by all the x together. Its value is (usually) between 0 and 1 and it indicates strength of Linear Regression model. Higher the R² value, data points are less scattered so it is a good model. Lesser the R² value is more scattered the data points.

Shannon ~ Observed

ggplot(metadata, aes(x = observed, y = diversity_shannon)) +
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  ggpmisc::stat_poly_eq(aes(label = paste(after_stat(rr.label),
                                          after_stat(p.value.label),
                                          sep = "*\", \"*")))

What should be your conclusions…be careful…

What is the r value?

Taxonomy: barplot graph

Abundance Transformation

Counts in percentage using `phyloseq::transform_sample_counts()`

pourcentS <- phyloseq::transform_sample_counts(physeq_rar, function(x) x/sum(x) * 100)

See plot:

phyloseq::plot_bar(pourcentS)

What are the separation lines?

Summarise at a given taxonomic level with `phyloseq::tax_glom()`

Remember ranks can be obtained with phyloseq::rank_names()

phyloseq::rank_names(pourcentS)

[1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

Phylum_glom <- phyloseq::tax_glom(pourcentS,
                                  taxrank = "Phylum",
                                  NArm = FALSE)

#Plot at Phylum taxonomic rank, with color
phyloseq::plot_bar(Phylum_glom, fill = "Phylum")

NArm?

Filter phylum (mean of the line): `phyloseq::filter_taxa()`

Let’s filter out the phylums with a mean relative abundance inferior to 1%

Phylum_1 <- phyloseq::filter_taxa(Phylum_glom,
                                  flist = function(x) mean(x) >= 1,
                                  prune = TRUE)

#Plot at Phylum taxonomic rank, with color
phyloseq::plot_bar(Phylum_1, fill = "Phylum")

How to save a table into a file: exemple of phylum taxonomic table

write.table(df_export(otu_table(Phylum_glom)),
            row.names = FALSE,
            file = file.path(output_alpha, "Phylum_pourcent.tsv"),
            sep = "\t")

Remove black lines

phyloseq::plot_bar(Phylum_glom, "Description", fill = "Phylum") +
  geom_bar(aes(colour = Phylum), stat = "identity")

Microbiome package

`microbiome::aggregate_taxa()`

# Order Rank
Order_microb <- microbiome::aggregate_taxa(pourcentS, "Order")

#Filter at 1%
Order1 <- phyloseq::filter_taxa(Order_microb, function(x) mean(x) >= 1, prune = TRUE)

`microbiome::plot_composition()`

p_order <- microbiome::plot_composition(Order1,
                                        otu.sort = "abundance",
                                        sample.sort = "Description",
                                        x.label = "Description",
                                        plot.type = "barplot",
                                        verbose = FALSE) +
  ggplot2::labs(x = "", y = "Relative abundance (%)")
#see
p_order

#Average by group :average_by option
p_order_groupe <- microbiome::plot_composition(Order1,
                                               otu.sort = "abundance",
                                               sample.sort = "Description",
                                               x.label = "Description",
                                               plot.type = "barplot",
                                               verbose = FALSE,
                                               average_by = "Geo") +
  ggplot2::labs(x = "", y = "Relative abundance (%)")

#see
p_order_groupe

Interactive barplot with `plotly::ggplotly()`

plotly::ggplotly(p_order)

How to manage colors in barplots

With the number of Phyla, Order etc a barplot can become very confusing… Need to have distinct color for each taxonomic groups.

Use the library RColorBrewer et scale_fill_manual() See here to understand the possibilities

You can visualise RColorBrewer’s palettes with the following command:

RColorBrewer::display.brewer.all()

Build your own palette

Let’s assemble from two RColorBrewer’s palettes a single 13 colors palette

#See Set2 colors
(col1 <- RColorBrewer::brewer.pal(name = "Set2", n = 8))

[1] "#66C2A5" "#FC8D62" "#8DA0CB" "#E78AC3" "#A6D854" "#FFD92F" "#E5C494"
[8] "#B3B3B3"

#See  Paired colors 
(col2 <- RColorBrewer::brewer.pal(name = "Paired", n = 5))

[1] "#A6CEE3" "#1F78B4" "#B2DF8A" "#33A02C" "#FB9A99"

#Build your set of colors using brewer.pal or your own colors
mycolors <- c(col1, col2)

Use your palette in the p_order barplot

#Use scale_fill_manual
p_order +
  ggplot2::scale_fill_manual("Order", values = mycolors) +
  theme(legend.position = "right",
        legend.text = element_text(size=8))

Other data Manipulation : select specific taxa, merge samples

Select Actinobacteria AND Bacteroidetes: `phyloseq::subset_taxa()`

myselection1 <- phyloseq::subset_taxa(Phylum_glom, Phylum == "Actinobacteriota" | Phylum == "Bacteroidota")

phyloseq::plot_bar(myselection1, x = "Description", fill = "Phylum")

phyloseq::plot_bar(myselection1, x = "Description",
                   fill="Phylum", facet_grid = ~Phylum)

Keep all with the exception of a class, a genus etc (e.g. contamination)

myselection2 <- phyloseq::subset_taxa(physeq, Class != "Thermoplasmata" | is.na(Class))

Understand:

! = means IS NOT

| = AND

Is.na = do not remove the NA (Not Assigned at the Class rank), by default it will be removed. be careful!

Merge samples (groups, duplicates etc)

Use a column from metadata to group/merge samples (North & South)

(NordSud <- phyloseq::merge_samples(physeq_rar, "Geo"))

Warning in asMethod(object): NAs introduced by coercion

Warning in asMethod(object): NAs introduced by coercion

Warning in asMethod(object): NAs introduced by coercion

Warning in asMethod(object): NAs introduced by coercion

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 209 taxa and 2 samples ]
sample_data() Sample Data:       [ 2 samples by 29 sample variables ]
tax_table()   Taxonomy Table:    [ 209 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 209 tips and 208 internal nodes ]

Sample selection: `phyloseq::subset_samples()`

(sub_North <- phyloseq::subset_samples(pourcentS, Geo == "North"))

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 209 taxa and 9 samples ]
sample_data() Sample Data:       [ 9 samples by 29 sample variables ]
tax_table()   Taxonomy Table:    [ 209 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 209 tips and 208 internal nodes ]
refseq()      DNAStringSet:      [ 209 reference sequences ]

Alternative way: `phyloseq::prune_samples`

Define what you want to keep

keep <- c("S1B", "S2S")

Then extract these samples from pourcentS phyloseq object

keep2samples <- phyloseq::prune_samples(keep, pourcentS)
sample_names(keep2samples)

[1] "S1B" "S2S"

Retrieve sequences from a phyloseq object

One sequence:

Biostrings::writeXStringSet(physeq_rar@refseq["ASV1"],
                            filepath = file.path(output_alpha,"ASV1.fasta"),
                            format = "fasta")

By name

listASV <- c("ASV2", "ASV8", "ASV32", "ASV58")

Biostrings::writeXStringSet(physeq_rar@refseq[listASV],
                            filepath = file.path(output_alpha,"several_asvs.fasta"),
                            format = "fasta")

From a selection

Let’s export a fasta files of all ASVs with a maximum relative abundance superior to 10% in North samples:

phyloseq::subset_samples(pourcentS, Geo == "North") |>
  phyloseq::filter_taxa(flist = function(x) max(x) >= 10, prune = TRUE) |>
  phyloseq::refseq() |>
  Biostrings::writeXStringSet(
    filepath = file.path(output_alpha, "fancy_selection_asvs.fasta"),
    format = "fasta"
  )

Retrieve all sequences

Biostrings::writeXStringSet(physeq_rar@refseq,
                            filepath = file.path(output_alpha,"all_asvs.fasta"),
                            format = "fasta")

Core microbiota analysis

Identify the taxa names of the core microbiota

Which core? Compare North & South core microbiota

#Create 2 phyloseq objects for North and South sample groups
sub_North <- phyloseq::subset_samples(pourcentS, Geo == "North")
sub_South <- phyloseq::subset_samples(pourcentS, Geo == "South")

#Check group North ok
sub_North@sam_data

    SampName   Geo Description groupe Pres PicoEuk Synec Prochloro NanoEuk
S1B      S1B North     North1B    NBF   52     660 32195     10675     955
S2B      S2B North     North2B    NBF   59     890 25480     16595     670
S2S      S2S North     North2S    NBS    0     890 25480     16595     670
S3B      S3B North     North3B    NBF   74     835 13340     25115    1115
S3S      S3S North     North3S    NBS    0     715 26725     16860     890
S4B      S4B North     North4B    NBF   78    2220  3130     29835    2120
S4S      S4S North     North4S    NBS   78    2220  3130     29835    2120
S5B      S5B North     North5B    NBF   42    1620 55780     23795    2555
S5S      S5S North     North5S    NBS    0    1620 56555     22835    2560
    Crypto SiOH4   NO2   NO3   NH4   PO4    NT    PT   Chla       T       S
S1B    115 1.813 0.256 0.889 0.324 0.132 9.946 3.565 0.0000 22.7338 37.6204
S2B    395 2.592 0.105 1.125 0.328 0.067 9.378 3.391 0.0000 22.6824 37.6627
S2S    395 3.381 0.231 0.706 0.450 0.109 8.817 3.345 0.0000 22.6854 37.6176
S3B    165 1.438 0.057 1.159 0.369 0.174 8.989 2.568 0.0000 21.5296 37.5549
S3S    200 1.656 0.098 0.794 0.367 0.095 7.847 2.520 0.0000 22.5610 37.5960
S4B    235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S4S    235 2.457 0.099 1.087 0.349 0.137 8.689 3.129 0.0000 18.8515 37.4542
S5B   1355 2.028 0.103 1.135 0.216 0.128 8.623 3.137 0.0102 24.1905 38.3192
S5S    945 2.669 0.136 0.785 0.267 0.114 9.146 3.062 0.0000 24.1789 38.3213
    Sigma_t observed diversity_gini_simpson diversity_shannon evenness_pielou
S1B 26.0046       35             0.94779244         3.1777661      0.89379888
S2B 26.0521       43             0.95777575         3.4080222      0.90609969
S2S 26.0137       36             0.94145759         3.1120657      0.86843848
S3B 26.2987       41             0.95039889         3.2759058      0.88214412
S3S 26.0332       38             0.94913135         3.2468023      0.89257055
S4B 26.9415       40             0.95702493         3.3506939      0.90832296
S4S 26.9415       40             0.93634460         3.0974738      0.83967878
S5B 26.1037       32             0.92715778         2.9785407      0.85942517
S5S 26.1065       33             0.92532499         2.9933406      0.85609441
    dominance_relative         NRI           NTI        PD
S1B        0.102747909  1.17623968 -0.7634036100 3.3155461
S2B        0.096774194 -0.26125740 -0.6647458642 3.6970066
S2S        0.109916368 -2.45515064 -0.7418024886 3.4222234
S3B        0.109916368  0.62045110 -1.2987453680 3.6834814
S3S        0.109916368 -0.14108530 -0.7345369499 3.4403182
S4B        0.087216249 -0.48556229  0.0045554178 3.3171476
S4S        0.143369176  1.77346210 -0.9136191708 3.4365770
S5B        0.166069295  0.15698400  0.1315279847 2.9592082
S5S        0.188769415 -0.13425169  1.8924504384 2.4738611

Change first column name of taxonomy rank

Replace “Kingdom” by “Domain”, needed for the use of add_best function

#Before
colnames(sub_North@tax_table)[1]

[1] "Kingdom"

#Apply change for North
colnames(sub_North@tax_table)[1] <- "Domain"
#See 
colnames(sub_North@tax_table)[1]

[1] "Domain"

Add the lowest taxonomy classification

sub_North <- microbiome::add_besthit(sub_North, sep = ":")

See the transformation of tax_table

head(sub_North@tax_table)

Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
                          Domain     Phylum             Class                
ASV1:Synechococcus CC9902 "Bacteria" "Cyanobacteria"    "Cyanobacteriia"     
ASV2:Pseudoalteromonas    "Bacteria" "Proteobacteria"   "Gammaproteobacteria"
ASV3:Clade Ia             "Bacteria" "Proteobacteria"   "Alphaproteobacteria"
ASV4:Marine Group II      "Archaea"  "Thermoplasmatota" "Thermoplasmata"     
ASV5:Clade Ia             "Bacteria" "Proteobacteria"   "Alphaproteobacteria"
ASV6:Clade II             "Bacteria" "Proteobacteria"   "Alphaproteobacteria"
                          Order              Family                  
ASV1:Synechococcus CC9902 "Synechococcales"  "Cyanobiaceae"          
ASV2:Pseudoalteromonas    "Enterobacterales" "Pseudoalteromonadaceae"
ASV3:Clade Ia             "SAR11 clade"      "Clade I"               
ASV4:Marine Group II      "Marine Group II"  NA                      
ASV5:Clade Ia             "SAR11 clade"      "Clade I"               
ASV6:Clade II             "SAR11 clade"      "Clade II"              
                          Genus                  Species
ASV1:Synechococcus CC9902 "Synechococcus CC9902" NA     
ASV2:Pseudoalteromonas    "Pseudoalteromonas"    NA     
ASV3:Clade Ia             "Clade Ia"             NA     
ASV4:Marine Group II      NA                     NA     
ASV5:Clade Ia             "Clade Ia"             NA     
ASV6:Clade II             NA                     NA

Identify Core microbiota

#North
(core_taxa_north <- microbiome::core_members(sub_North,
                                            detection = 0.0001,
                                            prevalence = 50/100))

 [1] "ASV1:Synechococcus CC9902"            
 [2] "ASV2:Pseudoalteromonas"               
 [3] "ASV3:Clade Ia"                        
 [4] "ASV4:Marine Group II"                 
 [5] "ASV5:Clade Ia"                        
 [6] "ASV6:Clade II"                        
 [7] "ASV9:Clade Ia"                        
 [8] "ASV10:Clade Ia"                       
 [9] "ASV11:AEGEAN-169 marine group"        
[10] "ASV12:Prochlorococcus MIT9313.marinus"
[11] "ASV16:AEGEAN-169 marine group"        
[12] "ASV18:Clade Ib"                       
[13] "ASV22:Clade Ia"                       
[14] "ASV23:Clade Ia"                       
[15] "ASV26:Chloroplast"                    
[16] "ASV30:Marine Group II"                
[17] "ASV32:Dadabacteriales"                
[18] "ASV33:SAR324 clade(Marine group B)"   
[19] "ASV35:Clade IV"                       
[20] "ASV37:Marine Group III"               
[21] "ASV49:SAR202 clade"                   
[22] "ASV53:AEGEAN-169 marine group"

Get core microbiota phyloseq object

Get the phyloseq object with also sequences, phylo tree etc.

(phyloseq_core_north <- microbiome::core(sub_North,
                                        detection = 0.0001,
                                        prevalence = .5))

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 22 taxa and 9 samples ]
sample_data() Sample Data:       [ 9 samples by 29 sample variables ]
tax_table()   Taxonomy Table:    [ 22 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 22 tips and 21 internal nodes ]
refseq()      DNAStringSet:      [ 22 reference sequences ]

See full taxanomy of core members

(tax_mat <- as.data.frame(phyloseq::tax_table(phyloseq_core_north)))

                                        Domain                       Phylum
ASV1:Synechococcus CC9902             Bacteria                Cyanobacteria
ASV2:Pseudoalteromonas                Bacteria               Proteobacteria
ASV3:Clade Ia                         Bacteria               Proteobacteria
ASV4:Marine Group II                   Archaea             Thermoplasmatota
ASV5:Clade Ia                         Bacteria               Proteobacteria
ASV6:Clade II                         Bacteria               Proteobacteria
ASV9:Clade Ia                         Bacteria               Proteobacteria
ASV10:Clade Ia                        Bacteria               Proteobacteria
ASV11:AEGEAN-169 marine group         Bacteria               Proteobacteria
ASV12:Prochlorococcus MIT9313.marinus Bacteria                Cyanobacteria
ASV16:AEGEAN-169 marine group         Bacteria               Proteobacteria
ASV18:Clade Ib                        Bacteria               Proteobacteria
ASV22:Clade Ia                        Bacteria               Proteobacteria
ASV23:Clade Ia                        Bacteria               Proteobacteria
ASV26:Chloroplast                     Bacteria                Cyanobacteria
ASV30:Marine Group II                  Archaea             Thermoplasmatota
ASV32:Dadabacteriales                 Bacteria                 Dadabacteria
ASV33:SAR324 clade(Marine group B)    Bacteria SAR324 clade(Marine group B)
ASV35:Clade IV                        Bacteria               Proteobacteria
ASV37:Marine Group III                 Archaea             Thermoplasmatota
ASV49:SAR202 clade                    Bacteria                  Chloroflexi
ASV53:AEGEAN-169 marine group         Bacteria               Proteobacteria
                                                    Class            Order
ASV1:Synechococcus CC9902                  Cyanobacteriia  Synechococcales
ASV2:Pseudoalteromonas                Gammaproteobacteria Enterobacterales
ASV3:Clade Ia                         Alphaproteobacteria      SAR11 clade
ASV4:Marine Group II                       Thermoplasmata  Marine Group II
ASV5:Clade Ia                         Alphaproteobacteria      SAR11 clade
ASV6:Clade II                         Alphaproteobacteria      SAR11 clade
ASV9:Clade Ia                         Alphaproteobacteria      SAR11 clade
ASV10:Clade Ia                        Alphaproteobacteria      SAR11 clade
ASV11:AEGEAN-169 marine group         Alphaproteobacteria Rhodospirillales
ASV12:Prochlorococcus MIT9313.marinus      Cyanobacteriia  Synechococcales
ASV16:AEGEAN-169 marine group         Alphaproteobacteria Rhodospirillales
ASV18:Clade Ib                        Alphaproteobacteria      SAR11 clade
ASV22:Clade Ia                        Alphaproteobacteria      SAR11 clade
ASV23:Clade Ia                        Alphaproteobacteria      SAR11 clade
ASV26:Chloroplast                          Cyanobacteriia      Chloroplast
ASV30:Marine Group II                      Thermoplasmata  Marine Group II
ASV32:Dadabacteriales                       Dadabacteriia  Dadabacteriales
ASV33:SAR324 clade(Marine group B)                   <NA>             <NA>
ASV35:Clade IV                        Alphaproteobacteria      SAR11 clade
ASV37:Marine Group III                     Thermoplasmata Marine Group III
ASV49:SAR202 clade                        Dehalococcoidia     SAR202 clade
ASV53:AEGEAN-169 marine group         Alphaproteobacteria Rhodospirillales
                                                       Family
ASV1:Synechococcus CC9902                        Cyanobiaceae
ASV2:Pseudoalteromonas                 Pseudoalteromonadaceae
ASV3:Clade Ia                                         Clade I
ASV4:Marine Group II                                     <NA>
ASV5:Clade Ia                                         Clade I
ASV6:Clade II                                        Clade II
ASV9:Clade Ia                                         Clade I
ASV10:Clade Ia                                        Clade I
ASV11:AEGEAN-169 marine group         AEGEAN-169 marine group
ASV12:Prochlorococcus MIT9313.marinus            Cyanobiaceae
ASV16:AEGEAN-169 marine group         AEGEAN-169 marine group
ASV18:Clade Ib                                        Clade I
ASV22:Clade Ia                                        Clade I
ASV23:Clade Ia                                        Clade I
ASV26:Chloroplast                                        <NA>
ASV30:Marine Group II                                    <NA>
ASV32:Dadabacteriales                                    <NA>
ASV33:SAR324 clade(Marine group B)                       <NA>
ASV35:Clade IV                                       Clade IV
ASV37:Marine Group III                                   <NA>
ASV49:SAR202 clade                                       <NA>
ASV53:AEGEAN-169 marine group         AEGEAN-169 marine group
                                                        Genus Species
ASV1:Synechococcus CC9902                Synechococcus CC9902    <NA>
ASV2:Pseudoalteromonas                      Pseudoalteromonas    <NA>
ASV3:Clade Ia                                        Clade Ia    <NA>
ASV4:Marine Group II                                     <NA>    <NA>
ASV5:Clade Ia                                        Clade Ia    <NA>
ASV6:Clade II                                            <NA>    <NA>
ASV9:Clade Ia                                        Clade Ia    <NA>
ASV10:Clade Ia                                       Clade Ia    <NA>
ASV11:AEGEAN-169 marine group                            <NA>    <NA>
ASV12:Prochlorococcus MIT9313.marinus Prochlorococcus MIT9313 marinus
ASV16:AEGEAN-169 marine group                            <NA>    <NA>
ASV18:Clade Ib                                       Clade Ib    <NA>
ASV22:Clade Ia                                       Clade Ia    <NA>
ASV23:Clade Ia                                       Clade Ia    <NA>
ASV26:Chloroplast                                        <NA>    <NA>
ASV30:Marine Group II                                    <NA>    <NA>
ASV32:Dadabacteriales                                    <NA>    <NA>
ASV33:SAR324 clade(Marine group B)                       <NA>    <NA>
ASV35:Clade IV                                           <NA>    <NA>
ASV37:Marine Group III                                   <NA>    <NA>
ASV49:SAR202 clade                                       <NA>    <NA>
ASV53:AEGEAN-169 marine group                            <NA>    <NA>

Visualise core microbiome with `microbiome::plot_core()`

Visualise the core microbiome of North samples

microbiome::plot_core(phyloseq_core_north,
                      plot.type = "heatmap",
                      colours = rev(RColorBrewer::brewer.pal(8, "RdBu")),
                      prevalences = seq(from = 0, to = 1, by = .1),
                      detections = seq(from = 0.1, to = 5, by = 0.2)) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2))+
  xlab("Detection Threshold (Relative Abundance (%))") +
  ylab("ASVs")

Warning in microbiome::plot_core(phyloseq_core_north, plot.type = "heatmap", : The plot_core function is typically used with compositional 
                data. The data is not compositional. Make sure that you
                intend to operate on non-compositional data.

Do the same for the South samples .. please!

What are your conclusions about the comparison between North & South core micobiota at the ASV level?