MicrobiomeAnalyst is a user-friendly, comprehensive web-based tool for analyzing data sets generated
from microbiome studies (16S rRNA, metagenomics or metatranscriptomics data). The focus of this tool is to perform
statistical analysis, visual exploration, and data integration.
It is designed to support four kinds of analysis:
Compositional profiling: (16S rRNA marker gene data) provides various common statistical methods for alpha and beta diversity analysis,
coupled with different types of visualization support (bar charts, stacked area plots and pie-charts) to help users obtain a
comprehensive view of the composition of their uploaded microbial communities.
Functional profiling: (16S rRNA data) supports the prediction of metabolic potentials; (shotgun metagenomics or metatranscriptomics data)
supports direct functional profiling at various levels: metabolic modules, metabolic pathways or higher functional (COG or EC) levels.
Comparative analysis: supporting multiple data filtering and normalization techniques coupled with differential analysis methods
(LEfSE, metagenomeSeq, edgeR, DEseq2, etc) to identify features that are significantly different between conditions under study.
Meta analysis: allows users to directly compare their data sets with published human microbiome data in the projection with public data module (PPD),
or compare important features with those defined or identified from literature or public resources in the taxon set enrichment analysis (TSEA) module.
MicrobiomeAnalyst supports common data formats generated in microbiome studies including
OTU tables, BIOM files, as well as outputs from both QIIME and mothur, together with associated meta-data files.
Detailed descriptions, screenshots, and example datasets can be found on the
Data Format page.
For 16S rRNA data, there are several pipelines for processing the raw sequence data into an OTU table. For instance,
QIIME: script make_otu_table.py produces an OTU table in biom format.
Mothur: commands such as make.shared and make.biom are used to create OTU tables in shared and biom formats respectively.
UPARSE: an OTU table can be generated using the usearch_global command in biom (QIIME) or shared (mothur) format.
For Shotgun sequencing data, HUMAnN and MG-RAST are the two main platforms that can be used for data annotation to
generate a KO count table. More recently, QIIME also offers support for shotgun metagenomics data using their established pipelines (details)
Maximum and minimum read counts may not correspond to the original ASV/OTU/KO table because the original data file goes through some
pre-processing steps during the Sanity Check. First, features are removed whose sample-wise counts sum is 1 (would be uninformative downstream).
Additionally, features that have a variance of 0 are dropped. Then the min/max number of OTUs #>2 are calculated from this filtered table.
Note, the minimum sample count is the highest non-zero value.
The purpose of data filtering is to allow users to remove low quality and/or uninformative features to improve
downstream statisitcal analysis. MicrobiomeAnalyst contains three data filtering procedures:
Minimal data filtering (applied to all analysis) - the procedure will remove features containing all zeros, or appearing in only one sample.
These extremely rare features should be removed from consideration;
Low abundance features - they may be due to sequencing errors or low-level contaminations;
Low varianace features - they are unlikely to be associated with the conditions under study;
Note, the last two options are not used for within-sample (alpha diveristy) profiling, but are strongly recommended for comparative analysis.
Users can choose from low count filter options to filter features with very small counts in very few sample (i.e. low prevalence).
If primary purpose is comparative analysis, users should filter features that exhibit low variance based on
inter-quantile range, standard deviation or coefficient of variation, as they are very unlikely to be significant
in the comparative analysis.
Metagenomic data possess some unique characteristics such as vast differences in sequencing depth,
sparsity (containing many zeros) and large variance in distributions (overdispersion). These unique attributes
have made it inappropriate to directly apply methods developed in other omics fields to perform differential
analysis on metegenomic data. To account for these issues, MicrobiomeAnalyst includes various
normalization methods such as:
Rarefaction and scaling methods: these methods deal with uneven sequencing depths by bringing samples to the same scale for comparison.
Transformation methods: it includes methods to deal with sparsity, compositionality, and large variations within the data.
MicrobiomeAnalyst supports a variety of methods for data normalization. A brief summary is provided below:
Total Sum Scaling (TSS) normalization: this method removes technical bias related to different
sequencing depth in different libraries via simply dividing each feature count with the
total library size to yield relative proportion of counts for that feature.
For easier interpretation, we can multiply it by 1,000,000 to get the number of reads
corresponding to that feature per million reads. LEfSe utilizes this kind of approach.
Relative log expression (RLE) normalization: this is the scaling factor method
proposed by Anders and Huber (2010). This method calculates the median library from the geometric
mean of all columns. The median ratio of each sample to the median library is taken as the
scaling factor.
Trimmed mean of M-values (TMM) normalization: this is the weighted trimmed mean of M-values proposed by Robinson and Oshlack
(2010), where the weights are from the delta method on Binomial data.
Upper Quantile normalization: this is the upper-quartile normalization method of Bullard et al (2010),
in which the scale factors are calculated from the 75% quantile of the counts for each library,
after removing features which are zero in all libraries. This idea is generalized here to allow
scaling by any quantile of the distributions.
Cumulative Sum Scaling (CSS) normalization: it calculates the quantile of the count distribution of samples
where they all should be roughly equivalent and independent of each other up to this quantile under
the assumption that, at this range, counts are derived from a common distribution.
By default, metagenomeSeq utilizes this approach for differential analysis.
Centered Log-Ratio (CLR) Transformation: This method is specially designed to normalize compositional
data. It convert the relative abundances of each part, or the values in the table of counts for each part,
to ratios between all parts by calculating geometric mean of all values. This method is robust
if data sets were not sparse because the geometric mean cannot be computed if any of the feature counts are zero.
Please note, data normalization is mainly used for visual data exploration such as beta-diversity and clustering analysis.
It is also used for comparative analysis using statistical methods without known normalization procedures that work
best (univariate statistics and LEfSe). Meanwhile, other comparative analyses will use their own specific
normalization methods. For example, cumulative sum scaling (CSS) normalization is used for metagenomeSeq,
and trimmed mean of M-values (TMM) is applied for edgeR.
At the moment, there is no consensus guideline with regard to which normalization should be used. Users are adviced to explore different approaches and
then visually examine the separation patterns (i.e. PCoA plot) to assess the effects of different normalization procedures with regard to experimental conditions
or other meta-data of interest. For detailed discussion about these methods, users are referred to two recent papers
Paul J. McMurdie et al. and
Jonathan Thorsen et al.
Whenever the sequencing depth of your samples differ too much (i.e. >10X), it is recommended to perform
rarefaction before normalizing your data. Note, users should also consider to
remove the shallow sequenced samples as such gross difference could be due to experimental failure. For more details, please
refer to the paper by Weiss, S et al/ In MicrobiomeAnalyst,
users can directly visualize the sample size (available on the Data Inspection page) to check the need of rarefaction for their data.
Outliers refers to those samples that are significantly different from the "majority".
The potential outlier will distinguish itself as the sample located far away from major
clusters or trends formed by the remaining data. To discover potential outliers, users can
use a variety of summary plots to visualize their data. For instance, a sample with extreme
diversity (alpha or beta) or very low sequencing depth (rarefaction curve analysis or sample size viewer)
may be considered a potential outlier. Outliers may be arise due to either biological or technical reasons. To deal with outliers,
the first step is to check if those samples were measured properly. In many cases, outliers are
the result of operational errors during the analytical process. If those values cannot be corrected
(i.e. by normalization procedures), they should be removed from your analysis via Sample Editor.
Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by
the administrators, unless required for system maintenance and troubleshooting. All files are deleted from the server after no
more than 72 hours, and no archives or backups are kept. You are advised to download your results as an zip immediately
after performing an analysis.
The Not_Assigned category contains features that have NA at that certain rank.
For instance if you're looking at the Species level, but 30% of features do not have a Species rank,
these are lumped together as Not_Assigned.
MicrobiomeAnalyst is intended for the comparative analysis of microbiome data. Therefore, we require
a minimum of two different taxa at each taxonomy level. For instance, if you upload a taxonomy table including
Phylum - > Species information but only see Species in the "Taxonomy level" drop-down menus, this means that
the Phylum, Class, Order, Family and Genus taxa are all the same, and therefore can not be used for comparative analysis.
Prepending higher taxonomic names to the current taxa names can change results if many of your features
are unassigned at the current taxonomic resolution. This is because MicrobiomeAnalyst aggregates all unassigned
features into one when plotting the graphs. By prepending a higher taxonomic name, this will distinguish the unassigned
features into different features.
Alpha diversity is used to measure the diversity within a sample or an ecosystem. The two most commonly used alpha-diversity
measurements are Richness (count) and Evenness (distribution). MicrobiomeAnalyst provides many metrics to
calculate diversity within samples. Most commonly used ones are listed below:
Richness: observed or estimated the number of unique species in the community (abundant or rare species are treated equally).
Observed: the amount of unique OTUs found in each sample.
ACE and Chao1: also account for unobserved species based on low-abundance OTUs.
Evenness: These metrics account for both richness and abundance.
Simpson and Shannon diversity measures are commonly used. Simpson diversity gives more weights to species with more frequency in a sample,
while Shannon method gives more weights to rare species.
Further details on different measures of diversity can be found here
These bars basically represent the Standard Error (SE) in calculating the alpha-diversity.
Methods that calculate a SE include ACE and Chao1, which are used as similarity estimators, allowing
for rigorous comparison of two or more similarity index values. SEs are computed by a bootstrap procedure,
which requires resampling the observed data and recomputing the estimators many times.
MicrobiomeAnalyst allows users to compute diversity based on either original OTUs or collapsing the data at different taxonomy
levels. Note, in the later case, OTUs without taxa designation will be collapsed into a “Not_Assigned” category,
which could be an arbitrary mix of OTUs from across different levels. In some cases, OTUs without genus/species information
are frequently both more abundant and more representative of total diversity than are OTUs with genus/species names.
Because of these issues, to understand the real diversity, it is recommended to first perform diversity analysis at OTU level
before collapsing data by taxonomy assignment.
When OTUs are well annotated or the selected taxonomy level includes the majority of the OTUs, it is biologically
useful to perform diversity analysis at higher taxa levels for both data reductions and hypothesis generations.
Good's coverage is defined as 1 - (F1/N) where F1 is the number of singleton OTUs and N is the sum of counts for all OTUs.
If a sample has a Good's coverage == .98, this means that 2% of your reads in that sample are from OTUs that appear only once in that sample.
Beta diversity represents the explicit comparison of microbial communities (between-samples) based on the
measure of the distance or dissimilarity between each sample pair. Beta diversity is calculated for every pair of
samples to generate a distance or dissimilarity matrix, and then apply ordination based methods such as
Nonmetric Multidimensional Scaling (NMDS) or Principal Coordinates Analysis (PCoA) for visual representation
at low-dimensional space (i.e. 2D-3D plots).
MicrobiomeAnalyst allows user to choose from various beta diversity metrics including
Bray-Curtis dissimilarity is simply calculated as: 1-(2w/(a+b)) where w is the sum
of the of the lesser scores for only those species which are present in both communities, a is the sum
of the measures of taxa in one community and b is the sum of the measures of taxa in the other community.
Jaccard index is computed as 2B /(1+B), where B is Bray–Curtis dissimilarity.
UniFrac distances are based on branches in a phylogenetic tree that are either shared or unique amongst samples.
The unweighted UniFrac only consider the presence or absence of the species, while weighted UniFrac consider the actual
abundance. The key feature of phylogenetic tree-based distance is that differences in community structure attributable to closely
related organisms are weighted less heavily than are differences arising from distancelt related organisms.
Further details on different measures of diversity can be found here.
UniFrac distance is based on the phylogenetic tree you provided during data upload. If this tree is not found, the UniFrac
(weighted or unweighted) will fail. Phylogenetic tree can be generated using QIIME or mothur based on their manual.
After displaying the samples on PCoA or NMDS plots, they can be colored based on:
Particular experimental factor from the meta-data (default option);
Their alpha diversity measures (shown as color gradients);
Abundance levels (shown as color gradients) of a particular feature (i.e. particular phylum, family, species, etc.) that they contain.
Note the use of this coloring option can be combined with the "label by" option to reveal more details about the
available patterns (see beta-diversity tutorial).
Depending on the sequencing technologies used (which determine the taxonomic resolution that can be obtained), it is possible
that there will be more OTUs can be confidently assigned at family levels, but less number of them can be assigned at the species level, maybe none at strain level.
In such cases, the unassigned species will be collapsed under "Not_Assigned" category which vastly reduces the number of features available for comparison and exploration.
This method is used to present relationship between number of OTUs and number of sequences.
It can show species richness from results of sampling.
It can infer if the reads of a sample is enough to reach plateau, which means that with increasing of sequences, the gain of newly discovered OTUs is limited.
If sequence depth of some samples are not enough, you may consider to resequence these samples or removed from downstream analysis.
A phylogenetic tree is a diagram which represents evolutionary relationships among species.
This method is used to determine the evolutionary relationship among taxonomic groups.
It reflects how they evolved from common ancestors.
If two organisms share a more recent common ancestor, they are more related.
Therefore, phylogenetic tree can be used to represent distances between them, which will be used for UniFrac distance based analysis, such as phylogenetic beta diversity.
This method is used to compare abundance of different taxonomic levels for each pair of factors in a metadata variable.
Heat Tree uses hierarchical structure of taxonomic classifications to quantitatively (median abundance) and statistically (non-parameter Wilcoxon Rank Sum test ) depict taxon differences among communities.
It generates a differential heat tree to show which taxa are more/less abundance in each group.
There exist several simple methods for computing correlation networks such as Pearson’s correlation, which determines if linear relationships exist between two taxa,
or Spearman’s and Kendall’s rank correlation, which measures rank relationships between pairs. However, these naïve-methods fail to address the compositional nature of
microbiome data and can be unreliable due to the identification of spurious correlations. Alternatively, compositionally-robust methods including SparCC and SPIEC-EASI
have been introduced, both of which make a strong assumption of a sparse correlation network. SparCC uses a log-ratio transformation and performs iterations to
identify taxa pairs that are outliers to background correlations. Meanwhile, SPIEC-EASI uses graphical network models to infer the entire correlation network at once.
Both methods are computationally intensive, though an efficient implementation of the SparCC algorithm named FastSpar was recently published. Due to its significantly
enhanced performance, MicrobiomeAnalyst implements FastSpar as well as Pearson’s, Spearman’s and Kendall’s correlation for network creation.
MetagenomeSeq is designed to assess differential abundance in sparse high-throughput microbial marker-gene survey data. This approach uses a
combination of Cumulative Sum Scaling (CSS) normalization with zero-inflated Gaussian distribution mixture or zero-inflated Log-Normal mixture model
that accounts for undersampling in large-scale marker-gene studies. For more details about its implementation please refer to the original paper by
Joseph N. Paulson et al..
There are two statistical models according to which data has been fitted or reshaped in metagenomeSeq to perform differential analysis:
fitFeature: It is based on a zero-inflated Log-Normal mixture model. This approach is recommended by the author.
Note this model currently only supports two-group comparisons.
fitZig: It is based on a zero-inflated Gaussian mixture model. It can be used when multiple groups are present
for differential abundance testing.
EdgeR and DESeq2 are both well-established methods for RNAseq data analysis. They differ in their method of data
normalization and the algorithms used for estimation of dispersion. In general, edgeR is more powerful (detecting more DE features) but
also comes with higher false positives. DESeq2 is more robust in estimating the DE features (i.e. low false positive rates).
DESeq2 is more computationally intensive and may take a long time for large data sets. For more details about their implementation please refer to
DESeq2 and EdgeR
papers.
The following guidelines are based on a recent paper by Weiss et al on data from 16S data.
DESeq2 has the highest power to compare groups, especially for less than 20 samples per group.
MetagenomeSeq’s fitZIG is a faster alternative to DESeq2 for large samples (over 50 samples per group)
For larger sample sizes (over 50 samples), rarefying paired with a non-parametric test, such as the Mann-Whitney test, can also yield equally high sensitivity.
LDA Effective Size (LEfSe) is a biomarker discovery and explanation method for high-dimensional data. It integrates statistical significance with biological consistency (effect size) estimation.
In particular, it first performs non-parametric factorial Kruskal-Wallis (KW) sum-rank test to detect features with significant differential abundance with
respect to the class of interest, followed by Linear Discriminant Analysis to estimate the effect size of each differentially abundant features.
The result consists of a table listing all the features, the logarithmic value of the highest mean among all the classes,
and if the feature is discriminative, the class with the highest mean and the logarithmic LDA score (Effect Size). Please refer to the original paper
by Segata et. al for more detailed explanations.
The original LEfSe implementation, which is available on the Huttenhower Galaxy (https://huttenhower.sph.harvard.edu/galaxy/), considers the entire set of taxa (all taxonomic ranks)
when performing LEfSe. In comparison, the MicrobiomeAnalyst implementation only performs LEfSe at the user’s specified taxonomic level. Additionally, the original LEfSe implementation uses
original p-values when determining significant taxa. Meanwhile, the MicrobiomeAnalyst implementation provides users the option to use either original or FDR adjusted p-value cutoffs to determine significant features.
The Random Forests algorithm uses an ensemble of classification trees (forests), with class prediction based on the
majority vote of the ensemble. It can provide an unbiased estimate of the classification error by aggregating
cross-validation results using bootstrapped samples, as the forest is built. In addition, the algorithm also
provides feature importance measures by calculating the increase of the classification error when it is permuted.
A graphical output is generated to summarize its classification performance. More details about Random Forest can be found
here.
The graphical output from Random Forests shows the changes of error rates with regard to the number of trees as the forests built.
It can indicate when the performance will stabilize after certain number of trees. Users can also see which group tend to have
higher error rates, which group is relatively easy to predict. For instance, the example below shows that using ~120 trees, the
algorithm can achieve perfect prediction for each group.
Partial correlation measures the strength of a linear relationship between two variables whilst controlling for the effect
of one or more confounding variables. For example, between serum metabolites and bacterial species whilst controlling
for BMI and age. MicrobiomeAnalystR incorporates the ppcor R package to perform partial
correlations, the details of which can be found here.
The 16S rRNA sequencing data can be used to predict the metabolic potentials of the corresponding microbial
species based on their phylogenetic distances or sequence similarities to those species whose whole genomes have been
sequenced and annotated. In particular, the PICRUSt is used for Greengenes annotated data and Tax4Fun is used for
SILVA annotated data. The result is a table containing relative KO abundance levels. The underlying abundance table
is available for download. Once downloaded, the results can be further explored using metabolic network visualization
through Shotgun Data Profiling (SDP) module.
PICRUSt is an evolutionary modeling algorithm which estimates the properties of ancestral organisms from living relatives
(ancestral state reconstruction or ASR). The underlying assumption is that taxonomically similar organisms will have similar functional capabilities.
Essentially it consist of two steps:
Gene content Inference: It estimates the gene content of microorganisms for which no genome sequence
is available, by using their sequenced relatives as a reference.
Metagenome Inference: Since the number of gene copies for each gene family per organism has already been
estimated in the precalculated files, producing a metagenome prediction is handled by simply multiplying the
vector of gene counts for each OTU by the abundance of that OTU in each each sample, and summed across all OTUs.
For more details about the methodology, please visit the
PICRUSt page.
Tax4Fun is an open source R package that predicts the functional capabilities of microbial communities based on 16S rRNA data.
It can be applied to the output of 16S rRNA analysis pipelines that can perform a mapping of 16S rRNA gene reads to SILVA database.
Prediction of functional profiles using Tax4Fun, includes three steps:
i) SILVA-based 16S rRNA profile is transformed to a taxonomic profile of the prokaryotic KEGG organisms. This linear transformation is
realized by a precomputed association matrix.
ii) The estimated abundances of KEGG organisms are normalized by the 16S rRNA copy number obtained from the NCBI genome annotations.
iii) The normalized taxonomic abundances are used to linearly combine the precomputed functional profiles of the KEGG organisms for the
prediction of the functional profile of the microbial community.
For complete details about the methodology, please visit the
Tax4Fun page.
Tax4Fun can only be used on data with SILVA taxonomy labels. If user's data are SILVA taxonomy but an error still occurs,
it is likely because of how the taxonomy table is formatted. Users must use the raw taxonomy file, other files will throw an error.
For further details, please visit the StackOverFlow page.
PICRUSt predictions are based on the nearest neighbours using the topology of the phylogenetic tree and the distance to the next sequenced organism
(it will link all OTUs even the distances are large), while Tax4Fun is based on nearest neighbours based on minimum rRNA sequence similarity.
Secondly, PICRUSt can only be used with Greengenes annotated OTUs while Tax4Fun depends upon SILVA labelled OTUs for their predictions.
In general, KO predictions are more complete with PICURSt, but better correspondence with Tax4Fun.
A detailed comparison of Tax4Fun with PICRUSt can be found here
KEGG and COG are two high-quality databases that offer organism-indepedent annnotation for genes from shotgun microbiome studies.
In MicrobiomeAnalyst, the KO annotation and visualization is mainly based on the global metabolic maps (ko01100). It contains
161 KEGG pathways and 236 KEGG modules. Users can perform mapping and visual exploration as stack area plot or metabolic networks.
Taken together, KEGG KO annotations offers a more detailed view on the metabolic capacity of the microbial community, with potential mechanistic interpretation.
The COG annotation provides 25 functional categories related to METABOLISM, CELLULAR PROCESSES AND SIGNALING, INFORMATION STORAGE AND PROCESSING.
They may provide a broader classification or more functional coverage as compared to KEGG. COG groups together functions
that are "related" according to the common biochemical knowledge, although no evidence of this relatedness is recorded. In addition,
many COG Categories contain COGs that are rather loosely associated with certain biological activity. Users should be cautious
to use the presence or absence from the COG pathway profile as an indication of the presence or absence of a certain metabolic pathway.
MicrobiomeAnalyst offers different approaches for computing the abundance of higher functional categories in order to deal with this kind of issue, including:
Total hits: simply sum all the hits belong to each category. For KOs belonging to multiple groups, they will be
counted multiple times;
Total hits normalized by category size: same as above, with each sum further normalized by the size of the category.
Sum of the weighted hits: the weight of a unique group member is 1; for those belonging to n groups,
their weighted will be 1/n (i.e. distributing hits equally to those groups)
The network framework has been developed based on the KEGG global metabolic network using the KEGGscape, followed by
manual curation. The network is rendered using a high-performance Javascript library
sigma.js. The metabolic network is
displayed in the central area of the screen, with nodes and edges representing metabolites
and enzymatic reactions, respectively. Some of the key features of this visualization are:
Users can use the mouse scroll to zoom in and out of the network;
Users can view the reaction information (KO and compounds) by double clicking on an edge;
Users can also change the background color, switch the view style, specify a highlighting color or download the current network view as a PNG or SVG image.
Note, there are other similar tools such as the iPath2.0 (flash-based) and the native
KEGG (SVG based).
In the Shotgun Data Profiling module, KO enrichment analysis is performed by first mapping
all uploaded genes to the in-house MicrobiomeAnalyst gene database. Only genes with KO matches
will be kept and used for analysis. Duplicate KO terms will be discarded.
If the uploaded data is a list of genes, KO enrichment is calculated
via over-representation analysis (ORA), using the hypergeometric test. ORA tests
if a particular group of KOs are represented more than expected by chance within
the uploaded list of genes.
If the uploaded data is an abundance table of genes, KO enrichment is calculated
via the global test algorithm.
Global test evaluates whether a set of genes (i.e. KEGG pathways) is significantly associated with a variable of interest.
Compared to ORA, which uses only the total number of KO hits in a pathway,
global test considers the gene abundance values and is considered to be more sensitive than ORA.
It assumes that if a gene set can be used to predict an outcome of interest, the gene expression patterns per outcome
must be different. Because of this, it is more likely to detect "subtle yet consistent" changes amongst genes in the same pathway.
The global test algorithm is implemented in MicrobiomeAnalyst using the globaltest R package.
P-values for both methods are corrected for multiple-testing using the Benjamini and Hochberg's False-Discovery Rate (FDR).
MicrobiomeAnalyst supports three kinds of taxa sets based on the taxonomic resolution of the organisms or microbes present:
Strain-level taxa sets: These sets mostly consist of phenotypes, ecological niches and disease associated traits for 2270 microbial reference
genomes derived from the HMP (Human Microbiome Project).
It has been collected from comprehensive databases such as GOLD (Genomes Online Database)
and PATRIC (Pathosystems Resource Integration Center).
Species-level taxa sets: These sets are manually collected from over 60 literature publications, organized based on their associations with various host
physiological (age, weight, etc.) and biochemical measures (granulocytes, leucocytes, insulin, etc.), or
disease states (heart attack, crohn's disease, etc.), or life style factors (fruits, vegetables, alcohol, etc.).
Higher-level taxa sets: These sets includes disease associated traits derived from the website MicroPattern.
You can downloaded these sets from here.
Taxon Set Enrichment Analysis (TSEA) aims to identify biologically or ecologically meaningful
patterns in taxonomy composition changes in microbiome studies. TSEA directly investigates if a
group of taxa (created by common traits or functional associations) are significantly
enriched. Essentially, TSEA is a taxonomic version of Gene Set Enrichment Analysis
approach, with its own collection of taxon set libraries. MicrobiomeAnalyst uses hypergeometric
test to evaluate if certain taxon sets are represented more often than expected by chance within a user
uploaded lists of taxa.
Human reference studies data have been downloaded from the publicly accessible
QIITA. In these datasets, taxon names are labelled
with a Greengenes identifier. Reference datasets have been seperated based on the sampling body sites and
the sequencing platform used to generate data. Once the user uploads their human microbiome data, MicrobiomeAnalyst tries to merge it with a selected
reference dataset based on common taxon names. At least 20 percent of OTUs between user and reference data have to match in order to proceed further.
A common data filtering and normalization method is applied on merged data to perform PCoA analysis using various distance measures.
For details about the reference studies and their methodologies, please refer to the paper by
Lozupone CA et al.
Beta diversity or PCoA analysis is mainly affected by those abundant taxa that are shared across samples. Therefore, most of the clustering
patterns used in PCoA are captured by only these abundant taxa or features. Also most of the reference studies in PPD have a large number
of samples, so many of the distance measures such as Unweighted UniFrac will require a lot of computational time for dissimilarity or
distance calculations between samples. In order to avoid this issue, MicrobiomeAnalyst provides an option of selecting just the top
20% most abundant features for computation.
At this time, most data sets are from human studies, with a few from mouse and cow studies. The resources will be updated once high-quality and well-annotated
public data sets become available.