Frequently Asked Questions (FAQs)
  1. When should I use MicrobiomeAnalyst?
  2. What types of input does MicrobiomeAnalyst accept?
  3. How can I format my data into an acceptable input for MicrobiomeAnalyst?
  4. Why should I use the data filtering option?
  5. Which option should I choose to perform data filtering?
  6. Why should I normalize my data?
  7. What are the various normalization methods and which method should I choose?
  8. When should I opt to rarefy my data?
  9. How can I detect and deal with outliers?
  1. When should I use MicrobiomeAnalyst?

    MicrobiomeAnalyst is a user-friendly, comprehensive web-based tool for analyzing data sets generated from microbiome studies (16S rRNA, metagenomics or metatranscriptomics data). The focus of this tool is to perform statistical analysis, visual exploration, and data integration. It is designed to support four kinds of analysis:

    • Compositional profiling: (16S rRNA marker gene data) provides various common statistical methods for alpha and beta diversity analysis, coupled with different types of visualization support (bar charts, stacked area plots and pie-charts) to help users obtain a comprehensive view of the composition of their uploaded microbial communities.
    • Functional profiling: (16S rRNA data) supports the prediction of metabolic potentials; (shotgun metagenomics or metatranscriptomics data) supports direct functional profiling at various levels: metabolic modules, metabolic pathways or higher functional (COG or EC) levels.
    • Comparative analysis: supporting multiple data filtering and normalization techniques coupled with differential analysis methods (LEfSE, metagenomeSeq, edgeR, DEseq2, etc) to identify features that are significantly different between conditions under study.
    • Meta analysis: allows users to directly compare their data sets with published human microbiome data in the projection with public data module (PPD), or compare important features with those defined or identified from literature or public resources in the taxon set enrichment analysis (TSEA) module.

  2. What types of input does MicrobiomeAnalyst accept?

    MicrobiomeAnalyst supports common data formats generated in microbiome studies including OTU tables, BIOM files, as well as outputs from both QIIME and mothur, together with associated meta-data files. Detailed descriptions, screenshots, and example datasets can be found on the Data Format page.

  3. How can I format my data into an acceptable input for MicrobiomeAnalyst?

    For 16S rRNA data, there are several pipelines for processing the raw sequence data into an OTU table. For instance,

    • QIIME: script produces an OTU table in biom format.
    • Mothur: commands such as make.shared and make.biom are used to create OTU tables in shared and biom formats respectively.
    • UPARSE: an OTU table can be generated using the usearch_global command in biom (QIIME) or shared (mothur) format.

    For Shotgun sequencing data, HUMAnN and MG-RAST are the two main platforms that can be used for data annotation to generate a KO count table. More recently, QIIME also offers support for shotgun metagenomics data using their established pipelines (details)

  4. Why should I use the data filtering?

    The purpose of data filtering is to allow users to remove low quality and/or uninformative features to improve downstream statisitcal analysis. MicrobiomeAnalyst contains three data filtering procedures:

    1. Minimal data filtering (applied to all analysis) - the procedure will remove features containing all zeros, or appearing in only one sample. These extremely rare features should be removed from consideration;
    2. Low abundance features - they may be due to sequencing errors or low-level contaminations;
    3. Low varianace features - they are unlikely to be associated with the conditions under study;
    Note, the last two options are not used for within-sample (alpha diveristy) profiling, but are strongly recommended for comparative analysis.

  5. Which option should I choose to perform data filtering?

    Users can choose from low count filter options to filter features with very small counts in very few sample (i.e. low prevalence). If primary purpose is comparative analysis, users should filter features that exhibit low variance based on inter-quantile range, standard deviation or coefficient of variation, as they are very unlikely to be significant in the comparative analysis.

  6. Why should I normalize my data?

    Metagenomic data possess some unique characteristics such as vast differences in sequencing depth, sparsity (containing many zeros) and large variance in distributions (overdispersion). These unique attributes have made it inappropriate to directly apply methods developed in other omics fields to perform differential analysis on metegenomic data. To account for these issues, MicrobiomeAnalyst includes various normalization methods such as:

    • Rarefaction and scaling methods: these methods deal with uneven sequencing depths by bringing samples to the same scale for comparison.
    • Transformation methods: it includes methods to deal with sparsity, compositionality, and large variations within the data.

  7. What are the various normalization methods and which method should I choose?

    MicrobiomeAnalyst supports a variety of methods for data normalization. A brief summary is provided below:

    • Total Sum Scaling (TSS) normalization: this method removes technical bias related to different sequencing depth in different libraries via simply dividing each feature count with the total library size to yield relative proportion of counts for that feature. For easier interpretation, we can multiply it by 1,000,000 to get the number of reads corresponding to that feature per million reads. LEfSe utilizes this kind of approach.
    • Relative log expression (RLE) normalization: this is the scaling factor method proposed by Anders and Huber (2010). This method calculates the median library from the geometric mean of all columns. The median ratio of each sample to the median library is taken as the scaling factor.
    • Trimmed mean of M-values (TMM) normalization: this is the weighted trimmed mean of M-values proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data.
    • Upper Quantile normalization: this is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing features which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.
    • Cumulative Sum Scaling (CSS) normalization: it calculates the quantile of the count distribution of samples where they all should be roughly equivalent and independent of each other up to this quantile under the assumption that, at this range, counts are derived from a common distribution. By default, metagenomeSeq utilizes this approach for differential analysis.
    • Centered Log-Ratio (CLR) Transformation: This method is specially designed to normalize compositional data. It convert the relative abundances of each part, or the values in the table of counts for each part, to ratios between all parts by calculating geometric mean of all values. This method is robust if data sets were not sparse because the geometric mean cannot be computed if any of the feature counts are zero.

    Please note, data normalization is mainly used for visual data exploration such as beta-diversity and clustering analysis. It is also used for comparative analysis using statistical methods without known normalization procedures that work best (univariate statistics and LEfSe). Meanwhile, other comparative analyses will use their own specific normalization methods. For example, cumulative sum scaling (CSS) normalization is used for metagenomeSeq, and trimmed mean of M-values (TMM) is applied for edgeR.

    At the moment, there is no consensus guideline with regard to which normalization should be used. Users are adviced to explore different approaches and then visually examine the separation patterns (i.e. PCoA plot) to assess the effects of different normalization procedures with regard to experimental conditions or other meta-data of interest. For detailed discussion about these methods, users are referred to two recent papers Paul J. McMurdie et al. and Jonathan Thorsen et al.

  8. When should I opt to rarefy my data?

    Whenever the sequencing depth of your samples differ too much (i.e. >10X), it is recommended to perform rarefaction before normalizing your data. Note, users should also consider to remove the shallow sequenced samples as such gross difference could be due to experimental failure. For more details, please refer to the paper by Weiss, S et al/ In MicrobiomeAnalyst, users can directly visualize the sample size (available on the Data Inspection page) to check the need of rarefaction for their data.

  9. How can I detect and deal with outliers?

    Outliers refers to those samples that are significantly different from the "majority". The potential outlier will distinguish itself as the sample located far away from major clusters or trends formed by the remaining data. To discover potential outliers, users can use a variety of summary plots to visualize their data. For instance, a sample with extreme diversity (alpha or beta) or very low sequencing depth (rarefaction curve analysis or sample size viewer) may be considered a potential outlier. Outliers may be arise due to either biological or technical reasons. To deal with outliers, the first step is to check if those samples were measured properly. In many cases, outliers are the result of operational errors during the analytical process. If those values cannot be corrected (i.e. by normalization procedures), they should be removed from your analysis via Sample Editor.

Processing ....
Your session is about to expire!

You will be logged off in seconds.

Do you want to continue your session?