MicrobiomeAnalyst

Data Format Overview

Two main types of files are required: a taxonomic profile (including abundance and taxonomy information) and a metadata file.

Taxonomic profile formats

Taxonomic profiles derived from both amplified 16S rRNA census data and whole-genome shotgun metagenomic data can be uploaded. The following formats are accepted (accompanying example metadata files are also given below):

mothur: This format has both a consensus taxonomy file (download) and a .shared file (download). Also available for download is an example of the accompanying metadata file (download).
BIOM format (from QIIME v1.5.0+: rich-format): This format consists of an abundance profile in .biom format (download). Here, the metadata file can be provided separately, if not already present within the .biom file.
Tab-separated (.txt) files: This format consists of either an abundance file(download) with a separate file of corresponding taxonomy mapping information (download), or an abundance file directly including taxonomic information (download). Note that metadata information is included in such files.

mothur

Two files are needed for a mothur taxonomic profile: a consensus taxonomy file (download) and a .shared file (download). The consensus taxonomy file can be created with mothur's classify.otu command. The .shared format can be created using mothur's make.shared command.

The accompanying example metadata file can be downloaded here.

BIOM format

QIIME v1.5.0, QIIME uses the BIOM format for its OTU table format. The BIOM file can be generated using QIIME's make_otu_table.py script.

An example biom file can be downloaded from here.

Tab-separated (.txt) files

The tab-separated (.txt) format is used for taxonomic profiles. Essentially, it consists of a data table containing expression values (raw counts from 16S data saved as a tab delimited text file (.txt) with rows for features (OTUs) and columns for sample). The tab delimited file can be generated from any spreadsheet program. Such a file must be in a specific format as discussed below:

The first line should contain sample names and start with "#NAME".
The first row should contain taxon names, beginning with "#TAXONOMY". Taxon names are any valid taxonomic identifiers which have been annotated via greengenes or SILVA databases. Such labels must contain information from domains to species, separated by semicolons (;).
Non specific taxon names (Eg. Otu0001) can also be present in the first column of file.. In such a case, a tab-delimited (.txt) taxonomy mapping file can also be uploaded which contains information from domain to species, for each taxon name. Examples are provided below.

Unrecognizable terms (e.g. "uncultured" or strain identifiers) can also be included in the taxonomic profile without causing any issues. There is no requirement to include information for multiple taxonomic rank levels, and there is no minimum or maximum taxonomic rank that must be included. Data cells can indicate the read count (preferable) or proportions or percentages of taxa in each sample.

Notes when formatting your data:

Both sample and feature names must be unique and consist of a combination of common English letters, underscores and numbers for naming purpose. Latin/Greek letters are not supported.
Sample and feature names must be consistent across all files (i.e. abundance table, taxonomy table, meta-data file, and phylogenetic tree file).
Data values (read counts or proportions) should contain only numeric and positive values. Empty cells or cells with NA values will be replaced with zero.
Leaving cells blank or "NA" (without quotes) for missing values are allowed in the taxonomy table. These values are considered to represent the maximum taxonomic resolution obtained for each OTU or feature.
Metadata is not permitted to be included in the abundance tables (i.e. do not add another row following #NAME with #CLASS).

Examples

Taxonomic profile with valid taxonomy identifier labeled names: download example

#NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
Archaea;           219	49	42	50	6	17	22	21
Archaea;Crenarchaeota;Thermoprotei;           424	0	191	0	0	0	0	0
Bacteria;Acidobacteria;           32	4	4	22	76	16	1	0
Bacteria;Actinobacteria;           47	0	0	4	0	0	0	0

Taxonomic profiles with non-specific taxon names: download example

#NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
OTU1           219	49	42	50	6	17	22	21
OTU2           424	0	191	0	0	0	0	0
OTU3           32	4	4	22	76	16	1	0
OTU5           47	0	0	4	0	0	0	0

Taxonomic mapping file: download example

#TAXONOMY	Kingdom	Phylum	Class	Order	Family	Genus	Species
Otu00001	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotellaceae	
Otu00002	Bacteria	Proteobacteria	Epsilonproteobacteria	Campylobacterales	Helicobacteraceae	Helicobacter	
Otu00003	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Alloprevotella	
Otu00004	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides

Metadata file format (download)

Tab delimited (.txt) format is also used for metadata files. Sample names/IDs are in first column beginning with "#NAME" in first row.

For metadata, sample names are present in columns and metadata types (e.g. depth, temperature) in rows. Data values should be discrete, qualitative labels (e.g. HIGH, MED, LOW). Please make sure that file does not contain Empty cells or with NA values. Use the same sample names/IDs as in your input taxonomic profile file. Note that you should make sure that neither your metadata type names or metadata labels include tab, since these are used to delimit separate items.

MicrobiomeAnalyst is primarily designed for for group comparisons. Please make sure there is at least one column contains such an experimental design (i.e., the primary metadata), with each group contains at least 3 replicate. No unique values are allowed in the primary metadata column.

Notes when formatting your data:

Empty or NA cells are not permitted.
Sample names must be consistent across all files (i.e. abundance table, meta-data file, and phylogenetic tree file).
Group names should not include any punctuation marks such as spaces, dashes or slashes. If you must, use an underscore - i.e. clinical-group should be clinical_group.

Example

#NAME       SampleType
Sample1     skin        
Sample2     gut
Sample3     skin                                                                                                   
Sample4     gut                                               
Sample5     gut
Sample6     gut
Sample7     skin
Sample8     skin

Tree file format (optional) (download)

The tree file (.tre) must either be in newick format (example data below) or nexus-format. You can generate the tree file from QIIME or other software using representative sequences.

A phylogenetic tree is a diagram which represents evolutionary relationships among species. It reflects how they have evolved from common ancestors. If two organisms share a more recent common ancestor, they are more related. Therefore, the phylogenetic tree can be used to represent distances between organisms, which will then be used for UniFrac distance based analysis, as implemented in phylogenetic beta diversity.

Example

(((((589277:0.00067,580629:0.00014)0.459:0.09069,535375:0.02088)0.766:0.02036,589071:0.03177)0.870:0.56187,(968675:0.23223,(1060621:0.00076,355750:0.00014)0.726:0.21268)0.845:0.15546)0.772:0.05313,
((1078207:0.24993,1097208:0.13062)0.918:0.16344,938948:0.34514)0.890:0.02316)0.862;

Raw sequence data formats

Raw 16s sequence data (single or paired-end) should be uploaded as demultiplexed, per-sample, compressed files together with a metadata file.

Sequencing data uploaded as individual zip/fastq.gz files - one zip per data [max: 100 files].
Metadata uploaded as a plain text (.txt) file containing multiple columns - files names, group labels and other experiment factors.

Example

A demo example data set containing 10 fastq files can be downloaded here.