Data Format Overview

Two main types of files are required: a taxonomic profile (including abundance and taxonomy information) and a metadata file.

Taxonomic profile formats

Taxonomic profiles derived from both amplified 16S rRNA census data and whole-genome shotgun metagenomic data can be uploaded. The following formats are accepted (accompanying example metadata files are also given below):

  • mothur: This format has both a consensus taxonomy file (download) and a .shared file (download). Also available for download is an example of the accompanying metadata file (download).
  • BIOM format (from QIIME v1.5.0+: rich-format): This format consists of an abundance profile in .biom format (download). Here, the metadata file can be provided separately, if not already present within the .biom file.
  • Tab-separated (.txt) files: This format consists of either an abundance file(download) with a separate file of corresponding taxonomy mapping information (download), or an abundance file directly including taxonomic information (download). Note that metadata information is included in such files.

mothur

Two files are needed for a mothur taxonomic profile: a consensus taxonomy file (download) and a .shared file (download). The consensus taxonomy file can be created with mothur's classify.otu command. The .shared format can be created using mothur's make.shared command.

The accompanying example metadata file can be downloaded here.

BIOM format

QIIME v1.5.0, QIIME uses the BIOM format for its OTU table format. The BIOM file can be generated using QIIME's make_otu_table.py script.

An example biom file can be downloaded from here.

Tab-separated (.txt) files

The tab-separated (.txt) format is used for taxonomic profiles. Essentially, it consists of a data table containing expression values (raw counts from 16S data saved as a tab delimited text file (.txt) with rows for features (OTUs) and columns for sample). The tab delimited file can be generated from any spreadsheet program. Such a file must be in a specific format as discussed below:

  • The first line should contain sample names and start with "#NAME".
  • The first row should contain taxon names, beginning with "#TAXONOMY". Taxon names are any valid taxonomic identifiers which have been annotated via greengenes or SILVA databases. Such labels must contain information from domains to species, separated by semicolons (;).
  • Non specific taxon names (Eg. Otu0001) can also be present in the first column of file.. In such a case, a tab-delimited (.txt) taxonomy mapping file can also be uploaded which contains information from domain to species, for each taxon name. Examples are provided below.

Unrecognizable terms (e.g. "uncultured" or strain identifiers) can also be included in the taxonomic profile without causing any issues. There is no requirement to include information for multiple taxonomic rank levels, and there is no minimum or maximum taxonomic rank that must be included. Data cells can indicate the read count (preferable) or proportions or percentages of taxa in each sample.

Notes when formatting your data:

  • Both sample and feature names must be unique and consist of a combination of common English letters, underscores and numbers for naming purpose. Latin/Greek letters are not supported.
  • Sample and feature names must be consistent across all files (i.e. abundance table, taxonomy table, meta-data file, and phylogenetic tree file).
  • Data values (read counts or proportions) should contain only numeric and positive values. Empty cells or cells with NA values will be replaced with zero.
  • Leaving cells blank or "NA" (without quotes) for missing values are allowed in the taxonomy table. These values are considered to represent the maximum taxonomic resolution obtained for each OTU or feature.
  • Metadata is not permitted to be included in the abundance tables (i.e. do not add another row following #NAME with #CLASS).

Examples

  • Taxonomic profile with valid taxonomy identifier labeled names: download example
    #NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
    Archaea;           219	49	42	50	6	17	22	21
    Archaea;Crenarchaeota;Thermoprotei;           424	0	191	0	0	0	0	0
    Bacteria;Acidobacteria;           32	4	4	22	76	16	1	0
    Bacteria;Actinobacteria;           47	0	0	4	0	0	0	0
                                
  • Taxonomic profiles with non-specific taxon names: download example
    #NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
    OTU1           219	49	42	50	6	17	22	21
    OTU2           424	0	191	0	0	0	0	0
    OTU3           32	4	4	22	76	16	1	0
    OTU5           47	0	0	4	0	0	0	0
                                
  • Taxonomic mapping file: download example
    #TAXONOMY	Kingdom	Phylum	Class	Order	Family	Genus	Species
    Otu00001	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotellaceae	
    Otu00002	Bacteria	Proteobacteria	Epsilonproteobacteria	Campylobacterales	Helicobacteraceae	Helicobacter	
    Otu00003	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Alloprevotella	
    Otu00004	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides