Data Format Overview

Two main types of files are required: a taxonomic profile (including abundance and taxonomy information) and a metadata file.

Taxonomic profile formats

Taxonomic profiles derived from both amplified 16S rRNA census data and whole-genome shotgun metagenomic data can be uploaded. The following formats are accepted (accompanying example metadata files are also given below):

  • mothur: This format has both a consensus taxonomy file (download) and a .shared file (download). Also available for download is an example of the accompanying metadata file (download).
  • BIOM format (from QIIME v1.5.0+: rich-format): This format consists of an abundance profile in .biom format (download). Here, the metadata file can be provided separately, if not already present within the .biom file.
  • Tab-separated (.txt) files: This format consists of either an abundance file(download) with a separate file of corresponding taxonomy mapping information (download), or an abundance file directly including taxonomic information (download). Note that metadata information is included in such files.

mothur

Two files are needed for a mothur taxonomic profile: a consensus taxonomy file (download) and a .shared file (download). The consensus taxonomy file can be created with mothur's classify.otu command. The .shared format can be created using mothur's make.shared command.

The accompanying example metadata file can be downloaded here.

BIOM format

QIIME v1.5.0, QIIME uses the BIOM format for its OTU table format. The BIOM file can be generated using QIIME's make_otu_table.py script.

An example biom file can be downloaded from here.

Tab-separated (.txt) files

The tab-separated (.txt) format is used for taxonomic profiles. Essentially, it consists of a data table containing expression values (raw counts from 16S data saved as a tab delimited text file (.txt) with rows for features (OTUs) and columns for sample). The tab delimited file can be generated from any spreadsheet program. Such a file must be in a specific format as discussed below:

  • The first line should contain sample names and start with "#NAME".
  • The first row should contain taxon names, beginning with "#TAXONOMY". Taxon names are any valid taxonomic identifiers which have been annotated via greengenes or SILVA databases. Such labels must contain information from domains to species, separated by semicolons (;).
  • Non specific taxon names (Eg. Otu0001) can also be present in the first column of file.. In such a case, a tab-delimited (.txt) taxonomy mapping file can also be uploaded which contains information from domain to species, for each taxon name. Examples are provided below.

Unrecognizable terms (e.g. "uncultured" or strain identifiers) can also be included in the taxonomic profile without causing any issues. There is no requirement to include information for multiple taxonomic rank levels, and there is no minimum or maximum taxonomic rank that must be included. Data cells can indicate the read count (preferable) or proportions or percentages of taxa in each sample.

Notes when formatting your data:

  • Both sample and feature names must be unique and consist of a combination of common English letters, underscores and numbers for naming purpose. Latin/Greek letters are not supported.
  • Data values (read counts or proportions) should contain only numeric and positive values. Empty cells or cells with NA values will be replaced with zero.
  • Using empty or "NA" (without quotes) for missing values are allowed in the taxonomy table. These values are considered to represent the maximum taxonomic resolution obtained for each OTU or feature.

Examples

  • Taxonomic profile with valid taxonomy identifier labeled names: download example
    #NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
    Archaea;           219	49	42	50	6	17	22	21
    Archaea;Crenarchaeota;Thermoprotei;           424	0	191	0	0	0	0	0
    Bacteria;Acidobacteria;           32	4	4	22	76	16	1	0
    Bacteria;Actinobacteria;           47	0	0	4	0	0	0	0
                        
  • Taxonomic profiles with non-specific taxon names: download example
    #NAME          Sample1  Sample2 Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
    OTU1           219	49	42	50	6	17	22	21
    OTU2           424	0	191	0	0	0	0	0
    OTU3           32	4	4	22	76	16	1	0
    OTU5           47	0	0	4	0	0	0	0
                        
  • Taxonomic mapping file: download example
    #TAXONOMY	Kingdom	Phylum	Class	Order	Family	Genus	Species
    Otu00001	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Prevotellaceae	
    Otu00002	Bacteria	Proteobacteria	Epsilonproteobacteria	Campylobacterales	Helicobacteraceae	Helicobacter	
    Otu00003	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Prevotellaceae	Alloprevotella	
    Otu00004	Bacteria	Bacteroidetes	Bacteroidia	Bacteroidales	Bacteroidaceae	Bacteroides
                        

Metadata file format (download)

Tab delimited (.txt) format is also used for metadata files. Sample names/IDs are in first column beginning with "#NAME" in first row.

For metadata, sample names are present in columns and metadata types (e.g. depth, temperature) in rows. Data values should be discrete, qualitative labels (e.g. HIGH, MED, LOW). Please make sure that file does not contain Empty cells or with NA values. Use the same sample names/IDs as in your input taxonomic profile file. Note that you should make sure that neither your metadata type names or metadata labels include tab, since these are used to delimit separate items.

Example

#NAME       SampleType
Sample1     skin        
Sample2     gut
Sample3     skin                                                                                                   
Sample4     gut                                               
Sample5     gut
Sample6     gut
Sample7     skin
Sample8     skin
Processing ....
Your session is about to expire!

You will be logged off in seconds.

Do you want to continue your session?