Summary of gene data

Gene count (total) 45,666,334
Gene count (oral) 23,466,696
Gene count (gut) 21,213,616
Gene count (shared) 493,011
Unique gene names 9,347
Unique ECiDs 2,148
Unique gene annotations 222,308
Unique NCBI taxon IDs 15,746
Mean cluster size 7
Mean gene length 600

Database column descriptions

Gene ID Unique identifier to each gene assigned during gene prediction
Annotation Gene annotations, derived from UniProt, Pfam, TIGRFAMs, and NCBI's RefSeq
EcID Additional gene functional annotation, if applicable
Gene name Annotation-associated gene name (if applicable) or designation as hypothetical protein.
NCBI taxon ID NCBI-derived taxonomic identifier for a given gene. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245000/)
Body site Body site in which gene was found (currently oral, gut, or both, which is annotation as oral | gut)
Number of genes in cluster The total number of homologous genes in a cluster identified by sequence-based clustering with CD-HIT. Currently, clusters are reported in terms of 95% identity.
Gene length Number of nucleotides in consensus gene sequence.

Pipeline tools

Step Software name Link
Assembly MEGAHIT https://github.com/voutcn/megahit
ORF-calling and annotation PROKKA https://github.com/tseemann/prokka
Gene catalog construction (sequence-based clustering) CD-HIT https://github.com/weizhongli/cdhit
Taxonomic annotation NCBI NR/taxonomy databases, Diamond
  1. NCBI Non-Redundant proteins database: ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
  2. NCBI taxon files: ftp://ftp.ncbi.nih.gov/pub/taxonomy/
  3. Diamond: https://github.com/bbuchfink/diamond