We study the heterogeneity of microbial genetics, with a particular emphasis on how microbes and their genes influence their environment, and through that, human health. With this resource, our goal is to expand the current understanding of variation and function in microbial sequence diversity around the world. Scroll down to see how we went about building out this database, as well as some of our key observations about microbial genetic diversity.

De novo assembly and gene identification

We started with building a “de novo assembly” pipeline. Assembly involves taking raw shotgun sequence data – short sequences of DNA – that were generated from, in this case, the aggregation of all microbial DNA in a given environment (i.e. human stool or saliva). Akin to putting together a giant puzzle, de novo assembly finds overlaps in and groups these sequences into long, contiguous stretches of DNA (contigs).

We can then algorithmically identify Open-Reading-Frames (probable microbial genes) on the contigs, afterwards finding the “non-redundant” set genes in our dataset by sequence-based clustering (i.e. calling two genes identical if they are more than 95% identical). As noted here, we were particularly interested in the overall rarity of genes – that is to say, how often they occurred in different samples. We refer to genes that appear once as “singletons,” and their counterparts as “non-singletons.”


Our database

Using our pipeline, we were able to identify a total of 157,241,550 ORFs from our assembled oral microbiome data, compared to 136,672,846 from our gut microbiome data. Clustering at the 95% identity threshold, our initial oral and gut catalogs contained 23,961,508 and 22,254,436 consensus genes, respectively. When our oral and gut catalogs were clustered together at 95% identity, the resultant catalog had 45,666,334 genes.

To make searching this data a simple process, we built this resource, which is outlined in the figure on the left. We constructed a PostgreSQL database housing our gene sequence information with metadata regarding predicted gene taxonomic annotation, gene function, gene cluster size (i.e. singleton status), gene length, and origin body site.

Gene cluster sizes,
95% identity

We found this result from our gene frequency analysis particularly striking – when we defined a unique gene at the 95% identity level (again greater than 95% distinct from all other genes in our dataset), we found that nearly half of the Open-Reading-Frames we found were singletons, occurring in only 1 sample. In this plot, the singletons in both the oral and gut microbiome can be seen in the peak on the far left of the plot.


Singletons at different percent identities

The first thing we did was to validate that singletons were not artifacts of our percent identity cutoff. We were concerned that by defining a gene as being 95% distinct from all other genes, we were being too conservative – perhaps the optimal cutoff was 90%, or even lower. If that were the case, then maybe all the singletons would cluster with other genes once we dropped below a certain point. With that in mind, we progressively relaxed our threshold, iteratively clustering our gene catalog until we hit 50% identity. To our surprise, the ratio of singletons to non-singletons stayed about the same regardless of gene identity cutoff.

Gene cluster sizes,
50% identity

Here we can see that, while the shape of the distributions changed slightly from 95%, our gene catalogs in the oral and gut are still about 50% singletons at 50% identity. We wrote a paper that further analyzes where these gene are coming from and what they’re doing. If you’re interested in learning more, we encourage you take a look at the “Contact” page of this site, where you can see the abstract and link to the text!