De novo assembly and gene identification
We started with building a “de novo assembly” pipeline. Assembly involves taking raw shotgun sequence data – short sequences of DNA – that were generated from, in this case, the aggregation of all microbial DNA in a given environment (i.e. human stool or saliva). Akin to putting together a giant puzzle, de novo assembly finds overlaps in and groups these sequences into long, contiguous stretches of DNA (contigs).
We can then algorithmically identify Open-Reading-Frames (probable microbial genes) on the contigs, afterwards finding the “non-redundant” set genes in our dataset by sequence-based clustering (i.e. calling two genes identical if they are more than 95% identical). As noted here, we were particularly interested in the overall rarity of genes – that is to say, how often they occurred in different samples. We refer to genes that appear once as “singletons,” and their counterparts as “non-singletons.”