How deeply should I sequence?

One Codex Team on April 12, 2016

We get the question from people jumping into metagenomic sequencing for the first time, “How many reads do I need per sample?”

The way that I like to break this up is by thinking about what you’re hoping to get from an experiment:

Pathogen detection

If you’re looking for a low abundance organism, increasing the depth of sequencing will linearly improve (lower) the limit of detection. A good rule of thumb is that you need 100-1000 reads to confidently identify an organism (this varies widely by the organism you’re looking for, but it’s a reasonable range). A little mental math tells us that sequencing 1 million reads will give us a limit of detection of 0.1-0.01%, 10 million reads will give us an LoD of 0.01-0.001%, so on and so forth. So plug in your desired LoD and you’re set.

Functional characterization

If your goal is to detect a set of genes in order to predict antimicrobial resistance, MLST, virulence, etc., you need to capture enough data to cover the whole microbial genome. Bacterial are on the order of 5 million bases long (viruses are on the order of 10 thousand bases). Covering every gene in the genome requires ~5X coverage, and an average read is 150bp long, so sequencing a bacteria requires roughly 150,000 reads (5Mb / 150bp * 5X).

If you’re sequencing a mixture, just divide by the proportional abundance of your target organism. If you’re looking for a bacteria that is 1% of the total sample, then you will need roughly 15M (150,000 / 1%) reads.

Community profiling

If what you care about is the total microbial community, such as the human microbiome, then we’ll need to make some assumptions. Let’s say we want to detect the organisms that are at least 0.1% of the total community, and let’s say that on average we need 100 reads to perform species-level identification. The total number of reads required for that task is actually pretty modest – 100,000.

Sequencing to a higher depth just adds to the resolution you get, catches more viruses with small genomes, and does a better job of picking up low-abundance species. Of course, samples with a large number of novel organisms may have a lower level of homology to reference databases, requiring more sequencing to perform specific classification.