CTA: What about the role of pseudogenes, or repetitive noncoding regions?

Dr Gerstein: Great question. Pseudogenes are basically duplicated copies of genes that have been disabled. They’re not working as genes but there are documented cases where pseudogenes take on a new life as a noncoding RNA that potentially regulates a current gene. And of course, we don’t map to pseudogenes in exome-only sequencing.

Continue Reading

Pseudogenes are a major complicating confounder for protein-coding gene sequencing. If you don’t properly include the pseudogenes in your WES, what happens is, often they still bind to the capturing reagent and they might have sequence differences from the gene, and those differences may be relative to the gene, because they’re disabled. Those sequence differences can be mapped onto the gene [in WES], resulting in erroneous variants being identified for those genes.

That’s one of the main reasons that WGS provides more accurate variant qualities — they’re not as much confounded by the mismapping of noncoding regions onto genes. Capturing agents are incompletely specific to the protein-coding regions, pulling in pseudogenes in a messy way. You don’t get as clean a signal because you’ve imposed the additional step [of a capturing reagent].

Related Articles

CTA: So, what are the key advantages of WGS?

Dr Gerstein: The nice thing about WGS is we don’t yet know what all of the regions of the genome do. For a lot of datasets, particularly rare diseases, the data have tremendous archival value. Today we may see no point of sequencing a particular intron far from a gene but maybe 20 years from now, we will. If you have the sample, you have a much better archival sample with WGS.

CTA: How much does the particular sequencing machine used, matter? Do particular machines offer more robust sequencing?

Dr Gerstein: I don’t think in exome sequencing it is that meaningful. If your read gets very long, say 4k, you can’t meaningfully sequence exomes at that point. WES is more meaningful for very short reads and reads are getting longer and longer. So, the concept of “the exome” becomes fuzzier and fuzzier.

CTA: Some researchers predict that clinical transcriptomics will be clinically available in the very near future — in the coming year or so. Do you agree?

Dr Gerstein: Sure. But there are a lot of challenges with RNA sequencing that aren’t an issue with DNA sequencing. RNA degrades much more quickly than DNA and it’s harder to get a sample of RNA. There’s a whole set of preparation issues with RNA, which can be tricky with degrading samples. But RNA sequencing is very useful for identifying biomarkers. It’s used in research using patient samples but I’m not sure it’s yet used in treatment of the patient. 

CTA: So, the cost of WGS is dropping. What about the cost of storage costs for the resulting and massive amounts of data?

Dr Gerstein: So, with whole-genome, there’s hundreds of times more data to store. That’s fair, but the point is that storage costs are dropping quickly, too. Just like sequencing costs, the storage of data is becoming exponentially less expensive. I can get a laptop with a couple of terabytes and it’s no big deal. It’s still hard to move that volume of data around, I agree. But storage is getting cheaper.


  1. Adams DR, Eng CM. Next-generation sequencing to diagnose suspected genetic disorders. N Engl J Med. 2018;379(14):1353-1362.
  2. Li S, Gerstein MB. Next-generation sequencing to diagnose suspected genetic disorders. N Engl J Med. 2019;380(2):200.
  3. Zhang Y, Li S, Abyzov A, Gerstein MB. Landscape and variation of novel retroduplications in 26 human populations. PLoS Comput Biol. 2017;13(6):e1005567