Researchers are seeking to clarify how to properly identify small germline variants, regardless of the type of technology that is used to sequence the genomes of patients and the panels that are used to evaluate the presence of a variant.1
In a paper published in Nature Biotechnology, investigators associated with the Global Alliance for Genomics and Health (GA4GH) explained how methods for calling variants within “truth sets,” or high-confidence regions, can be standardized. The tools of best practice were used in PrecisionFDA, the agency’s pilot program on best practices in variant calling.
But outside of these “easier” variants and known areas of the genome, concordance in variant calling drops to approximately 60%. Despite this, recall estimates for difficult variants may become better once the team starts incorporating “long, single-molecule-sequencing reads,” they noted. And, tools are currently in the works so that the team can suggest benchmark sets to be used to detect variants in linked reads (10x Genomics) and long reads (PacBio). At the same time, the Genome in a Bottle (GIAB) consortium is working on other best-practice suggestions for evaluating structural variant call sets and somatic variants.2
Although investigator variant-calling abilities will likely continue to improve in tandem with improvements to benchmark data and sequencing technologies, the new suggestions for standardization may give some much-needed genomic context to the reams of genetic data pouring in — especially as it relates to sequencing in the clinic. Still, “groups will also need to modify benchmarking strategies to address changes in the way the human genome itself is represented,” wrote the study authors.
Cancer Therapy Advisor spoke with Albert Vilella, PhD, head of precision bioinformatics at Cambridge Epigenetix in the United Kingdom, and Jonas Korlach, PhD, chief scientific officer of Pacific Biosciences (PacBio) in Menlo Park, California, to discuss how well the GA4GH team did in its quest to make variants easier to spot.
Cancer Therapy Advisor: What is your gut reaction to this study and its conclusions?
Dr Vilella: There are different scenarios possible when trying to evaluate which combination of assays (DNA sequencing of short reads, long reads, different coverages, etc) and analysis software produce the best results [for calling variants].
One scenario is where high-confidence data can be produced by performing expensive experiments once that can inform the correctness of cheaper recurrent assay plus its companion software analysis. The GA4GH alliance has done a great job at framing this in a way that can be shared collectively across academia and industry.
The Illumina and the Platinum Genomes project is, in my opinion, an example of [this type of] ‘expensive experiment that needs to be done once,’ where they sequenced at relatively high coverage the entire pedigree of a 3-generations family for a total of 17 individuals. This was then combined with some clever software implementation designed by Michael Eberle, PhD, to be able to assess single-sample DNA sequencing and variant-calling platforms.
The GIAB applied similar approaches to produce reference data that produces highly confident calls in some areas of the genome (long reads, indels, copy number) and can be contrasted against different DNA-sequencing technologies.
There is a final scenario where multiple people work semi-independently on a problem, but nobody can produce significantly better data or results than the other groups, and people produce instances of software that have different good/bad corner case behaviors when compared collectively. This is the least preferable situation to be in, but it allows consortia to produce final higher-quality results in the aggregate compared [with] any individual tool.
Reading the section of the paper on stratification of regions, I wonder if the communities working on this aspect are doing enough in influencing the considerations for improving the software that is applied at the level of the DNA sequencer instrument: at the base-calling level. If the software that performs the base calling is not being evaluated with regard to the quality of the base calling in, for example, 52-200 bp AT dinucleotide tandem repeats, then later in the process we find out that certain DNA sequencers struggle to correctly base call reads that truly reflect the complexity of these regions in the human genome.
Overall, I hope the different communities performing these benchmarking exercises continue being actively sponsored, and that their results are communicated to all parties involved so that the science that can be performed on this technology continues to improve.
Dr Korlach: The work described, and the reference materials and call sets that the work are based on, have been critically important for validating existing genomic analysis tools and developing new ones. It highlights a broader need for variant benchmarking and best-practice methods using currently available genomic standards.
The GIAB benchmark is extremely valuable to ensuring quality and consistency in genome sequencing, and thus, groups like GA4GH and the GIAB project will play essential roles in developing robust methods and standards for precise measurements of variants in human genomes.
Precise genome standards, and variant detection benchmarking tools, will ultimately serve as the foundation for precision medicine and should especially benefit the cancer diagnostic community once broadly adopted.
Internally, [PacBio has] made a lot of progress on short-variant calling in the past 8 months using [PacBio’s] new long and accurate HiFi read paradigm based on [the company’s] CCS [circular consensus sequencing] approach, and this would have been much more difficult without the work of GIAB.
One of the current shortcomings (particularly important for the field of oncology) is that the present benchmark only includes small variants, however, a structural variant benchmark is already in development.
- Krusche P, Trigg L, Boutros PC, et al. Best practices for benchmarking germline small-variant calls in human genomes [published online March 11, 2018]. Nat Biotech. doi: 10.1038/s41587-019-0054-x
- PacBio. GIAB expands call sets with SMRT sequencing results. Published March 6, 2019. Accessed March 11, 2019