The Genome that Won A Nobel Prize

April 18, 2008

My group met recently to discuss the in-press-at-Nature publication of Jim Watson’s genome – the first diploid human genome to be sequenced with next-generation technology. I’ve been waiting for this since 454 announced the project’s completion at the HGM2007 meeting last year in Montreal. It’s a landmark publication in terms of human genetic variation, and of particular interest to me since I work on our center’s 454 analysis pipeline.

Watson and Crick

In two months Roche/454 generated ~106.5 million genomic reads from Watson’s DNA in 234 runs. Using BLAT they mapped 93.2 million reads (87.5%) to hg36, yielding an average coverage of about 7.4x. No doubt the expense of this effort was substantial, though the authors claim it was 1/100th of what capillary sequencing would have cost. It probably also hurt to throw away 2.5 million “unmapped” reads, though they did some post-processing of these with interesting results.

After a few filters were applied, the authors produced a set of 3.32 million SNPs in Watson’s genome, a number deliciously comparable to Craig Venter’s 3.47 million SNPs. In both men >80% of the SNPs are already known (to dbSNP). The most recent build of dbSNP (build 128), which doesn’t yet include novel Watson/Venter SNPs, has 9.89 million SNPs. The authors didn’t say but I estimate that the men share about 300,000 novel SNPs. Together they’ll add about 10% to the set of known SNPs, and only 1-2% of nonsynonymous SNPs. I hate to break it to you, but the sun is setting for nsSNPs. We know about 95% of them already and in Jim Watson only 7% are likely to be deleterious.

Also, over at GeneticFuture Daniel MacArthur discusses how the Watson Genome may be gloomy news for the field of personal genomics.  He points out that we’re perhaps five years away from affordable whole-genome sequencing, and by then we will no doubt have a much better understanding of how functional variation affects human phenotypes.

Indels are why I love 454 technology. In Watson’s genome they identified >200,000 indels of at least 2bp. Insertion detection is limited by read length, and so most were <200 bp. The largest deletion, however, was nearly 40 kbp. Only a fraction of the indels (~350) affected coding sequence. They saw a validation rate of 70% for a sampling of coding indels between 2 and 50 bp, which is pretty good. Single-base indels were treated with extreme caution, as over 80% of these were associated with homopolymers, the Achilles heel of 454 sequencing.

This paper was worth the wait. Not only was it an impressive demonstration of the power of 454 sequencing for whole-genome sequencing, but it openly addressed many of the informatics challenges therein and answered some interesting questions along the way. We can now confidently say that an individual carries ~3.7 million SNPs relative to the reference sequence, of which perhaps 10,000 are protein-altering. Ten of Watson’s nsSNPs were Mendelian-recessive, highly penetrant, disease-causing alleles according to HGMD, suggesting that each of us carries many more deleterious alleles than was previously believed. Yet analysis of the unplaced 454 reads suggests that as many as 100 protein-coding genes are still absent from the reference sequence. It seems like the work on the human genome is never done. I certainly know the feeling.


Genome-Wide Association Failures

April 1, 2008

There was an interesting post over at GeneticFuture on why genome-wide association studies fail. It’s a good discussion of the many challenges that still face GWAS even in the era of high-throughput SNP genotyping.

It should be noted that there have been many successful genome-wide association studies, especially since the completion of the International HapMap Project (phases I/II). Last year saw high-profile publication of GWAS’s for coronary heart disease, breast cancer, celiac disease, type I diabetes and Crohn’s disease , just to name a few. deCODE Genetics performed a large-scale study on the genetics underlying exfoliation glaucoma, and found that individuals with two particular SNPs in the first exon of LOXL1 had a 100X greater chance of getting the disease.

Last June the Wellcome Trust Case Control Consortium published the largest study ever of genetics behind common diseases. In a massive cohort of 17,000 samples, the researchers performed GWAS’s for diabetes, rheumatoid arthritis, cardiac disease, and other common, complex phenotypes. Perhaps the most exciting result of this study was the association of several genes that had never before been implicated in human disease.

Yet, as the GeneticFuture post pointed out, we rarely hear about the failure of genome-wide association studies to turn up such interesting discoveries. The complexities of small allelic effects, population structure, rare variants, and copy number variation may explain how such failures manifest in the realm of genetics. As for epigenetic factors and disease heterogeneity, well, these issues are out of our hands for the time being.

As far as SNPs go, I believe we’re getting very close to a complete catalog of variation that’s common in human populations. Genome-wide sequencing of two individual human genomes each found ~600,000 SNPs that are not already in dbSNP. At best they’d increase the number of known SNPs by ~10%. At ~10-11 million SNPs, dbSNP is mostly complete in my opinion. We still have a long way to go, though, in cataloging copy number variation.

Another challenge not mentioned in FutureMedicine, but nevertheless important, is the fact that a substantial fraction of the genetic variation underlying complex disease occurs outside the coding regions of known genes. It’s time to look beyond nonsynonymous coding SNPs, people. But that’s a post for another day.