Happy DNA Day!

April 25, 2008

NHGRI

April 25th is National DNA Day in the U.S., an occasion that commemorates discovery of DNA’s structure by Watson and Crick (published in Nature on April 25th, 1953) and the completion of the Human Genome Project fifty years later in April 2003. NHGRI has set up a web site for DNA day featuring a welcome video from Francis Collins and a live chatroom where anyone can chat with “leading researchers in the field of genetics.” Strangely I haven’t yet gotten the call.

The WashU Genome Center outreach office is also sponsoring a number of activities. A number of DNA Day Ambassadors are visiting local high schools in St. Louis. A symposium on the medical campus will have student poster sessions and a seminar, “The Human Genome Sequence: A Foundation for Biological Inquiry”, given by our co-director Elaine Mardis.

By chance I happened to Google the King and Wilson 1975 paper the other day, and came across a very interesting site of Landmarks in the History of Genetics. While not up-to-date, it’s a nice story of key events in DNA’s history (and their implications) since 1745. Of course there’s Darwin’s publication of The Origin of Species in 1859, and Mendel’s Experiments in Plant Hybridisation just six years later. I recognize the name of Erwin Chargaff, whose insights (1950) into the relative incidence of A, C, G, and T nucleotides was not random, but perhaps a kind of code. Two years later came the Hershey and Chase experiments, which showed that viruses infect host cells by injecting their DNA, while the proteins generally remain outside the cell. Of course we know Watson and Crick (1953), as well as King and Wilson (1975). How about Barbara McClintock, whose discovery of transposable elements in maize in the 1940’s was not fully recognized for decades?

That rather sounds like Mendel, doesn’t it? I wonder, how many other important discoveries in biology have already been made, but not yet appreciated. It’s something to think about. Happy DNA Day everyone!


Cis-regs and Functional Noncoding Variation

April 24, 2008

On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.

Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:

  1. The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson [1975] found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
  2. The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
  3. The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.

Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.

Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.

The Model-Free Approach: PhyloNet-SNP

All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.

The Experimental Backup: Allele-specific Expression

The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.

At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.


The Genome that Won A Nobel Prize

April 18, 2008

My group met recently to discuss the in-press-at-Nature publication of Jim Watson’s genome – the first diploid human genome to be sequenced with next-generation technology. I’ve been waiting for this since 454 announced the project’s completion at the HGM2007 meeting last year in Montreal. It’s a landmark publication in terms of human genetic variation, and of particular interest to me since I work on our center’s 454 analysis pipeline.

Watson and Crick

In two months Roche/454 generated ~106.5 million genomic reads from Watson’s DNA in 234 runs. Using BLAT they mapped 93.2 million reads (87.5%) to hg36, yielding an average coverage of about 7.4x. No doubt the expense of this effort was substantial, though the authors claim it was 1/100th of what capillary sequencing would have cost. It probably also hurt to throw away 2.5 million “unmapped” reads, though they did some post-processing of these with interesting results.

After a few filters were applied, the authors produced a set of 3.32 million SNPs in Watson’s genome, a number deliciously comparable to Craig Venter’s 3.47 million SNPs. In both men >80% of the SNPs are already known (to dbSNP). The most recent build of dbSNP (build 128), which doesn’t yet include novel Watson/Venter SNPs, has 9.89 million SNPs. The authors didn’t say but I estimate that the men share about 300,000 novel SNPs. Together they’ll add about 10% to the set of known SNPs, and only 1-2% of nonsynonymous SNPs. I hate to break it to you, but the sun is setting for nsSNPs. We know about 95% of them already and in Jim Watson only 7% are likely to be deleterious.

Also, over at GeneticFuture Daniel MacArthur discusses how the Watson Genome may be gloomy news for the field of personal genomics.  He points out that we’re perhaps five years away from affordable whole-genome sequencing, and by then we will no doubt have a much better understanding of how functional variation affects human phenotypes.

Indels are why I love 454 technology. In Watson’s genome they identified >200,000 indels of at least 2bp. Insertion detection is limited by read length, and so most were <200 bp. The largest deletion, however, was nearly 40 kbp. Only a fraction of the indels (~350) affected coding sequence. They saw a validation rate of 70% for a sampling of coding indels between 2 and 50 bp, which is pretty good. Single-base indels were treated with extreme caution, as over 80% of these were associated with homopolymers, the Achilles heel of 454 sequencing.

This paper was worth the wait. Not only was it an impressive demonstration of the power of 454 sequencing for whole-genome sequencing, but it openly addressed many of the informatics challenges therein and answered some interesting questions along the way. We can now confidently say that an individual carries ~3.7 million SNPs relative to the reference sequence, of which perhaps 10,000 are protein-altering. Ten of Watson’s nsSNPs were Mendelian-recessive, highly penetrant, disease-causing alleles according to HGMD, suggesting that each of us carries many more deleterious alleles than was previously believed. Yet analysis of the unplaced 454 reads suggests that as many as 100 protein-coding genes are still absent from the reference sequence. It seems like the work on the human genome is never done. I certainly know the feeling.


Drowning in the Flood of Next-Gen Data

April 18, 2008

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity - servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

The 62,000 job backlog

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.


Lung Cancer: The Big Picture

April 10, 2008

Yesterday was our GSC-wide lab meeting, a quarterly event that crams 400+ people into an undersized auditorium. The guest speaker was Ramaswamy Govindan (MD), a medical oncologist from the Siteman Cancer Center. He gave a great 20-minute talk about lung cancer. He could easily have spoken to us for two hours on this topic, but alas, time is short. One topic he discussed that’s very germane to our work is the EGFR cancer pathway, which may account for 10% or more of lung cancers. Interestingly Asians, women, and NON-smokers are far more likely to have EGFR mutations (there was a fourth risk factor that I didn’t have time to write down).

From what I understand, EGFR encodes an epidermal growth factor receptor and as I understand it is expressed in normal tissues only during development. There is a drug called gefitinib that inhibits EGF receptors - a simple tablet that’s taken orally. Dr. Govindan showed some superior-view images of the chest cavity of a lung cancer patient before and after gefitinib therapy. The difference was amazing. Before treatment one lung was completely cancerous, and 2 years later (after treatment) the cancer was totally gone. It was a compelling example of what the future of cancer therapy might look like.

Speaking of which, the speaker went on to talk a bit about pharmacogenetics - the study of how genetics affect differential response to treatment among individuals. Evidently, classifying lung cancer patients by EGFR mutation status is extremely effective in predicting the outcome of chemotherapy. Patients with EGFR mutations respond well, while other patients see little or no benefit. Worse, some patients might have a toxic response to the drug. The ability to identify responders, non-responders, and toxic-responders by genotyping or gene expression profiling is perhaps one of the most important goals of cancer genetics.


Still Waiting for that ABI SOLiD Genome

April 8, 2008

One of the big announcements at this year’s AGBT was ABI’s sequencing of a complete human genome using the SOLiD system. It wasn’t just any genome, either - it was the genome of an African male of the Yoruba tribe in Nigeria (one of the HapMap samples). Perhaps I should be unsurprised that the press releases flew months ago but we’ve yet to see the peer-reviewed publication. Yet I’m eager to read the results of their project, as it will be the first complete genome sequencing of an individual from the African continent. Many studies have seen higher incidence and allele frequencies of SNPs in African samples, consistent with population bottlenecks during out-of-Africa expansions. In fact, a recent genome-wide survey of genetic variation in 51 populations showed that humans formed a chain of colonies as they migrated out of Africa some 10,000 years ago. That article’s a very interesting read.

But back to ABI. Perusing the SOLiD web site, I did find a poster on the genome-wide variation detected from their not-yet-completed SOLiD sequencing. From it I took these key pieces of information. They sequenced both fragment and mate-pair libraries to a coverage of about 4.9X. The mate-pair libraries allowed them to detect ~22,000 insertions and ~45,000 deletions, nearly all of which were heterozygous. At ~4X coverage on chromosome 7, some 75% of the SNPs detected were already in dbSNP. In the ENCODE regions (which have been extensively characterized), 91% of the SNPs detected were in dbSNP. To me, the fraction of novel SNPs seems low, but if it remains constant, this study will almost certainly add more SNPs to public databases than the Watson and Venter efforts.


Helicos Resequences M13 Virus Genome

April 7, 2008

The April 4th issue of Science had an article by Helicos BioSciences in which they described the single-molecule DNA sequencing of a viral genome. I knew about Helicos because they came and gave a talk to our Genetics department describing their planned strategy to develop a method for single-molecule sequencing. As I recall, the talk was entirely theoretical as they didn’t have much experimental data to show. Clearly things have gone well for Helicos, since their article convincingly demonstrates the potential of single-molecule sequencing for high-throughput, low-cost sequencing.

Introduction: The Problems with PCR

Why bother with single molecule sequencing? The introduction briefly discussed three problems associated with PCR-based sequencing.

  1. Bias in template representation. Due to thermodynamics and other factors I don’t well understand, PCR efficiency is directly affected by characteristics of the template. Shorter products, for example, are more efficient to amplify than longer products.
  2. Library preparation complications. PCR-based sequencing methods require a lot of templates, and preparation of the libraries can be “onerous and expensive in terms of DNA manipulation,” according to the article. I don’t do library prep myself, but this sounds reasonable.
  3. Error incorporation. Here is something that I do know about. Any time you use PCR, there’s a chance that mis-incorporation at an early cycle will introduce (and then amplify) errors in the sequence. We’ve seen some problems with 454 and Solexa sequencing that may be attributed to this. The idea of taking PCR-induced errors out of sequence reads appeals to me very much.

Results: Sequencing-by-synthesis of the M13 Viral Genome

The authors report sequencing the ~7 kbp M13 genome with 100% coverage and at an average depth of 150X. The read lengths averaged 23-27 bp, depending on the run and some post-processing; the authors claim to have performed runs with average read lengths of over 30 bp. According to alignment statistics in Table 1, there were 32,473 forward-orientation reads (relative to the reference) for an average coverage of 96X, and 34,109 reverse-orientation readds for an average coverage of 105X. Coverage in both orientations becomes important during their mutation-detection simulations.

Simulations of Mutation Detection

Because they sequenced the canonical strain of M13, there should be no sequence polymorphisms. So, to test the ability of this sequencing method to pick up mutations, the authors created “synthetic mutations” in the reference sequence and re-performed alignments. The synthetically-introduced mutations are picked up with an average sensitivity of ~98%. To me, this was the weaker part of the paper - mutations created in silico won’t accurately represent real variation, but at least it let the authors discuss analysis and refinement steps that led to improved mutation detection.

Discussion: Caveats and Future Directions

I don’t think Helicos is yet a threat to established next-generation platforms like Roche/454 and Illumina/Solexa. At 25 bp, the reads are too short to be useful in eukaryotes. Like 454, the Helicos platform has some difficulties with homopolymers , especially runs of cytosine residues. The authors readily admit that “large genomes, heterogeneous samples, and genomic structural variations will likely require longer reads, reduced homopolymer run through, and enhanced alignment tools.”

Yet this publication is an important proof-of-principle for the Helicos method. As far as single-molecule DNA sequencing goes, it looks like Helicos Biosciences is the one to beat.


Genome-Wide Association Failures

April 1, 2008

There was an interesting post over at GeneticFuture on why genome-wide association studies fail. It’s a good discussion of the many challenges that still face GWAS even in the era of high-throughput SNP genotyping.

It should be noted that there have been many successful genome-wide association studies, especially since the completion of the International HapMap Project (phases I/II). Last year saw high-profile publication of GWAS’s for coronary heart disease, breast cancer, celiac disease, type I diabetes and Crohn’s disease , just to name a few. deCODE Genetics performed a large-scale study on the genetics underlying exfoliation glaucoma, and found that individuals with two particular SNPs in the first exon of LOXL1 had a 100X greater chance of getting the disease.

Last June the Wellcome Trust Case Control Consortium published the largest study ever of genetics behind common diseases. In a massive cohort of 17,000 samples, the researchers performed GWAS’s for diabetes, rheumatoid arthritis, cardiac disease, and other common, complex phenotypes. Perhaps the most exciting result of this study was the association of several genes that had never before been implicated in human disease.

Yet, as the GeneticFuture post pointed out, we rarely hear about the failure of genome-wide association studies to turn up such interesting discoveries. The complexities of small allelic effects, population structure, rare variants, and copy number variation may explain how such failures manifest in the realm of genetics. As for epigenetic factors and disease heterogeneity, well, these issues are out of our hands for the time being.

As far as SNPs go, I believe we’re getting very close to a complete catalog of variation that’s common in human populations. Genome-wide sequencing of two individual human genomes each found ~600,000 SNPs that are not already in dbSNP. At best they’d increase the number of known SNPs by ~10%. At ~10-11 million SNPs, dbSNP is mostly complete in my opinion. We still have a long way to go, though, in cataloging copy number variation.

Another challenge not mentioned in FutureMedicine, but nevertheless important, is the fact that a substantial fraction of the genetic variation underlying complex disease occurs outside the coding regions of known genes. It’s time to look beyond nonsynonymous coding SNPs, people. But that’s a post for another day.