On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.
Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:
- The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson  found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
- The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
- The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.
Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.
Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.
The Model-Free Approach: PhyloNet-SNP
All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.
The Experimental Backup: Allele-specific Expression
The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.
At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.