MassGenomics is Moving!

May 21, 2008

Hello all,

Massgenomics has grown enough to become its own site.  Please update your bookmarks and RSS feeds:


Entries RSS:

Comments RSS:

Short Read Aligners: Maq, Eland, and Others

May 14, 2008

This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.

Short-Read Data Sets at the WashU Genome Center

One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:

Metric Maq 0.6.3 Maq 0.6.5
Average alignment time for normal runs 17.7 hours 9.1 hours
Max alignment time for a normal run 240 hours 28.8 hours
Total number of jobs 2168 1467
Jobs that took longer than 1 day 443 3

The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.

Heng Li’s Comparison of Short-Read Aligners

For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:

  • Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.
  • RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.
  • SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.
  • SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.
  • Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.

More Comments from Heng Li

Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.

Unbiased Comparisons of Short-Read Aligners

In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.


WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.

Landmark Study in Structural Variation

May 1, 2008

At last, some results from Evan Eichler’s SV project! The results of the first phase of the “Human Genome Structural Variation Project” were presented in today’s issue of Nature. I’ve been cognizant of this project for a couple of years and eager for the results, as it is really the first large-scale, sequence-based study of copy number and structural variants. As it happens, our Genome Center played a big role in the sequencing, and two of our researchers (Tina Graves and Rick Wilson) are among the authors.

In fairness, however, I should disclose another thing about the Evan Eichler project.

In 2006, just after three simultaneous papers in Nature Genetics brought structural variation to the forefront, my former lab began working on a grant proposal. In it, we proposed to mine existing trace data (from NCBI’s Trace Archive) for putative structural variants in the human genome. We were developing a sequence-based approach to identify reads spanning insertions, deletions, duplications, inversions, and translocations. It was an ambitious project and the timing was perfect, but, unfortunately, NIH sent our proposal back twice. Unscored. Later, I learned that a group headed by Evan Eichler pretty much locked up U.S. funding for this research in the form of a $40 million grant.  With all of the NIH eggs in a single basket, I thought to myself, they had better deliver.

It looks like I won’t be disappointed.  Dr. Eichler and colleagues constructed whole-genome libraries of ~1 million fosmids for each of 8 individuals whose samples were used in the HapMap Project.  Four were African, two were CEPH European, one was Japanese, and one Han Chinese.  They used the fosmid end sequencing approach (described by Tuzun et al, 2004) in which you sequence the ends of each fosmid and map the sequences to the reference genome.  Altogether, about 6.1 million end-sequence pairs (ESPs) were uniquely mapped to the genome.  Of these some 76,767 (~1.26%) were discordant by alignment distance or orientation, indicating a possible underlying structural variant.  It’s a big paper (the Supplemental File alone was 57 pages) so I’ll hit you with the take-homes:

  1. The human genome is still incomplete.  The fosmid approach yielded a number of novel sequences not present in the current human genome assembly.  The sequences range in size from 2-130 kbp, were randomly distributed among genic/nongenic regions, and were often (40% of the time) copy-number variable.  There’s still more human genome out there, folks.
  2. Current CNV databases are inflated. The higher resolution of sequence-based approaches served to refine the location of many previously-reported CNVs.  Generally speaking, CNVs on a haplotype are smaller (in bp) than copy-number-variable regions as a whole, meaning that fewer genes are affected.  Array-cGH platforms, which to-date are among the most widely-used approaches to assess CNV, greatly exaggerate the size of these variants.  And less than half of the SVs in this study cannot adequately be genotyped with existing platforms.
  3. NAHR is the driving force behind most structural variation.  About half of the SVs detected in this study were flanked by segmental duplications.  This is nothing new, but it somewhat contradicts the study by Korbel and colleagues last year, which used 454 paired-end sequencing and reported that NAHR-mediated events were rare.  Eichler et. al.  suggest that the association of NAHR with segmental duplications and the difficulties of short sequence reads to resolve such regions might together mean that next-generation ESP approaches are missing a lot of variation.  I’m not sure I agree, but that’s what they said.

Overall it seems like an impressive publication.  They planned and executed a very careful study, and as a result, we learned quite a bit about the landscape of structural variation in the human genome.  Not bad, Dr. Eichler.

Happy DNA Day!

April 25, 2008


April 25th is National DNA Day in the U.S., an occasion that commemorates discovery of DNA’s structure by Watson and Crick (published in Nature on April 25th, 1953) and the completion of the Human Genome Project fifty years later in April 2003. NHGRI has set up a web site for DNA day featuring a welcome video from Francis Collins and a live chatroom where anyone can chat with “leading researchers in the field of genetics.” Strangely I haven’t yet gotten the call.

The WashU Genome Center outreach office is also sponsoring a number of activities. A number of DNA Day Ambassadors are visiting local high schools in St. Louis. A symposium on the medical campus will have student poster sessions and a seminar, “The Human Genome Sequence: A Foundation for Biological Inquiry”, given by our co-director Elaine Mardis.

By chance I happened to Google the King and Wilson 1975 paper the other day, and came across a very interesting site of Landmarks in the History of Genetics. While not up-to-date, it’s a nice story of key events in DNA’s history (and their implications) since 1745. Of course there’s Darwin’s publication of The Origin of Species in 1859, and Mendel’s Experiments in Plant Hybridisation just six years later. I recognize the name of Erwin Chargaff, whose insights (1950) into the relative incidence of A, C, G, and T nucleotides was not random, but perhaps a kind of code. Two years later came the Hershey and Chase experiments, which showed that viruses infect host cells by injecting their DNA, while the proteins generally remain outside the cell. Of course we know Watson and Crick (1953), as well as King and Wilson (1975). How about Barbara McClintock, whose discovery of transposable elements in maize in the 1940’s was not fully recognized for decades?

That rather sounds like Mendel, doesn’t it? I wonder, how many other important discoveries in biology have already been made, but not yet appreciated. It’s something to think about. Happy DNA Day everyone!

Cis-regs and Functional Noncoding Variation

April 24, 2008

On Tuesday I attended a very interesting thesis defense by Scott Doniger, a student in Justin Fay’s lab. I admit, I was lured in by the thesis title, “Comparing and Contrasting Cis-regulatory Sequences to Identify Functional Noncoding Sequence Variation.” While I do not know Scott personally, I’m certainly familiar with Justin Fay’s work on positive and negative selection in the human genome. His paper, in fact, is the foundation of my work on signatures of natural selection and the SNPseek project.

Scott proved a confident and articulate speaker, and laid the groundwork for his thesis by presenting three convincing motivations for this work:

  1. The regulatory hypothesis of evolution. Despite the obvious phenotypic diversity of species on this planet, the DNA sequence diversity is surprisingly limited. More than twenty-five years before the completion of the human genome sequence, King and Wilson [1975] found that the chimpanzee and human genomes diverged by only 1.6%. From this seminal paper came the idea that regulation of gene expression, not differences in DNA sequence, drove phenotypic divergence.
  2. The functional relevance of noncoding sequences. Despite the traditional view that functional variants in humans alter protein-coding sequence, it is becoming clear that the genetics underlying many traits extend into noncoding DNA, particularly for complex phenotypes like disease susceptibility and drug response.
  3. The availability of numerous genome sequences. Draft genome sequences for at least 27 vertebrate species have been completed to date, and their availability has spurred wide interest in the field of comparative genomics.

Scott’s work is based on the reasonable premise that functional noncoding sequences are subject to purifying selection (fewer changes tolerated over time), and thus they should be conserved between genomes that share common ancestry. Thus, comparative genomics serves to guide us to functional variants, as SNPs in constrained positions are more likely to be deleterious. This works well for coding sequences in both humans and yeast (the Fay lab model organism). Scott looked at the 9 known quantitative trait nucleotides (QTNs) in yeast and sure enough, 8 of them were SNPs in highly conserved amino acid positions. Gravy.

Because deep sequence conservation approaches might not work for noncoding SNPs, they focused on a few closely related species of yeast, identifying 2,106 variant positions (13% of the total) that fell within conserved transcription factor binding sites (TFBS’s). Of those, 615 (29%) appear to be deleterious based on their conserved-nucleotide model. If I can extrapolate, by their approach about 3.8% of the SNPs between closely related yeast species are likely to be functional.

The Model-Free Approach: PhyloNet-SNP

All of Scott’s work to this point relies on having good annotations of cis-regulatory TFBS’s in your genome of interest. Because you can’t always count on that, they developed a “model-free” approach to evaluating SNPs. With some help from Gary Stormo’s group, they devised an algorithm (PhyloNet-SNP) that uses each SNP +/- 20 bp of flanking sequence in each direction as a query sequence to identify those within multi-copy conserved elements of a genome. By this approach, ~15% of the SNPs in their model system were called as functional.

The Experimental Backup: Allele-specific Expression

The brief wet-lab portion of the thesis work was an allele-specific expression experiment where the ability of SNPs to alter gene expression levels was evaluated in vivo. Among randomly-chosen SNPs about 8% had a regulatory effect. However, using sequence conservation and/or PhyloNet-SNP to select SNPs brought this up to 25%, suggesting that the conservation approach yields a three-fold enrichment of SNPs that affect gene expression.

At the conclusion, Scott admitted that while comparative genomics does help identify functional sequences and variation, it doesn’t explain everything. Indeed, recent findings from the ENCODE project cast doubt on whether many conserved noncoding sequences are important at all. Yet until we have a better understanding of the dark matter of the human genome, using sequence conservation to identify SNPs of interest seems like a good way to go.

The Genome that Won A Nobel Prize

April 18, 2008

My group met recently to discuss the in-press-at-Nature publication of Jim Watson’s genome – the first diploid human genome to be sequenced with next-generation technology. I’ve been waiting for this since 454 announced the project’s completion at the HGM2007 meeting last year in Montreal. It’s a landmark publication in terms of human genetic variation, and of particular interest to me since I work on our center’s 454 analysis pipeline.

Watson and Crick

In two months Roche/454 generated ~106.5 million genomic reads from Watson’s DNA in 234 runs. Using BLAT they mapped 93.2 million reads (87.5%) to hg36, yielding an average coverage of about 7.4x. No doubt the expense of this effort was substantial, though the authors claim it was 1/100th of what capillary sequencing would have cost. It probably also hurt to throw away 2.5 million “unmapped” reads, though they did some post-processing of these with interesting results.

After a few filters were applied, the authors produced a set of 3.32 million SNPs in Watson’s genome, a number deliciously comparable to Craig Venter’s 3.47 million SNPs. In both men >80% of the SNPs are already known (to dbSNP). The most recent build of dbSNP (build 128), which doesn’t yet include novel Watson/Venter SNPs, has 9.89 million SNPs. The authors didn’t say but I estimate that the men share about 300,000 novel SNPs. Together they’ll add about 10% to the set of known SNPs, and only 1-2% of nonsynonymous SNPs. I hate to break it to you, but the sun is setting for nsSNPs. We know about 95% of them already and in Jim Watson only 7% are likely to be deleterious.

Also, over at GeneticFuture Daniel MacArthur discusses how the Watson Genome may be gloomy news for the field of personal genomics.  He points out that we’re perhaps five years away from affordable whole-genome sequencing, and by then we will no doubt have a much better understanding of how functional variation affects human phenotypes.

Indels are why I love 454 technology. In Watson’s genome they identified >200,000 indels of at least 2bp. Insertion detection is limited by read length, and so most were <200 bp. The largest deletion, however, was nearly 40 kbp. Only a fraction of the indels (~350) affected coding sequence. They saw a validation rate of 70% for a sampling of coding indels between 2 and 50 bp, which is pretty good. Single-base indels were treated with extreme caution, as over 80% of these were associated with homopolymers, the Achilles heel of 454 sequencing.

This paper was worth the wait. Not only was it an impressive demonstration of the power of 454 sequencing for whole-genome sequencing, but it openly addressed many of the informatics challenges therein and answered some interesting questions along the way. We can now confidently say that an individual carries ~3.7 million SNPs relative to the reference sequence, of which perhaps 10,000 are protein-altering. Ten of Watson’s nsSNPs were Mendelian-recessive, highly penetrant, disease-causing alleles according to HGMD, suggesting that each of us carries many more deleterious alleles than was previously believed. Yet analysis of the unplaced 454 reads suggests that as many as 100 protein-coding genes are still absent from the reference sequence. It seems like the work on the human genome is never done. I certainly know the feeling.

Drowning in the Flood of Next-Gen Data

April 18, 2008

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity – servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

The 62,000 job backlog

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.

Lung Cancer: The Big Picture

April 10, 2008

Yesterday was our GSC-wide lab meeting, a quarterly event that crams 400+ people into an undersized auditorium. The guest speaker was Ramaswamy Govindan (MD), a medical oncologist from the Siteman Cancer Center. He gave a great 20-minute talk about lung cancer. He could easily have spoken to us for two hours on this topic, but alas, time is short. One topic he discussed that’s very germane to our work is the EGFR cancer pathway, which may account for 10% or more of lung cancers. Interestingly Asians, women, and NON-smokers are far more likely to have EGFR mutations (there was a fourth risk factor that I didn’t have time to write down).

From what I understand, EGFR encodes an epidermal growth factor receptor and as I understand it is expressed in normal tissues only during development. There is a drug called gefitinib that inhibits EGF receptors – a simple tablet that’s taken orally. Dr. Govindan showed some superior-view images of the chest cavity of a lung cancer patient before and after gefitinib therapy. The difference was amazing. Before treatment one lung was completely cancerous, and 2 years later (after treatment) the cancer was totally gone. It was a compelling example of what the future of cancer therapy might look like.

Speaking of which, the speaker went on to talk a bit about pharmacogenetics – the study of how genetics affect differential response to treatment among individuals. Evidently, classifying lung cancer patients by EGFR mutation status is extremely effective in predicting the outcome of chemotherapy. Patients with EGFR mutations respond well, while other patients see little or no benefit. Worse, some patients might have a toxic response to the drug. The ability to identify responders, non-responders, and toxic-responders by genotyping or gene expression profiling is perhaps one of the most important goals of cancer genetics.

Still Waiting for that ABI SOLiD Genome

April 8, 2008

One of the big announcements at this year’s AGBT was ABI’s sequencing of a complete human genome using the SOLiD system. It wasn’t just any genome, either – it was the genome of an African male of the Yoruba tribe in Nigeria (one of the HapMap samples). Perhaps I should be unsurprised that the press releases flew months ago but we’ve yet to see the peer-reviewed publication. Yet I’m eager to read the results of their project, as it will be the first complete genome sequencing of an individual from the African continent. Many studies have seen higher incidence and allele frequencies of SNPs in African samples, consistent with population bottlenecks during out-of-Africa expansions. In fact, a recent genome-wide survey of genetic variation in 51 populations showed that humans formed a chain of colonies as they migrated out of Africa some 10,000 years ago. That article’s a very interesting read.

But back to ABI. Perusing the SOLiD web site, I did find a poster on the genome-wide variation detected from their not-yet-completed SOLiD sequencing. From it I took these key pieces of information. They sequenced both fragment and mate-pair libraries to a coverage of about 4.9X. The mate-pair libraries allowed them to detect ~22,000 insertions and ~45,000 deletions, nearly all of which were heterozygous. At ~4X coverage on chromosome 7, some 75% of the SNPs detected were already in dbSNP. In the ENCODE regions (which have been extensively characterized), 91% of the SNPs detected were in dbSNP. To me, the fraction of novel SNPs seems low, but if it remains constant, this study will almost certainly add more SNPs to public databases than the Watson and Venter efforts.

Helicos Resequences M13 Virus Genome

April 7, 2008

The April 4th issue of Science had an article by Helicos BioSciences in which they described the single-molecule DNA sequencing of a viral genome. I knew about Helicos because they came and gave a talk to our Genetics department describing their planned strategy to develop a method for single-molecule sequencing. As I recall, the talk was entirely theoretical as they didn’t have much experimental data to show. Clearly things have gone well for Helicos, since their article convincingly demonstrates the potential of single-molecule sequencing for high-throughput, low-cost sequencing.

Introduction: The Problems with PCR

Why bother with single molecule sequencing? The introduction briefly discussed three problems associated with PCR-based sequencing.

  1. Bias in template representation. Due to thermodynamics and other factors I don’t well understand, PCR efficiency is directly affected by characteristics of the template. Shorter products, for example, are more efficient to amplify than longer products.
  2. Library preparation complications. PCR-based sequencing methods require a lot of templates, and preparation of the libraries can be “onerous and expensive in terms of DNA manipulation,” according to the article. I don’t do library prep myself, but this sounds reasonable.
  3. Error incorporation. Here is something that I do know about. Any time you use PCR, there’s a chance that mis-incorporation at an early cycle will introduce (and then amplify) errors in the sequence. We’ve seen some problems with 454 and Solexa sequencing that may be attributed to this. The idea of taking PCR-induced errors out of sequence reads appeals to me very much.

Results: Sequencing-by-synthesis of the M13 Viral Genome

The authors report sequencing the ~7 kbp M13 genome with 100% coverage and at an average depth of 150X. The read lengths averaged 23-27 bp, depending on the run and some post-processing; the authors claim to have performed runs with average read lengths of over 30 bp. According to alignment statistics in Table 1, there were 32,473 forward-orientation reads (relative to the reference) for an average coverage of 96X, and 34,109 reverse-orientation readds for an average coverage of 105X. Coverage in both orientations becomes important during their mutation-detection simulations.

Simulations of Mutation Detection

Because they sequenced the canonical strain of M13, there should be no sequence polymorphisms. So, to test the ability of this sequencing method to pick up mutations, the authors created “synthetic mutations” in the reference sequence and re-performed alignments. The synthetically-introduced mutations are picked up with an average sensitivity of ~98%. To me, this was the weaker part of the paper – mutations created in silico won’t accurately represent real variation, but at least it let the authors discuss analysis and refinement steps that led to improved mutation detection.

Discussion: Caveats and Future Directions

I don’t think Helicos is yet a threat to established next-generation platforms like Roche/454 and Illumina/Solexa. At 25 bp, the reads are too short to be useful in eukaryotes. Like 454, the Helicos platform has some difficulties with homopolymers , especially runs of cytosine residues. The authors readily admit that “large genomes, heterogeneous samples, and genomic structural variations will likely require longer reads, reduced homopolymer run through, and enhanced alignment tools.”

Yet this publication is an important proof-of-principle for the Helicos method. As far as single-molecule DNA sequencing goes, it looks like Helicos Biosciences is the one to beat.