Short Read Aligners: Maq, Eland, and Others

May 14, 2008

This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.

Short-Read Data Sets at the WashU Genome Center

One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:

Metric Maq 0.6.3 Maq 0.6.5
Average alignment time for normal runs 17.7 hours 9.1 hours
Max alignment time for a normal run 240 hours 28.8 hours
Total number of jobs 2168 1467
Jobs that took longer than 1 day 443 3

The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.

Heng Li’s Comparison of Short-Read Aligners

For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:

  • Eland: eland is definitely the fastest, much faster than all the competitors. What is more important, eland gives the number of alternative places, which makes it possible for you to get further information about the repetitive structures of the genome and to select reads that can be really confidently mapped. In addition, with the help of additional scripts, Eland IS able to map reads longer than 32bp. Eland is one of the best software I ever used. It would be even superior if Tony could make it easier to use for a user, like me, who wants to run eland independently of the GAPipeline.
  • RMAP: the strength of rmap is to use base qualities to improve the alignment accuracy. I believe it can produce better alignment than maq -se because maq trades accuracy for speed at this point (technically it is a bit hard to explain here). Nonetheless, I think rmap would be more popular if its authors could add support for fastq-like quality string which is now the standard in both Illumina and the Sanger Institute (although maybe not elsewhere). rmap supports longer reads, which is also a gain. Furthermore, I did learn a lot from its way to count the number of mismatches.
  • SOAP: soap is a versatile program. It supports iterative-trimmed alignment, long reads, gapped alignment, TAG alignment and PE mode. Its PE mode is easier to use than eland. In principle, soap and eland should give almost the same number of wrong alignments. However, soap gives 442 more wrong alignments. Further investigation shows that most of these 442 wrong ones are flagged as R? (repeat) by eland.
  • SHRiMP: Actually I was not expecting that a program using seeding +Smith-Waterman could be that fast. So far as I know, all the other software in the list do not do Smith-Waterman (maq does for PE data only), which is why they are fast. SHRiMP’s normodds score has similar meaning to mapping quality. Such score helps to determine whether an alignment is reliable. The most obvious advantage is SHRiMP can map long reads (454/capillary) with the standard gapped alignment. If you only work with small genomes, SHRiMP is a worthy choice. I think SHRiMP would be better if it could make use of paired end information; it would be even better if it could calculate mapping quality. The current normodds score helps but is not exactly the mapping quality. In addition, I also modified probcalc codes because in 1.04 underflow may occur to long reads and leads to “nan” normodds. However, although my revision fixes the underflow, it may lead to some inaccurate normodds.
  • Maq: at the moment maq is easier to use than eland. Supporting SNP calling is maq’s huge gain. Its paired end mode is also highly helpful to recover some repetitive regions. Maq’s random mapping, which is frequently misused by users who have not noticed mapping qualities, may be useful to some people, too, and at least it helps to call SNPs at the verge of repeats.

What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.

More Comments from Heng Li

Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.

Unbiased Comparisons of Short-Read Aligners

In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.

Credits

WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.


Landmark Study in Structural Variation

May 1, 2008

At last, some results from Evan Eichler’s SV project! The results of the first phase of the “Human Genome Structural Variation Project” were presented in today’s issue of Nature. I’ve been cognizant of this project for a couple of years and eager for the results, as it is really the first large-scale, sequence-based study of copy number and structural variants. As it happens, our Genome Center played a big role in the sequencing, and two of our researchers (Tina Graves and Rick Wilson) are among the authors.

In fairness, however, I should disclose another thing about the Evan Eichler project.

In 2006, just after three simultaneous papers in Nature Genetics brought structural variation to the forefront, my former lab began working on a grant proposal. In it, we proposed to mine existing trace data (from NCBI’s Trace Archive) for putative structural variants in the human genome. We were developing a sequence-based approach to identify reads spanning insertions, deletions, duplications, inversions, and translocations. It was an ambitious project and the timing was perfect, but, unfortunately, NIH sent our proposal back twice. Unscored. Later, I learned that a group headed by Evan Eichler pretty much locked up U.S. funding for this research in the form of a $40 million grant.  With all of the NIH eggs in a single basket, I thought to myself, they had better deliver.

It looks like I won’t be disappointed.  Dr. Eichler and colleagues constructed whole-genome libraries of ~1 million fosmids for each of 8 individuals whose samples were used in the HapMap Project.  Four were African, two were CEPH European, one was Japanese, and one Han Chinese.  They used the fosmid end sequencing approach (described by Tuzun et al, 2004) in which you sequence the ends of each fosmid and map the sequences to the reference genome.  Altogether, about 6.1 million end-sequence pairs (ESPs) were uniquely mapped to the genome.  Of these some 76,767 (~1.26%) were discordant by alignment distance or orientation, indicating a possible underlying structural variant.  It’s a big paper (the Supplemental File alone was 57 pages) so I’ll hit you with the take-homes:

  1. The human genome is still incomplete.  The fosmid approach yielded a number of novel sequences not present in the current human genome assembly.  The sequences range in size from 2-130 kbp, were randomly distributed among genic/nongenic regions, and were often (40% of the time) copy-number variable.  There’s still more human genome out there, folks.
  2. Current CNV databases are inflated. The higher resolution of sequence-based approaches served to refine the location of many previously-reported CNVs.  Generally speaking, CNVs on a haplotype are smaller (in bp) than copy-number-variable regions as a whole, meaning that fewer genes are affected.  Array-cGH platforms, which to-date are among the most widely-used approaches to assess CNV, greatly exaggerate the size of these variants.  And less than half of the SVs in this study cannot adequately be genotyped with existing platforms.
  3. NAHR is the driving force behind most structural variation.  About half of the SVs detected in this study were flanked by segmental duplications.  This is nothing new, but it somewhat contradicts the study by Korbel and colleagues last year, which used 454 paired-end sequencing and reported that NAHR-mediated events were rare.  Eichler et. al.  suggest that the association of NAHR with segmental duplications and the difficulties of short sequence reads to resolve such regions might together mean that next-generation ESP approaches are missing a lot of variation.  I’m not sure I agree, but that’s what they said.

Overall it seems like an impressive publication.  They planned and executed a very careful study, and as a result, we learned quite a bit about the landscape of structural variation in the human genome.  Not bad, Dr. Eichler.


Happy DNA Day!

April 25, 2008

NHGRI

April 25th is National DNA Day in the U.S., an occasion that commemorates discovery of DNA’s structure by Watson and Crick (published in Nature on April 25th, 1953) and the completion of the Human Genome Project fifty years later in April 2003. NHGRI has set up a web site for DNA day featuring a welcome video from Francis Collins and a live chatroom where anyone can chat with “leading researchers in the field of genetics.” Strangely I haven’t yet gotten the call.

The WashU Genome Center outreach office is also sponsoring a number of activities. A number of DNA Day Ambassadors are visiting local high schools in St. Louis. A symposium on the medical campus will have student poster sessions and a seminar, “The Human Genome Sequence: A Foundation for Biological Inquiry”, given by our co-director Elaine Mardis.

By chance I happened to Google the King and Wilson 1975 paper the other day, and came across a very interesting site of Landmarks in the History of Genetics. While not up-to-date, it’s a nice story of key events in DNA’s history (and their implications) since 1745. Of course there’s Darwin’s publication of The Origin of Species in 1859, and Mendel’s Experiments in Plant Hybridisation just six years later. I recognize the name of Erwin Chargaff, whose insights (1950) into the relative incidence of A, C, G, and T nucleotides was not random, but perhaps a kind of code. Two years later came the Hershey and Chase experiments, which showed that viruses infect host cells by injecting their DNA, while the proteins generally remain outside the cell. Of course we know Watson and Crick (1953), as well as King and Wilson (1975). How about Barbara McClintock, whose discovery of transposable elements in maize in the 1940’s was not fully recognized for decades?

That rather sounds like Mendel, doesn’t it? I wonder, how many other important discoveries in biology have already been made, but not yet appreciated. It’s something to think about. Happy DNA Day everyone!


Drowning in the Flood of Next-Gen Data

April 18, 2008

Working at the WashU Genome Center, I expect to encounter datasets that are large even by bioinformatician standards. But as we transition from traditional 3730-based sequencing to next-generation platforms, I’m beginning to appreciate just how much additional infrastructure we’ll need to handle the data flow. In the Medical Genomics group we’re constantly pushing up against capacity - servers, disk space, and man hours. None of these are in adequate supply for what’s ahead.

This is not to say that we’re without resources. In fact, the infrastructure already in place is considerable. We have about 500 computational servers (1600 cores) and nearly a petabyte (1,000 terabytes) of disk space. There’s an LSF system through which we submit and monitor jobs on The Blades.

You Didn’t Need That Done TODAY, Did you?

I submit about 1,000 small jobs and notice they’re all pending:

The 62,000 job backlog

No doubt that’s because there are 61,000 jobs in front of me. We have a few different “queues” into which jobs can be submitted. The “short” queue is for jobs that execute in less than 15 minutes. At one job per core if every job finishes in 15 minutes, it looks like my jobs will start in about 9 hours. Oy.

The powers that be around here are rushing to build up our resources. As I’m not part of management, I can’t say for sure how long it will take to get the disk space and hardware we need. One thing I do know: we need a lot, and we need it soon.


Still Waiting for that ABI SOLiD Genome

April 8, 2008

One of the big announcements at this year’s AGBT was ABI’s sequencing of a complete human genome using the SOLiD system. It wasn’t just any genome, either - it was the genome of an African male of the Yoruba tribe in Nigeria (one of the HapMap samples). Perhaps I should be unsurprised that the press releases flew months ago but we’ve yet to see the peer-reviewed publication. Yet I’m eager to read the results of their project, as it will be the first complete genome sequencing of an individual from the African continent. Many studies have seen higher incidence and allele frequencies of SNPs in African samples, consistent with population bottlenecks during out-of-Africa expansions. In fact, a recent genome-wide survey of genetic variation in 51 populations showed that humans formed a chain of colonies as they migrated out of Africa some 10,000 years ago. That article’s a very interesting read.

But back to ABI. Perusing the SOLiD web site, I did find a poster on the genome-wide variation detected from their not-yet-completed SOLiD sequencing. From it I took these key pieces of information. They sequenced both fragment and mate-pair libraries to a coverage of about 4.9X. The mate-pair libraries allowed them to detect ~22,000 insertions and ~45,000 deletions, nearly all of which were heterozygous. At ~4X coverage on chromosome 7, some 75% of the SNPs detected were already in dbSNP. In the ENCODE regions (which have been extensively characterized), 91% of the SNPs detected were in dbSNP. To me, the fraction of novel SNPs seems low, but if it remains constant, this study will almost certainly add more SNPs to public databases than the Watson and Venter efforts.


Helicos Resequences M13 Virus Genome

April 7, 2008

The April 4th issue of Science had an article by Helicos BioSciences in which they described the single-molecule DNA sequencing of a viral genome. I knew about Helicos because they came and gave a talk to our Genetics department describing their planned strategy to develop a method for single-molecule sequencing. As I recall, the talk was entirely theoretical as they didn’t have much experimental data to show. Clearly things have gone well for Helicos, since their article convincingly demonstrates the potential of single-molecule sequencing for high-throughput, low-cost sequencing.

Introduction: The Problems with PCR

Why bother with single molecule sequencing? The introduction briefly discussed three problems associated with PCR-based sequencing.

  1. Bias in template representation. Due to thermodynamics and other factors I don’t well understand, PCR efficiency is directly affected by characteristics of the template. Shorter products, for example, are more efficient to amplify than longer products.
  2. Library preparation complications. PCR-based sequencing methods require a lot of templates, and preparation of the libraries can be “onerous and expensive in terms of DNA manipulation,” according to the article. I don’t do library prep myself, but this sounds reasonable.
  3. Error incorporation. Here is something that I do know about. Any time you use PCR, there’s a chance that mis-incorporation at an early cycle will introduce (and then amplify) errors in the sequence. We’ve seen some problems with 454 and Solexa sequencing that may be attributed to this. The idea of taking PCR-induced errors out of sequence reads appeals to me very much.

Results: Sequencing-by-synthesis of the M13 Viral Genome

The authors report sequencing the ~7 kbp M13 genome with 100% coverage and at an average depth of 150X. The read lengths averaged 23-27 bp, depending on the run and some post-processing; the authors claim to have performed runs with average read lengths of over 30 bp. According to alignment statistics in Table 1, there were 32,473 forward-orientation reads (relative to the reference) for an average coverage of 96X, and 34,109 reverse-orientation readds for an average coverage of 105X. Coverage in both orientations becomes important during their mutation-detection simulations.

Simulations of Mutation Detection

Because they sequenced the canonical strain of M13, there should be no sequence polymorphisms. So, to test the ability of this sequencing method to pick up mutations, the authors created “synthetic mutations” in the reference sequence and re-performed alignments. The synthetically-introduced mutations are picked up with an average sensitivity of ~98%. To me, this was the weaker part of the paper - mutations created in silico won’t accurately represent real variation, but at least it let the authors discuss analysis and refinement steps that led to improved mutation detection.

Discussion: Caveats and Future Directions

I don’t think Helicos is yet a threat to established next-generation platforms like Roche/454 and Illumina/Solexa. At 25 bp, the reads are too short to be useful in eukaryotes. Like 454, the Helicos platform has some difficulties with homopolymers , especially runs of cytosine residues. The authors readily admit that “large genomes, heterogeneous samples, and genomic structural variations will likely require longer reads, reduced homopolymer run through, and enhanced alignment tools.”

Yet this publication is an important proof-of-principle for the Helicos method. As far as single-molecule DNA sequencing goes, it looks like Helicos Biosciences is the one to beat.