Massgenomics has grown enough to become its own site. Please update your bookmarks and RSS feeds:
Entries RSS: http://www.massgenomics.org/feed
Comments RSS: http://www.massgenomics.org/comments/feed
This month I’ve come across some interesting statistics on the performance of Maq, Eland, and other short-read alignment tools as applied to Illumina/Solexa data. I took note because these programs are finally being evaluated against appropriate data sets, as opposed to simulated reads or tiny genomes. First the disclaimers: all of these numbers came from people other than myself (see Credits, below), so please forgive any inaccuracies. Also, this entry reflects my personal second-hand impressions of the different alignment tools, and should not be considered an endorsement or criticism of the different alignment tools by the WashU GC.
Short-Read Data Sets at the WashU Genome Center
One of our data sets includes 100+ Solexa runs (non-paired) from the genomic DNA of a single individual. We’ve applied a number of alignment tools to these data: Eland (part of the Illumina suite), Maq (free/open source), SX Oligo Search (proprietary), SlimSearch (proprietary), and even BLAT. Our group (Medical Genomics) is currently leaning toward Maq for read mapping and SNP discovery purposes. There’s recently been a new release of Maq (0.6.5) which seems to run substantially faster:
|Metric||Maq 0.6.3||Maq 0.6.5|
|Average alignment time for normal runs||17.7 hours||9.1 hours|
|Max alignment time for a normal run||240 hours||28.8 hours|
|Total number of jobs||2168||1467|
|Jobs that took longer than 1 day||443||3|
The developer of Maq, Heng Li, presented a poster describing the Maq algorithm at CSHL last week and also gave a small workshop talk on issues in short read mapping. He sent these links out to the Maq user list along with a benchmarking comparison of various read mapping tools.
Heng Li’s Comparison of Short-Read Aligners
For the comparison, Heng generated 1 million simulated read-pairs from chromosome X. The numbers themselves are a bit mind-boggling, but fortunately he summarized the results with these notes:
What a nice guy! Here he is, comparing his own tool against several competitors and he manages to praise the strengths of each one. That takes humility.
More Comments from Heng Li
Ken Chen, a colleague of mine, happened to discuss the benchmarking with Heng at Cold Spring Harbor. According to his evaluation, the current version of recently-published SOAP may be somewhat buggy (it had more mapping errors and crashed on paired-end alignment), but is nevertheless promising because it supports gapped alignment and longer reads. Paired-end alignment is perhaps Maq’s greatest strength; the alignment error rate from Maq for paired-end data is significantly reduced. Heng also mentioned that the upcoming new release of Eland will support longer read lengths (>32 bp) and will also calculate mapping quality scores.
Unbiased Comparisons of Short-Read Aligners
In summary, there are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. It’s hard to trust any benchmarking comparison on tools like these because usually, it’s the developers of one of the tools that publish them. Here’s an idea: what if NHGRI, Illumina, or another group put together a short-read-aligning contest? They generate a few short-read data sets: real, simulated, with/without errors, with/without SNPs and indels, etc. Then, the developers of each aligner are invited to throw their best efforts at it. Every group submits the results to a DCC, which analyzes the results in a simple, unbiased way: # of reads placed correctly/incorrectly. # of SNPs/indels detected, missed, or false-positives. The results are published on a web site or in the literature for all to see. Yeah, I know, there are hurdles, like the fact that most proprietary tool developers would probably chicken out of an unbiased head-to-head comparison, given the stakes. But wouldn’t it be nice to know the results? Unless that happens, however, I think Heng’s analysis is about as unbiased as can be.
WashU GC Maq version comparisons were sent out by Jim Eldred on 5/01/2008. Heng Li’s benchmarking comparison was sent to the Maq user list on 5/12/2008. Additional comments from Heng Li were reported by Ken Chen on 5/12/2008.
At last, some results from Evan Eichler’s SV project! The results of the first phase of the “Human Genome Structural Variation Project” were presented in today’s issue of Nature. I’ve been cognizant of this project for a couple of years and eager for the results, as it is really the first large-scale, sequence-based study of copy number and structural variants. As it happens, our Genome Center played a big role in the sequencing, and two of our researchers (Tina Graves and Rick Wilson) are among the authors.
In fairness, however, I should disclose another thing about the Evan Eichler project.
In 2006, just after three simultaneous papers in Nature Genetics brought structural variation to the forefront, my former lab began working on a grant proposal. In it, we proposed to mine existing trace data (from NCBI’s Trace Archive) for putative structural variants in the human genome. We were developing a sequence-based approach to identify reads spanning insertions, deletions, duplications, inversions, and translocations. It was an ambitious project and the timing was perfect, but, unfortunately, NIH sent our proposal back twice. Unscored. Later, I learned that a group headed by Evan Eichler pretty much locked up U.S. funding for this research in the form of a $40 million grant. With all of the NIH eggs in a single basket, I thought to myself, they had better deliver.
It looks like I won’t be disappointed. Dr. Eichler and colleagues constructed whole-genome libraries of ~1 million fosmids for each of 8 individuals whose samples were used in the HapMap Project. Four were African, two were CEPH European, one was Japanese, and one Han Chinese. They used the fosmid end sequencing approach (described by Tuzun et al, 2004) in which you sequence the ends of each fosmid and map the sequences to the reference genome. Altogether, about 6.1 million end-sequence pairs (ESPs) were uniquely mapped to the genome. Of these some 76,767 (~1.26%) were discordant by alignment distance or orientation, indicating a possible underlying structural variant. It’s a big paper (the Supplemental File alone was 57 pages) so I’ll hit you with the take-homes:
Overall it seems like an impressive publication. They planned and executed a very careful study, and as a result, we learned quite a bit about the landscape of structural variation in the human genome. Not bad, Dr. Eichler.