April 8, 2008
One of the big announcements at this year’s AGBT was ABI’s sequencing of a complete human genome using the SOLiD system. It wasn’t just any genome, either – it was the genome of an African male of the Yoruba tribe in Nigeria (one of the HapMap samples). Perhaps I should be unsurprised that the press releases flew months ago but we’ve yet to see the peer-reviewed publication. Yet I’m eager to read the results of their project, as it will be the first complete genome sequencing of an individual from the African continent. Many studies have seen higher incidence and allele frequencies of SNPs in African samples, consistent with population bottlenecks during out-of-Africa expansions. In fact, a recent genome-wide survey of genetic variation in 51 populations showed that humans formed a chain of colonies as they migrated out of Africa some 10,000 years ago. That article’s a very interesting read.
But back to ABI. Perusing the SOLiD web site, I did find a poster on the genome-wide variation detected from their not-yet-completed SOLiD sequencing. From it I took these key pieces of information. They sequenced both fragment and mate-pair libraries to a coverage of about 4.9X. The mate-pair libraries allowed them to detect ~22,000 insertions and ~45,000 deletions, nearly all of which were heterozygous. At ~4X coverage on chromosome 7, some 75% of the SNPs detected were already in dbSNP. In the ENCODE regions (which have been extensively characterized), 91% of the SNPs detected were in dbSNP. To me, the fraction of novel SNPs seems low, but if it remains constant, this study will almost certainly add more SNPs to public databases than the Watson and Venter efforts.
March 29, 2008
We’re working on a project with ~2.2 million 454 reads from two cDNA libraries and my job is to find and classify the insertion/deletion variants (indels). As you might guess, since these are reads of transcribed sequence, there’s a lot of noise due to mRNA processing. Spliced-out introns look like deletions. Partially-processed transcripts might look like they contain insertions. So, once I made indel predictions based on aligning 454 data to the hg36 reference sequence, the next priority was to remove the noise.
Fortunately, two colleagues in my group, Ken Chen (the developer of PolyScan) and Brian Dunford-Shore (our resident physicist) have built a “transcriptome” based on all of the known transcripts in CCDS, Ensembl, and Vega databases. One of the files generated with the transcriptome is the refseq “footprint” which contains all of the UTRs and exons of all transcripts. It seems to me this file offers the most comprehensive source for annotating the indels from cDNA data.
So, I wrote a script, annotate_with_footprint.pl, which cross-references a set of indels with the footprint file. Insertions are classified as either within-CDS-exon, within-UTR-exon, or noncoding. Deletions are a bit more complicated – they could be within-CDS-exon, within-UTR-exon, or noncoding. They could also span multiple CDS or UTR exons, span intron-exon-junctions, etc.
As it turned out, only about 12% of the insertions and 1% of the deletions were in exons; The vast majority were in UTR/intron regions or intron-exon splice artifacts. Another 4% of the deletions appeared to span one or more CDS exons, but many of these may be exon-skipping events, not true deletions.
Even with strong 454 cDNA support, I won’t be confident that these are real coding mutations until we validate them in genomic DNA.