11:00 AM-12:00 PM
Room: Gold Rush A
We report on the design of a whole genome shotgun assembler and its application to the sequencing of the Drosophila genome. Celera's whole genome strategy consists of randomly sampling pairs of sequence reads of length 500-600 that are at approximately known distances from each other - short pairs at a distance of 2K, long pairs at 10K, and BAC-end pairs at 150K. For Drosophila, we collected 1.6 million pairs whereby the sum of the lengths of the reads is roughly 12 times the length of the genome (~130 million), a so called 12X shotgun data set. The reads were further collected so there are two short read pairs to every long read pair, with a sprinkling of roughly 12,000 BAC-end pairs. The experimental accuracy of the read sequences is roughly 98%. Given this data set, the problem is to determine the sequence of Drosophila's 4 chromosomes that are estimated to be 10-15% repetitive sequence.
The assembler computes all overlaps between the reads in under 13 hours on a 4-processor Compaq platform, and completes the entire assembly process in under 36 hours. We layer the ideas of uncontested interval graph collapsing, confirmed read pairs, and mutually confirming paths to yield a strategy that makes remarkably few errors. The assembler correctly identifies all unique stretches of the genome, correctly building contigs for each and ordering them into scaffolds spanning each of the chromosomes. Thus all useful proteomic information has been assembled as of this writing. We will be reporting on the extent to which the ubiquitous repeats that lie between these contigs are resolved. Preliminary trials suggest 99.97% or more of the genome will be assembled, far exceeding the 95% standard set for human chromosome 22. The design of the assembler provides a complete audit trail of the moves it makes in assembling a data set and is capable of incorporating external data such as independently sequenced segments of the given genome, lightly shotgunned (3-5X) segments, and smaller marker sequences located on the genome (STSs). By arranging assembly to be concurrent with data collection, this assembler should achieve a comparable result on the human genome with a 3 month computation on a 10-processor platform.