Sunday, March 06, 2011

Assembling haplotypes in diploid sequencing projects

Shotgun sequencing has been the dominant mode of genome sequencing since the beginning of genomics. However, assembly of a complete genome can be complicated when the two haploid genomes present within the individual being sequenced are quite different, in which case the coverage is reduced by half and the two haploid genomes must be assembled separately. Problems also arise when two haploid genomes diverge over only some of their length. For example, Barrière et al. (2009) find that, despite inbreeding designed to generate a fully homozygous sample, "approximately 10% and 30% of the Caenorhabditis remanei and C. brenneri genomes, respectively, are represented by two alleles in the assemblies."

A similar problem arises when attempting to resolve the haplotypes within an individual. In the January issue of Nature Biotechnology, Kitzman et al. describe the "Haplotype-resolved genome sequencing of a Gujarati Indian individual." Sequencing pools of large-insert clones provides information about individual haplotypes across most of the genome. The power of combining "the throughput of massively parallel sequencing with the contiguity information provided by large-insert cloning" allows parallel assembly of distinct sequence from large-insert clones to provide information about genome structure that might otherwise be very difficult to tease out of a mixed assembly.

What interests me about the method of Kitzman et al. is that it can be applied directly to cases of widespread structural polymorphism, and I expect to see it used for a variety of problems in the coming years. With this approach, or similar approaches, intractably complex genomes (e.g. Drosophila subobscura - see Sánchez-Gracia and Rozas 2011), asexual species and even metagenomic samples will yield their secrets.