Prof. Dr. Anand Srivastav, Christian-Albrechts-Universität zu Kiel
Genome assembly is a central topic in modern life science. Increasing data is a major limitation for the applicability of presently available genome assembly software. Several approaches have been proposed to model the problem of de novo genome assembly as a combinatorial optimization problem in order to transfer the field's vast knowledge, but the literature does not reveal any major advantages of one approach over the other. Our focus will be on the development of holistic approaches, which try to encode all available information in one optimization problem, perhaps with parameters controlling certain aspects. Our investigation will be guided by the following questions: What kind of data is available or could be make available by modifications of the sequencing technique? How can we incorporate all data into an optimization problem? How can such an optimization problem be solved? How do solutions to the optimization problem look in practice (benchmarking, statistical tests)?
In common with the whole genome research area, massive amount of reads is also produced at Kiel University at the institutions of the proponents P.Rosenstiel (CAU Kiel) and T. Reusch (GEOMAR) where state-of-the-artassemblers failed. This major bottleneck may require new approaches to genome assembly. We propose highly-space efficient streaming algorithms where a family of reads comes as a stream to which only space-limited access called only a few times is possible.Another approach is maximum-likelihood algorithms and convexprogramming. Existing work assumes several simplifications, concluding that "the development of a comprehensive assembler based on the maximumlikelihoodformulation is still an open issue". We will aim to close this gap by a thorough application of techniques from algorithm engineering. Streaming may also help here to reduce space bottlenecks when applied to convex programming. A third approach is probabilistic data structures, which represent information with fewer bits than an exact representation would require, at the expense of the possibility of errors. Bloom filters have been successfully applied to de Bruijn graphs, one of the graph models for genome assembly. A theoretical foundation is lacking, and other probabilistic data structures should be explored.Finally, we wish to solve the genome assembly and related problems in Kiel: Case studies will deal with large eukaryotic genomes of non-modelorganisms form the oceans with a range of genome sizes and complexities.In the group of T. Reusch, one emphasis are eukaryotic plankton organisms of the haplotypes with genome sizes of 250-1000 M base pairs and a low to moderate degree of repetitive elements. Case studies in the group of P.Rosenstiel will cover the detection of gene variations by mapping reads from e.g. Crohn/Colitis cells onto a reference genome.