Prof. Dr. Anand Srivastav, Christian-Albrechts-Universität zu Kiel
Prof. Dr. Thorsten Reusch, GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
Prof. Dr. Philip Caspar Rosenstiel, Universitätsklinikum Schleswig Holstein


Genome assembly is a central topic in modern life science. Increasing data is a major limitation for the applicability of presently available genome assembly software. Several approaches have been proposed to model the problem of de novo genome assembly as a combinatorial optimization problem in order to transfer the field's vast knowledge, but the literature does not reveal any major advantages of one approach over the other. Our focus will be on the development of holistic approaches, which try to encode all available information in one optimization problem, perhaps with parameters controlling certain aspects. Our investigation will be guided by the following questions: What kind of data is available or could be make available by modifications of the sequencing technique? How can we incorporate all data into an optimization problem? How can such an optimization problem be solved? How do solutions to the optimization problem look in practice (benchmarking, statistical tests)?

In common with the whole genome research area, massive amount of reads is also produced  at Kiel University  at the institutions of the proponents  P.Rosenstiel (CAU Kiel) and T. Reusch (GEOMAR) where state-of-the-artassemblers failed. This major bottleneck may require new approaches to genome assembly. We propose highly-space efficient streaming algorithms where a family of reads comes as a stream to which only space-limited access called  only a few  times is possible.Another  approach  is maximum-likelihood  algorithms and convexprogramming. Existing work assumes several  simplifications, concluding that "the development  of a comprehensive  assembler based on the maximumlikelihoodformulation  is still an open issue". We will aim to close this gap by a thorough application of techniques from algorithm engineering.  Streaming may also help here to reduce space bottlenecks when  applied  to convex programming. A third approach  is probabilistic data structures, which represent  information with fewer  bits than an exact representation  would  require,  at the expense of the possibility of errors. Bloom filters have  been successfully applied to de Bruijn graphs, one of the graph models for genome assembly. A theoretical foundation  is lacking, and other probabilistic  data structures should be explored.Finally, we wish to solve  the genome  assembly and related problems  in Kiel: Case studies will deal with large  eukaryotic genomes  of non-modelorganisms form the oceans with a range of genome sizes and complexities.In the group  of T. Reusch, one emphasis are eukaryotic plankton organisms of the haplotypes with genome sizes of  250-1000  M base pairs and a low to moderate degree of repetitive elements. Case studies in the group of P.Rosenstiel will cover  the detection of gene variations by mapping reads from e.g. Crohn/Colitis  cells onto a reference genome.