Algorithm Engineering for Efficient Genome Comparison
Prof. Dr. Knut Reinert; Freie Universität Berlin
In the last two decades, the field of genomics has seen large changes, fueled by technological advances in sequencing technology. Nowadays, important large mammalian genomes (e.g. human, rat, mouse) are completely available. Maybe even more impressive, the time/cost to read genomic data has sunk by at four/five orders of magnitude. In the next years, the rate of generated data will continue to increase as the cost will continue to decrease.
Clearly, this huge amount of data calls for the usage of computers and algorithms and both have successfully been applied to support biologists in their work. Without the contribution of computer science and bioinformatics, the biological and medical advances driven by genomics would have been impossible.
The pioneering work in Bioinformatics is done very closely to the sequencing labs and biologists. There, it is most important for a computational result to be available quickly. As long as the result is not obviously lacking, the methodology of designing, implementing and evaluating the programs, respectively algorithms is not questioned. Often, sequence analysis programs are written with ad hoc heuristic, tailored to the one use case at hand. Besides some simple metrics like "number of aligned bases/genes", experimental evaluation is often limited to anecdotal considerations. One reason for this is the lack of established metrics.
Benchmarks for Read Mapping
Read Mapping is one key component for modern genomics. In the last decade, tens of papers have appeared on this topic, due to the large advances that have been made in sequencing technology. Clearly, it is important for Biologists and Bioinformaticians to select a good read mapper objectively. However, most published read mappers solve slightly different problems.
Beginning from a precise definition of the problem of read mapping we are developing a precise formulation of a solution to this problem. Using these precise formulations, we are then designing algorithms and writing programs to evaluate read mapping programs in a fair and objective manner. These are then tied together by wrapping scripts to provide a comprehensive read mapping benchmark.
Whole Genome Comparison
Complex models that capture each and every biological detail quickly become computationally intractable. Simple models, however, are vulnerable to an excess of abstraction despite their ease of computability. These models can ultimately lead to efficient algorithms producing useless results. We plan to review approaches from the literature, create precise problem formulations from the whole genome alignment and comparison from these, introduce metrics for whole genome alignments and design/implement/evaluate (i.e. engineer) algorithms that solve these problems.
- M. Holtgrewe, A.-K. Emde, D. Weese and K. Reinert. "A Novel And Well-Defined Benchmarking Method For Second Generation Read Mapping", BMC Bioinformatics, Vol. 12, May, 2011. [More]
- M. Holtgrewe. "Mason - a read simulator for second generation sequencing data". Fachbereich für Mathematik und Informatik, Freie Universität Berlin. Tech. Rep. TR-B-10-06. 2010. [More]