Algorithm Engineering for High Troughput Sequencing Data

During the last five years modern sequencing technologies have brought a super-exponential growth of sequencing capacities. This project aims at responding to the described increase of genomic sequence data with algorithmic approaches that benefit from redundancies across multiple datasets. More specifically we aim at:
1) Developing a data structure representing one or more genomic sequences by storing only the differences to a similar reference sequence while maintaining the ability to navigate quickly in all sequences. We then us this data structure for developing algorithms to transform the substring index data structure of a reference to the substring index of a new genome without rebuilding it from scratch and by only storing the differences to the reference index.
2) Developing algorithms that efficiently process multiple genomes in parallel based on the representation developed in 1).
3) Bridging the gap between algorithm theory and practical implementations by extending SeqAn as a library providing the core algorithmic components required to analyze large-scale genomic data and as an experimental platform to design, analyze, and implement state-of-the-art bioinformatics algorithms.