Applications of Compressed Data Structures on Sequences and Structured Data

Event type: 
Doctoral dissertation
Doctoral dissertation
Respondent: 
Niko Välimäki
Opponent: 
Research director Eric Rivals, LIRMM, CNRS
Custos: 
Professor Veli Mäkinen
Event time: 
2012-08-21 12:00 to 16:00
Place: 
University of Helsinki Main Building, Auditorium XIV, Unioninkatu 34
Description: 

Recent advancements in the field of compressed data structures create interesting opportunities for interdisciplinary research and applications. Compressed data structures provide essentially a time--space tradeoff for solving a problem; while traditional data structures use extra space in addition to the input, compressed data structures replace the input and require space proportional to the compressed size of the input. The amount of available memory is often fixed, thus, the user might be willing to spend more time if it allows the use of larger inputs. However, despite the potential behind compressed data structures, they have not quite reached the audience of other disciplines. We study how to take advantage of compressed data structures in the fields of bioinformatics, data analysis and information retrieval.

We present several novel applications for compressed data structures and include an experimental evaluation of the time-space tradeoffs achieved. More precisely, we propose (i) a space-efficient string mining algorithm to recognise substrings that admit the given frequency constraints, (ii) both theoretical and practical methods for computing approximate overlaps between all string pairs, (iii) a practical path-based graph kernel for predicting the function of unknown enzymatic reactions, and (iv) a compressed XML index that supports efficient XPath queries on both the tree-structure and textual content of XML documents. Problem (i) is motivated by knowledge discovery in databases, where the goal is to extract emerging substrings that discriminate two (or more) databases. Problem (ii) is one of the first phases in a sequence assembly pipeline and requires efficient algorithms due to the new high-throughput sequencing systems. Problem (iii) is motivated by machine learning, where kernels are used to measure the similarity of complex objects. Problem (iv) has its background in information retrieval.

The proposed methods achieve theoretical and practical improvements over the earlier state of the art. To raise the overall awareness of compressed data structures, our results have been published in interdisciplinary forums, including conferences and journals from the fields of bioinformatics, data engineering and data mining.


Last updated on 29 Oct 2012 by Pirjo Moen - Page created on 7 Aug 2012 by Pirjo Moen