Scalable Computing Methods for Processing Next Generation Sequencing Data in the Cloud

Lecturer : 
Keijo Heljanko
Event type: 
HIIT seminar
Doctoral dissertation
Respondent: 
Opponent: 
Custos: 
Event time: 
2012-05-14 13:15
Place: 
Lecture Hall T2, ICS department
Description: 

Abstract:

One of the currently ongoing changes in the field of computing is the emergence of large scale cloud computing datacenters containing tens or even hundreds of thousands of computers as the platform of choice for implementing software based services at the Internet scale. These datacenters are called "Warehouse-scale computers (WSC)" (term invented by Hoelzle and Barroso of Google) when talking about the in-house computing infrastructure of Google, Facebook, Yahoo!, etc. They have a new blueprint for the network and computing infrastructure that is based on the use of common-off-the-shelf (COTS) components at massive scale. At this scale component failures in the system are the norm, and some parts of the infrastructure will always be unavailable due to component failures. This also needs a radically different software infrastructure that is immune to the failure of any single component. The massive scale of computing requires a radical change in the programming paradigms employed in networking and computing infrastructure.

We discuss the use scalable cloud computing technologies such as the MapReduce programming framework and its open source Hadoop implementation for manipulating large data masses arising from next generation sequencing data. Our main tool Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

This is joint work with Matti Niemenmaa and André Schumacher from Aalto as well as Aleksi Kallio, Petri Klemelä, and Eija Korpelainen from CSC.


Bio:

Keijo Heljanko is a Professor at Aalto University, Department of Information and Computer Science. He obtained his doctoral degree from Helsinki University of Technology in 2002, has worked as a Postdoc at University of Stuttgart and as an Academy Research Fellow before being nominated to his current position in 2008.


Last updated on 7 May 2012 by Sohan Seth - Page created on 7 May 2012 by Sohan Seth