Mining Big Data Streams: Better Algorithms or Faster Systems?

Lecturer : 
Gianmarco De Francisci Morales
Event type: 
Guest lecture
Doctoral dissertation
Respondent: 
Opponent: 
Custos: 
Event time: 
2015-03-30 13:15 to 14:00
Place: 
Lecture hall T2,Computer Science building, Konemiehentie 2, 02150, Espoo
Description: 

Abstract:

The rate at which the world produces data is growing steadily, thus creating ever larger streams of continuously evolving data. However, current (de-facto standard) solutions for big data analysis are not designed to mine evolving streams. So, should we find better algorithms to mine data streams, or should we focus on building faster systems?
 
In this talk, we debunk this false dichotomy between algorithms and systems, and we argue that the data mining and distributed systems community need to work together to bring about the next revolution in data analysis. In doing so, we introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza.
 
As a case study, we present one of SAMOA's main algorithms for classification, the Vertical Hoeffding Tree (VHT). Then, we analyze the algorithm from a distributed systems perspective, highlight the issue of load balancing, and describe a generalizable solution to it. Finally, we conclude by envisioning system-algorithm co-design as a promising direction for the future of big data analytics.
 
 
Bio:
Gianmarco De Francisci Morales is a Visiting Scientist at Aalto University. Previously he worked as a Research Scientist at Yahoo Labs Barcelona. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on scalable data mining, with an emphasis on Web mining and data-intensive scalable computing systems. He is an active member of the open source community of the Apache Software Foundation, working on the Hadoop ecosystem, and a committer for the Apache Pig project. He is one of the lead developers of Apache SAMOA, an open-source platform for mining big data streams. He co-organizes the workshop series on Social News on the Web (SNOW) co-located with the WWW conference.

Last updated on 16 Mar 2015 by Yi Chen - Page created on 16 Mar 2015 by Yi Chen