T5: Mining Application Traces at Scale with Apache Spark

Vincent Leroy

University of Grenoble
Grenoble, France


July 25 (Tuesday)


09:00 - 12:00 (Half-day)


Application traces contain valuable information for developers. They can be analyzed to extract workflows, discover usage patterns, or characterize anomalies. The typical toolbox of a data scientist comprises clustering, pattern mining, and classifier algorithms to perform this analysis. To extract more fine-grained and reliable information, it is often necessary to process increasingly large amounts of data, to the point that many existing implementations struggle to obtain results. In this tutorial, we will see how Apache Spark, an open source platform for big data processing, can be used to alleviate this issue. Developing an application using Spark allows the analyst to perform small-scale analysis on a laptop, and then deploy the same code on clusters or clouds to benefit from more processing power when dealing with a large-scale dataset. We will start by presenting Spark’s architecture and computing paradigm. Then, we will describe several analysis scenarios on real application traces and do a code walkthrough of the applications used to mine these traces.

About the Speaker

Vincent Leroy holds an associate professor position at the University of Grenoble with a research chair from CNRS. He is a permanent member of the Scalable Information Discovery and Exploitation (SLIDE) research group. He earned a PhD degree on large-scale distributed systems for social applications from Inria Rennes, France, in 2010. From 2010 to 2012, he worked on distributed search engines at Yahoo! Research Labs in Barcelona, Spain. Vincent’s research interests lie at the intersection of distributed system and large-scale data management. Currently, he is working on the design of algorithms and architectures to efficiently perform large-scale data mining.