Programme de l'école

Déroulement	Intervenant (nom et qualité)	Module
Monday, June 6, 2016
9:00-10:30	Michel Crucianu CNAM-Paris	Analyse de données massives
11:00-12:30	Alexandre Termier Université Rennes 1	Parallel Pattern Matching
14:00-16:00	Michel Crucianu CNAM-Paris	Analyse de données massives (TP)
Tuesday, June 7, 2016
9:00-10:30	Daniela Grigori Université Paris Dauphine	Business process analytics
11:00-12:30	Nicolas Terpolilli OpenDataSoft	Initiatives open data et défis technologiques
14:00-16:00	Présentation des participants
Wednesday, June 8, 2016
9:00-10:30	Dennis Shasha New York University and INRIA	Fast data analytics for time series and other ordered data
11:00-12:30	Christophe Pradal Inria Sophia Antipolis	Scientific workflows
14:00-17:00	Social event
Thursday, June 9, 2016
9:00-10:30	Pierre Senellart Telecom ParisTech	Approximation algorithms
11:00-12:30	Nicolas Anciaux, Luc Bouganim Inria Saclay - Île-de-France	Decentralized Personal Data Management, using secure devices.
14:00-15:30	Tristan Allard Université Rennes 1	Privacy-Preserving Data Publishing : Where Are We Now?

Daniela Grigori is a Professor at University Paris-Dauphine since september 2011. Previously she was an associate professor at the University of Versailles (2002 - 2011).
Her current research interests include web services, process management, business process intelligence, graph databases. She has a number of papers in international conferences and journals and has served as organizer and program committee member in many conferences. She is co-author of a book on process analytics (Springer, 2016).

Business process analytics

Les processus métiers sont inséparables de données : données d'exécution des processus, documentation et description des processus, modèles et versions des processus, artefacts et données générées ou échangées pendant l'exécution du processus. Ces données peuvent être sous diverses formes : structurées, semi-structurées ou non structurées. Une variété d’outils de capture, collecte de données et la mise en œuvre de processus dans différents systèmes ont amplifié la quantité de données qui sont disponibles autour des processus.

Pour améliorer la qualité des services offerts et être compétitives, un problème central pour les entreprises est l'identification, la mesure, l'analyse et l'amélioration de leurs processus. Ce tutoriel est une introduction aux concepts, méthodes et techniques permettant l’analyse des données de processus.

Fast data analytics for time series and other ordered data

The relational model is based on a single data type and a few operations: unordered tables which can be selected, projected, joined, and aggregated. This model is in fact unnecessary for simplicity and needlessly limits the expressive power, making it difficult to express query on ordered data such as time series data and other sequence data.

This talk presents a language for expressing ordered queries, optimization techniques and performance results. The talk goes on to present experiments comparing the system against other popular data analytic sys tems including Sybase IQ, Python's popular Pandas library and MonetDB using a variety of benchmarks including the ones that those systems use themselves. On the same hardware, our system is faster.

Approximation and randomized algorithms

Numerous data management tasks are intractable: it is NP-hard, or worse, to determine the exact answer to those tasks. This needs not be the end of the story: in many cases, it is actually possible to obtain an approximation of the answer, with certain guarantees. Deterministic approximation algorithms provide a way to efficiently approximate, within a certain factor, the answer to some intractable problems. Randomized approximation algorithms provide a probabilistic approximation guarantee. In this lecture, we will review some basic approximation algorithms, see cases where hardness of approximation itself can be shown, and illustrate how randomized approximation algorithms (from naive Monte Carlo sampling to more elaborate polynomial-time approximation schemes) can be used in data management applications.

Pierre Senellart

Pierre Senellart is a Professor in the DBWeb team at Télécom ParisTech, and a Senior Research Fellow in the Department of Computer Science of the National University of Singapore, within the IPAL laboratory. He is an alumnus of the École normale supérieure and obtained his M.Sc. (2003) and his Ph.D. (2007) in computer science from Université Paris-Sud, studying under the supervision of Serge Abiteboul. He was awarded an Habilitation à diriger les recherches in 2012 from Université Pierre et Marie Curie. Pierre Senellart has published numerous articles in internationally renowned conferences and journals (PODS, AAAI, VLDB Journal, Journal of the ACM, etc.) He has been a member of the program committee and participated in the organization of various international conferences and workshops (including PODS, WWW, VLDB, SIGMOD, ICDE). His research interests focus around practical and theoretical aspects of Web data management, including Web crawling and archiving, Web information extraction, uncertainty management, Web mining, and querying under access limitations.

Nicolas Anciaux

Nicolas Anciaux is a researcher at INRIA, in the SMIS project which focuses on secured and mobile information systems. He received his PhD in 2004 and his French Habilitation in 2014 from University of Versailles. Before joining Inria, Nicolas was a post-doc researcher at University of Twente and studied database implementations of the "right-to-be-forgotten" in ambient intelligent environments. Since he joined INRIA in 2006, his research interests lie in the area of data management on specific hardware architecture, and more precisely on secure chips and embedded systems. He proposes architectures using secure hardware, and data structures and algorithms to manage personal data with strong privacy guarantees. Nicolas co-authored around 30 research articles. He is a co-designer of PlugDB

(https://project.inria.fr/plugdb/), a secure and personal database device. Since 2015, Nicolas co-animates the research activities of the privacy cluster of the Digital Society Institute which brings together economists, jurists and computer scientists.

This tutorial will discuss existing cloud based architectures for personal data management and will propose some alternatives, based on decentralization and secure devices.

Parallel pattern mining

Pattern mining is a task of data mining focusing on extracting regularities from data. It is extremely computationally intensive, making it a good candidate for exploiting large parallel platforms. However computation structure of pattern mining algorithms is mostly irregular, so parallelizing these algorithms is non-trivial. We will present several successful approaches for parallelizing pattern mining algorithms that allow them to benefit from parallel platforms, either multicore processors or distributed platforms. We will focus on flexible pattern mining algorithms that allow the user to tailor the definition of patterns to their needs.

Alexandre Termier

Alexandre Termier is Professor of Computer Science at the University of Rennes 1 since 2014. He was before at University of Grenoble-Alpes (2007-2014). He is also the head of the LACODAM group at IRISA / INRIA. His research is focused on pattern mining, especially for defining new, more useful and/or more efficient pattern methods. His work are validated through industrial collaboration in various domains: embedded systems, retail or energy consumption.

Analyse de données massives

L'analyse statistique des données massives doit permettre de comprendre des phénomènes complexes et de prendre des décisions justifiées. Ce tutoriel a pour objectif de fournir une vue d'ensemble de la modélisation descriptive et décisionnelle à partir de données. La majeure partie sera consacrée aux différentes familles de problèmes de modélisation et à quelques méthodes couramment employées. Nous examinerons ensuite le passage à l'échelle de ces méthodes et notamment l'exécution sur des plateformes distribuées. L'après-midi, un atelier permettra de réaliser une analyse simple d'un jeu de données en utilisant des installations locales de Apache Spark.

Michel Crucianu

Michel Crucianu est professeur d'informatique au Conservatoire National des Arts et Métiers (Paris) depuis 2005. Ses travaux de recherche concernent notamment la fouille de grandes bases de données multimédia. Dans ce cadre il s'intéresse à l'apprentissage semi-supervisé et actif, à la construction de représentations, aux index multidimensionnels et métriques.

Travaux Pratiques sur l'analyse de données Massives

Mécanismes d'exécution simples (type MapReduce) sur architectures distribuées
Spark
Analyse de Données

Luc Bouganim

Luc Bouganim is a Director of Research at Inria Saclay-Île de France and is the vice-head of SMIS (Secured and Mobile Information Systems) research team. He obtained a PhD and the Habilitation à Diriger des Recherches, both from the University of Versailles in 1996 and 2006, respectively. He worked as an assistant professor from 1997 to 2002 when he joined Inria. Since 2000, Luc is strongly engaged in research activities on database management on chip and on the protection of data confidentiality, using cryptographic techniques. More recently, Luc focused on flash memory and more precisely on its impact on DBMSs. Luc co-authored more than 90 conference and journal papers, an international patent and was the recipient of 5 international awards.

Nicolas Terpolilli

Passionné d'Open Data depuis 5 ans maintenant, j'ai rejoins OpenDataSoft en tant que Chief Data Officer. OpenDataSoft propose un outil qui permet à de plus en plus de clients de valoriser leurs données simplement et efficacement. Mon rôle est de faire en sorte qu'il y ait de plus en plus de données disponibles sur la plateforme, qu'elles circulent de plus en plus facilement et, surtout, qu'elles soient de plus en plus réutilisées. Je suis diplômé de l'Ecole Centrale de Lille. Après avoir travaillé sur l'OpenData à Manchester, j'ai fait l'aller retour entre activités d'entrepreneur et de freelance avant de rejoindre OpenDataSoft à l'été 2015.

Initiatives Open Data et défis technologiques

Dans un monde où la création de valeur est de plus en plus distribuée et où l'intermédiation est la principale menace pour la plupart des organisations établies, l'Open Data s'impose comme une démarche efficace techniquement et économiquement.

L'objet de ce cours sera donc de faire une mise en perspective assez générale de l'économie de la données et ses mutations depuis l'avènement du numérique. De présenter comment l'Open Data s'inscrit dans cette dynamique. Puis d'expliquer très concrètement quel est le modèle d'affaire d'OpenDataSoft afin d'avoir un exemple très concret. Enfin se sera l'occasion d'évoquer les problématiques techniques ou les questions plus juridiques de licences.

Tristan Allard

Tristan Allard is currently holding an assistant professor position (MCF) at the University of Rennes 1. He conducted his PhD at the University of Versailles during which he worked on the sanitization of personal data distributed over secure personal data servers. Before joining the University of Rennes 1, he was a postdoc at Inria Montpellier, where he focused on the privacy-preserving clustering of personal time-series decentralized over personal devices. Now at Rennes 1, he continues designing privacy-preserving techniques for personal data management, in a variety of settings (crowdsourcing, cloud, peer-to-peer), by interwining encryption and sanitization schemes.

Privacy-Preserving Data Publishing : Where Are We Now?

The massive personal datasets collected by today's companies or institutions are valuable resources, both for the entities that hold them and for society at large. Privacy-preserving data publishing aims at opening personal datasets to large-scale analysis without jeopardizing the individuals privacy. The problem is hard, ranging from the definition of an adequate privacy criteria to the design of efficient and useful privacy algorithms. Ten years after the publications of the two seminal l-Diversity and Differential Privacy works, this lecture is a guided tour of the main privacy-preserving data publishing models and algorithms. We will synthesize the partition-based and differential privacy families of models and algorithms, analyze their strengths and weaknesses, and try to extract strong tendencies from the past decade.

Christophe Pradal

Christophe Pradal is a researcher at CIRAD and at INRIA, in the VirtualPlants project, in Montpellier. He is the project leader of OpenAlea, an international open source scientific workflow system. Prior to that, he spent 4 years in the industry, at Dassault Systèmes (1998-2002) during which he designed topological and geometrical operators used in the automotive industry, aeronautical design and shipbuilding. His research interests focus on scientific workflows, reproductibility, computational modeling of plants morphogenesis, and plant phenotyping.

Scientific Workflows

Analysing scientific data may involve very complex and interlinked steps where several tools are combined together. Scientific workflows systems have reached a level of maturity that makes them able to support the design and execution of such in-silico experiments. They provide a systematic way of describing the scientific and data methods, and execute complex analysis on a variety of distributed resources.

In this lecture, we will review main features of scientific workflows (representation, composition, model of computation, execution, mapping), present different workflows systems, and illustrate how algebraic scientific workflow and provenance can enhance reproductibility in the analysis and simulation of complex systems in biology.

Masses de données distribuées