The workshop BOSS'16 will be held in conjunction with the
42nd International Conference on
Very Large Data Bases
New Delhi, India - September 5 - 9, 2016
Following the great success of the first Workshop on Big Data Open Source Systems (BOSS'15), the second Workshop on Big Data Open Source Systems (BOSS'16) will again give a deep-dive introduction into several active, publicly available, open-source systems. The systems will be presented in tutorials by experts in the presented systems. The tutorials will give details on installation and non-trivial examples usage of the presented system. This year's workshop will have a special session on "Polyglot Data Management".
In contrast to the first instance of the workshop, there was an open call for tutorials and a public vote on the proposed tutorials to ensure a good participation in all tutorials.
The workshop follows a novel massively parallel format. After a joint introduction
Introduction Slides
multiple parallel tutorial sessions are held.
Each tutorial is 1,5 hours in length and will be repeated so that participants can attend two tutorials.
In the tutorial sessions the following systems, which received the most votes, will be presented:
The traditional tags for databases have been rendered obsolete since new data management technologies
emerge with a combination of traditional and new capabilities that are transforming the database world.
There are many open research challenges in which many different data stores coexist and even ecosystems
and architectures based on multiple data stores, such as the lambda architecture. The new technologies
have brought new data models and query languages more appropriate for certain kinds of problems such as
key-value data stores, document data stores, and graph databases. In this context developers must deal
with this data diversity in order to get insights and knowledge from the different data stores. Moreover,
guaranteeing consistency across data stores has become a major issue due to the polyglot persistence trend.
This session will present different technologies in this evolving arena of new data management technologies
combining multiple capabilities.
Polyglot Databases Session@BOSS-VLDB
9:00 - 9:15 | Introduction |
9:15 - 9:45 | Flash Session |
9:45 - 10:30 | Tutorial Session 1 (Part 1) |
10:30 - 11:00 | Break |
11:00 - 12:00 | Tutorial Session1 1 (Part 2) |
12:00 - 12:30 | Polyglot Session (Part 1) |
12:30 - 14:00 | Lunch Break |
14:00 - 15:30 | Polyglot Session (Part 2) |
15:30 - 16:00 | Break |
16:00 - 17:30 | Tutorial Session 2 |
Real time data stream processing has become a critical building block for many modern
businesses. In stream processing, the data to be processed is viewed as continuously
produced series of events over time. This departs from the
well studied batch processing paradigm where data is viewed as static. In fact,
batch processing problems can be viewed as a subclass of stream processing ones,
where streams are finite
Building on this principle, Apache Flink has emerged as one of the most
successful open-source stream processors, with an active community of 190 contributors
from both industry and academia and several companies deploying it in production.
At its core Flink has a stream processing engine that allows for robust, high-throughput,
low latency processing of out-of-order streams, while providing exactly-once semantics
for stateful computations.
In this tutorial the attendees will have the opportunity to learn about the most important
notions introduced by the stream processing paradigm, get introduced to the architecture
and the main building blocks of Flink and get exposed to different features of Apache Flink
through representative applications.
Apache SystemML aims at declarative, scalable machine learning (ML). Data scientists express their custom analytics in a high-level language,
including linear algebra and statistical functions, which simplifies the implementation of new or customization of existing ML algorithms.
These algorithms can be run without modifications in a single node (scale-up), or on a cluster using MapReduce or Spark (scale-out).
The SystemML language is expressive enough to cover a broad class of ML algorithms including descriptive statistics, classification, clustering,
regression, matrix factorization, dimension reduction, and deep learning for training and scoring. SystemML's cost-based compiler
then automatically generates hybrid runtime execution plans that are composed of single-node and distributed operations depending on data
characteristics such as data size/shape (e.g., tall and skinny or wide matrices), data sparsity (e.g., dense or sparse), and cluster characteristics
(e.g., cluster size, and memory configurations). The benefits of declarative ML include (1) increased productivity of data scientists due to a data-scientist-centric
algorithm specification, (2) algorithm reusability and simplified deployment for varying data characteristics and runtime environments, and (3) good runtime
performance due to automatic optimization of execution plans given data and cluster characteristics.
In this tutorial, we will discuss SystemML usecases, example ML algorithms,
and various APIs for different deployments. Along this user-level tutorial, we will primarily focus on a deep-dive into
SystemML's architecture, optimizer including rewrites and cost model, and internals to help understand performance characteristics
and limitations as well as enable the database community to extend SystemML for further research.
HopsFS is a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS' single node in-memory metadata
service with a distributed metadata service built on a NewSQL database (MySQL Cluster). By removing the metadata bottleneck, HopsFS improves capacity and throughput
compared to HDFS. More interestingly, HDFS' metadata is now in a distributed database that supports mechanism for the efficient export of metadata to other systems.
Both HDFS and HopsFS, like other POSIX compliant filesystems, lack support for efficient free-text search of the filesystem namespace. HopsFS stores
the metadata normalized in relational tables in MySQL Cluster, and, as such, the database cannot efficiently handle queries such as "what are the largest files in the system" or "list
all files with the prefix 'hopsfs'". Moreover, HopsFS, can extend metadata for inodes (files or directories) by adding new table(s) with a foreign key to the inode to ensure the
integrity of extended metadata. We have used extended metadata to implement features such as erasure-coded replication. We have a solution to the problem of free-text search for
both namespace metadata and extended metadata by developing an efficient and highly available eventually consistent replication protocol from MySQL Cluster to Elasticsearch. Our system,
called ePipe, leverages the event API for MySQL Cluster enabling us to sends process batches of transactions filtered by table name from the single source of truth, MySQL Cluster.
These transactions contain mutations to metadata and extended metadata that will be indexed by Elasticsearch. ePipe maintains the consistency of the system by making sure that the events
coming from the database are processed in order. As ePipe only processes mutations, it has to read actual data from the database which it does using only efficient primary key or
partition-pruned index-scans. We also added a new RPC to HopsFS (not found in HDFS) to turn on or off metadata for inodes or inode subtrees. Our Hadoop platform has been presented
at Hadoop Strata (Startup Showcase in San Jose as well as London), and the Data Science Summit (both Israel and San Francisco).
The tutorial will guide users through the deployment and setup of Hops Hadoop on AWS, including Elasticsearch using an automated installation platform based on Chef. After installation, users will be guided through the design of the Hops concepts of projects and datasets using our Hadoop-as-a-Service platform, www.hops.site. Users will create a project and import some public datasets, share datasets with one another, and show how to request access to a dataset. Users will use Elasticsearch free-text search to search the system. We will also show users how they can make their datasets globally visible and downloadable by any other registered Hops cluster using peer-to-peer technology.
Modern web scale companies have outgrown traditional analytical products due to
challenges in analyzing massive scale, fast moving datasets at real-time latencies.
As a result, the traditional integrated analytics database has been unbundled into its
component architectural pieces that span data ingestion, data processing and analytics
query serving. This unbundling has spawned dozens of open source and commercial products
that specialize in solving each individual phase (ingest, process, serve) of this data lifecycle.
This in turn has created new challenges in adopting and integrating each of these systems to create
a high-functioning real-time analytics stack. In this tutorial, we showcase the open-source LinkedIn
Big Data Analytics Stack composed of Kafka, Gobblin, Hadoop and Pinot. Kafka and Hadoop are
well understood parts of this puzzle already w.r.t streaming and batch data storage and compute.
However, ingesting and managing this data and serving it efficiently at scale for analytics use-cases
data ingestion at scale with Gobblin and high-performance real-time distributed OLAP serving with Pinot.
Gobblin is an open source platform for large scale data management that is under active development at LinkedIn. It is designed with data diversity
in mind, and as such, it can move data between a large variety of data sources and sinks ranging from HDFS, Kafka, RDBMS', REST sources, FTP/SFTP servers etc. It abstracts away logic common to all data ingestion pipelines such as data discovery, error handling, quality checking and state management. Gobblin is the primary Hadoop data ingest, replication and data management tool at LinkedIn processing well over 50TB of data daily. Standardizing on Gobblin has enabled LinkedIn to add data sources quickly (data relevant for analysis is almost never known a priori), comply with regulatory requirements related to data handling and provide data quality guarantees for data consumers; features that are absolutely essential for any data ecosystem.
Pinot is a real-time distributed OLAP data store, which was built at LinkedIn to deliver scalable real time analytics with low latency. It implements
the Lambda architecture and can ingest and merge data from batch data sources (such as Hadoop and flat files) as well as streaming sources (such as Kafka). Pinot is optimized for analytical use cases on immutable append-only data and offers data freshness on the order of a few seconds. In production, Pinot scales horizontally to hundreds of tables, thousands of events per second, hundreds of machines and thousands of queries per second while responding to complex queries within tens to hundreds of milliseconds. At LinkedIn, Pinot powers more than 50 site facing and internal applications, including flagship applications like "Who Viewed My Profile" and "Company and Job Analytics".
In the tutorial we will setup a Kafka ingestion pipeline, consuming data from a local Kafka server, performing format
transformation, identifying bad records, performing a quality check on the ratio of bad records, and writing the data to time partitioned files in HDFS. If time permits, we will show how the same Gobblin flow can also ingest data from a completely different source: MySQL. We will then load this data (both real-time and batch) into a local Pinot instance where the data can be queried interactively.
Attendees will learn how to:
Massive multi-dimensional arrays appear as spatio-temporal sensor, image, simulation output, or statistics "datacubes" in a variety of domains in Earth, Space, and Life sciences, but also Social sciences as well as business. Understandably, arrays as a datatype come along with a distinct set of challenges in data modelling, analytics, optimization, and maintenance. Array Databases have set to address these facets and provide modern, fast, and scalable services on such array data. rasdaman ("raster data manager") is the pioneer Array Database bundling novel processing techniques in a user-friendly package with standards-compliant interfaces. As of today, rasdaman is a mature system in operational use on hundreds of Terabytes of timeseries datacubes; installations can easily be mashed up securely, thereby enabling large-scale location-transparent query processing in federations of any combination from typical desktop machines, to cloud nodes, datacenter supercomputers, and even cubesats. Its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL, many of which are supported by both open-source and commercial systems meantime. In the geo field, rasdaman is a reference implementation for the Open Geospatial Consortium (OGC) Big Data standard, WCS, now also under adoption by ISO.
In this tutorial we present the open-source rasdaman Array DBMS.
Participants will need to follow one of the options below:
Bored of keep moving your app to the newest data processing platform to achieve high performance? Tired of dealing with a zoo of processing platforms to get the best performance for your analytic tasks? Then, this tutorial is for you!
Indeed, we are witnessing a plethora of innovative data processing platforms in the last few years. While this is generally great, leveraging these new technologies in practice bears quite some challenges, just to name a few, developers must: (i) find among the plethora of processing platforms the best one for their applications; (ii) migrate their applications to newer and faster platforms every now and then; and (iii) orchestrate different platforms so that applications leverage their individual benefits.
We address these issues with RHEEM, a system that enables big data analytics over multiple data processing platforms in a seamless manner. It provides a three-layer data processing abstraction whereby applications can achieve both platform independence and interoperability across multiple platforms. With RHEEM, dRHEEMers (RHEEM developers) can focus on the logics of their applications. RHEEM, in turn, takes care of efficiently executing applications by choosing either a single or multiple processing platforms. To achieve this, it comes with a cross-platform optimizer that allows a single task to run faster over multiple platforms than on a single platform.
"...You may say I'm a dRHEEMer. But I'm not the only one. I hope someday you'll join us..."
- John Lennon -
We plan a tutorial where participants are the main players and can interact with the speakers at any time. Tentatively, our tutorial will be as follows.
In this first part of the tutorial, we will introduce the main concepts and rationale design behind RHEEM. This will allow participants to get familiar with the RHEEM paradigm.
After introducing participants to Rheem, we will guide them through the installation and set up of RHEEM to get ready to code their first application.
At this stage of the tutorial, we will show to users how one can easily develop and run an application on top of RHEEM. We will recap the main concepts in a hands-on exercise. In particular, we will see how to abstract and implement an application, how to support new functionalities by adding RHEEM operators, how to achieve higher performance by adding execution operators, and how to make the optimizer aware of the added operators.
After introducing participants to Rheem, we will guide them through the installation and set up of RHEEM to get ready to code their first application.
We will conclude this tutorial by summarizing all main concepts using ML4all, our machine learning system built on top of Rheem and used by a big airline company from the middle-east. We will also leverage this application to walk participants through advanced features of RHEEM.
Workshop Chair:
Polyglot Chair:
Advisory Committee: