About BOSS

The workshop BOSS'17 will be held in conjunction with the

43rd International Conference on
Very Large Data Bases
Munich, Germany • 28 August - 1 September 2017

Workshop Date
  September 1st, 2017

The Third Workshop on Big Data Open Source Systems (BOSS'17) at this year's VLDB will give a deep-dive introduction into several active, publicly available, open-source systems. This year, there are tutorials on Apache AsterixDB, Apache Flink, Apache Impala, Apache Spark, and TensorFlow. The systems will be presented in hands-on tutorials by experts and developers. The tutorials will give details on installation, loading data, running specific workloads, and non-trivial example usages. To this end, the participants will get hands on experience to perform all the operations themselves. Participants are required to bring their own laptops to the tutorials.

There has been an open call for tutorials and a public vote on the proposed tutorials.

Program

Tutorial flow diagram of the BOSS program
The workshop follows a bulk synchronous parallel format. After the opening three parallel tutorial sessions are held. Each tutorials are 2 hours in length and most will be repeated in the afternoon so that participants can attend two of the parallel tutorials. There is a plenary tutorial on TensorFlow that all participants can attend.

For details of the systems, please follow the links below:


Program Outline:

Time Room
08:30 - 09:00 Workshop Opening 1601
09:00 - 10:00 Parallel Tutorials I (Part I) 1601 2601 2605
Apache Spark Apache Flink Apache Impala
10:00 - 10:30 Coffee Break
10:30 - 11:30 Parallel Tutorials I (Part II) 1601 2601 2605
Apache Spark Apache Flink Apache Impala
11:30 - 12:00 Plenary Tutorial (Part I) 1601
TensorFlow
12:00 - 13:30 Lunch Break
13:30 - 15:00 Plenary Tutorial (Part II) 1601
TensorFlow
15:00 - 15:30 Coffee Break
15:30 - 17:30 Parallel Tutorials II 1601 2601 2605
Apache Flink Apache Spark Apache AsterixDB

Details of the Presented Systems

Apache AsterixDB

In this tutorial, we will present Apache AsterixDB and show how it can be used to to store, index, query, analyze, and visualize large amounts of real-time social media data. First, we will set up an AsterixDB cluster and use its feed feature to ingest tweets. Using the resulting data, we will issue various kinds of queries against the cluster using the system's SQL++ query language. We will also show how to integrate external machine learning models and NLP techniques into the system as user-defined functions in order to add data annotations such as results from sentiment analysis. Finally, we will introduce the attendees to an open source AsterixDB-based middleware platform called Cloudberry that supports interactive analytics and visualization over social media data.
A brief description of the system Apache AsterixDB
Apache AsterixDB is a Big Data Management System (BDMS) with rich features that set it apart from other Big Data platforms such as Big Data Analytics Systems and NoSQL stores. Its features make it well-suited to applications including Web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a declarative query language that supports a wide range of queries; a scalable runtime system; partitioned, LSM-based data storage and indexing including B+ tree, R tree, and text indexes; support for external as well as native data; a rich set of built-in types, including spatial, temporal, and textual types; support for fuzzy, spatial, and temporal queries; a built-in notion of data feeds for ingestion of data; and, last but not least, transaction support akin to that of a NoSQL store. Co-developed by contributors at UC Irvine, UC Riverside, Couchbase, and other organizations, AsterixDB recently graduated from the incubator of the Apache Software Foundation (ASF) and it is now a full top-level project. Apache AsterixDB has also completely revamped its user-facing query support to feature SQL++, an adaptation of SQL aimed at semi-structured (schema-optional and/or nested) data, and its data ingestion and user-defined function support have recently been improved (made more extensible and hardened) as well.

Tutorial Outline
The following topics are covered:
  • Setting up an AsterixDB cluster;
  • Defining, adding, and querying data sets using SQL++;
  • Using data feeds to ingest real-time social media data (e.g., from Twitter);
  • Advanced querying: spatial, temporal, full-text, and similarity queries using SQL++;
  • Using external libraries to do social media data analysis (e.g., open source NLP libraries);
  • Using a middleware software tool called Cloudberry to show interactive analytics and visualization on social media data (tweets).

The technology used for the hands on the tutorial

  • The Java runtime environment (JRE) is required for setting up an AsterixDB cluster.
  • A Web browser is required for querying an AsterixDB cluster.
  • An external library (an open source NLP library package) will also be utilized in the proposed demo

The Tutorial will be available here:

AsterixDB Tutorial

Presenters:
  • Taewoo Kim (UC Irvine)
  • Steven Jacobs (UC Riverside)
  • Michael Carey (UC Irvine)
  • Chen Li (UC Irvine)
  • Till Westmann (Couchbase)
  • Ian Maxon (UC Irvine)
  • Vassilis Tsotras (UC Riverside)

Resources:

Apache Flink

Flink is a popular, top-level Apache project with more than 300 contributors which offers a full software stack for programming, compiling and deploying reliably continuous distributed data processing pipelines. A major distinctive characteristic of Flink compared to other solutions is its capability to declare persistent application state within continuous user-defined transformations using managed data collections. Flink couples state management with computation tightly and efficiently using a lightweight mechanism to acquire consistent snapshots for any managed state and can orchestrate failure recovery and scale-out/in by using them as consistent points of reconfiguration. Snapshots are globally coordinated, yet, locally executed, also supporting incremental state changes, and backed by durable, distributed file systems externally. Furthermore, programmers can make use of snapshots for application versioning, debugging and other fundamental operational needs. State in Flink can encapsulate any stream summary from simple rolling aggregates to complex windows and it further allows in-place reads from external user queries. Several powerful Flink libraries such as CEP (Complex Event Processing) and SQL exploit Flink's managed state in order to lift all these unique capabilities to newly formed domains of computation in stream processing today.

Tutorial Outline

In this tutorial we present the open-source system Apache Flink

The following topics are covered:
  • Introduction: In this part we will guide you through Flink's architecture with a focus on its state management components. In more detail, we are going to talk about how state is partitioned, acquired and stored into files, starting from its conceptual model up to the physical persistence and maintenance. We will further discuss the mechanisms and benefits of incremental snapshots which reduce persistence overhead even further.
  • Data modelling and concepts: What kind of data is supported? Can it mix with the standard relational model?
  • We will present a quick and intuitive demo that showcases latency benefits using locally managed embedded database backends asynchronously for incremental snapshotting.
  • There will also be a hands-on tutorial demonstrating different usages of elastic managed state in different domains of computation such as complex event processing and standing SQL queries on dynamically partitioned tables.

Apache Flink Tutorial

Technologies Used

We will use the latest Flink release (JDK7 or higher required) for the hands-on tutorials.

Presenters:
  • Paris Carbone (KTH Stockholm)
  • Gyula Fora (King)
  • Stefan Richter (dataArtisans)

Resources:


Apache Impala

Apache Impala (incubating) is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop's flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this tutorial, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud. In the hands-on tutorial, we demonstrate how to launch an Impala cluster and run SQL queries that combine data from multiple storage engines.

Tutorial Outline
The following topics are covered:
  • Launch a Hadoop cluster that runs Impala, HDFS, Hive and Kudu.
  • Load data into HDFS and run SQL queries using Impala.
  • Analyze the performance of Impala queries using the query plans and runtime profiles.
  • Ingest data into Kudu using Impala and run SQL queries that access both Kudu and HDFS data.
  • Run CRUD (CREATE, UPDATE, DELETE) SQL statements against Kudu tables from Impala.

Technologies Used

Apache Impala (Incubating), Apache Kudu, Hadoop (HDFS)

Prerequisites
Presenters:
  • Dimitris Tsirogiannis (Cloudera Inc.)

Resources:


Apache Spark

Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark.
Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. In addition, the new Structured Streaming engine allows developers perform computations in real time using the same APIs as they would use for batch computation.

Tutorial Outline

This tutorial covers the core APIs for using Spark 2.2, including DataFrames, Datasets, SQL. We talk about how the system has evolved to meet the demands of an every growing user base, including details on the: the Catalyst query optimizer, whole-stage code-generation as well as incremental query planning for streaming computation. The tutorial will be accompanied by hands on exercises.
Apache Spark Tutorial

Presenter:
  • Michael Armbrust (Databricks)
  • Reynold Xin (Databricks)

Resources:


TensorFlow

TensorFlowTM is an open source software library for numerical computation using data flow graphs. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Tutorial Outline

In the first part of the tutorial, I will introduce the basic concepts of TensorFlow. What are TensorFlow data flow graphs and why do we need them? I will show how you can build your own computation graph and implement a linear classifier with low-level TensorFlow primitives. I will also motivate higher-level TensorFlow abstractions, that will increase your productivity, especially for Machine Learning.
In the second part, there will be a hands-on tutorial, in which you will be able to implement your own linear, as well as neural network classifier for recognizing handwritten digits using the higher-level TensorFlow Estimator API.

The goal of this tutorial is that you will:
  • Understand the basic concepts of TensorFlow computation graphs, such as tensors, operations, sessions.
  • Be able to judge if TensorFlow is the right tool for your problem.
  • Attain an overview of higher-level TensorFlow APIs that can increase your productivity.
  • Gain hands-on experience in deep learning by implementing a neural network classifier using TensorFlow estimator API.

Since the time for the hands-on part is limited and to get the most out of the session, we ask you to install TensorFlow on your local computer by following the instruction in the link:
How-to Install TensorFlow using Docker on your laptop

Presenters:
  • Stephan Wolf (Google Research Europe)

Resources:


Workshop Organization

Workshop Chair:

  • Tyson Condie, UCLA, tcondie@cs.ucla.edu
  • Tilmann Rabl, TU Berlin, rabl@tu-berlin.de

Advisory Committee:

  • Michael Carey, UC Irvine
  • Volker Markl, TU Berlin



Call for tutorials
  • Important Date: Open until May 15.

    In order to propose a tutorial, please email
    • a short abstract with a brief description of the system,
    • an outline of the planned tutorial,
    • the technology used for the hands on tutorial,
    • a list of presenters involved,
    • and a link to the website of your system

    to boss@dima.tu-berlin.de
  • Public vote: will be open for two weeks until June 18.

  • Accepted presenters: will be notified by June 30.
Selection Process for Tutorials
After a sanity check on the submissions making sure that they are complete and relevant, (i.e., they are open source, publicly available, and big data related), the workshop will use an open voting process similar to the Hadoop Summit. Thereby we will ensure that the tutorials that are most interesting to a broad audience will be chosen, and that they will be attended by many people.

Previous Editions

BOSS'15 in conjunction with VLDB 2015 on September 4, 2015.

BOSS'16 in conjunction with VLDB 2016 on September 9, 2016.