Boss Workshop 2017

Details of the Presented Systems

Apache AsterixDB

In this tutorial, we will present Apache AsterixDB and show how it can be used to to store, index, query, analyze, and visualize large amounts of real-time social media data. First, we will set up an AsterixDB cluster and use its feed feature to ingest tweets. Using the resulting data, we will issue various kinds of queries against the cluster using the system's SQL++ query language. We will also show how to integrate external machine learning models and NLP techniques into the system as user-defined functions in order to add data annotations such as results from sentiment analysis. Finally, we will introduce the attendees to an open source AsterixDB-based middleware platform called Cloudberry that supports interactive analytics and visualization over social media data.
A brief description of the system Apache AsterixDB
Apache AsterixDB is a Big Data Management System (BDMS) with rich features that set it apart from other Big Data platforms such as Big Data Analytics Systems and NoSQL stores. Its features make it well-suited to applications including Web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a declarative query language that supports a wide range of queries; a scalable runtime system; partitioned, LSM-based data storage and indexing including B+ tree, R tree, and text indexes; support for external as well as native data; a rich set of built-in types, including spatial, temporal, and textual types; support for fuzzy, spatial, and temporal queries; a built-in notion of data feeds for ingestion of data; and, last but not least, transaction support akin to that of a NoSQL store. Co-developed by contributors at UC Irvine, UC Riverside, Couchbase, and other organizations, AsterixDB recently graduated from the incubator of the Apache Software Foundation (ASF) and it is now a full top-level project. Apache AsterixDB has also completely revamped its user-facing query support to feature SQL++, an adaptation of SQL aimed at semi-structured (schema-optional and/or nested) data, and its data ingestion and user-defined function support have recently been improved (made more extensible and hardened) as well.

Tutorial Outline

The following topics are covered:

Setting up an AsterixDB cluster;
Defining, adding, and querying data sets using SQL++;
Using data feeds to ingest real-time social media data (e.g., from Twitter);
Advanced querying: spatial, temporal, full-text, and similarity queries using SQL++;
Using external libraries to do social media data analysis (e.g., open source NLP libraries);
Using a middleware software tool called Cloudberry to show interactive analytics and visualization on social media data (tweets).

The technology used for the hands on the tutorial

The Java runtime environment (JRE) is required for setting up an AsterixDB cluster.
A Web browser is required for querying an AsterixDB cluster.
An external library (an open source NLP library package) will also be utilized in the proposed demo

The Tutorial will be available here:

AsterixDB Tutorial

Presenters:

Taewoo Kim (UC Irvine)
Steven Jacobs (UC Riverside)
Michael Carey (UC Irvine)
Chen Li (UC Irvine)
Till Westmann (Couchbase)
Ian Maxon (UC Irvine)
Vassilis Tsotras (UC Riverside)

Resources:

Apache AsterixDB
Cloudberry

Apache Flink

Flink is a popular, top-level Apache project with more than 300 contributors which offers a full software stack for programming, compiling and deploying reliably continuous distributed data processing pipelines. A major distinctive characteristic of Flink compared to other solutions is its capability to declare persistent application state within continuous user-defined transformations using managed data collections. Flink couples state management with computation tightly and efficiently using a lightweight mechanism to acquire consistent snapshots for any managed state and can orchestrate failure recovery and scale-out/in by using them as consistent points of reconfiguration. Snapshots are globally coordinated, yet, locally executed, also supporting incremental state changes, and backed by durable, distributed file systems externally. Furthermore, programmers can make use of snapshots for application versioning, debugging and other fundamental operational needs. State in Flink can encapsulate any stream summary from simple rolling aggregates to complex windows and it further allows in-place reads from external user queries. Several powerful Flink libraries such as CEP (Complex Event Processing) and SQL exploit Flink's managed state in order to lift all these unique capabilities to newly formed domains of computation in stream processing today.

Tutorial Outline

In this tutorial we present the open-source system Apache Flink

The following topics are covered:

Introduction: In this part we will guide you through Flink's architecture with a focus on its state management components. In more detail, we are going to talk about how state is partitioned, acquired and stored into files, starting from its conceptual model up to the physical persistence and maintenance. We will further discuss the mechanisms and benefits of incremental snapshots which reduce persistence overhead even further.
Data modelling and concepts: What kind of data is supported? Can it mix with the standard relational model?
We will present a quick and intuitive demo that showcases latency benefits using locally managed embedded database backends asynchronously for incremental snapshotting.
There will also be a hands-on tutorial demonstrating different usages of elastic managed state in different domains of computation such as complex event processing and standing SQL queries on dynamically partitioned tables.

Apache Flink Tutorial

Technologies Used

We will use the latest Flink release (JDK7 or higher required) for the hands-on tutorials.

Presenters:

Paris Carbone (KTH Stockholm)
Gyula Fora (King)
Stefan Richter (dataArtisans)

Resources:

Apache Flink

Apache Impala

Apache Impala (incubating) is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop's flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this tutorial, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud. In the hands-on tutorial, we demonstrate how to launch an Impala cluster and run SQL queries that combine data from multiple storage engines.

Tutorial Outline

The following topics are covered:

Launch a Hadoop cluster that runs Impala, HDFS, Hive and Kudu.
Load data into HDFS and run SQL queries using Impala.
Analyze the performance of Impala queries using the query plans and runtime profiles.
Ingest data into Kudu using Impala and run SQL queries that access both Kudu and HDFS data.
Run CRUD (CREATE, UPDATE, DELETE) SQL statements against Kudu tables from Impala.

Technologies Used

Apache Impala (Incubating), Apache Kudu, Hadoop (HDFS)

Prerequisites

Install Oracle VirtualBox https://www.virtualbox.org
Install XQuartz (Mac) https://www.xquartz.org
Make sure VBoxManage is in your $PATH
Install the Impala/Kudu quickstart VM
curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash

Presenters:

Dimitris Tsirogiannis (Cloudera Inc.)

Resources:

Apache Impala
Apache Kudu
Apache hadoop

Apache Spark

Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark.
Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. In addition, the new Structured Streaming engine allows developers perform computations in real time using the same APIs as they would use for batch computation.

Tutorial Outline

This tutorial covers the core APIs for using Spark 2.2, including DataFrames, Datasets, SQL. We talk about how the system has evolved to meet the demands of an every growing user base, including details on the: the Catalyst query optimizer, whole-stage code-generation as well as incremental query planning for streaming computation. The tutorial will be accompanied by hands on exercises.
Apache Spark Tutorial

Presenter:

Michael Armbrust (Databricks)
Reynold Xin (Databricks)

Resources:

Apache Spark

TensorFlow

TensorFlow^TM is an open source software library for numerical computation using data flow graphs. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Tutorial Outline

In the first part of the tutorial, I will introduce the basic concepts of TensorFlow. What are TensorFlow data flow graphs and why do we need them? I will show how you can build your own computation graph and implement a linear classifier with low-level TensorFlow primitives. I will also motivate higher-level TensorFlow abstractions, that will increase your productivity, especially for Machine Learning.
In the second part, there will be a hands-on tutorial, in which you will be able to implement your own linear, as well as neural network classifier for recognizing handwritten digits using the higher-level TensorFlow Estimator API.

The goal of this tutorial is that you will:

Understand the basic concepts of TensorFlow computation graphs, such as tensors, operations, sessions.
Be able to judge if TensorFlow is the right tool for your problem.
Attain an overview of higher-level TensorFlow APIs that can increase your productivity.
Gain hands-on experience in deep learning by implementing a neural network classifier using TensorFlow estimator API.

Since the time for the hands-on part is limited and to get the most out of the session, we ask you to install TensorFlow on your local computer by following the instruction in the link:
How-to Install TensorFlow using Docker on your laptop

Presenters:

Stephan Wolf (Google Research Europe)

Resources:

TensorFlow

Time		Room
08:30 - 09:00	Workshop Opening	1601
09:00 - 10:00	Parallel Tutorials I (Part I)	1601	2601	2605
09:00 - 10:00	Parallel Tutorials I (Part I)	Apache Spark	Apache Flink	Apache Impala
10:00 - 10:30	Coffee Break
10:30 - 11:30	Parallel Tutorials I (Part II)	1601	2601	2605
10:30 - 11:30	Parallel Tutorials I (Part II)	Apache Spark	Apache Flink	Apache Impala
11:30 - 12:00	Plenary Tutorial (Part I)	1601
11:30 - 12:00	TensorFlow
12:00 - 13:30	Lunch Break
13:30 - 15:00	Plenary Tutorial (Part II)	1601
13:30 - 15:00	Plenary Tutorial (Part II)	TensorFlow
15:00 - 15:30	Coffee Break
15:30 - 17:30	Parallel Tutorials II	1601	2601	2605
15:30 - 17:30	Parallel Tutorials II	Apache Flink	Apache Spark	Apache AsterixDB

About BOSS

Workshop Date

Program

Program Outline:

Details of the Presented Systems

Apache AsterixDB

Tutorial Outline

The following topics are covered:

The technology used for the hands on the tutorial

Presenters:

Resources:

Apache Flink

Tutorial Outline

The following topics are covered:

Technologies Used

Presenters:

Resources:

Apache Impala

Tutorial Outline

The following topics are covered:

Technologies Used

Prerequisites

Presenters:

Resources:

Apache Spark

Tutorial Outline

Presenter:

Resources:

TensorFlow

Tutorial Outline

The goal of this tutorial is that you will:

Presenters:

Resources:

Workshop Organization

Call for tutorials

Selection Process for Tutorials

Previous Editions