The workshop BOSS'17 will be held in conjunction with the
43rd International Conference on
Very Large Data Bases
Munich, Germany • 28 August - 1 September 2017
The Third Workshop on Big Data Open Source Systems (BOSS'17) at this year's VLDB will give a deep-dive introduction into several active, publicly available, open-source systems. This year, there are tutorials on Apache AsterixDB, Apache Flink, Apache Impala, Apache Spark, and TensorFlow. The systems will be presented in hands-on tutorials by experts and developers. The tutorials will give details on installation, loading data, running specific workloads, and non-trivial example usages. To this end, the participants will get hands on experience to perform all the operations themselves. Participants are required to bring their own laptops to the tutorials.
There has been an open call for tutorials and a public vote on the proposed tutorials.
The workshop follows a bulk synchronous parallel format. After the opening three parallel tutorial sessions are held. Each tutorials are 2 hours in length and most will be repeated in the afternoon so that participants can attend two of the parallel tutorials. There is a plenary tutorial on TensorFlow that all participants can attend.
|08:30 - 09:00||Workshop Opening||Lecture Hall|
|09:00 - 10:00||Parallel Tutorials I (Part I)||All Rooms|
|10:00 - 10:30||Coffee Break|
|10:30 - 11:30||Parallel Tutorials I (Part II)||All Rooms|
|11:30 - 12:00||Plenary Tutorial (Part I)||Lecture Hall|
|12:00 - 13:30||Lunch Break|
|13:30 - 15:00||Plenary Tutorial (Part II)||Lecture Hall|
|15:00 - 15:30||Coffee Break|
|15:30 - 17:30||Parallel Tutorials II||All Rooms|
In this tutorial, we will present Apache AsterixDB and show how it can be used to to store,
index, query, analyze, and visualize large amounts of real-time social media data. First, we will
set up an AsterixDB cluster and use its feed feature to ingest tweets. Using the resulting data,
we will issue various kinds of queries against the cluster using the system's SQL++ query
language. We will also show how to integrate external machine learning models and NLP
techniques into the system as user-defined functions in order to add data annotations such as
results from sentiment analysis. Finally, we will introduce the attendees to an open source
AsterixDB-based middleware platform called Cloudberry that supports interactive analytics and
visualization over social media data.
A brief description of the system Apache AsterixDB
Apache AsterixDB is a Big Data Management System (BDMS) with rich features that set it apart from other Big Data platforms such as Big Data Analytics Systems and NoSQL stores. Its features make it well-suited to applications including Web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a declarative query language that supports a wide range of queries; a scalable runtime system; partitioned, LSM-based data storage and indexing including B+ tree, R tree, and text indexes; support for external as well as native data; a rich set of built-in types, including spatial, temporal, and textual types; support for fuzzy, spatial, and temporal queries; a built-in notion of data feeds for ingestion of data; and, last but not least, transaction support akin to that of a NoSQL store. Co-developed by contributors at UC Irvine, UC Riverside, Couchbase, and other organizations, AsterixDB recently graduated from the incubator of the Apache Software Foundation (ASF) and it is now a full top-level project. Apache AsterixDB has also completely revamped its user-facing query support to feature SQL++, an adaptation of SQL aimed at semi-structured (schema-optional and/or nested) data, and its data ingestion and user-defined function support have recently been improved (made more extensible and hardened) as well.
Flink is a popular, top-level Apache project with more than 300 contributors which offers a full software stack for programming, compiling and deploying reliably continuous distributed data processing pipelines. A major distinctive characteristic of Flink compared to other solutions is its capability to declare persistent application state within continuous user-defined transformations using managed data collections. Flink couples state management with computation tightly and efficiently using a lightweight mechanism to acquire consistent snapshots for any managed state and can orchestrate failure recovery and scale-out/in by using them as consistent points of reconfiguration. Snapshots are globally coordinated, yet, locally executed, also supporting incremental state changes, and backed by durable, distributed file systems externally. Furthermore, programmers can make use of snapshots for application versioning, debugging and other fundamental operational needs. State in Flink can encapsulate any stream summary from simple rolling aggregates to complex windows and it further allows in-place reads from external user queries. Several powerful Flink libraries such as CEP (Complex Event Processing) and SQL exploit Flink's managed state in order to lift all these unique capabilities to newly formed domains of computation in stream processing today.
In this tutorial we present the open-source system Apache Flink
We will use the latest Flink release (JDK7 or higher required) for the hands-on tutorials.
Apache Impala (incubating) is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is a brand-new engine, written from the ground up in C++ and Java. It maintains Hadoop's flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this tutorial, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud. In the hands-on tutorial, we demonstrate how to launch an Impala cluster and run SQL queries that combine data from multiple storage engines.
Apache Impala (Incubating), Apache Kudu, Hadoop (HDFS)
Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics.
Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark.
Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. In addition, the new Structured Streaming engine allows developers perform computations in real time using the same APIs as they would use for batch computation.
This tutorial covers the core APIs for using Spark 2.2, including DataFrames, Datasets, SQL. We talk about how the system has evolved to meet the demands of an every growing user base, including details on the: the Catalyst query optimizer, whole-stage code-generation as well as incremental query planning for streaming computation. The tutorial will be accompanied by hands on exercises.
TensorFlowTM is an open source software library for numerical computation using data flow graphs. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.
In the first part of the tutorial, I will introduce the basic concepts of TensorFlow. What are TensorFlow data flow graphs and why do we need them?
I will show how you can build your own computation graph and implement a linear classifier with low-level TensorFlow primitives. I will also motivate higher-level
TensorFlow abstractions, that will increase your productivity, especially for Machine Learning.
In the second part, there will be a hands-on tutorial, in which you will be able to implement your own linear, as well as neural network classifier for recognizing handwritten digits using the higher-level TensorFlow Estimator API.