First Workshop on Big Data Open Source Systems
September 4, 2015, Hilton Waikoloa Village, Waikoloa, Hawaii
The first workshop on Big Data Open Source Systems (BOSS'15) will be held in conjunction with the
The workshop gave a deep-dive introduction into several active, publicly available, open-source systems. For the first instance of the BOSS workshop, 8 diverse systems were chosen. The tutorials will gave details on installation and non-trivial examples usage of the presented system. All presentations were held by experts in the respective systems. The workshop followed a novel massively parallel format. After a joint introduction, 8 parallel tutorial sessions were held. Each tutorial was 2 hours in length and was repeated. Between the two tutorial sessions a panel on "Big Data and Exascale" was held.
The program consisted of 8 parallel tutorial sessions, which were repeated, and a panel between the repetitions. The 8 presented systems were: Apache AsterixDB, Apache Flink, Apache Reef, Apache Singa, Apache Spark, Padres, rasdaman, and SciDB. The panel was on the topic "Exascale and Big Data".
|9:00 - 9:15||Introduction (slides)|
|9:15 - 10:00||Flash Session|
|10:00 - 10:30||Tutorial Session 1 (Part 1)|
|10:30 - 11:00||Break|
|11:00 - 12:30||Tutorial Session 1 (Part 2)|
|12:30 - 14:00||Lunch break|
|14:00 - 15:00||Panel discussion|
|15:00 - 15:30||Tutorial Session 2 (Part 1)|
|15:30 - 16:00||Break|
|16:00 - 17:30||Tutorial Session 2 (Part 2)|
The emergence of large-scale, data intensive science on one hand, and the increasing requirement for sophisticated deep learning models and real-time analytics in business, on the other, provides the opportunity and necessity to examine convergence of capacity, functionality, and capabilities of traditional high performance computing (HPC) with the distributed, resilient, and ease of programming models of Big Data. < /br> The panel discussed application use cases; use case "patterns"; software-hardware codesign and integration issues; and, the role and impact of disruptive hardware and software technologies, all of which may drive the convergence of HPC and Big Data. The panel will also consider and discuss implications for medium and long-term research. A companion panel on the same topic has been accepted at the Supercomputing 2015 Conference (Nov 15-20, 2016, Austin, TX).Panel Chair:
AsterixDB is a BDMS (Big Data Management System) with a rich feature set that sets it apart from other Big Data platforms such as Big Data Analytics Systems and NoSQL stores. Its features make it well-suited to applications including web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a declarative query language that supports a wide range of queries; a scalable runtime system; partitioned, LSM-based data storage and indexing (including B+ tree, R tree, and text indexes); support for external as well as native data; a rich set of built-in types, including spatial, temporal, and textual types; support for fuzzy, spatial, and temporal queries; a built-in notion of data feeds for ingestion of data; and, last but not least, transaction support akin to that of a NoSQL store. Co-developed by researchers at UC Irvine and UC Riverside, Apache AsterixDB is now undergoing incubation at The Apache Software Foundation (ASF) and has contributors and early adopters from multiple institutions both within and beyond the UC System.
Apache Flink is an open source system for expressive, declarative, fast,and efficient data analysis on both historical (batch) and real-time (streaming) data. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with The efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. At its core, Flink builds on a distributed dataflow runtime that unifies batch and incremental computations over a true-streaming pipelined execution. Its programming model allows for stateful, fault toleranT computations, flexible user-defined windowing semantics for streaming and unique support For iterations. Flink is converging into a use-case complete system for parallel data processing with a wide range of top level libraries ranging from machine learning and graph processing.Apache Flink originates from the Stratosphere project led by TU Berlin and has led to various scientific papers (VLDBJ, SIGMOD, (P)VLDB, ICDE, HPDC, etc.).
REEF is a meta-framework that provides a control-plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource re-use for data caching, and state management abstractions that greatly ease the development of elastic data processing workflows. REEF is being used to develop several commercial offerings such as the Azure Stream Analytics service and experimental prototypes for machine learning algorithms and a port of the CORFU system. REEF supports both Java and C# languages. REEF is currently an Apache Incubator project that has attracted contributors from several institutions
It has shown that big data and deep structures greatly improve the accuracy of deep learning models in various machine learning
tasks. However, the training usually takes a long time for large models with massive data. Distributed training makes it possible
to scale, thus has gained traction in both research and industry.
In this talk, we are going to present Apache SINGA --- a general distributed deep learning system. SINGA provides a flexible training architecture that can be easily customized to be scalable for specific training tasks. A unified representation is proposed for different model structures, based on which a simple programming model is designed. We will introduce the programming model and architecture of SINGA together with hands-on training demos for users with different levels of deep learning background.
Apache Spark is a general-purpose computation engine used for many types of data processing. Spark comes packaged with support for ETL, interactive queries (SQL), advanced analytics (e.g. machine learning)
and streaming over large datasets. Spark has seen rapid adoption at more than 1000 organizations. Spark is supported by a wide array of vendors, including Databricks, IBM, SAP, Cloudera, Hortonworks, MapR, Datastax, with deployment size from single laptop to a cluster of 8000 nodes, running against data ranging from
kilobytes to petabytes. Originally developed at the UC Berkeley AMPLab, the Spark project has led to many academic publications and awards:
- 2014 ACM Best Dissertation Award (Matei Zaharia)
- Best Paper Award at NSDI 2012
- Best Demo Award at SIGMOD 2012
- OSDI 2014
- SIGMOD (2013 and 2015) Spark is also the largest open source project for data processing, with over 700 contributors from over 200 organizations.
PADRES (Publish/Subscribe Applied to Distributed Resource Scheduling) is an enterprise-grade event management infrastructure that is designed for large-scale event management applications. Ongoing research seeks to add and improve enterprise-grade qualities of the middleware. The PADRES system is a distributed content-based publish/subscribe middleware with features built with enterprise applications in mind. These features include - Intelligent and scalable rule-based routing protocol and matching algorithm - Powerful correlation of future and historic event - Failure detection, recovery and dynamic load balancing system administration and monitoring As well, the PADRES project studies application concerns above the infrastructure layer, such as - Distributed transformation, deployment and execution - Distributed monitoring and control - Goal-oriented resource discovery and scheduling - Secure, decentralized choreography and orchestration A publish/subscribe middleware provides many benefits to enterprise applications. Content-based interaction simplifies the IT development and maintenance by decoupling enterprise components. As well, the expressive PADRES subscription language supports sophisticated interactions among components, and allows fine-grained queries and event management functions. Furthermore, scalability is achieved with in-network filtering and processing capabilities.
rasdaman ("raster data manager") has pioneered Array Databases by adding massive multi-dimensional gridded data, an information category long missing in
databases, to scalable data management and analysis. Its declarative query language, rasql, extends SQL with array operators which are optimized and
parallelized on server side. Installations can easily be mashed up securely, thereby enabling large-scale location-transparent query processing in
federations. Domain experts value the integration with their commonly used tools leading to a quick learning curve.
Earth, Space, and Life sciences, but also Social sciences as well as business have massive amounts of data and complex analysis challenges that are answered by rasdaman. As of today, rasdaman is mature and in operational use on hundreds of Terabytes of timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Additionally, its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL, many of which are supported by both open-source and commercial systems meantime. In the geo field, rasdaman is reference implementation for the Open Geospatial Consortium (OGC) Big Data standard, WCS, now also under adoption by ISO.
In this tutorial we present the concepts and design rationales, and then get hands-on with real-time examples. We will run nontrivial queries both locally and via Internet, with an opportunity for participants to recap and run their own queries. Finally, we will inspect operational services.
SciDB combines the features of a scalable computational engine with a reliable, massively parallel, data store. SciDB's high level goals and detailed feature list were the results of a survey of scientific users of data management technology--astronomers, high energy physicists, earth scientists and biologists--many of whom have spent their careers working in an environment characterized by an (over)abundance of machine (sensor) generated data. Often, resolving scientific questions involved integrating data from multiple sources, and by encouraging collaboration and data sharing. We now see the commercial world, from Life Sciences to Quant Finance, to IoT facing the same challenges and adopting and deploying SciDB. In this tutorial, we will go over the design and implementation of SciDB and review a number of the applications – both commercial, and scientific – to which it is being applied.