4. Apache Drill
Open source distributed SQL query engine for non-relational datastores
- JSON document model
- Columnar
Key Advantages
- Columnar
- Schema on the fly
- Integrates with any non-relational datastore
- Elastic scalability
- Data can be treated like SQL Tables
- SQL like query syntax
- No overhead (creating and maintaining schemas, ETL process, etc )
- Vectorization (SIMD instructions)
6. Test environment - SW basic configuration
Architecture Gigabyte Marvell® ThunderX2® "Saber" 3 node cluster
OS platform Debian GNU/Linux 9.9 (stretch)
Linux Kernel version Debian 4.16.13.linaro.290-1
GCC version gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
GlibC version Debian GLIBC 2.24-11+deb9u4
JAVA version openjdk version "1.8.0_191"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.191-b12, mixed mode)
Hadoop version Hadoop 2.8.5
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
0b8464d75227fcee2c6e7f2410377b3d53d3d5f8
Compiled by jdu on 2018-09-10T03:32Z
Compiled with protoc 2.5.0
Using upstream release packages from apache.org.
Running on commercially available Arm server based on Marvell ThunderX2.
7. Test environment - SW basic configuration
Zookeeper and libzookeeper-java version 3.4.9-3+deb9u2
Apache Drill version v1.16.0
Jupyter Notebook version
Dataset 3 TB+ of github activity dataset contains a full snapshot . The content is more than 2.8 million open
source GitHub repositories. Which includes more than 145 million unique commits
Can replicate this demo using upstream release packages and open source data set.
jupyter core 4.5.0
jupyter-notebook 6.0.1
qtconsole 4.5.4
ipython 7.7.0
ipykernel 5.1.2
jupyter client 5.3.1
jupyter lab 1.0.9
nbconvert 5.6.0
ipywidgets 7.5.1
nbformat 4.4.0
traitlets 4.3.2
16. Top Python repositories by their commits count
SELECT * FROM
dfs.`/usersummary/*.json` limit 20
17. Top Apache Projects by contribution
Need to paste Apache Drill query snapshot
18. Who Are We Anyway?
We are Linaro: Leading collaboration in the
Arm ecosystem
19. Linaro: Open Source
Delivering high value collaboration
Top 5 company contributor in Linux
kernel
Contributor to >70 open source projects;
many maintained by Linaro engineers
Company 4.8-4.13 Changesets %
1 Intel 10,833 13.1%
2 Red Hat 5,965 7.2%
3 Linaro 4,636 5.6%
Source: Linux Kernel Development Report, Linux Foundation
Selected projects Linaro contributes to
20. Linaro: BigData Objective
● Ensure that Arm is a first class platform for Hadoop and Spark.
● Profile Hadoop and Spark for real world workloads on 64-bit Arm server
systems.
● Ensure that OpenJDK is running optimally against Hadoop and Spark workloads.
21. ❏ Founded in November 1990
❏ Designs the RISC processor cores
❏ Licenses Arm core designs to
partners who fabricate and sell
to their customers
22. Arm Ecosystem momentum continues to accelerate
www.arm.com
Workloads
Networking
Virtualization &
Containers
Language & Library
Operating system
26. Marvell-University of Michigan Partnership
Built on Cavium/Marvell-Michigan relationship
Deploy ThunderX for Big Data
● 4800 Cores
● 25 TB Memory
● 40 & 100 Gbps networking
● 3 PB Hadoop File System
Accelerating the software ecosystem for data science for Arm.
Directly consuming Linaro Big Data software builds
We bring an advanced user base in the data science domain
27. Questions ?
Contact Us:
Ganesh Raju
ganesh.raju@linaro.org
Naresh Bhat
nareshb@marvell.com
naresh.bhat@linaro.org
Blogpost
https://nbhatlinaro.blogspot.com/2019/04/apache-drill-on-arm64.html
Thanks to Linaro Team:
Yuqi Gu
Jun He
Guodong Xu
Inspiration from Felipe Hoffa’s talks on Google
BigQuery
https://s3.amazonaws.com/connect.linaro.org/bkk19/presentations/bkk19-
300k1.pdf