Pixels Camp 2017 - Stories from the trenches of building a data architecture

BinaryEdge.io
Be Ready. Be Safe. Be Secure.
Florentino Bexiga
Stories from the Trenches of
Building a Data Architecture
Data Engineer/ Platform Developer
fb@binaryedge.io

WHO WE ARE AND WHAT WE DO
VNC
RDP
Files People
Social
Company
registration
internal
external
Phone
Email
Linked urls
BGP
AS
Whois
AS membership
AS peer
List of IPs
Shared
infrastructure
Co-hosted
sites
Contact
Geolocation
Office
locations
Social
networks
Phone
portscan
dns
Screenshots
Web
Services
http https
Users
AppsFiles
Banners
Image
Classifier
Vulnerabilities
DATA POINTS
metadata
Photos
Family&friends
Behaviour
Likes
Topics
Search
News
Forums
Sub-reddits
Domains
AXFR
MX records
Webserver
Framework
Headers
Cookies
Certificate
Configuration
Authorities
Entities
OCR
SW
ip address
url address
SMB
torrents
peers torrent name categorysource hashes of files

AGENDA
01
02
THE NEED OF A DATA ARCHITECTURE
03
SIMPLE ARCHITECTURE OVERVIEW
04
05
MESSAGE QUEUE
STREAM PROCESSING
06 BATCH PROCESSING
07 DATABASES
08 BONUS ROUND: MANAGEMENT
09 ARCHITECTURE REVISITED
10 CLOUD-BASED ARCHITECTURES
THE BASIC SURVIVAL KIT

THE NEED OF A DATA ARCHITECTURE
Rules before building a data architecture Typical list of needs
Think about what you need to do with the data
There are no more rules
Gather a lot of data coming from different places
Process that data in (close to) real-time
Make data available in multiple formats
Provide ways to easily process that data

SIMPLE ARCHITECTURE OVERVIEW
SENSOR
STREAM PRO-
CESSING
SENSOR
SENSOR
DATA SINK
MESSAGE
QUEUE
FILE
STORAGE
BATCH
PROCESSING
DATABASES APIs PORTALS

Apache Hadoop
MapReduce
HDFS
Yarn
Why Apache Hadoop?
Interoperability with many other tools
Great community
Gets the job done

YARN
Available resources per node for processing
Timeouts
Heap, heap...
HDFS
Same as above
Primary/ Secondary nodes - high availability
Points of attention

MESSAGE QUEUE
Apache Kafka
Originally developed by LinkedIn
Massively scalable publish/ subscribe message queue
High troughout
Low latency
Concepts
Topics
Consumers
Consumer groups
Partitions
Replicas

MESSAGE QUEUE
Points of attention
Timeouts
Message sizes
Retention logs vs cleanup interval !!!!
Also, do not, for the love of god, simply delete all the subdirectories in your“kafka-logs”directory, you will cry.

STREAM PROCESSING
The good parts
Very simple programming model and APIs
Multilanguage support
Points of attention
Mini-batch processing, not real stream
Heavy resource fingerprint
Prone to timeouts of memory errors
Hard to fine-tune to get the right performance
DataFrame API
ML Libraries
Wide community
Wide range of addons

STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Points of attention
Slightly more complex programming model
Some support for other languages
Works without much configuration effort
Low resources configuration
Wide community
Lots of connectors and addons
Great performance, like,“The flash”great

STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Buuuuut.....
Does not have a wide community
Does not have that many connectors and addons
Simple API (very similar to Spark)
Dataset API
ML Libraries
Good handling of resources
Low configuration/ optimisation overhead

BATCH PROCESSING
Apache Spark
Simple API
DataFrame API
ML Libraries
Wide community
Wide range of addons
Apache Flink
The good parts
Simple API (very similar)
DataSet API
ML Libraries

BATCH PROCESSING
Apache Spark
Heavy resource fingerprint
Prone to timeouts of memory errors
Hard to fine-tune to get the right performance
Apache Flink
Points of attention
Less configuration problems
Better handling of resources
Not a big community
Not many addons

DATABASES
Before commiting to a database
01 Think about how you need to access the data
02 Read 1 again
03 Seriously, read 1 again
Select a database, based on your needs, i.e.:
Hardcore read/ write workload and not much advanced querying: HBase
Heavy read/ write workload and minimally dynamic querying: Cassandra
Advanced text querying and not such heavy read/ write workload: something else

BONUS ROUND: MANAGEMENT
Apache Ambari
Provision a Hadoop Cluster
Manage a Hadoop Cluster
Monitor a Hadoop Cluster
Ambari uses Hadoop ecosystem distributions such as:
Hortonworks
Cloudera

ARCHITECTURE REVISITED
SENSOR
APACHE
STORM
SENSOR
SENSOR
DATA SINK
APACHE
KAFKA
APACHE
HDFS
APACHE
SPARK
APACHE HBASE/
CASSANDRA
APIs PORTALS

CLOUD BASED ARCHITECTURES
Pros
Less configuration overhead
Less maintenance overhead
Easily scalable
Reliable
Return focus back to data
and product
Cons
$$$$$$$$$$

SENSOR
GOOGLE
DATAFLOW
SENSOR
SENSOR
DATA SINK
GOOGLE
PUBSUB
GOOGLE CLOUD
STORAGE
GOOGLE
DATAPROC
APIs PORTALS
GOOGLE BIGTABLE/
BIGQUERY

SENSOR
AMAZON DATA
PIPELINE
SENSOR
SENSOR
DATA SINK
AMAZON SIMPLE
QUEUE SERVICE
AMAZON S3
AMAZON ELASTIC
MAPREDUCE
APIs PORTALS
AMAZON
DYNAMODB/
REDSHIFT

BE READY. BE SAFE. BE SECURE.
BinaryEdge AG
Freigutstrasse 40,
8001 Zurich
Switzerland
info@binaryedge.io
www.binaryedge.io
+ 41 78 713 40 00
CONTIGENCY THREAT SAFE IRRELEVANT

Pixels Camp 2017 - Stories from the trenches of building a data architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pixels Camp 2017 - Stories from the trenches of building a data architecture

Similar to Pixels Camp 2017 - Stories from the trenches of building a data architecture (20)

More from Tiago Henriques

More from Tiago Henriques (17)

Recently uploaded

Recently uploaded (20)

Pixels Camp 2017 - Stories from the trenches of building a data architecture