We live in a Data-centric era. Nowadays we have at our disposal an enormous variety of services using data. Behind those services there are architectures supporting the flowing and processing of that data. BinaryEdge.io is no exception. Supporting our platform, we have a data architecture processing 1000s of events per second, which was built and is currently maintained by us. In this talk we are going to review the parts that compose a data architecture, and discuss which tools can be used at each step to arrive at a functional architecture. Note that the insights given will not be based of theoretical documents or truckloads of years of experience, but on our own experience of building and maintaining a large scale data infrastructure and architecture
Pixels Camp 2017 - Stories from the trenches of building a data architecture
1. BinaryEdge.io
Be Ready. Be Safe. Be Secure.
Florentino Bexiga
Stories from the Trenches of
Building a Data Architecture
Data Engineer/ Platform Developer
fb@binaryedge.io
2. WHO WE ARE AND WHAT WE DO
VNC
RDP
Files People
Social
Company
registration
internal
external
Phone
Email
Linked urls
BGP
AS
Whois
AS membership
AS peer
List of IPs
Shared
infrastructure
Co-hosted
sites
Contact
Geolocation
Office
locations
Social
networks
Phone
portscan
dns
Screenshots
Web
Services
http https
Users
AppsFiles
Banners
Image
Classifier
Vulnerabilities
DATA POINTS
metadata
Photos
Family&friends
Behaviour
Likes
Topics
Search
News
Forums
Sub-reddits
Domains
AXFR
MX records
Webserver
Framework
Headers
Cookies
Certificate
Configuration
Authorities
Entities
OCR
SW
ip address
url address
SMB
torrents
peers torrent name categorysource hashes of files
3. AGENDA
01
02
THE NEED OF A DATA ARCHITECTURE
03
SIMPLE ARCHITECTURE OVERVIEW
04
05
MESSAGE QUEUE
STREAM PROCESSING
06 BATCH PROCESSING
07 DATABASES
08 BONUS ROUND: MANAGEMENT
09 ARCHITECTURE REVISITED
10 CLOUD-BASED ARCHITECTURES
THE BASIC SURVIVAL KIT
4. THE NEED OF A DATA ARCHITECTURE
Rules before building a data architecture Typical list of needs
Think about what you need to do with the data
There are no more rules
Gather a lot of data coming from different places
Process that data in (close to) real-time
Make data available in multiple formats
Provide ways to easily process that data
6. THE BASIC SURVIVAL KIT
Apache Hadoop
MapReduce
HDFS
Yarn
Why Apache Hadoop?
Interoperability with many other tools
Great community
Gets the job done
THE BASIC SURVIVAL KIT
7. THE BASIC SURVIVAL KIT
YARN
Available resources per node for processing
Timeouts
Heap, heap...
HDFS
Same as above
Primary/ Secondary nodes - high availability
Points of attention
8. MESSAGE QUEUE
Apache Kafka
Originally developed by LinkedIn
Massively scalable publish/ subscribe message queue
High troughout
Low latency
Concepts
Topics
Consumers
Consumer groups
Partitions
Replicas
9. MESSAGE QUEUE
Points of attention
Timeouts
Message sizes
Retention logs vs cleanup interval !!!!
Also, do not, for the love of god, simply delete all the subdirectories in your“kafka-logs”directory, you will cry.
11. STREAM PROCESSING
The good parts
Very simple programming model and APIs
Multilanguage support
Points of attention
Mini-batch processing, not real stream
Heavy resource fingerprint
Prone to timeouts of memory errors
Hard to fine-tune to get the right performance
DataFrame API
ML Libraries
Wide community
Wide range of addons
12. STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Multilanguage support
Points of attention
Slightly more complex programming model
Some support for other languages
Works without much configuration effort
Low resources configuration
Wide community
Lots of connectors and addons
Great performance, like,“The flash”great
13. STREAM PROCESSINGSTREAM PROCESSING
The good parts
Stream processing
Multilanguage support
Buuuuut.....
Does not have a wide community
Does not have that many connectors and addons
Simple API (very similar to Spark)
Dataset API
ML Libraries
Good handling of resources
Low configuration/ optimisation overhead
14. BATCH PROCESSING
Apache Spark
Multilanguage support
Simple API
DataFrame API
ML Libraries
Wide community
Wide range of addons
Apache Flink
The good parts
Multilanguage support
Simple API (very similar)
DataSet API
ML Libraries
15. BATCH PROCESSING
Apache Spark
Heavy resource fingerprint
Prone to timeouts of memory errors
Hard to fine-tune to get the right performance
Apache Flink
Points of attention
Less configuration problems
Better handling of resources
Not a big community
Not many addons
16. DATABASES
Before commiting to a database
01 Think about how you need to access the data
02 Read 1 again
03 Seriously, read 1 again
Select a database, based on your needs, i.e.:
Hardcore read/ write workload and not much advanced querying: HBase
Heavy read/ write workload and minimally dynamic querying: Cassandra
Advanced text querying and not such heavy read/ write workload: something else
17. BONUS ROUND: MANAGEMENT
Apache Ambari
Provision a Hadoop Cluster
Manage a Hadoop Cluster
Monitor a Hadoop Cluster
Ambari uses Hadoop ecosystem distributions such as:
Hortonworks
Cloudera
19. CLOUD BASED ARCHITECTURES
Pros
Less configuration overhead
Less maintenance overhead
Easily scalable
Reliable
Return focus back to data
and product
Cons
$$$$$$$$$$