SlideShare a Scribd company logo
1 of 37
Apache Spark @enbrite.ly
Budapest Spark Meetup
March 30, 2016
Joe MÉSZÁROS
software engineer
@joemesz
joemeszaros
Who we are?
Our vision is to revolutionize the KPIs and metrics the online
advertisement industry currently using. With our products,
Antifraud, Brandsafety and Viewability we provide actionable
data to our customers.
Agenda
● What we do?
● How we do? - enbrite.ly data platform
● Real world antifraud example
● LL + Spark in scale +/-
DATA
COLLECTION
ANALYZE
DATA PROCESSION
ANTI FRAUD
VIEWABILITY
BRAND SAFETY
REPORT + API
What we do?
How we do? DATA COLLECTION
How we do? DATA PROCESSION
Amazon EMR
● Most popular cloud service provider
● Amazon Big Data ecosystem
● Applications: Hadoop, Spark, Hive, ….
● Scaling is easy
● Do not trust the BIG guys (API problem)
● Spark application in EMR runs on YARN (cluster
manager)
For more information: https://aws.amazon.com/elasticmapreduce/
Tools we use
https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors
Workflow engine, that helps you build
complex data pipelines of batch jobs.
Created by Spotify’s engineering team.
Your friendly plumber, that sticks your Hadoop, Spark, … jobs
with simple dependency definition and failure management.
class SparkMeetupTask(luigi.Task):
param = luigi.Parameter(default=42)
def requires(self):
return SomeOtherTask(self.param)
def run(self):
with self.output().open('w') as f:
f.write('Hello Spark meetup!')
def output(self):
return luigi.LocalTarget('/meetup/message')
if __name__ == '__main__':
luigi.run()
Web interface
Web interface
Let me tell you a short story...
Tools we created GABO LUIGI
Luigi + enbrite.ly extensions = Gabo Luigi
● Dynamic task configuration + dependencies
● Reshaped web interface
● Define reusable data pipeline template
● Monitoring for each task
Tools we created GABO LUIGI
Tools we created GABO LUIGI
We plan to release it to the wild and make it open
source as part of Spotify’s Luigi! If you are
interested, you are front of open doors :-)
Tools we created GABO MARATHON
Motivation: Testing with large data sets and slow batch jobs is
boring and wasteful!
Tools we created GABO MARATHON
Graphite
Real world example
You are fighting against robots and want to humanize
ad tech era. You have a simple idea to detect bot traffic,
which saves the world. Let’s implement it!
Real world example
THE IDEA: Analyse events which are too hasty and deviate
from regular, humanlike profiles: too many clicks in a defined
timeframe.
INPUT: Load balancer access logs files on S3
OUTPUT: Print invalid sessions
Step 1: convert access log files to events
Step 2: sessionize events
Step 3: detect too many clicks
How to solve it?
The way to access log
{
"session_id": "spark_meetup_jsmmmoq",
"timestamp": 1456080915621,
"type": "click"
}
eyJzZXNzaW9uX2lkIjoic3Bhcmtfb
WVldHVwX2pzbW1tb3EiLCJ0aW1l
c3RhbXAiOjE0NTYwODA5MTU2M
jEsInR5cGUiOiAiY2xpY2sifQo=
Click event attributes
(created by JS tracker)
Access log format
TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
1. 2.
3.
Step 1: log to event
Simplify: log files are on the local storage, only click events.
SparkConf conf = new SparkConf().setAppName("LogToEvent");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER);
// 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET
https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
Step 1: log to event
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lkIj…
JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]);
// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...
JavaRDD<String> eventParameter = rawUrls
.map(u -> parseUrl(u).get("event"));
// eyJzZXNzaW9uX2lk
JavaRDD<String> base64Decoded = eventParameter
.map(e -> new String(Base64.getDecoder().decode(e)));
// {"session_id": "spark_meetup_jsmmmoq",
// "timestamp": 1456080915621, "type": "click"}
IoUtil.saveAsJsonGzipped(base64Decoded);
Step 2: event to session
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
SparkConf conf = new SparkConf().setAppName("EventToSession");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<ClickEvent> clickEvents = jsonEvents
.map(e -> readJsonObject(e));
JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents =
clickEvents.groupBy(e -> e.getSessionId());
JavaPairRDD<String, Session> sessions = grouped
.flatMapValues(sessionizer);
Step 2: event to session
//Sessionizer
public Session call(Iterable<ClickEvent> clickEvents) {
List<ClickEvent> ordered = sortByTimestamp(clickEvents);
Session session = new Session();
for (ClickEvent event: ordered) {
session.addClick(event)
}
return session;
}
Step 2: event to session
class Session {
public Boolean isBad = False;
public List<Long> clickTimestamps;
public void addClick(ClickEvent e) {
clickTimestamps.add(e.getTimestamp());
}
public void badify() { this.isBad = True; }
}
Step 3: detect bad sessions
JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE);
JavaRDD<Session> markedSessions = sessions
.map(s -> s.clickTimestamps.size() > THRESHOLD);
JavaRDD<Session> badSessions = markedSessions
.filter(s -> s.isBad());
badSessions.collect().foreach(println);
Congratulation!
MISSION COMPLETED
YOU just saved the world with a
simple idea within ~10 minutes.
Using Spark pros
● Sparking is funny, community, tools
● Easy to start with it
● Language support: Python, Scala, Java, R
● Unified stack: batch, streaming, SQL,
ML
Using Spark cons
● You need memory and memory
● Distributed application, hard to debug
● Hard to optimize
Lessons learned
● Do not use default config, always optimize!
● Eliminate technical debt + automate
● Failures happen, use monitoring from the very
first breath + fault tolerant implementation
● Sparking is funny, but not a hammer for
everything
Data platform future
● Would like to play with Redshift
● Change data format (avro, parquet, …)
● Would like to play with streaming
● Would like to play with Spark 2.0
WE ARE HIRING!
working @exPrezi office, K9
check out the company in Forbes :-)
amazing company culture
BUT the real reason ….
WE ARE HIRING!
… is our mood manager, Bigyó :)
Joe MÉSZÁROS
software engineer
joe@enbrite.ly
@joemesz
@enbritely
joemeszaros
enbritely
THANK YOU!
QUESTIONS?

More Related Content

What's hot

Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersDaniel Zivkovic
 
Blockchain for Java Developers - Cloud Conference Day
Blockchain for Java Developers - Cloud Conference DayBlockchain for Java Developers - Cloud Conference Day
Blockchain for Java Developers - Cloud Conference DayJuarez Junior
 
Data ANZ - Using database for ML.NET.pptx
Data ANZ - Using database for ML.NET.pptxData ANZ - Using database for ML.NET.pptx
Data ANZ - Using database for ML.NET.pptxLuis Beltran
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Bot Revolution lab at Codemotion Milan 2016
Bot Revolution lab at Codemotion Milan 2016Bot Revolution lab at Codemotion Milan 2016
Bot Revolution lab at Codemotion Milan 2016gjuljo
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overviewgjuljo
 
Accelerating Digital Transformation With Microsoft Azure And Cognitive Services
Accelerating Digital Transformation With Microsoft Azure And Cognitive ServicesAccelerating Digital Transformation With Microsoft Azure And Cognitive Services
Accelerating Digital Transformation With Microsoft Azure And Cognitive ServicesThuan Ng
 
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Cosmos DB Conf - Cosmos DB + Azure Functions .pptxCosmos DB Conf - Cosmos DB + Azure Functions .pptx
Cosmos DB Conf - Cosmos DB + Azure Functions .pptxLuis Beltran
 
Windows azure mobile services from start to rest
Windows azure mobile services from start to restWindows azure mobile services from start to rest
Windows azure mobile services from start to restAidan Casey
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Daniel Zivkovic
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataLightbend
 
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
BRK20011:  Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...BRK20011:  Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...Tracy Van der Schyff
 
SQLDay 2021 PL AI Enrichment Azure Search.pptx
SQLDay 2021 PL AI Enrichment Azure Search.pptxSQLDay 2021 PL AI Enrichment Azure Search.pptx
SQLDay 2021 PL AI Enrichment Azure Search.pptxLuis Beltran
 
Microsoft Azure News - August 2021
Microsoft Azure News - August 2021Microsoft Azure News - August 2021
Microsoft Azure News - August 2021Daniel Toomey
 
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...confluent
 
Intel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSIntel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSAmazon Web Services
 
If an Event is Published to a Topic and No One is Around to Consume it, Does ...
If an Event is Published to a Topic and No One is Around to Consume it, Does ...If an Event is Published to a Topic and No One is Around to Consume it, Does ...
If an Event is Published to a Topic and No One is Around to Consume it, Does ...confluent
 
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft CloudEuropean Collaboration Summit
 

What's hot (20)

Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
Blockchain for Java Developers - Cloud Conference Day
Blockchain for Java Developers - Cloud Conference DayBlockchain for Java Developers - Cloud Conference Day
Blockchain for Java Developers - Cloud Conference Day
 
Data ANZ - Using database for ML.NET.pptx
Data ANZ - Using database for ML.NET.pptxData ANZ - Using database for ML.NET.pptx
Data ANZ - Using database for ML.NET.pptx
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Bot Revolution lab at Codemotion Milan 2016
Bot Revolution lab at Codemotion Milan 2016Bot Revolution lab at Codemotion Milan 2016
Bot Revolution lab at Codemotion Milan 2016
 
Microsoft Azure Technical Overview
Microsoft Azure Technical OverviewMicrosoft Azure Technical Overview
Microsoft Azure Technical Overview
 
Accelerating Digital Transformation With Microsoft Azure And Cognitive Services
Accelerating Digital Transformation With Microsoft Azure And Cognitive ServicesAccelerating Digital Transformation With Microsoft Azure And Cognitive Services
Accelerating Digital Transformation With Microsoft Azure And Cognitive Services
 
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
Cosmos DB Conf - Cosmos DB + Azure Functions .pptxCosmos DB Conf - Cosmos DB + Azure Functions .pptx
Cosmos DB Conf - Cosmos DB + Azure Functions .pptx
 
Windows azure mobile services from start to rest
Windows azure mobile services from start to restWindows azure mobile services from start to rest
Windows azure mobile services from start to rest
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 
Journey to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big DataJourney to the Modern App with Containers, Microservices and Big Data
Journey to the Modern App with Containers, Microservices and Big Data
 
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
BRK20011:  Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...BRK20011:  Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
BRK20011: Put the DEV in Citizen DEVeloper with Microsoft Power Automate and...
 
SQLDay 2021 PL AI Enrichment Azure Search.pptx
SQLDay 2021 PL AI Enrichment Azure Search.pptxSQLDay 2021 PL AI Enrichment Azure Search.pptx
SQLDay 2021 PL AI Enrichment Azure Search.pptx
 
Azure functions
Azure functionsAzure functions
Azure functions
 
Microsoft Azure News - August 2021
Microsoft Azure News - August 2021Microsoft Azure News - August 2021
Microsoft Azure News - August 2021
 
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
Handling GDPR with Apache Kafka: How to Comply Without Freaking Out? (David J...
 
Intel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWSIntel SoC as a Platform to Connect Sensor Data to AWS
Intel SoC as a Platform to Connect Sensor Data to AWS
 
If an Event is Published to a Topic and No One is Around to Consume it, Does ...
If an Event is Published to a Topic and No One is Around to Consume it, Does ...If an Event is Published to a Topic and No One is Around to Consume it, Does ...
If an Event is Published to a Topic and No One is Around to Consume it, Does ...
 
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
[Toroman/Kranjac] Red Team vs. Blue Team in Microsoft Cloud
 
Games en
Games enGames en
Games en
 

Viewers also liked

Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017Drift
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 
BigWeatherGear Group and Corporate Services Brochure 2013
BigWeatherGear Group and Corporate Services Brochure 2013BigWeatherGear Group and Corporate Services Brochure 2013
BigWeatherGear Group and Corporate Services Brochure 2013Kristin Matson
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?Maciej Lasyk
 
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013Cain Ransbottyn
 
Growing great schools: what has the most impact?
Growing great schools: what has the most impact?Growing great schools: what has the most impact?
Growing great schools: what has the most impact?Ofsted
 
Technology Vision 2017 infographic
Technology Vision 2017 infographicTechnology Vision 2017 infographic
Technology Vision 2017 infographicAccenture Technology
 
Agriculture connectée 4.0
Agriculture connectée 4.0Agriculture connectée 4.0
Agriculture connectée 4.0Jérôme Monteil
 
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會Jason Cheng
 

Viewers also liked (20)

Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
BigWeatherGear Group and Corporate Services Brochure 2013
BigWeatherGear Group and Corporate Services Brochure 2013BigWeatherGear Group and Corporate Services Brochure 2013
BigWeatherGear Group and Corporate Services Brochure 2013
 
Shall we play a game?
Shall we play a game?Shall we play a game?
Shall we play a game?
 
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
Privacy is an Illusion and you’re all losers! - Cryptocow - Infosecurity 2013
 
Projeto gelo
Projeto geloProjeto gelo
Projeto gelo
 
出版学会(活字離れ)資料
出版学会(活字離れ)資料出版学会(活字離れ)資料
出版学会(活字離れ)資料
 
Endocarditis
EndocarditisEndocarditis
Endocarditis
 
Growing great schools: what has the most impact?
Growing great schools: what has the most impact?Growing great schools: what has the most impact?
Growing great schools: what has the most impact?
 
Pénfigo
PénfigoPénfigo
Pénfigo
 
Technology Vision 2017 infographic
Technology Vision 2017 infographicTechnology Vision 2017 infographic
Technology Vision 2017 infographic
 
Agriculture connectée 4.0
Agriculture connectée 4.0Agriculture connectée 4.0
Agriculture connectée 4.0
 
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
開源 x 節流:企業導入實例分享 (二) [2016/03/31] 文件自由日研討會
 

Similar to Budapest Spark Meetup - Apache Spark @enbrite.ly

Startup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformStartup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformMészáros József
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandMaarten Balliauw
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases WSO2
 
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
[@IndeedEng] Logrepo: Enabling Data-Driven Decisionsindeedeng
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabRoman
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsAdégòkè Obasá
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformGregor Jarisch
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Un-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew SkeltonUn-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew SkeltonSkelton Thatcher Consulting Ltd
 
Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...Matthew Skelton
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics PlatformSrinath Perera
 
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew SkeltonUn-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew SkeltonSkelton Thatcher Consulting Ltd
 
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...ITCamp
 
Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2Matthew McCullough
 
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...GITS Indonesia
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Large scale data capture and experimentation platform at Grab
Large scale data capture and experimentation platform at GrabLarge scale data capture and experimentation platform at Grab
Large scale data capture and experimentation platform at GrabRoman
 
Flexible Event Tracking (Paul Gebheim)
Flexible Event Tracking (Paul Gebheim)Flexible Event Tracking (Paul Gebheim)
Flexible Event Tracking (Paul Gebheim)MongoSF
 

Similar to Budapest Spark Meetup - Apache Spark @enbrite.ly (20)

Startup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformStartup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platform
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
 
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
 
Scaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at GrabScaling Experimentation & Data Capture at Grab
Scaling Experimentation & Data Capture at Grab
 
IndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web AppsIndexedDB and Push Notifications in Progressive Web Apps
IndexedDB and Push Notifications in Progressive Web Apps
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot Platform
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Un-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew SkeltonUn-broken Logging - Operability.io 2015 - Matthew Skelton
Un-broken Logging - Operability.io 2015 - Matthew Skelton
 
Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...Un-broken logging - the foundation of software operability - Operability.io -...
Un-broken logging - the foundation of software operability - Operability.io -...
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew SkeltonUn-broken Logging - TechnologyUG - Leeds - Matthew Skelton
Un-broken Logging - TechnologyUG - Leeds - Matthew Skelton
 
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
The Fine Art of Time Travelling - Implementing Event Sourcing - Andrea Saltar...
 
Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2Google App Engine for Java v0.0.2
Google App Engine for Java v0.0.2
 
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
Gits class #22: [ONLINE] Analyze Your User's Activities Using BigQuery and Da...
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Async
AsyncAsync
Async
 
Large scale data capture and experimentation platform at Grab
Large scale data capture and experimentation platform at GrabLarge scale data capture and experimentation platform at Grab
Large scale data capture and experimentation platform at Grab
 
Flexible Event Tracking (Paul Gebheim)
Flexible Event Tracking (Paul Gebheim)Flexible Event Tracking (Paul Gebheim)
Flexible Event Tracking (Paul Gebheim)
 

Recently uploaded

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Budapest Spark Meetup - Apache Spark @enbrite.ly

  • 1. Apache Spark @enbrite.ly Budapest Spark Meetup March 30, 2016
  • 3. Who we are? Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.
  • 4. Agenda ● What we do? ● How we do? - enbrite.ly data platform ● Real world antifraud example ● LL + Spark in scale +/-
  • 6. How we do? DATA COLLECTION
  • 7. How we do? DATA PROCESSION
  • 8. Amazon EMR ● Most popular cloud service provider ● Amazon Big Data ecosystem ● Applications: Hadoop, Spark, Hive, …. ● Scaling is easy ● Do not trust the BIG guys (API problem) ● Spark application in EMR runs on YARN (cluster manager) For more information: https://aws.amazon.com/elasticmapreduce/
  • 9. Tools we use https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors Workflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotify’s engineering team.
  • 10. Your friendly plumber, that sticks your Hadoop, Spark, … jobs with simple dependency definition and failure management.
  • 11. class SparkMeetupTask(luigi.Task): param = luigi.Parameter(default=42) def requires(self): return SomeOtherTask(self.param) def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!') def output(self): return luigi.LocalTarget('/meetup/message') if __name__ == '__main__': luigi.run()
  • 14. Let me tell you a short story...
  • 15. Tools we created GABO LUIGI Luigi + enbrite.ly extensions = Gabo Luigi ● Dynamic task configuration + dependencies ● Reshaped web interface ● Define reusable data pipeline template ● Monitoring for each task
  • 16. Tools we created GABO LUIGI
  • 17. Tools we created GABO LUIGI We plan to release it to the wild and make it open source as part of Spotify’s Luigi! If you are interested, you are front of open doors :-)
  • 18. Tools we created GABO MARATHON Motivation: Testing with large data sets and slow batch jobs is boring and wasteful!
  • 19. Tools we created GABO MARATHON Graphite
  • 20. Real world example You are fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Let’s implement it!
  • 21. Real world example THE IDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe. INPUT: Load balancer access logs files on S3 OUTPUT: Print invalid sessions
  • 22. Step 1: convert access log files to events Step 2: sessionize events Step 3: detect too many clicks How to solve it?
  • 23. The way to access log { "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click" } eyJzZXNzaW9uX2lkIjoic3Bhcmtfb WVldHVwX2pzbW1tb3EiLCJ0aW1l c3RhbXAiOjE0NTYwODA5MTU2M jEsInR5cGUiOiAiY2xpY2sifQo= Click event attributes (created by JS tracker) Access log format TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..." 1. 2. 3.
  • 24. Step 1: log to event Simplify: log files are on the local storage, only click events. SparkConf conf = new SparkConf().setAppName("LogToEvent"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER); // 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."
  • 25. Step 1: log to event JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lkIj… JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("s+")[3]); // GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj... JavaRDD<String> eventParameter = rawUrls .map(u -> parseUrl(u).get("event")); // eyJzZXNzaW9uX2lk JavaRDD<String> base64Decoded = eventParameter .map(e -> new String(Base64.getDecoder().decode(e))); // {"session_id": "spark_meetup_jsmmmoq", // "timestamp": 1456080915621, "type": "click"} IoUtil.saveAsJsonGzipped(base64Decoded);
  • 26. Step 2: event to session SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); SparkConf conf = new SparkConf().setAppName("EventToSession"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<ClickEvent> clickEvents = jsonEvents .map(e -> readJsonObject(e)); JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents = clickEvents.groupBy(e -> e.getSessionId()); JavaPairRDD<String, Session> sessions = grouped .flatMapValues(sessionizer);
  • 27. Step 2: event to session //Sessionizer public Session call(Iterable<ClickEvent> clickEvents) { List<ClickEvent> ordered = sortByTimestamp(clickEvents); Session session = new Session(); for (ClickEvent event: ordered) { session.addClick(event) } return session; }
  • 28. Step 2: event to session class Session { public Boolean isBad = False; public List<Long> clickTimestamps; public void addClick(ClickEvent e) { clickTimestamps.add(e.getTimestamp()); } public void badify() { this.isBad = True; } }
  • 29. Step 3: detect bad sessions JavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE); JavaRDD<Session> markedSessions = sessions .map(s -> s.clickTimestamps.size() > THRESHOLD); JavaRDD<Session> badSessions = markedSessions .filter(s -> s.isBad()); badSessions.collect().foreach(println);
  • 30. Congratulation! MISSION COMPLETED YOU just saved the world with a simple idea within ~10 minutes.
  • 31. Using Spark pros ● Sparking is funny, community, tools ● Easy to start with it ● Language support: Python, Scala, Java, R ● Unified stack: batch, streaming, SQL, ML
  • 32. Using Spark cons ● You need memory and memory ● Distributed application, hard to debug ● Hard to optimize
  • 33. Lessons learned ● Do not use default config, always optimize! ● Eliminate technical debt + automate ● Failures happen, use monitoring from the very first breath + fault tolerant implementation ● Sparking is funny, but not a hammer for everything
  • 34. Data platform future ● Would like to play with Redshift ● Change data format (avro, parquet, …) ● Would like to play with streaming ● Would like to play with Spark 2.0
  • 35. WE ARE HIRING! working @exPrezi office, K9 check out the company in Forbes :-) amazing company culture BUT the real reason ….
  • 36. WE ARE HIRING! … is our mood manager, Bigyó :)