SlideShare a Scribd company logo
1 of 48
Download to read offline
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Zach Musgrave and Steven Moy
Yelp Inc
Cloud Storage Spring
Cleaning: A Treasure Hunt
#UnifiedAnalytics #SparkAISummit
Agenda today
• What is our treasure?
• Where can we find some clues?
• How do we uncover the treasure map?
• What can we find with it?
!3
Motivation
• Insight into cloud storage patterns
• Justification to delete or transition data
• Classification of data via real-world accesses
!4
#UnifiedAnalytics #SparkAISummit
Motivation
• In the beginning, there was infinite storage.
• Yelp said, "Yes, please!"
• Now, we don't have a problem, per se...
!5
#UnifiedAnalytics #SparkAISummit
Motivation
• Now, we don't have a problem, per se...
• We have dozens of PBs worth of problems :/
!6
What is our treasure?
!7
Spark+AI
Summit
ztm & smoy
What is our treasure?
!8
We keep many things in S3, and we want to lifecycle data
into the right cost tier. How do we make the right decision?
For example, should we archive after 365 days? 400 days?
#UnifiedAnalytics #SparkAISummit
Motivation
• Cloud object storage: It's everywhere!
• S3 continues to eat the world's storage workloads
• No more filing tickets to add hard drives
• As long as you pay the monthly bill, it’s easy...
• Put(key, object)
• Get(key)
!9
Eventually, leadership says...
!10
Why are we spending so much money?
Do we really need tens of petabytes stored in the cloud?
Eventually, leadership says...
!11
Why are we spending so much money?
Do we really need tens of petabytes for immediate access at all times?
Eventually, leadership says...
!12
Why are we spending so much money?
Do we really access all data equally all the time?
And then we say...
!13
If only we knew something - anything! - about engineers' access patterns...
We need a map...
!14
What is the treasure?
• Large storage savings, measured in $$$
• You may end up in your org's quarterly slide deck
• Security implications
• Principle of least privilege made easy easier
• Access trails
• Make decisions without rolling a 12-sided dice
!15
Where are the clues?
• We use AWS S3 as our example
• Your storage provider likely has similar features
• S3 Server Side Logging
• Apache-like access logs for every operation
!16
Where are the clues?
• Use S3?
• Get access logs.
• Where do those go?
• ... Inside S3 ...
• If you enable them.
!17
Where are the clues?
!18
analyze the S3 access logs...
How do we do that?
!19
Let's look at the contents of these logs.
S3 Access Logs
• Non-compressed
• Row-based format
• Files can be very small
• Range from KB to MB
• Dozens - to many thousands - of rows
• Require a complex regex to parse
!20
S3 Access Logs: Example
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521
8e7cd47ef2be awsexamplebucket [06/Feb/2019:00:00:38
+0000] 192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521
8e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING -
"GET /awsexamplebucket?versioning HTTP/1.1" 200 -
113 - 7 - "-" "S3Console/0.4" -
s9lzHYrFp76ZVxRcpX9+5cjAnEH2ROuNkd2BHfIa6UkFVdtjf5mK
R3/eTPFvsiP/XV/VLi31234= SigV2 ECDHE-RSA-AES128-GCM-
SHA256 AuthHeader awsexamplebucket.s3.amazonaws.com
!21
Persistence and structure
!22
• Conversion is a bit expensive
• No need to repeat conversion over 10+ years of data
• May as well store a queryable result
• Determinism: Easy to replicate prior results.
Before we go further
!23
All queries, code fragments, snippets, etc. we show you
Are included in our Github repository!
github.com/Yelp/
aws_logs_to_parquet_converter
So, no need to photograph each slide :)
Where are the clues?
!24
!25
SQL Analytics (AWS Athena)
Columnar Files (Parquet)
Batch Conversion (Spark + Jupyter)
S3 Prefix Scan (boto3 + RDD -> dataframe)
Clue #1 - SQL declarative analysis
• Our developers and analysts speak SQL
• We can execute SQL against flat files
• Lots of different ways to do this
• AWS Athena is easiest for us
!26
Clue #1(a) Nomenclature
• Use your naming conventions
• s3://bucketname/logs/2019/03/04/any_file_name_here.tgz
• Extract object's creation date:
!27
date_parse(
array_join(
regexp_extract_all(key, '/(d+)', 1), '-'
),
'%Y-%m-%d'
) AS dt_written ==> date(2019, 03, 04)
!28
Clue #1(a) Nomenclature
Clue #1(b) Days Apart
• We don't care when accesses occur
• We do care how far in the past the access is
!29
date_diff(
'day',
date_parse(array_join(regexp_extract_all(...... ,
date_trunc('day', request_time)
) AS days_apart,
Clue #1(c) ARN Bingo
• Carefully break cardinality? Meaningful results!
!30
Clue #1(d) Group to Understand
• Any large org has billions of accesses per year
• Need to aggregate accesses carefully
!31
SELECT
requester,
log_name,
count(*) AS access_count,
sum(bytes_sent) AS total_bytes
FROM tmp_workspace WHERE
days_apart > 365
GROUP BY 1, 2
ORDER BY access_count DESC
!32
SQL Analytics (AWS Athena)
Columnar Files (Parquet)
Batch Conversion (Spark + Jupyter)
S3 Prefix Scan (boto3 + RDD -> dataframe)
Clue #2 Columnar File Format
• Many small files? List them all? Impossible.
• Our access logs have over a million keys per prefix!
• Queries don't use all possible columns:
• Amazon's schema has 19 columns
• We usually reference 7 of those
!33
Clue #2 Columnar File Format
• Convert row-based access logs
• Make them snappy-compressed Parquet files!
• Write a converter in Spark:
• Good scalability
• Great support for Parquet file format
!34
!35
SQL Analytics (AWS Athena)
Columnar Files (Parquet)
Batch Conversion (Spark + Jupyter)
S3 Prefix Scan (boto3 + RDD -> dataframe)
Clue #3 S3 is not a filesystem
• Spark's S3 integration is sort of like HDFS
• Many entries at a directory? Can't list them all.
• OutOfMemoryException :'(
• Access log keys
• Prefixed with date information
• Use a prefix-based S3 query
• Paginate? Paginate.
!36
Clue #3 S3 is not a filesystem
!37
Clue #3 S3 is not a filesystem
• List all log files with a date-based prefix
• Limit the number of results from S3
• Constrain your memory utilization
• Generate RDD from the previous results
• Convert the RDD to a dataframe?
• Better ecosystem integrations
!38
Two-step analysis?
• Self-contained Spark job
• Process S3 access logs
• Parquet-ify, then write back to S3
• Query Parquet files with AWS Athena
• We can all write SQL against them
• Accessible in the org-wide Data Lake
!39
Org-wide Considerations
!40
• First time at Yelp all data isn't instantly available
• Glacier bulk restores require ~12 hours
• Faster (15 minute) restores possible for small files
• Resulting developer impact
• Very minor! Under ten restore operations so far.
• Processing time for large computations already measured
in days.
What did we find?
!41
30%
20%
50%
Glacier
Infrequent Access
Standard / Intelligent
What did we find?
• ~20% data - 90 to 400 days old - rarely accessed
• S3 Infrequent Access
• Pay per download
• Still immediately available
• Developer impact: None!
!42
What did we find?
• ~50% data - over 400 days old - archivable
• AWS S3 Glacier!
• Pay to archive (once)
• Pay to retrieve (rarely, usually never)
• Need to wait ~12 hours between request and usage
• Developer impact: Need to plan ahead.
!43
What did we find?
• Total savings to the org
• About 25% of our monthly S3 bill. Gone for good!
• Total developer impact
• Very little. Estimated at ~20 hours per year.
• Total pushback when implementing
• None. Because we had the treasure map!
• Please sign off on this... unless you dislike money.
!44
What did we do?
!45
We applied Spark, Parquet, Jupyter, and Athena to S3 access logs.
We wrote SQL to compute basic statistics.
We decreased Yelp's S3 bucket cost by ~25% going forward.
Did we steal the treasure?
!46
Thanks so much!
!47
All queries, code fragments, snippets, etc.
Are included in our Github repository!
github.com/Yelp/
aws_logs_to_parquet_converter
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big dataJack (Yaakov) Bezalel
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesDenodo
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake PracticeSamanthaSwain7
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 

What's hot (20)

Data Federation
Data FederationData Federation
Data Federation
 
How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data LakesData Ninja Webinar Series: Realizing the Promise of Data Lakes
Data Ninja Webinar Series: Realizing the Promise of Data Lakes
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake Practice
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 

Similar to Cloud Storage Spring Cleaning: A Treasure Hunt

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...OpenSource Connections
 
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...Amazon Web Services
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaConMartin Durant
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
 
RedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedis Labs
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesRose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesRose Toomey
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
The Long Road to Archivists' Toolkit
The Long Road to Archivists' ToolkitThe Long Road to Archivists' Toolkit
The Long Road to Archivists' ToolkitMegMason
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 

Similar to Cloud Storage Spring Cleaning: A Treasure Hunt (20)

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
AWS re:Invent 2016| GAM302 | Sony PlayStation: Breaking the Bandwidth Barrier...
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Intake at AnacondaCon
Intake at AnacondaConIntake at AnacondaCon
Intake at AnacondaCon
 
Scalable web architecture
Scalable web architectureScalable web architecture
Scalable web architecture
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
RedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.js
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
The Long Road to Archivists' Toolkit
The Long Road to Archivists' ToolkitThe Long Road to Archivists' Toolkit
The Long Road to Archivists' Toolkit
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 

Recently uploaded

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 

Recently uploaded (20)

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 

Cloud Storage Spring Cleaning: A Treasure Hunt

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Zach Musgrave and Steven Moy Yelp Inc Cloud Storage Spring Cleaning: A Treasure Hunt #UnifiedAnalytics #SparkAISummit
  • 3. Agenda today • What is our treasure? • Where can we find some clues? • How do we uncover the treasure map? • What can we find with it? !3
  • 4. Motivation • Insight into cloud storage patterns • Justification to delete or transition data • Classification of data via real-world accesses !4
  • 5. #UnifiedAnalytics #SparkAISummit Motivation • In the beginning, there was infinite storage. • Yelp said, "Yes, please!" • Now, we don't have a problem, per se... !5
  • 6. #UnifiedAnalytics #SparkAISummit Motivation • Now, we don't have a problem, per se... • We have dozens of PBs worth of problems :/ !6
  • 7. What is our treasure? !7 Spark+AI Summit ztm & smoy
  • 8. What is our treasure? !8 We keep many things in S3, and we want to lifecycle data into the right cost tier. How do we make the right decision? For example, should we archive after 365 days? 400 days?
  • 9. #UnifiedAnalytics #SparkAISummit Motivation • Cloud object storage: It's everywhere! • S3 continues to eat the world's storage workloads • No more filing tickets to add hard drives • As long as you pay the monthly bill, it’s easy... • Put(key, object) • Get(key) !9
  • 10. Eventually, leadership says... !10 Why are we spending so much money? Do we really need tens of petabytes stored in the cloud?
  • 11. Eventually, leadership says... !11 Why are we spending so much money? Do we really need tens of petabytes for immediate access at all times?
  • 12. Eventually, leadership says... !12 Why are we spending so much money? Do we really access all data equally all the time?
  • 13. And then we say... !13 If only we knew something - anything! - about engineers' access patterns...
  • 14. We need a map... !14
  • 15. What is the treasure? • Large storage savings, measured in $$$ • You may end up in your org's quarterly slide deck • Security implications • Principle of least privilege made easy easier • Access trails • Make decisions without rolling a 12-sided dice !15
  • 16. Where are the clues? • We use AWS S3 as our example • Your storage provider likely has similar features • S3 Server Side Logging • Apache-like access logs for every operation !16
  • 17. Where are the clues? • Use S3? • Get access logs. • Where do those go? • ... Inside S3 ... • If you enable them. !17
  • 18. Where are the clues? !18 analyze the S3 access logs...
  • 19. How do we do that? !19 Let's look at the contents of these logs.
  • 20. S3 Access Logs • Non-compressed • Row-based format • Files can be very small • Range from KB to MB • Dozens - to many thousands - of rows • Require a complex regex to parse !20
  • 21. S3 Access Logs: Example 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521 8e7cd47ef2be awsexamplebucket [06/Feb/2019:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d521 8e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING - "GET /awsexamplebucket?versioning HTTP/1.1" 200 - 113 - 7 - "-" "S3Console/0.4" - s9lzHYrFp76ZVxRcpX9+5cjAnEH2ROuNkd2BHfIa6UkFVdtjf5mK R3/eTPFvsiP/XV/VLi31234= SigV2 ECDHE-RSA-AES128-GCM- SHA256 AuthHeader awsexamplebucket.s3.amazonaws.com !21
  • 22. Persistence and structure !22 • Conversion is a bit expensive • No need to repeat conversion over 10+ years of data • May as well store a queryable result • Determinism: Easy to replicate prior results.
  • 23. Before we go further !23 All queries, code fragments, snippets, etc. we show you Are included in our Github repository! github.com/Yelp/ aws_logs_to_parquet_converter So, no need to photograph each slide :)
  • 24. Where are the clues? !24
  • 25. !25 SQL Analytics (AWS Athena) Columnar Files (Parquet) Batch Conversion (Spark + Jupyter) S3 Prefix Scan (boto3 + RDD -> dataframe)
  • 26. Clue #1 - SQL declarative analysis • Our developers and analysts speak SQL • We can execute SQL against flat files • Lots of different ways to do this • AWS Athena is easiest for us !26
  • 27. Clue #1(a) Nomenclature • Use your naming conventions • s3://bucketname/logs/2019/03/04/any_file_name_here.tgz • Extract object's creation date: !27 date_parse( array_join( regexp_extract_all(key, '/(d+)', 1), '-' ), '%Y-%m-%d' ) AS dt_written ==> date(2019, 03, 04)
  • 29. Clue #1(b) Days Apart • We don't care when accesses occur • We do care how far in the past the access is !29 date_diff( 'day', date_parse(array_join(regexp_extract_all(...... , date_trunc('day', request_time) ) AS days_apart,
  • 30. Clue #1(c) ARN Bingo • Carefully break cardinality? Meaningful results! !30
  • 31. Clue #1(d) Group to Understand • Any large org has billions of accesses per year • Need to aggregate accesses carefully !31 SELECT requester, log_name, count(*) AS access_count, sum(bytes_sent) AS total_bytes FROM tmp_workspace WHERE days_apart > 365 GROUP BY 1, 2 ORDER BY access_count DESC
  • 32. !32 SQL Analytics (AWS Athena) Columnar Files (Parquet) Batch Conversion (Spark + Jupyter) S3 Prefix Scan (boto3 + RDD -> dataframe)
  • 33. Clue #2 Columnar File Format • Many small files? List them all? Impossible. • Our access logs have over a million keys per prefix! • Queries don't use all possible columns: • Amazon's schema has 19 columns • We usually reference 7 of those !33
  • 34. Clue #2 Columnar File Format • Convert row-based access logs • Make them snappy-compressed Parquet files! • Write a converter in Spark: • Good scalability • Great support for Parquet file format !34
  • 35. !35 SQL Analytics (AWS Athena) Columnar Files (Parquet) Batch Conversion (Spark + Jupyter) S3 Prefix Scan (boto3 + RDD -> dataframe)
  • 36. Clue #3 S3 is not a filesystem • Spark's S3 integration is sort of like HDFS • Many entries at a directory? Can't list them all. • OutOfMemoryException :'( • Access log keys • Prefixed with date information • Use a prefix-based S3 query • Paginate? Paginate. !36
  • 37. Clue #3 S3 is not a filesystem !37
  • 38. Clue #3 S3 is not a filesystem • List all log files with a date-based prefix • Limit the number of results from S3 • Constrain your memory utilization • Generate RDD from the previous results • Convert the RDD to a dataframe? • Better ecosystem integrations !38
  • 39. Two-step analysis? • Self-contained Spark job • Process S3 access logs • Parquet-ify, then write back to S3 • Query Parquet files with AWS Athena • We can all write SQL against them • Accessible in the org-wide Data Lake !39
  • 40. Org-wide Considerations !40 • First time at Yelp all data isn't instantly available • Glacier bulk restores require ~12 hours • Faster (15 minute) restores possible for small files • Resulting developer impact • Very minor! Under ten restore operations so far. • Processing time for large computations already measured in days.
  • 41. What did we find? !41 30% 20% 50% Glacier Infrequent Access Standard / Intelligent
  • 42. What did we find? • ~20% data - 90 to 400 days old - rarely accessed • S3 Infrequent Access • Pay per download • Still immediately available • Developer impact: None! !42
  • 43. What did we find? • ~50% data - over 400 days old - archivable • AWS S3 Glacier! • Pay to archive (once) • Pay to retrieve (rarely, usually never) • Need to wait ~12 hours between request and usage • Developer impact: Need to plan ahead. !43
  • 44. What did we find? • Total savings to the org • About 25% of our monthly S3 bill. Gone for good! • Total developer impact • Very little. Estimated at ~20 hours per year. • Total pushback when implementing • None. Because we had the treasure map! • Please sign off on this... unless you dislike money. !44
  • 45. What did we do? !45 We applied Spark, Parquet, Jupyter, and Athena to S3 access logs. We wrote SQL to compute basic statistics. We decreased Yelp's S3 bucket cost by ~25% going forward.
  • 46. Did we steal the treasure? !46
  • 47. Thanks so much! !47 All queries, code fragments, snippets, etc. Are included in our Github repository! github.com/Yelp/ aws_logs_to_parquet_converter
  • 48. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT