SlideShare a Scribd company logo
1 of 35
Apache Beam in Production
June 10th, 2020
● USA based monetization platform for mobile game
developers.
● Growing our engineering office in Barcelona with a strong
tech-driven culture:
- Visibility of the impact of your code.
- Not afraid of implementing tech, like Apache beam ; )
- Fostering best practices and a true care for code quality.
+300k
Mobile game
integrations
900M
Unique
Users
200B
Ad request
2011
Founding
year
Who we are
100TBs
Data Scale
Hi, I’m Ferran!
Data Engineer @ Chartboost
+5 years of Big Data experience
Agenda
● What is Apache Beam
● Basic Requirements
● Production use cases:
○ Ingest data into BigQuery or GCS
○ BigQuery to BigTable
○ BigQuery to S3 (Parquet)
● Dealing with Streaming
● Questions
What is Apache Beam
“Apache Beam is an open source,
unified model for defining both batch
and streaming data-parallel processing
pipelines.”
Write once, run anywhere
Basic Requirements
- Reduce cluster provisioning overhead.
- Create a generic and reusable code.
- The architecture must be agnostic of the source (streaming or
batch).
- The architecture must have different configurable sinks.
Text files or Parquet
Ingest data into BQ or GCS
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
outputTableSpec project:dataset.table
batchJobInstance NO_TRANSFORMATION
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded data into BigQuery!
I wanted upper case...
import com.google.api.services.bigquery.model.TableRow;
public class TestTransformation {
public TableRow customTransformation(TableRow tr) {
tr.set("src_name",
tr.get("src_name").toString().toUpperCase());
return tr;
}
}
class Types {
static final String TEST = "test";
}
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
batchJobInstance test
outputTableSpec project:dataset.table
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded custom data into BigQuery!
Core
batchJobInstance
streamingJobInstance
customTransA
customTransB
customTransC
How custom transformations work
How custom transformations work
@ProcessElement
public void processElement(ProcessContext context) {
try {
String element = context.element();
if (element != null) {
JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class);
if (jsonModel != null) {
TableRow tr = customTransformation(JsonParseToBigQuery.getInstance()
.getJsonParse(jsonModel.getDetails(), howToParse));
if (tr != null) {
context.output(tr);
}
}
}
} catch (Exception e) {
LOG.error("Error on parsing", e);
}
}
}
Parameter Default Value
inputPath -
isParquet true
outputTableSpec -
howToParse -
disposition APPEND
isPartitionedTable true
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
writeAsParquet true
Main Batch Jobs Parameters
Parameter Input
inputPath gs://bucket_x/*.parquet
batchJobInstance test
outputDirectory gs://bucket_y/test
outputTableSpec project:dataset.table
writeIntoGCS true
parquetSchema original_app_id:STRING,app_id:LON
G
howToParse {"original_app_id":"original_app_id","
app.id":"app_id","src_name":"src_nam
e","id":"id"}
Parameter Default Value
streamSource pub_sub
inputTopic -
subscription -
kafkaBrokers -
kafkaGroupId -
filePartitionPolicy DAILY
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
windowDuration 5m
Main Streaming Jobs Parameters
Problems / Tips
- Debugging failures was not always easy.
- If you want to create templates, remember, ValueProviders are
only available at Runtime.
- Be careful with non thread safe classes.
- Default GCP instances are okay, but try to use custom ones.
BigQuery to BigTable
Going from SQL to NoSQL
What is BigTable
“Bigtable is a compressed, high
performance, proprietary data storage
system built on Google File System, Chubby
Lock Service, SSTable (log-structured
storage like LevelDB) and a few other
Google technologies”
Key, Value storage
ColumnFamily (info)
RowKey Qualifier (name) Qualifier (email) Qualifier(phone)
ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382
kfgiuiu5937je3 jdoe@gmail.com 555-738-234
iojcou9wujd77 Rick Sanchez
BigTable table example
Parameter Input
sqlQuery SELECT X, Y, Z FROM ...
rowKeyMap -
bqToBtMap cf:qualifier:something,cf2...
bigTableInstanceId chartboost
externalSinkProject project-id-x
bigTableAppProfileId batch
Problems / Tips
● For heavy load jobs always use
BigTable application profiles.
BigQuery to S3 (Parquet)
From BQ Dataflow to S3 Parquet
Parameter Input
AWSRegion us-east-1
inputTableSpec project:dataset.table
outputPath gs://bucket_y/test
successFileName _SUCCESS
schema original_app_id:STRING,app_id:LON
G
writeIntoS3 true
public interface Options extends BigQueryToParquetOptions {
}
----------------------------------------------------------------------------------------------------
BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(),
options.getAWSSecretKey());
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred));
options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion());
What we need to connect to AWS
private static PCollection<TableRow> executeSql(Pipeline p, String sql) {
return
p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
.usingStandardSql());
}
How to read from BigQuery
Problems / Tips
- Choose the right region in order to reduce latency and cost.
- To avoid extraction quota issues use DIRECT_READ
- FileIO only writes.
- Be careful with complex types (arrays, nested arrays).
Dealing with Streaming
Custom Airflow plugin to trigger Dataflow Jobs
Questions?
Thank you!

More Related Content

Similar to Apache Beam in Production

Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes
 
InterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsInterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsChris Bailey
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelDeepak Chandramouli
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsJulien Bataillé
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Codemotion
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at ScaleTim Burks
 
Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Tobias Trelle
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019Paul Shapiro
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3André Wuttig
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSAmazon Web Services
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code維佋 唐
 

Similar to Apache Beam in Production (20)

Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT Management
 
InterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsInterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.js
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at Scale
 
Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Spring Data, Jongo & Co.
Spring Data, Jongo & Co.
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...apidays LIVE Australia - From micro to macro-coordination through domain-cent...
apidays LIVE Australia - From micro to macro-coordination through domain-cent...
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

Apache Beam in Production

  • 1. Apache Beam in Production June 10th, 2020
  • 2. ● USA based monetization platform for mobile game developers. ● Growing our engineering office in Barcelona with a strong tech-driven culture: - Visibility of the impact of your code. - Not afraid of implementing tech, like Apache beam ; ) - Fostering best practices and a true care for code quality. +300k Mobile game integrations 900M Unique Users 200B Ad request 2011 Founding year Who we are 100TBs Data Scale
  • 3. Hi, I’m Ferran! Data Engineer @ Chartboost +5 years of Big Data experience
  • 4. Agenda ● What is Apache Beam ● Basic Requirements ● Production use cases: ○ Ingest data into BigQuery or GCS ○ BigQuery to BigTable ○ BigQuery to S3 (Parquet) ● Dealing with Streaming ● Questions
  • 5. What is Apache Beam “Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.” Write once, run anywhere
  • 6. Basic Requirements - Reduce cluster provisioning overhead. - Create a generic and reusable code. - The architecture must be agnostic of the source (streaming or batch). - The architecture must have different configurable sinks.
  • 7. Text files or Parquet
  • 8. Ingest data into BQ or GCS
  • 9. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE outputTableSpec project:dataset.table batchJobInstance NO_TRANSFORMATION howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 10. You have loaded data into BigQuery! I wanted upper case...
  • 11. import com.google.api.services.bigquery.model.TableRow; public class TestTransformation { public TableRow customTransformation(TableRow tr) { tr.set("src_name", tr.get("src_name").toString().toUpperCase()); return tr; } } class Types { static final String TEST = "test"; }
  • 12. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE batchJobInstance test outputTableSpec project:dataset.table howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 13. You have loaded custom data into BigQuery!
  • 15. How custom transformations work @ProcessElement public void processElement(ProcessContext context) { try { String element = context.element(); if (element != null) { JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class); if (jsonModel != null) { TableRow tr = customTransformation(JsonParseToBigQuery.getInstance() .getJsonParse(jsonModel.getDetails(), howToParse)); if (tr != null) { context.output(tr); } } } } catch (Exception e) { LOG.error("Error on parsing", e); } } }
  • 16. Parameter Default Value inputPath - isParquet true outputTableSpec - howToParse - disposition APPEND isPartitionedTable true writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 writeAsParquet true Main Batch Jobs Parameters
  • 17. Parameter Input inputPath gs://bucket_x/*.parquet batchJobInstance test outputDirectory gs://bucket_y/test outputTableSpec project:dataset.table writeIntoGCS true parquetSchema original_app_id:STRING,app_id:LON G howToParse {"original_app_id":"original_app_id"," app.id":"app_id","src_name":"src_nam e","id":"id"}
  • 18. Parameter Default Value streamSource pub_sub inputTopic - subscription - kafkaBrokers - kafkaGroupId - filePartitionPolicy DAILY writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 windowDuration 5m Main Streaming Jobs Parameters
  • 19.
  • 20. Problems / Tips - Debugging failures was not always easy. - If you want to create templates, remember, ValueProviders are only available at Runtime. - Be careful with non thread safe classes. - Default GCP instances are okay, but try to use custom ones.
  • 22. Going from SQL to NoSQL
  • 23. What is BigTable “Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies” Key, Value storage
  • 24. ColumnFamily (info) RowKey Qualifier (name) Qualifier (email) Qualifier(phone) ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382 kfgiuiu5937je3 jdoe@gmail.com 555-738-234 iojcou9wujd77 Rick Sanchez BigTable table example
  • 25. Parameter Input sqlQuery SELECT X, Y, Z FROM ... rowKeyMap - bqToBtMap cf:qualifier:something,cf2... bigTableInstanceId chartboost externalSinkProject project-id-x bigTableAppProfileId batch
  • 26. Problems / Tips ● For heavy load jobs always use BigTable application profiles.
  • 27. BigQuery to S3 (Parquet)
  • 28. From BQ Dataflow to S3 Parquet
  • 29. Parameter Input AWSRegion us-east-1 inputTableSpec project:dataset.table outputPath gs://bucket_y/test successFileName _SUCCESS schema original_app_id:STRING,app_id:LON G writeIntoS3 true
  • 30. public interface Options extends BigQueryToParquetOptions { } ---------------------------------------------------------------------------------------------------- BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(), options.getAWSSecretKey()); options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred)); options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion()); What we need to connect to AWS
  • 31. private static PCollection<TableRow> executeSql(Pipeline p, String sql) { return p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ) .usingStandardSql()); } How to read from BigQuery
  • 32. Problems / Tips - Choose the right region in order to reduce latency and cost. - To avoid extraction quota issues use DIRECT_READ - FileIO only writes. - Be careful with complex types (arrays, nested arrays).
  • 33. Dealing with Streaming Custom Airflow plugin to trigger Dataflow Jobs