Apache Beam in Production

Apache Beam in Production
June 10th, 2020

● USA based monetization platform for mobile game
developers.
● Growing our engineering office in Barcelona with a strong
tech-driven culture:
- Visibility of the impact of your code.
- Not afraid of implementing tech, like Apache beam ; )
- Fostering best practices and a true care for code quality.
+300k
Mobile game
integrations
900M
Unique
Users
200B
Ad request
2011
Founding
year
Who we are
100TBs
Data Scale

Hi, I’m Ferran!
Data Engineer @ Chartboost
+5 years of Big Data experience

Agenda
● What is Apache Beam
● Basic Requirements
● Production use cases:
○ Ingest data into BigQuery or GCS
○ BigQuery to BigTable
○ BigQuery to S3 (Parquet)
● Dealing with Streaming
● Questions

What is Apache Beam
“Apache Beam is an open source,
unified model for defining both batch
and streaming data-parallel processing
pipelines.”
Write once, run anywhere

Basic Requirements
- Reduce cluster provisioning overhead.
- Create a generic and reusable code.
- The architecture must be agnostic of the source (streaming or
batch).
- The architecture must have different configurable sinks.

Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
outputTableSpec project:dataset.table
batchJobInstance NO_TRANSFORMATION
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}

You have loaded data into BigQuery!
I wanted upper case...

import com.google.api.services.bigquery.model.TableRow;
public class TestTransformation {
public TableRow customTransformation(TableRow tr) {
tr.set("src_name",
tr.get("src_name").toString().toUpperCase());
return tr;
}
}
class Types {
static final String TEST = "test";
}

Parameter Input
isPartitionedTable false
disposition TRUNCATE
batchJobInstance test
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}

You have loaded custom data into BigQuery!

Core
batchJobInstance
streamingJobInstance
customTransA
customTransB
customTransC
How custom transformations work

How custom transformations work
@ProcessElement
public void processElement(ProcessContext context) {
try {
String element = context.element();
if (element != null) {
JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class);
if (jsonModel != null) {
TableRow tr = customTransformation(JsonParseToBigQuery.getInstance()
.getJsonParse(jsonModel.getDetails(), howToParse));
if (tr != null) {
context.output(tr);
}
}
}
} catch (Exception e) {
LOG.error("Error on parsing", e);
}
}
}

Parameter Default Value
inputPath -
isParquet true
outputTableSpec -
howToParse -
disposition APPEND
isPartitionedTable true
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
writeAsParquet true
Main Batch Jobs Parameters

Parameter Input
batchJobInstance test
outputDirectory gs://bucket_y/test
writeIntoGCS true
parquetSchema original_app_id:STRING,app_id:LON
G
howToParse {"original_app_id":"original_app_id","
app.id":"app_id","src_name":"src_nam
e","id":"id"}

Parameter Default Value
streamSource pub_sub
inputTopic -
subscription -
kafkaBrokers -
kafkaGroupId -
filePartitionPolicy DAILY
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
windowDuration 5m
Main Streaming Jobs Parameters

Problems / Tips
- Debugging failures was not always easy.
- If you want to create templates, remember, ValueProviders are
only available at Runtime.
- Be careful with non thread safe classes.
- Default GCP instances are okay, but try to use custom ones.

What is BigTable
“Bigtable is a compressed, high
performance, proprietary data storage
system built on Google File System, Chubby
Lock Service, SSTable (log-structured
storage like LevelDB) and a few other
Google technologies”
Key, Value storage

ColumnFamily (info)
RowKey Qualifier (name) Qualifier (email) Qualifier(phone)
ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382
kfgiuiu5937je3 jdoe@gmail.com 555-738-234
iojcou9wujd77 Rick Sanchez
BigTable table example

Parameter Input
sqlQuery SELECT X, Y, Z FROM ...
rowKeyMap -
bqToBtMap cf:qualifier:something,cf2...
bigTableInstanceId chartboost
externalSinkProject project-id-x
bigTableAppProfileId batch

Problems / Tips
● For heavy load jobs always use
BigTable application profiles.

From BQ Dataflow to S3 Parquet

Parameter Input
AWSRegion us-east-1
inputTableSpec project:dataset.table
outputPath gs://bucket_y/test
successFileName _SUCCESS
schema original_app_id:STRING,app_id:LON
G
writeIntoS3 true

public interface Options extends BigQueryToParquetOptions {
}
----------------------------------------------------------------------------------------------------
BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(),
options.getAWSSecretKey());
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred));
options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion());
What we need to connect to AWS

private static PCollection<TableRow> executeSql(Pipeline p, String sql) {
return
p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
.usingStandardSql());
}
How to read from BigQuery

Problems / Tips
- Choose the right region in order to reduce latency and cost.
- To avoid extraction quota issues use DIRECT_READ
- FileIO only writes.
- Be careful with complex types (arrays, nested arrays).

Dealing with Streaming
Custom Airflow plugin to trigger Dataflow Jobs

Apache Beam in Production

Recommended

Recommended

More Related Content

Similar to Apache Beam in Production

Similar to Apache Beam in Production (20)

Recently uploaded

Recently uploaded (20)

Apache Beam in Production