Presenting 3 real-life use cases of Apache Beam in production. Code reusability for bounded and unbounded data as well as running Apache Beam to write into different cloud providers are some of the aspects that will be treated in this presentation.
2. ● USA based monetization platform for mobile game
developers.
● Growing our engineering office in Barcelona with a strong
tech-driven culture:
- Visibility of the impact of your code.
- Not afraid of implementing tech, like Apache beam ; )
- Fostering best practices and a true care for code quality.
+300k
Mobile game
integrations
900M
Unique
Users
200B
Ad request
2011
Founding
year
Who we are
100TBs
Data Scale
4. Agenda
● What is Apache Beam
● Basic Requirements
● Production use cases:
○ Ingest data into BigQuery or GCS
○ BigQuery to BigTable
○ BigQuery to S3 (Parquet)
● Dealing with Streaming
● Questions
5. What is Apache Beam
“Apache Beam is an open source,
unified model for defining both batch
and streaming data-parallel processing
pipelines.”
Write once, run anywhere
6. Basic Requirements
- Reduce cluster provisioning overhead.
- Create a generic and reusable code.
- The architecture must be agnostic of the source (streaming or
batch).
- The architecture must have different configurable sinks.
11. import com.google.api.services.bigquery.model.TableRow;
public class TestTransformation {
public TableRow customTransformation(TableRow tr) {
tr.set("src_name",
tr.get("src_name").toString().toUpperCase());
return tr;
}
}
class Types {
static final String TEST = "test";
}
15. How custom transformations work
@ProcessElement
public void processElement(ProcessContext context) {
try {
String element = context.element();
if (element != null) {
JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class);
if (jsonModel != null) {
TableRow tr = customTransformation(JsonParseToBigQuery.getInstance()
.getJsonParse(jsonModel.getDetails(), howToParse));
if (tr != null) {
context.output(tr);
}
}
}
} catch (Exception e) {
LOG.error("Error on parsing", e);
}
}
}
16. Parameter Default Value
inputPath -
isParquet true
outputTableSpec -
howToParse -
disposition APPEND
isPartitionedTable true
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
writeAsParquet true
Main Batch Jobs Parameters
17. Parameter Input
inputPath gs://bucket_x/*.parquet
batchJobInstance test
outputDirectory gs://bucket_y/test
outputTableSpec project:dataset.table
writeIntoGCS true
parquetSchema original_app_id:STRING,app_id:LON
G
howToParse {"original_app_id":"original_app_id","
app.id":"app_id","src_name":"src_nam
e","id":"id"}
18. Parameter Default Value
streamSource pub_sub
inputTopic -
subscription -
kafkaBrokers -
kafkaGroupId -
filePartitionPolicy DAILY
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
windowDuration 5m
Main Streaming Jobs Parameters
19.
20. Problems / Tips
- Debugging failures was not always easy.
- If you want to create templates, remember, ValueProviders are
only available at Runtime.
- Be careful with non thread safe classes.
- Default GCP instances are okay, but try to use custom ones.
23. What is BigTable
“Bigtable is a compressed, high
performance, proprietary data storage
system built on Google File System, Chubby
Lock Service, SSTable (log-structured
storage like LevelDB) and a few other
Google technologies”
Key, Value storage
24. ColumnFamily (info)
RowKey Qualifier (name) Qualifier (email) Qualifier(phone)
ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382
kfgiuiu5937je3 jdoe@gmail.com 555-738-234
iojcou9wujd77 Rick Sanchez
BigTable table example
25. Parameter Input
sqlQuery SELECT X, Y, Z FROM ...
rowKeyMap -
bqToBtMap cf:qualifier:something,cf2...
bigTableInstanceId chartboost
externalSinkProject project-id-x
bigTableAppProfileId batch
26. Problems / Tips
● For heavy load jobs always use
BigTable application profiles.
30. public interface Options extends BigQueryToParquetOptions {
}
----------------------------------------------------------------------------------------------------
BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(),
options.getAWSSecretKey());
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred));
options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion());
What we need to connect to AWS
31. private static PCollection<TableRow> executeSql(Pipeline p, String sql) {
return
p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
.usingStandardSql());
}
How to read from BigQuery
32. Problems / Tips
- Choose the right region in order to reduce latency and cost.
- To avoid extraction quota issues use DIRECT_READ
- FileIO only writes.
- Be careful with complex types (arrays, nested arrays).