Slides for a talk given by Fede Fernández & Fran Pérez. It describes some common tips to improve a Kafka and Spark application, going through improving table joins, operational parameters as blockIntervalTime or number of partitions, serializations or how byKey operations work under the scenes.
9. groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
28. Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
30. Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
31. Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
33. Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
36. Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel