The document discusses Recursion Pharmaceuticals' migration of their image analysis pipeline from an on-premise Kafka Streams solution to a cloud-based workflow using Dagger, a workflow library built on Kafka Streams. The pipeline processes large volumes of high-throughput cell imaging data to extract features at different levels and perform downstream analysis. The migration addressed issues with the original system's batch processing approach and lack of real-time feedback. Examples are provided of implementing a sample workflow to extract site-level features using both low-level Kafka Streams APIs and the higher-level Dagger abstraction.
Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Pharmaceutical) Kafka Summit SF 2019
1. Ben Mabey
VP of Engineering
@bmabey
Discovering Drugs
with Kafka Streams
Scott Nielsen
Director of Data Engineering
K A F K A S U M M I T S F 2 0 1 9
42. Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
43. Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
44. Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
68. Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
69. Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
70. Let’s take a look at the
logical pipeline that we
needed to implement…
74. Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
75. Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
76. Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metrics
77. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metrics
78. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metricsmetrics
79. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metricsmetrics
plate level features metrics
80. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
Experiment A
81. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
Experiment A
88. 🗡 dagger 🗡
workflow library
written on top of Kafka Streams
that orchestrates microservices
Dagger, ya know, because
it is all about the workflows
represented as directed
acyclic graphs, i.e. DAGs.
92. Core logic in library
is ~2800 LOC
New workflow system in 2017?
Not Invented Here syndrome?
93. Core logic in library
is ~2800 LOC
All of our our DAGs,
including schema, task,
and workflow definition
~1700 LOC
New workflow system in 2017?
Not Invented Here syndrome?
94. Core logic in library
is ~2800 LOC
All of our our DAGs,
including schema, task,
and workflow definition
~1700 LOC
New workflow system in 2017?
Not Invented Here syndrome?
95. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
96. Let’s look at a small workflow
using Kafka Streams initially…
133. well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
152. Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
✓
✓
✓
153. Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
✓
✓
✓
✓
154. Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
✓
✓
✓
✓
EC2 and Lambda -> Google Clould preemptibles.