SlideShare a Scribd company logo
1 of 166
Ben Mabey
VP of Engineering
@bmabey
Discovering Drugs
with Kafka Streams
Scott Nielsen
Director of Data Engineering
K A F K A S U M M I T S F 2 0 1 9
Penn Teller
Penn TellerB
Scott
Decoding Biology
to Radically Improve Lives
© 2017 Recursion Pharmaceuticals
1000s of untreated
genetic diseases
Photo of our wall?
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015
Transistor Area (% of 1970 values)
Moore’s Law
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015
Transistor Area (% of 1970 values)
1
10
100
1971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010
R&D Spend / Drug (% of 2007 values)
Moore’s Law
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015
Transistor Area (% of 1970 values)
1
10
100
1971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010
R&D Spend / Drug (% of 2007 values)
Moore’s Law
Eroom’s Law
0
10
20
30
40
50
60
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
Number of Drugs Approved in US (1993-2016)
How can we fix this?
RecursionPharma.com
RecursionPharma.com
Over 7 million per week
RecursionPharma.com
hoechst (DNA)
RecursionPharma.com
concanavalin A (ER)
RecursionPharma.com
mitotracker (mitochondria)
RecursionPharma.com
WGA (golgi apparatus, cell membrane)
RecursionPharma.com
SYTO 14 (RNA, nucleoli)
RecursionPharma.com
phalloidin (actin fibers)
RecursionPharma.com
combined
How do these pretty
pictures help?
Healthy child
Child with rare
genetic disease
(Cornelia de Lange
Syndrome)
Healthy child Healthy cells
Child with rare
genetic disease
(Cornelia de Lange
Syndrome)
Genetic disease
model cells
(Cornelia de Lange
Syndrome)
Healthy Disease
Healthy Disease Disease + Drug?
Public Dataset: http://rxrx.ai
Nature Article
Machine learning brings cell imaging promises into focus
https://tinyurl.com/ml-cells

Learn more…
How is this data
produced?
308 wells/plate
4 sites/well
308 wells/plate
6 channels (images)/site
7,392 images per plate
4 sites/well
308 wells/plate
6 channels (images)/site
7,392 images per plate
4 sites/well
308 wells/plate
~69GB per plate
Experiment A Experiment B Experiment C
Experiment D
Our “Series A” System
On-Premise
On-Premise
Stream images to S3
On-Premise
Generate thumbnails
Image metrics
Stream images to S3
On-Premise
Generate thumbnails
Image metrics
Stream images to S3
On-Premise
Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
On-Premise
Generate thumbnails
Image metrics
Fire and forget
Experiment A
Stream images to S3
On-Premise
Generate thumbnails
Image metrics
Fire and forget
Experiment A
Stream images to S3
Extract Features
On-Premise
Process experiments
in batch
Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
On-Premise
Process experiments
in batch
Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
Generate thumbnails
Image metrics
Fire and forget
Stream images to S3
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
Traditional, low throughput, biology
Traditional, low throughput, biology
~6-12 plates per week, ~400-800GB
© 2017 Recursion Pharmaceuticals
High-throughput
experiments
Robots
photo
100
6.9TB
100
6.9TB
300
20TB
100
6.9TB
300
20TB
Kafka Streams solution
was launched
100
6.9TB
300
20TB
700
48TB
1,300
90TB
1,700
118TB
1,900
132 TB
Kafka Streams solution
was launched
100
6.9TB
300
20TB
700
48TB
1,300
90TB
1,700
118TB
1,900
132 TB
100
6.9TB
300
20TB
700
48TB
1,300
90TB
1,700
118TB
1,900
132 TB
280 TB
Today
So what was wrong with the
original system?
Generate thumbnails
Image metrics
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
Experiment A Experiment B Experiment C
Experiment D
Plates are not imaged in order
Migration Goals
Migration Goals
Move orchestration and processing to cloud.
Migration Goals
Move orchestration and processing to cloud.
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
Let’s take a look at the
logical pipeline that we
needed to implement…
Images / channel level
Images / channel level
image level metrics
Images / channel level
site (all channels/images)
thumbnails
image level metrics
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metrics
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metrics
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metricsmetrics
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
image level metrics
site metricsmetrics
plate level features metrics
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
Experiment A
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
Experiment A
Kafka Streams
was just released…
Kafka Streams
was just released…
🗡 dagger 🗡
workflow library
written on top of Kafka Streams
that orchestrates microservices
🗡 dagger 🗡
workflow library
written on top of Kafka Streams
that orchestrates microservices
Dagger, ya know, because
it is all about the workflows
represented as directed
acyclic graphs, i.e. DAGs.
🗡 dagger 🗡
workflow library
written on top of Kafka Streams
that orchestrates microservices
New workflow system in 2017?
New workflow system in 2017?
Not Invented Here syndrome?
Core logic in library
is ~2800 LOC
New workflow system in 2017?
Not Invented Here syndrome?
Core logic in library
is ~2800 LOC
All of our our DAGs,
including schema, task,
and workflow definition
~1700 LOC
New workflow system in 2017?
Not Invented Here syndrome?
Core logic in library
is ~2800 LOC
All of our our DAGs,
including schema, task,
and workflow definition
~1700 LOC
New workflow system in 2017?
Not Invented Here syndrome?
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
Let’s look at a small workflow
using Kafka Streams initially…
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
final KTable<String, ExperimentMetadata> experimentMetadata = builder.table(
EXPERIMENT_METADATA_TOPIC);
final KStream<String, ChannelLevel> images = builder.stream(
CHANNEL_IMAGES_TOPIC);
final KStream<String, Site> sites = images
.groupBy((exp, channel) -> channel.site())
.windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS)))
.aggregate(
() -> new AggState(),
(site, channel, agg) -> agg.observe(channel.site(), channel.channel),
(site, agg_a, agg_b) -> agg_a.merge(agg_b))
.join(experimentMetadata,
(agg, expMeta) -> agg.markCompleted(expMeta.numChannels))
.filterValues(agg -> agg.isComplete())
.mapValues(agg -> agg.site());
sites.to(SITE_IMAGES_TOPIC);
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Kafka Streams App External Service
task input topic
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Kafka Streams App External Service
task input topic
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Kafka Streams App External Service
task input topic
task output topic
How would you do the same
workflow in dagger?
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Input topics & tables
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Input topics & tables
Stream operations
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Input topics & tables
Stream operations
Tasks
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
Input topics & tables
Stream operations
Tasks
Output topics
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
Specify function to be used
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
{"name": "extract-site-level-features",
"graph":
{"images-channel":
{"type": "topic-stream", "topic-name": "images_channels"}
"experiment-metadata":
{"type": "topic-table", "topic-name": "experiment_metadata"},
"images-site":
{"type": "stream-operation",
"key-schema": "long", "value-schema": "job_site_level",
"inputs": ["images-channel", "experiment-metadata"],
"function": "aggregations/images-site-grouping"},
"features-site":
{"type": "external-task",
"stream": "images-site",
"task-name": "extract-features"},
"features-output":
{"type": "publish",
"topic-name": "extracted_features",
"stream": "features-site"}}}
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
{:name "extract-site-level-features",
:graph
{:images-channel
{:type :topic-stream, :topic-name "images_channels"},
:experiment-metadata
{:type :topic-table, :topic-name "experiment_metadata"},
:images-site
{:type :stream-operation,
:key-schema :long, :value-schema "job_site_level",
:inputs [:images-channel, :experiment-metadata],
:function (fn [images-channel experiment-metadata] …),
:features-site
{:type :external-task,
:task-name "extract-features",
:stream :images-site},
:features-output
{:type :publish,
:stream :features-site,
:topic-name "extracted_features"}}}
images_site stream
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
{:name "extract-site-level-features",
:graph
{:images-channel
{:type :topic-stream, :topic-name "images_channels"},
:experiment-metadata
{:type :topic-table, :topic-name "experiment_metadata"},
:images-site
{:type :stream-operation,
:key-schema :long, :value-schema "job_site_level",
:inputs [:images-channel, :experiment-metadata],
:function (fn [images-channel experiment-metadata] …),
:features-site
{:type :external-task,
:task-name "extract-features",
:stream :images-site},
:features-output
{:type :publish,
:stream :features-site,
:topic-name "extracted_features"}}}
images_site stream
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
{:name "extract-site-level-features",
:graph
{:images-channel
{:type :topic-stream, :topic-name "images_channels"},
:experiment-metadata
{:type :topic-table, :topic-name "experiment_metadata"},
:images-site
{:type :stream-operation,
:key-schema :long, :value-schema "job_site_level",
:inputs [:images-channel, :experiment-metadata],
:function (fn [images-channel experiment-metadata] …),
:features-site
{:type :external-task,
:task-name "extract-features",
:stream :images-site},
:features-output
{:type :publish,
:stream :features-site,
:topic-name "extracted_features"}}}
Inline function directly
images_site stream
extract site features
images_channel topic
experiment_metadata topic table
extracted_features topic
images_site stream
🗡( )
Dagger is a compiler
🗡( ) Kafka Streams
Topology
Dagger is a compiler
🗡( ) Kafka Streams
Topology
Dagger is a compiler
What would the entire pipeline
look like in dagger?
well level features
Images / channel level
site (all channels/images)
thumbnails
site level features
experiment features
image level metrics
site metricsmetrics
plate level features metrics
metrics, models,
reports, etc
🗡 🗡
Our pipeline application that uses Dagger
How does the whole system
look like now?
Generate thumbnails
Image metrics
Extract Features
metrics, models,
reports, etc
On-Premise
Process experiments
in batch
On-Premise
On-Premise
Publish Image Events
On-Premise
Publish Image Events
On-Premise
🗡
🗡
Publish Image Events
Uploader
On-Premise
🗡
🗡
Publish Image Events
Uploader
dagger is used here too!
On-Premise
🗡
🗡
Publish Image Events
Uploader
On-Premise
🗡
🗡
Publish Image Events
Uploader
On-Premise
🗡
🗡
🗡 🗡
Publish Image Events
Uploader
On-Premise
🗡
🗡
🗡 🗡
Autoscaled WorkersPublish Image Events
Uploader
On-Premise
🗡
🗡
🗡 🗡
Microservices
Publishers & Consumers
Autoscaled WorkersPublish Image Events
Uploader
On-Premise
🗡
🗡
🗡 🗡
BigQuerySQL
🗡Transform & Load
🗡
Microservices
Publishers & Consumers
Autoscaled WorkersPublish Image Events
Uploader
Migration Goals
Migration Goals
Move orchestration and processing to cloud.✓
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
✓
✓
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
✓
✓
✓
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
✓
✓
✓
✓
Migration Goals
Move orchestration and processing to cloud.
Faster feedback and less bursty workloads.
Preserve existing micro-services logic.
Make cheaper.
✓
✓
✓
✓
EC2 and Lambda -> Google Clould preemptibles.
Big data, small metadata…
Big data, small metadata…
Lessons learned…
Early Adopter Tax
Missed out on mature
workflow monitoring
On-Premise
🗡
🗡
🗡 🗡
🗡Transform & Load
🗡
Uploader
Easy deployment!
Kafka Streams App External Service
task input topic
task output topic
Durable Log FTW
Thank you!
🗡
Come help us decode biology!
@RecursionPharma @bmabey

More Related Content

Similar to Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Pharmaceutical) Kafka Summit SF 2019

Elastic{ON} Seminar New York (2017)
Elastic{ON} Seminar New York (2017)Elastic{ON} Seminar New York (2017)
Elastic{ON} Seminar New York (2017)Franklin Angulo
 
Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment NetApp
 
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office Hours
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office HoursIVS CTO Night And Day 2018 Winter - AWS Startup Tech Office Hours
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office HoursAmazon Web Services Japan
 
Tektronix keithley Product and Application update Q2 2016
Tektronix keithley Product and Application update Q2 2016Tektronix keithley Product and Application update Q2 2016
Tektronix keithley Product and Application update Q2 2016Jeff Sable
 
EnviroInsite training workshop - Overview of EnviroInsite Features
EnviroInsite training workshop - Overview of EnviroInsite FeaturesEnviroInsite training workshop - Overview of EnviroInsite Features
EnviroInsite training workshop - Overview of EnviroInsite FeaturesBruce Jacobs
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...HostedbyConfluent
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015Bipin Singh
 
DSR microservices
DSR microservicesDSR microservices
DSR microservicesSteve Upton
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?ScyllaDB
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldSean Chittenden
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Enterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchAzul Systems Inc.
 
AWS Summit Berlin 2013 - Keynote - 6wunderkinder
AWS Summit Berlin 2013 - Keynote - 6wunderkinderAWS Summit Berlin 2013 - Keynote - 6wunderkinder
AWS Summit Berlin 2013 - Keynote - 6wunderkinderAWS Germany
 
[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniquesJaeJun Yoo
 
BPippey_resume 2 Pg
BPippey_resume 2 PgBPippey_resume 2 Pg
BPippey_resume 2 PgBill Pippey
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithRoss McFadyen
 
Applying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeApplying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeC4Media
 

Similar to Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Pharmaceutical) Kafka Summit SF 2019 (20)

Elastic{ON} Seminar New York (2017)
Elastic{ON} Seminar New York (2017)Elastic{ON} Seminar New York (2017)
Elastic{ON} Seminar New York (2017)
 
Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment Accelerate Your OpenStack Deployment
Accelerate Your OpenStack Deployment
 
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office Hours
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office HoursIVS CTO Night And Day 2018 Winter - AWS Startup Tech Office Hours
IVS CTO Night And Day 2018 Winter - AWS Startup Tech Office Hours
 
Tektronix keithley Product and Application update Q2 2016
Tektronix keithley Product and Application update Q2 2016Tektronix keithley Product and Application update Q2 2016
Tektronix keithley Product and Application update Q2 2016
 
EnviroInsite training workshop - Overview of EnviroInsite Features
EnviroInsite training workshop - Overview of EnviroInsite FeaturesEnviroInsite training workshop - Overview of EnviroInsite Features
EnviroInsite training workshop - Overview of EnviroInsite Features
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
Scaling your Kafka streaming pipeline can be a pain - but it doesn’t have to ...
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
 
Sydney summit-lock note
Sydney summit-lock noteSydney summit-lock note
Sydney summit-lock note
 
DSR microservices
DSR microservicesDSR microservices
DSR microservices
 
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Enterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up SearchEnterprise Search Summit - Speeding Up Search
Enterprise Search Summit - Speeding Up Search
 
AWS Summit Berlin 2013 - Keynote - 6wunderkinder
AWS Summit Berlin 2013 - Keynote - 6wunderkinderAWS Summit Berlin 2013 - Keynote - 6wunderkinder
AWS Summit Berlin 2013 - Keynote - 6wunderkinder
 
[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques[CVPR2020] Simple but effective image enhancement techniques
[CVPR2020] Simple but effective image enhancement techniques
 
BPippey_resume 2 Pg
BPippey_resume 2 PgBPippey_resume 2 Pg
BPippey_resume 2 Pg
 
Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016Netflix Edge Engineering Open House Presentations - June 9, 2016
Netflix Edge Engineering Open House Presentations - June 9, 2016
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the Monolith
 
Applying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing CodeApplying Java 8 Idioms to Existing Code
Applying Java 8 Idioms to Existing Code
 

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Discovering Drugs with Kafka Streams (Ben Mabey & Scott Nielsen, Recursion Pharmaceutical) Kafka Summit SF 2019

  • 1. Ben Mabey VP of Engineering @bmabey Discovering Drugs with Kafka Streams Scott Nielsen Director of Data Engineering K A F K A S U M M I T S F 2 0 1 9
  • 5. © 2017 Recursion Pharmaceuticals 1000s of untreated genetic diseases Photo of our wall?
  • 7. 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015 Transistor Area (% of 1970 values) 1 10 100 1971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010 R&D Spend / Drug (% of 2007 values) Moore’s Law
  • 8. 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000197119721973197419751976197719781979198019811982198319841985198619871988198919901991199219931994199519961997199819992000200120022003200420052006200720082009201020112012201320142015 Transistor Area (% of 1970 values) 1 10 100 1971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010 R&D Spend / Drug (% of 2007 values) Moore’s Law Eroom’s Law
  • 10. How can we fix this?
  • 20. How do these pretty pictures help?
  • 21. Healthy child Child with rare genetic disease (Cornelia de Lange Syndrome)
  • 22. Healthy child Healthy cells Child with rare genetic disease (Cornelia de Lange Syndrome) Genetic disease model cells (Cornelia de Lange Syndrome)
  • 25. Public Dataset: http://rxrx.ai Nature Article Machine learning brings cell imaging promises into focus https://tinyurl.com/ml-cells
 Learn more…
  • 26. How is this data produced?
  • 29. 6 channels (images)/site 7,392 images per plate 4 sites/well 308 wells/plate
  • 30. 6 channels (images)/site 7,392 images per plate 4 sites/well 308 wells/plate ~69GB per plate
  • 31. Experiment A Experiment B Experiment C Experiment D
  • 35. Stream images to S3 On-Premise
  • 36. Generate thumbnails Image metrics Stream images to S3 On-Premise
  • 37. Generate thumbnails Image metrics Stream images to S3 On-Premise
  • 38. Generate thumbnails Image metrics Fire and forget Stream images to S3 On-Premise
  • 39. Generate thumbnails Image metrics Fire and forget Experiment A Stream images to S3 On-Premise
  • 40. Generate thumbnails Image metrics Fire and forget Experiment A Stream images to S3 Extract Features On-Premise Process experiments in batch
  • 41. Generate thumbnails Image metrics Fire and forget Stream images to S3 Extract Features On-Premise Process experiments in batch
  • 42. Generate thumbnails Image metrics Fire and forget Stream images to S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  • 43. Generate thumbnails Image metrics Fire and forget Stream images to S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  • 44. Generate thumbnails Image metrics Fire and forget Stream images to S3 Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  • 46. Traditional, low throughput, biology ~6-12 plates per week, ~400-800GB
  • 47. © 2017 Recursion Pharmaceuticals High-throughput experiments Robots photo
  • 48.
  • 49.
  • 50.
  • 55.
  • 56.
  • 59. So what was wrong with the original system?
  • 60. Generate thumbnails Image metrics Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  • 61. Experiment A Experiment B Experiment C Experiment D Plates are not imaged in order
  • 62.
  • 64. Migration Goals Move orchestration and processing to cloud.
  • 65. Migration Goals Move orchestration and processing to cloud.
  • 66. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads.
  • 67. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads.
  • 68. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. Preserve existing micro-services logic.
  • 69. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. Preserve existing micro-services logic. Make cheaper.
  • 70. Let’s take a look at the logical pipeline that we needed to implement…
  • 72. Images / channel level image level metrics
  • 73. Images / channel level site (all channels/images) thumbnails image level metrics
  • 74. Images / channel level site (all channels/images) thumbnails site level features image level metrics
  • 75. Images / channel level site (all channels/images) thumbnails site level features image level metrics
  • 76. Images / channel level site (all channels/images) thumbnails site level features image level metrics site metrics
  • 77. well level features Images / channel level site (all channels/images) thumbnails site level features image level metrics site metrics
  • 78. well level features Images / channel level site (all channels/images) thumbnails site level features image level metrics site metricsmetrics
  • 79. well level features Images / channel level site (all channels/images) thumbnails site level features image level metrics site metricsmetrics plate level features metrics
  • 80. well level features Images / channel level site (all channels/images) thumbnails site level features experiment features image level metrics site metricsmetrics plate level features metrics Experiment A
  • 81. well level features Images / channel level site (all channels/images) thumbnails site level features experiment features image level metrics site metricsmetrics plate level features metrics metrics, models, reports, etc Experiment A
  • 82.
  • 83.
  • 84.
  • 85. Kafka Streams was just released…
  • 86. Kafka Streams was just released…
  • 87. 🗡 dagger 🗡 workflow library written on top of Kafka Streams that orchestrates microservices
  • 88. 🗡 dagger 🗡 workflow library written on top of Kafka Streams that orchestrates microservices Dagger, ya know, because it is all about the workflows represented as directed acyclic graphs, i.e. DAGs.
  • 89. 🗡 dagger 🗡 workflow library written on top of Kafka Streams that orchestrates microservices
  • 91. New workflow system in 2017? Not Invented Here syndrome?
  • 92. Core logic in library is ~2800 LOC New workflow system in 2017? Not Invented Here syndrome?
  • 93. Core logic in library is ~2800 LOC All of our our DAGs, including schema, task, and workflow definition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?
  • 94. Core logic in library is ~2800 LOC All of our our DAGs, including schema, task, and workflow definition ~1700 LOC New workflow system in 2017? Not Invented Here syndrome?
  • 95. well level features Images / channel level site (all channels/images) thumbnails site level features experiment features image level metrics site metricsmetrics plate level features metrics metrics, models, reports, etc
  • 96. Let’s look at a small workflow using Kafka Streams initially…
  • 97. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream
  • 98. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 99. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 100. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 101. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 102. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 103. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream final KTable<String, ExperimentMetadata> experimentMetadata = builder.table( EXPERIMENT_METADATA_TOPIC); final KStream<String, ChannelLevel> images = builder.stream( CHANNEL_IMAGES_TOPIC); final KStream<String, Site> sites = images .groupBy((exp, channel) -> channel.site()) .windowedBy(SessionWindows.with(Duration.ofHours(SESSION_WINDOW_HOURS))) .aggregate( () -> new AggState(), (site, channel, agg) -> agg.observe(channel.site(), channel.channel), (site, agg_a, agg_b) -> agg_a.merge(agg_b)) .join(experimentMetadata, (agg, expMeta) -> agg.markCompleted(expMeta.numChannels)) .filterValues(agg -> agg.isComplete()) .mapValues(agg -> agg.site()); sites.to(SITE_IMAGES_TOPIC);
  • 104. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Kafka Streams App External Service task input topic
  • 105. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Kafka Streams App External Service task input topic
  • 106. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Kafka Streams App External Service task input topic task output topic
  • 107. How would you do the same workflow in dagger?
  • 108. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream
  • 109. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Input topics & tables
  • 110. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Input topics & tables Stream operations
  • 111. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Input topics & tables Stream operations Tasks
  • 112. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream Input topics & tables Stream operations Tasks Output topics
  • 113. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 114. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 115. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 116. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 117. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 118. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 119. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}} Specify function to be used
  • 120. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 121. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 122. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 123. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 124. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream {"name": "extract-site-level-features", "graph": {"images-channel": {"type": "topic-stream", "topic-name": "images_channels"} "experiment-metadata": {"type": "topic-table", "topic-name": "experiment_metadata"}, "images-site": {"type": "stream-operation", "key-schema": "long", "value-schema": "job_site_level", "inputs": ["images-channel", "experiment-metadata"], "function": "aggregations/images-site-grouping"}, "features-site": {"type": "external-task", "stream": "images-site", "task-name": "extract-features"}, "features-output": {"type": "publish", "topic-name": "extracted_features", "stream": "features-site"}}}
  • 125. extract site features images_channel topic experiment_metadata topic table extracted_features topic {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream
  • 126. extract site features images_channel topic experiment_metadata topic table extracted_features topic {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} images_site stream
  • 127. extract site features images_channel topic experiment_metadata topic table extracted_features topic {:name "extract-site-level-features", :graph {:images-channel {:type :topic-stream, :topic-name "images_channels"}, :experiment-metadata {:type :topic-table, :topic-name "experiment_metadata"}, :images-site {:type :stream-operation, :key-schema :long, :value-schema "job_site_level", :inputs [:images-channel, :experiment-metadata], :function (fn [images-channel experiment-metadata] …), :features-site {:type :external-task, :task-name "extract-features", :stream :images-site}, :features-output {:type :publish, :stream :features-site, :topic-name "extracted_features"}}} Inline function directly images_site stream
  • 128. extract site features images_channel topic experiment_metadata topic table extracted_features topic images_site stream
  • 129. 🗡( ) Dagger is a compiler
  • 130. 🗡( ) Kafka Streams Topology Dagger is a compiler
  • 131. 🗡( ) Kafka Streams Topology Dagger is a compiler
  • 132. What would the entire pipeline look like in dagger?
  • 133. well level features Images / channel level site (all channels/images) thumbnails site level features experiment features image level metrics site metricsmetrics plate level features metrics metrics, models, reports, etc
  • 134.
  • 135. 🗡 🗡 Our pipeline application that uses Dagger
  • 136. How does the whole system look like now?
  • 137. Generate thumbnails Image metrics Extract Features metrics, models, reports, etc On-Premise Process experiments in batch
  • 147. On-Premise 🗡 🗡 🗡 🗡 Microservices Publishers & Consumers Autoscaled WorkersPublish Image Events Uploader
  • 148. On-Premise 🗡 🗡 🗡 🗡 BigQuerySQL 🗡Transform & Load 🗡 Microservices Publishers & Consumers Autoscaled WorkersPublish Image Events Uploader
  • 150. Migration Goals Move orchestration and processing to cloud.✓
  • 151. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. ✓ ✓
  • 152. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. Preserve existing micro-services logic. ✓ ✓ ✓
  • 153. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. Preserve existing micro-services logic. Make cheaper. ✓ ✓ ✓ ✓
  • 154. Migration Goals Move orchestration and processing to cloud. Faster feedback and less bursty workloads. Preserve existing micro-services logic. Make cheaper. ✓ ✓ ✓ ✓ EC2 and Lambda -> Google Clould preemptibles.
  • 155. Big data, small metadata…
  • 156. Big data, small metadata…
  • 159. Missed out on mature workflow monitoring
  • 160.
  • 161.
  • 162. On-Premise 🗡 🗡 🗡 🗡 🗡Transform & Load 🗡 Uploader Easy deployment!
  • 163. Kafka Streams App External Service task input topic task output topic Durable Log FTW
  • 165. 🗡
  • 166. Come help us decode biology! @RecursionPharma @bmabey