Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm

Genomic Computation at Scale
with Serverless, StackStorm, and Docker
SC17, 14 Nov 2017
Dmitri Zimine
Fellow @ Extreme Networks
@dzimine
Image by Miki Yoshihito, Creative Commons license

Genomic Sequencing and Annotation
ACGTGACCGGTACTGGTAACGTACA
CCTACGTGACCGGTACTGGTAACGT
ACGCCTACGTGACCGGTACTGGTAA
CGTATACACGTGACCGGTACTGGTA
ACGTACACCTACGTGACCGGTACTG
CTGGTAACGTATACCTCT...
Sequencer
Sequenced Genome
DNA Sample
Annotated Sequence
Compute
in silko

3
So that…
Source: http://www.yourgenome.org

Victor Solovyev
Partner,
Leading scientist in
computational
biology
Victor Solovyev is a leading scientist in computational biology. His
experience is a good mixture of academic positions, including Professor
at Royal Holloway and KAUST, and various industry roles. His research
on bioinformatics and genomic computations are published in Nature,
Science, Genome Research and highly cited.
As Chief Sci. Officer at Softberry, he is leading software development
for biomedical data analysis and research in computational biology.
Softberry software products have been used in over 2000 research
publications in 2016 alone. Fgenesh program has been cited in ~ 3200,
Bprom program in ~ 800, Fgenesb pipeline in ~500 scientific
publications.

5
fgenesb pipeline: some [prev] results

PROPERTIES:
Challenges:
• Offer annotation pipelines online
• Use cloud, for large elastic capacity
• Handle scale - spiky workload
• Economically
GAaaS – Genomic Annotation as a Service

Agenda
8
Problem &
Solution
Domain demands, technology selection
& serverless, toolchain, solution overview
Show & Tell Demo
Discussion
Lessons learned, what to keep & what to
refactor, the path forward

Typicalgenomicannotationpipeline
Search for similar
proteins in
databases
KEGG
Prediction of
genes and
proteins
Compilation and
presentation of
results
NR
fgenesb
Blast(NR)
GCView
50-100Gb
KOALA(KEGG)
1Mb-3Gb
Highly
Parallel-able

Annotation Pipelines
A basic exome pipeline
delivering called variants from
raw sequence could consist of
as few as 12 steps, most of
which can be run in parallel,
but a real analysis will typically
involve several additional
downstream steps and
complex report generation.
Source: Brief Bioinform bbw020.
DOI: https://doi.org/10.1093/bib/bbw020

Annotation Pipelines
A basic exome pipeline
delivering called variants from
raw sequence could consist of
as few as 12 steps, most of
which can be run in parallel,
but a real analysis will typically
involve several additional
downstream steps and
complex report generation.
Source: Brief Bioinform bbw020.
DOI: https://doi.org/10.1093/bib/bbw020
PROPERTIES:
• Steps:
• jobs/functions
• Run times – may be hours & days
• Diverse (a.k.a. “don’t run on the same box”)
• Workflow orchestration:
• Logical patterns: splits, parallels, joins
• Data flow:
Upstream results –> downstream inputs
• Scale dimentions: spiky load
• Low volume of requests,
• Very high compute demand per request
Properties:

Authoritative: Mike Roberts on martinfowler.com:
My summary
• Function, not service: “down when done”
• Scale – elastic, infinite, transparent for developer
• Pay per use consumption model
https://goo.gl/bTfgfU
What is Serverless?

14
Serverless fits!
*) BYOC – Bring Your Own Code (see the serverless compute manifesto, https://goo.gl/q9HsXB
Typical Serverless requirements:
• “Functions”, not “servers”,
down when done
• Elastic scale:
handle spiky workload pattern
• BYOC*:
package algorithms into containers
• Launch on a variety of events

Additional requirements:
• Long running times: hours
• Pipeline orchestration:
execution logic and data
passing
• Local Dev environment,
consistent and convenient
15
Serverless fits, but…
Typical Serverless requirements:
• Elastic scale:
handle spiky workload pattern
• “Functions”, not “servers”,
down when done
• BYOC*: package programs into
containers, run everywhere
• Launch on a variety of events

Why not <…>
16
AWS Lambda?
5 min limitation
- jobs run for hours and days
Azure?
No native support for Functions
in docker containers *
OpenWhisk?
Lacks powerful workflow to
orchestrate pipelines (only
sequences)
*) At the time of selecting. I will cover ”what has changed” in Discussion.

Terraform provisions infra on AWS (WIP);
Vagrant for local dev infra.
Ansible deploys & cofigures software on
Infra.
Docker to containerize functions and
push to local Docker Registry.
StackStorm orchestrates pipeline
executions,
invokes Swarm to run functions,
dynamically scales Swarm on load.
Tool Chain

StackStorm,in1minute
ActionsSensors
WorkflowsRules
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring Ops Support
Triggers Calls

©2017 Extreme Networks, Inc. All rights reserved
StackStorm is like …
ActionsSensors
WorkflowsRules
Step Functions
AWS Lambda
OpenSource, for DIY Serverless

Three Sides to Serverless Story
DevOps
Developer
End User
Submits sequence,
Gets results,
fast and cheap.
Packs algorithms in
containers,
Defines pipelines
Provides
infrastructure

1. DevOps: deploys serverless solution
23
share(:rw) data(:ro)
StackStorm
other infra…
f(x)
Registry
Controller
f(x)
f(x)
f(x)
Worker
f(x)
f(x)
f(x)
Worker
f(x)
f(x)
f(x)
Worker
/share /data
$ function
Scale
DevOps

2. Developer:
creates functions, defines pipeline
25
StackStorm
Registry
Create functions (BYOC),
pack into Docker image,
push to local Registry
Define pipelines
as StackStorm workflows
Developer
1
2
f(x)
f(x)
f(x)
f(x)

StackStorm
StackStorm
sends results
back to user
Swarm
controller
2
46
Docker pulls
function’s images
5
Functions run in
containers, produce
data
f(x)
StackStorm runs workflow
schedules functions
as jobs on Swarm
Swarm
Worker
3
Swarm schedules
services
User sends
sequence data1
f(x) f(x)
Registry
3. User submits data,
System runs pipeline & produces results
End
User

27
Genomic annotation pipeline
with StackStorm, Docker,
and Docker Swarm
Show & Tell, PART 1

Scale: dynamically, on load
29
share(:rw) data(:ro)
StackStorm
other infra…
f(x)
Registry
Controller
f(x)
f(x)
f(x)
Worker
f(x)
f(x)
f(x)
Worker
f(x)
Worker
Scale

30
Show & Tell, PART 2
Dynamically scaling
Swarm cluster on AWS,
on workload

Agenda
32
Problem &
Solution
Domain demands, technology selection
& serverless, toolchain, solution overview
Show & Tell Demo
Discussion
Lessons learned, what to keep & what to
refactor, the path forward

Serverless hype accelerates
25+ framewors … but no turn-key fit yet

Kubernetes Won Container Arm Race
now with built-in AWS autoscaler .

Azure Introduced Container Instances
no messing with VMs, per-second billing .

We are outpaced by technology
So What?

Path Forward: Options
Option 1: Kubernetes
• Use Kubernetes pack from StackStorm Exchange
• Utilize k8s “run to completion” jobs
• Deploy on AWS, minikube for local development,
• Leverage AWS autoscaler for elastic capacity
StackStorm handles pipeline workflow, calls k8s Jobs.
Same app developer experience.
39

Path Forward: Options
Option 2: Azure
• Use Azure’s ”Self-orchestration” option with StackStorm
• Azure provides containers on demand (no VMs!)
• Per container, per second billing
StackStorm handles pipeline workflow, calls Azure containers.
App developer experience stays the same.
40

StackStorm
StackStorm
sends results
back to user
Azure
Container
Service
2
46
Docker pulls
function’s images
from Registry
5
Functions run in
containers, produce
data
f(x)
StackStorm runs workflow
schedules functions
as containers on Azure
Azure
Container
Instance
3
Azure schedules
container instances
User sends
sequence data1
f(x) f(x)
Registry
Path forward: Change to Azure Container Instances
End
User

STACKSTORM EVENT-DRIVEN AUTOMATION ALLOWS YOU TO GET YOUR
SOLUTION UP AND RUNNING QUICKLY SO YOU CAN DELIVER BUSINESS FAST,
EXPERIMENT AND INNOVATE. ONCE YOU HAVE IT JUST RIGHT, YOU CAN BUILD
A MORE PERMANENT VERSION WITH MICROSERVICES
ActionsSensors
WorkflowsRules
44

StackStorm is an innovation platform
where we can build solutions,
experiment and learn,
while deliver business value,
before moving implementation to
dedicated services

46
StackStorm OpenSource
Platform
Brocade Workflow Composer
(StackStorm Enterprise Edition)
Network Automation
StackStorm Exchange
Community
Security Assisted
Networking

Image by Miki Yoshihito, Creative Commons license
Dmitri Zimine
Extreme Networks
@dzimine
http://github.com/dzimine/serverless-swarm
@Stack_Storm
http://github.com/StackStorm/st2
Star 2,317
Thank You!

Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm

Similar to Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm (20)

Recently uploaded

Recently uploaded (20)

Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm