Presented on SuperComputing SC17 on Nov 14/2017 by Dmitri Zimine.
This talk is a story of bio-tech meeting DevOps to produce genomic computations, economically, and at scale.
Genomic computation is growing in demand as it comes to the mainstream practices of bio-technology, agriculture, and personal medicine. It also explodes the demand for compute resources. In fact, with inexpensive next-gen sequencing, some labs sequence over 1,000,000 billion bases per year. Genetic data banks are growing over 10x annually. How to compute the genomic data at massive scale, and do it in a cost-efficient way?
In the presentation, we describe and demonstrate a serverless solution built with Docker, Docker Swarm, StackStorm and other tools from the DevOps toolchain on AWS. The solution offers a new take on creating and computing a bio-informatic pipelines that can run at high scale and at optimal cost.
http://sc17.supercomputing.org/presentation/?id=exforum106&sess=sess150
DevEX - reference for building teams, processes, and platforms
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
1. Genomic Computation at Scale
with Serverless, StackStorm, and Docker
SC17, 14 Nov 2017
Dmitri Zimine
Fellow @ Extreme Networks
@dzimine
Image by Miki Yoshihito, Creative Commons license
2. Genomic Sequencing and Annotation
ACGTGACCGGTACTGGTAACGTACA
CCTACGTGACCGGTACTGGTAACGT
ACGCCTACGTGACCGGTACTGGTAA
CGTATACACGTGACCGGTACTGGTA
ACGTACACCTACGTGACCGGTACTG
CTGGTAACGTATACCTCT...
Sequencer
Sequenced Genome
DNA Sample
Annotated Sequence
Compute
in silko
4. Victor Solovyev
Partner,
Leading scientist in
computational
biology
Victor Solovyev is a leading scientist in computational biology. His
experience is a good mixture of academic positions, including Professor
at Royal Holloway and KAUST, and various industry roles. His research
on bioinformatics and genomic computations are published in Nature,
Science, Genome Research and highly cited.
As Chief Sci. Officer at Softberry, he is leading software development
for biomedical data analysis and research in computational biology.
Softberry software products have been used in over 2000 research
publications in 2016 alone. Fgenesh program has been cited in ~ 3200,
Bprom program in ~ 800, Fgenesb pipeline in ~500 scientific
publications.
7. PROPERTIES:
Challenges:
• Offer annotation pipelines online
• Use cloud, for large elastic capacity
• Handle scale - spiky workload
• Economically
GAaaS – Genomic Annotation as a Service
8. Agenda
8
Problem &
Solution
Domain demands, technology selection
& serverless, toolchain, solution overview
Show & Tell Demo
Discussion
Lessons learned, what to keep & what to
refactor, the path forward
10. Annotation Pipelines
A basic exome pipeline
delivering called variants from
raw sequence could consist of
as few as 12 steps, most of
which can be run in parallel,
but a real analysis will typically
involve several additional
downstream steps and
complex report generation.
Source: Brief Bioinform bbw020.
DOI: https://doi.org/10.1093/bib/bbw020
11. Annotation Pipelines
A basic exome pipeline
delivering called variants from
raw sequence could consist of
as few as 12 steps, most of
which can be run in parallel,
but a real analysis will typically
involve several additional
downstream steps and
complex report generation.
Source: Brief Bioinform bbw020.
DOI: https://doi.org/10.1093/bib/bbw020
PROPERTIES:
• Steps:
• jobs/functions
• Run times – may be hours & days
• Diverse (a.k.a. “don’t run on the same box”)
• Workflow orchestration:
• Logical patterns: splits, parallels, joins
• Data flow:
Upstream results –> downstream inputs
• Scale dimentions: spiky load
• Low volume of requests,
• Very high compute demand per request
Properties:
13. Authoritative: Mike Roberts on martinfowler.com:
My summary
• Function, not service: “down when done”
• Scale – elastic, infinite, transparent for developer
• Pay per use consumption model
https://goo.gl/bTfgfU
What is Serverless?
14. 14
Serverless fits!
*) BYOC – Bring Your Own Code (see the serverless compute manifesto, https://goo.gl/q9HsXB
Typical Serverless requirements:
• “Functions”, not “servers”,
down when done
• Elastic scale:
handle spiky workload pattern
• BYOC*:
package algorithms into containers
• Launch on a variety of events
15. Additional requirements:
• Long running times: hours
• Pipeline orchestration:
execution logic and data
passing
• Local Dev environment,
consistent and convenient
15
Serverless fits, but…
Typical Serverless requirements:
• Elastic scale:
handle spiky workload pattern
• “Functions”, not “servers”,
down when done
• BYOC*: package programs into
containers, run everywhere
• Launch on a variety of events
16. Why not <…>
16
AWS Lambda?
5 min limitation
- jobs run for hours and days
Azure?
No native support for Functions
in docker containers *
OpenWhisk?
Lacks powerful workflow to
orchestrate pipelines (only
sequences)
*) At the time of selecting. I will cover ”what has changed” in Discussion.
19. Terraform provisions infra on AWS (WIP);
Vagrant for local dev infra.
Ansible deploys & cofigures software on
Infra.
Docker to containerize functions and
push to local Docker Registry.
StackStorm orchestrates pipeline
executions,
invokes Swarm to run functions,
dynamically scales Swarm on load.
Tool Chain
22. Three Sides to Serverless Story
DevOps
Developer
End User
Submits sequence,
Gets results,
fast and cheap.
Packs algorithms in
containers,
Defines pipelines
Provides
infrastructure
25. 2. Developer:
creates functions, defines pipeline
25
StackStorm
Registry
Create functions (BYOC),
pack into Docker image,
push to local Registry
Define pipelines
as StackStorm workflows
Developer
1
2
f(x)
f(x)
f(x)
f(x)
26. StackStorm
StackStorm
sends results
back to user
Swarm
controller
2
46
Docker pulls
function’s images
5
Functions run in
containers, produce
data
f(x)
StackStorm runs workflow
schedules functions
as jobs on Swarm
Swarm
Worker
3
Swarm schedules
services
User sends
sequence data1
f(x) f(x)
Registry
3. User submits data,
System runs pipeline & produces results
End
User
30. 30
Show & Tell, PART 2
Dynamically scaling
Swarm cluster on AWS,
on workload
31.
32. Agenda
32
Problem &
Solution
Domain demands, technology selection
& serverless, toolchain, solution overview
Show & Tell Demo
Discussion
Lessons learned, what to keep & what to
refactor, the path forward
39. Path Forward: Options
Option 1: Kubernetes
• Use Kubernetes pack from StackStorm Exchange
• Utilize k8s “run to completion” jobs
• Deploy on AWS, minikube for local development,
• Leverage AWS autoscaler for elastic capacity
StackStorm handles pipeline workflow, calls k8s Jobs.
Same app developer experience.
39
40. Path Forward: Options
Option 2: Azure
• Use Azure’s ”Self-orchestration” option with StackStorm
• Azure provides containers on demand (no VMs!)
• Per container, per second billing
StackStorm handles pipeline workflow, calls Azure containers.
App developer experience stays the same.
40
41. StackStorm
StackStorm
sends results
back to user
Azure
Container
Service
2
46
Docker pulls
function’s images
from Registry
5
Functions run in
containers, produce
data
f(x)
StackStorm runs workflow
schedules functions
as containers on Azure
Azure
Container
Instance
3
Azure schedules
container instances
User sends
sequence data1
f(x) f(x)
Registry
Path forward: Change to Azure Container Instances
End
User
44. STACKSTORM EVENT-DRIVEN AUTOMATION ALLOWS YOU TO GET YOUR
SOLUTION UP AND RUNNING QUICKLY SO YOU CAN DELIVER BUSINESS FAST,
EXPERIMENT AND INNOVATE. ONCE YOU HAVE IT JUST RIGHT, YOU CAN BUILD
A MORE PERMANENT VERSION WITH MICROSERVICES
ActionsSensors
WorkflowsRules
44
45. StackStorm is an innovation platform
where we can build solutions,
experiment and learn,
while deliver business value,
before moving implementation to
dedicated services