1. To Cloud
or Not to Cloud?
Greg Lindahl, CTO
@glindahl – greg@blekko.com"
2. About
Us
• Web-‐scale
search
engine
with
our
own
crawl
&
index
• Public
launch,
November
2010
• $60
M
raised
• 800
servers,
16
PB
spinning
rust,
½
PB
flash
disk
5. The
wiring
diagram
Web
Crawler
Extractor
Ranker
Indexer
Lookup
Query
Analyzer
Front
End
Query
SERP
DIG
KB
6. Hijacking
a
meetup
topic
• Original
topic
was
“virtualizaUon
or
not”
• But
really,
virtualizaUon
is
an
implementaUon
detail
these
days
– cloud
=>
virtual
– virtual
=>
public
or
private
cloud
(probably)
• This
talk:
Public
cloud
vs.
not
• I’m
trying
to
list
a
bunch
of
things
that
you
should
think
about
…
your
situaUon
probably
differs
from
mine
7. The
quesUon
• It’s
2007,
and
your
CEO
asks
you:
Should
our
new
startup
use
this
newfangled
cloud
compuUng
stuff
or
not?
8. Why
cloud
at
all?
• Flexible
– prototyping
&
development
– tesUng
at
scale
– scale
up
for
high
usage
and
back
down
later
• Turns
CapEx
into
OpEx
– startups
prefer
paying
over
Ume
– “money
tomorrow
is
cheaper
than
money
today”,
if
you’re
successful
{btw,
plenty
of
banks
will
loan
against
equipment.}
9. Cloud
win
examples
• CommonCrawl.org
has
a
web
crawl
dataset
on
EC2
– Map/Reduce
job
to
read
the
whole
thing
is
~
$50
• Fewer
ops
people
is
actually
true
• Your
company
can
change
direcUon
10. OK,
so
what’s
bad?
• Examine
the
curve
of
Amazon’s
pricing
over
Ume
and
per
volume
• People
think
it’s
a
low-‐priced
product,
but
it’s
not.
• It’s
value
priced.
• Not
enough
compeUUon,
yet,
to
really
drive
Amazon’s
margins
down
• This
is
good
for
Amazon,
maybe
not
for
you.
11. 6
Reasons
to
not
use
Amazon
• Economy
of
scale
in
your
favor?
• Your
max::min
raUo
is
not
large
enough
• Cloud
IOPs
are
expensive
• Data
is
heavy
if
you
use
a
lot
of
local
disk
• SSDs
are
overpriced
• RaUo
of
disk
capacity
or
bandwidth
::
ssd
::
memory
::
compute
may
not
be
ideal
for
you
12. Economy
of
scale
• “Amazon
has
100s
of
thousands
of
servers,
so
they
can
run
them
cheaper
than
I
can.”
• But:
– you
pay
retail,
not
wholesale
price
– there
are
diminishing
returns
with
size
• At
some
point,
it’s
cheaper
to
do
it
yourself
• 100
servers?
50
servers?
{
blekko
had
700
at
launch…
}
13. Your
max::min
raUo
is
not
big
enough
• Maybe
you
use
100x
as
many
servers
some
days?
– Cloud
is
for
you!
• How
long
do
your
usage
spikes
last?
• Can
you
predict
them
far
enough
in
advance?
• How
long
does
it
take
you
to
spin
up
a
new
node?
{blekko’s
day::night
is
only
2x}
14. Cloud
IOPs
are
expensive
• I/O
OperaUons
are
expensive
to
start
with
– “spinning
rust”
disks
only
seek
so
much
• Networked
storage
has
low
bandwidth
compared
to
10
apached
disks
– 1
Gbyte/sec
sustained
–
woah!
• Networked
disks
are
more
expensive
than
local
– beper
failure
behavior,
whether
I
want
it
or
not
15. Data
is
heavy
if
you
use
a
lot
of
local
disk
• I
mean:
it
takes
a
loooooong
Ume
to
copy
a
few
tbytes
of
data
onto
your
local
disk
over
the
network
– 1
gigabit:
½
tbyte/hour
– 10
gigabit:
5
tbytes/hour
– even
filling
your
½
tbyte
SSD
is
kinda
slow
• Slow
spin-‐up/down
of
nodes
hurts
your
ability
to
flex
up
and
down
16. SSDs
are
overpriced
(by
cloud
providers)
• SSDs
are
completely
awesome
for
read-‐heavy
analyUcs
queries
• SSDs
wear
out
with
writes
• No
cloud
provider
charges
a
fee
for
writes?
• Instead,
they
assume
all
their
customers
are
average
• …
and
so
they
charge
way
too
much
to
customers
who
are
smart
about
not
wriUng
too
much
{
blekko
is
great
at
not
wriUng
to
our
SSDs
}
17. RaUos
available
might
not
fit
your
usage
• Amazon
tries
prepy
hard:
– high
memory,
high-‐CPU,
GPU,
high
I/O,
high-‐storage
– weirder
ones
are
less
flexible
• It’s
sUll
easy
to
not
fit
into
that
set
of
cookie
cupers
• Not
firng
==
wasted
money
– idle
resources
that
you’ve
paid
for
– moves
the
break-‐even
point
to
smaller
node
count
{
blekko
crawler
nodes:
10
local
disks
(capacity,
bandwidth,
seeks),
2
ssds,
96
gigs
ram}
18. So…
• For
us,
it
was
easy
to
predict
the
right
answer
• Our
SWAG
for
launch
day
was
600
servers
– and
our
enUre
index
in
SSD
– and
we
can’t
scale
down
from
that
• Amazon
wasn’t
renUng
SSDs
yet
• If
you’re
going
to
run
your
own
servers,
you
need
to
start
early
19. How
about
you?
• RT
analyUcs
is
a
complicated
subject
• Two
main
thrusts
– Pre:
pre-‐compute
aggregate
numbers,
query
those
– Mem:
sUck
a
subset
of
your
big
data
that
fits
into
ram
or
ssd,
do
complicated
queries
against
those
{
blekko
only
does
Pre
}
20. Pre
• Needs
to
be
wired
into
your
stream
of
data
generaUon,
e.g.
your
webserver
• Summary
data
can
be
prepy
small
• Doesn’t
really
maper
where
you
put
it
• Not
much
impact
on
the
cloud/no-‐cloud
decision
{
blekko
pre-‐computes
a
lot
of
things
using
“combinators”
in
our
home-‐grown
NoSQL,
opUonally
stuffing
them
into
our
SSD
caching
system
}
21. SERVER 1
PROCESS 1 PROCESS 2
SERVER 2
PROCESS 1 PROCESS 2
DISK 1 DISK 2 DISK 3
+4 +3 +4 +7
+11
+11+11
+7
+7+7
+18 +18 +18
Combinators
reduce
the
total
work
22. Mem
• Even
a
decimated
subset
of
your
fresh
data
can
involve
a
lot
of
write
bandwidth
– SomeUmes
referred
to
as
“high
velocity”
• High
BW
probably
needs
to
go
nearby
your
big
data
store
• AnalyUcs
probably
isn’t
going
to
influence
the
cloud/not-‐cloud
decision
23. Discuss!
• Discuss
• For
more
about
blekko’s
setup:
– 3
part
blog
series
at
highscalability.com
– Please
search
[high
scalability
blekko]
in
your
search
engine
of
choice
– greg@blekko.com
-‐-‐-‐
@glindahl