Henry Robinson gave a presentation on upcoming features for ZooKeeper. He discussed observers, which allow non-voting servers to scale client connections without impacting performance. He also covered dynamic ensembles, which would allow changing the ZooKeeper cluster membership without downtime. Finally, he announced that Cloudera's Distribution for Hadoop will include ZooKeeper packages to integrate it more fully.
2. ZooKeeper Futures
Expanding the Menagerie
Henry Robinson
Software engineer @ Cloudera
Hadoop meetup - 11/5/2009
Thursday, 5 November 2009
3. Upcoming features for ZooKeeper
▪ Observers
▪ Dynamic ensembles
▪ ZooKeeper in Cloudera’s Distribution for Hadoop
Thursday, 5 November 2009
4. Observers
▪ ZOOKEEPER-368
▪ Problem:
▪ Every node in a ZooKeeper cluster has to vote
▪ So increasing the cluster size increases the cost of write
operations
▪ But increasing the cluster size is the only way currently to get
client scalability
▪ False tension between number of clients and performance
▪ Should only increase size of voting cluster to improve reliability
Thursday, 5 November 2009
5. Observers
▪ It’s worse than that
▪ Since clients are given a list of servers in the ensemble to connect
to, the cluster is not isolated from swamping due to the number of
clients
▪ That is, if a swarm of clients connect to one server and kill it,
they’ll move on to another and do the same.
▪ Now we are sharing the same number of clients amongst fewer
servers!
▪ So if these were enough clients originally to down a server, the
prognosis is not good for those remaining
▪ Only n/2 servers have to die before the cluster is no longer live
Thursday, 5 November 2009
10. Observers
▪ Simple way to attack this problem: non-voting cluster members
▪ Act as a fan-in point for client connections by proxying requests to
the inner voting ensemble
▪ Doesn’t matter if they die (in the sense that liveness is preserved) -
cluster is still available for writes
▪ Write throughput stays roughly constant as number of Observers
increases
▪ So we can freely scale the number of Observers to meet the
requirements of the number of clients
Thursday, 5 November 2009
11. Observers: More benefits
▪ Voting ensemble members must meet strict latency contracts in
order to not be considered ‘failed’
▪ Therefore distributing ZooKeeper across many racks, or even
datacenters, is problematic.
▪ No such requirements made of Observers
▪ So deploy the voting ensemble for reliability and low latency
communicaton, and everywhere you need a client, add an Observer
▪ Reads get served locally, so wide distribution isn’t too painful for
some workloads
▪ Likelihood of partition increases relative to distribution of
ensemble, so availability is increased in some cases
▪ Good integration point for publish-subscribe, and for specific
optimisations
Thursday, 5 November 2009
12. Observers: Current state
▪ This patch required a lot of structural work
▪ Hoping to get in to 3.3
▪ One major refactor patch committed
▪ Core patch up on ZOOKEEPER-368
▪ Check it out and add comments!
▪ Fully functional - you can apply the patch, update your configuration
and start using Observers today
▪ Benchmarks show expected (and pleasing!) performance
improvements
▪ To come in future JIRAs - performance tweaking (batching)
Thursday, 5 November 2009
13. Dynamic Ensembles
▪ ZOOKEEPER-107
▪ Problem:
▪ What if you really do want to change the membership of your
cluster?
▪ Downtime is problematic for a ‘highly-available’ service
▪ But failures occur and machines get repurposed or upgraded
Thursday, 5 November 2009
14. Dynamic Ensembles
▪ We would like to be able to add or remove machines from the
cluster without stopping the world
▪ Conceptually, this is reasonably easy - we have a mechanism for
updating information on every server synchronously, and in order
▪ (it’s called ZooKeeper)
▪ In practice, this is rather involved:
▪ When is a new cluster ‘live’?
▪ Who votes on the cluster membership change?
▪ How do we deal with slow members?
▪ What happens when the leader changes?
▪ How do we find the cluster when it’s completely different?
Thursday, 5 November 2009
15. Dynamic Ensembles
▪ Getting all this right is hard
▪ (good!)
▪A fundamental change in how ZooKeeper is designed - much of the
code is predicated on a static view of the cluster membership
▪ Ideally, we want to prove that the resulting protocol is correct
▪ The key observation is that membership changes must be voted
upon by both the old and the new configuration
▪ So this is no magic bullet if the cluster is down
▪ Need to keep track of old configurations so that each vote can be
tallied with the right quorum
Thursday, 5 November 2009
16. Dynamic Ensembles
▪ Lots of discussion on the JIRA
▪ although no public activity for a couple of months
▪I have code that pretty much works
▪ But waiting until Observers gets committed before I move focus
completely to this
▪ Current situation not *too* bad; there are upgrade workarounds
that are a bit scary theoretically but in practice work ok.
Thursday, 5 November 2009
17. ZooKeeper Packages in CDH
▪ We maintain Cloudera’s Distribution for Hadoop
▪ Packages for Mapred, HDFS, HBase, Pig and Hive
▪ We see ZooKeeper as increasingly important to that stack, as well
as having a wide variety of other applications
▪ Therefore, we’ve packaged ZooKeeper 3.2.1 and are making it a first
class part of CDH
▪ We’ll track the Apache releases, and also backport important
patches
▪ Wrapped up in the service framework:
▪ /sbin/service zookeeper start
▪ RPMs and tarballs are done, DEBs to follow imminently
▪ Download RPMs at http://archive.cloudera.com/redhat/cdh/unstable/
Thursday, 5 November 2009