Buy My Book, "The Manager's Path," Available March 2017!

Sunday, December 30, 2012

Make it Easy

One of my overriding principles is: make it easy for people to do the right thing.

This seems like it should be a no-brainer, but it was not always obvious to me. Early in my career I was a bit of a self-appointed build cop. The team I worked on was an adopter of some of the agile/extreme programming principles, and the result of that was a 40+ person team all working against the same code base, which was deployed weekly for 3 distinct business purposes. All development was done against trunk, using feature flags. We managed to do this through the heavy use of automated unit/integration testing; to check code in, you were expected to write tests of course, and to run the entire test suite to successful completion before checking in.

Unsurprisingly, people did this only to a certain level of compliance. It drove me crazy when people broke the build, especially in a way that indicated they had not bothered to run tests before they checked in. So I became the person that would nag them about it, call them out for breaking things, and generally intimidate my way into good behavior. Needless to say, that only worked so well. People were not malicious, but the tests took a LONG time to run (upwards of 4 hours at the worst), and on the older desktops you couldn't even get much work done while the test suite ran. In the 4 hours that someone was running tests another person might have checked in a conflicting change that caused errors; was the first person really supposed to re-merge and run tests for another 4 hours to make sure things were clean? It was an unsustainable situation. All my intimidation and bullying wasn't going to cause perfect compliance.

Even ignoring people breaking the build, this was an issue we needed to tackle. And so we did, taking several months improve the overall runtime and make things easier. We teased out test suites into specific ones for the distinct business purposes combined with a core test suite. We made it so that developers could run the build on distributed hardware from their local machine. We figured out how to run certain tests in parallel, and moved database-dependent tests into in-memory databases. The test run time went way down, and even better, folks could kick off the tests remotely and continue to work on their machine, so there was much less reason to try and sneak in an untested change. And lo and behold, compliance went way up. All the sudden my build cop duties were rarely required, and the whole team was more likely to take on that job rather than leaving it to me.

Make it easy goes up and down the stack, far beyond process improvements. I occasionally find myself at odds with folks that see the purity of implementing certain standards and ignore the fact that those standards, taken to extreme, make it harder for people to do the right thing. One example is REST standards. You can use the http verbs to modify the meanings of your endpoints and make them do different things, and from a computer-brain perspective, this is totally reasonable. But this can be very bad when you must add the human brain perspective to the mix. Recently an engineer proposed that we change some endpoints from being called /sysinfo (which would return OK or DEAD depending on whether a service was accepting requests), and /drain (which would switch the /sysinfo endpoint to always return DEAD), into one endpoint. That endpoint would be /sys/drain. When called with GET, it would return OK or DEAD. When called with PUT, it would act as the old drain.

To me, this is a great example of making something hard. I don't see the http verb, I see the name of the endpoint, and I see the potential for human error. If I'm looking for the status-giving endpoint, I would never guess that it would be the one called "drain", and I would certainly not risk trying to call it to find out. Even knowing what it does, I see myself accidentally calling the endpoint with GET, now I didn't drain my service before restarting it. Or I accidentally called it with PUT and now it's been taken out of the load balancer. To a computer brain, GET and PUT are very different, and hard to screw up, but when I'm typing a curl or using postman to call an endpoint, it's very easy for me as a human to make a mistake. In this case, we're not making it easy for people using the endpoints to do the right thing, we're making it easy for them to be confused, or worse, to act in error. And to what benefit? REST purity? Any quest for purity that ignores human readability does so at its peril.

All this doesn't mean I want to give everyone safety scissors. I generally prefer to use frameworks that force me and my team to do more implementation work rather than making it trivially easy. I want to make the "easy" path the one that forces folks to understand the implementation to a certain level of depth, and encourages using only the tools necessary for the job. This makes better developers of my whole team, and makes debugging production problems more science than magic, not to mention the advantage it gives you when designing for scale and general future-proofing.

Many great engineers are tripped up by human nature, when there's really no need to be. Look at your critical human-involving processes and think: am I making it easy for people to do the right thing here? Can I make it even easier? It might take more work up front on your part, or even more verbosity in your code, but it's worth it in the long run.

Thursday, December 20, 2012

Building a Global, Highly Available Service Discovery Infrastructure with ZooKeeper

This is the written version of a presentation I made at the ZooKeeper Users Meetup at Strata/Hadoop World in October, 2012 (slides available here). This writeup expects some knowledge of ZooKeeper.

The Problem:
Create a "dynamic discovery" service for a global company. This allows servers to be found by clients until they are shut down, remove their advertisement, or lose their network connectivity, at which point they are automatically de-registered and can no longer be discovered by clients. ZooKeeper ephemeral nodes are used to hold these service advertisements, because they will automatically be removed when the ZooKeeper client that made the node is closed or stops responding.

This service should be available globally, with expected "service advertisers" (servers advertising their availability, aka, writers) able to scale to the thousands, and "service clients" (servers looking for available services, aka, readers) able to scale to the tens of thousands. Both readers and writers may exist in any of three global regions: New York, London, or Asia. Each region has two datacenters with a fat pipe between them, and each region is connected to each other region, but these connections are much slower and less tolerant for piping large quantities of data.

This service should be able to withstand the loss of any one entire data center.

As creators of the infrastructure, we control the client that connects to this service. While this client wraps the ZooKeeper client, it does not have to support all of the ZooKeeper functionality.

Implications and Discoveries:
ZooKeeper requires a majority (n/2 + 1) of servers to be available and able to communicate with each other in order to form a quorum, and thus you cannot split a quorum across two data centers and guarantee that the quorum will be available with the loss of any one data center (because at least one data center will fail to have a pure majority of servers). To sustain the loss of a datacenter therefore you must split your cluster across 3 data centers.

Write speed dramatically decreases when the quorum must wait for votes to travel over the WAN. We also want to limit the number of heartbeats that must travel across the WAN. This means that both a ZooKeeper cluster with nodes spread across the globe is undesirable (due to write speed), and a ZooKeeper cluster with members only in one region is also undesirable (because writing clients outside of that region would have to continue to heartbeat over the WAN). Even if we decided to have a cluster in only one region, we would have to solve the problem that no region has more than 2 data centers, and we need 3 data centers to handle the loss/network partition of an entire data center.

Create 3 regional clusters to support discovery for each region. Each cluster has N-1 nodes split across the 2 local data centers, with the final node in the nearest remote data center.

By splitting the nodes this way, we guarantee that there is always availability if any one data center is lost or partitioned from the rest of the data centers. We also minimize the affects of the WAN on write speed by ensuring that the remote quorum member is never made into the leader node, and the general effect of the majority of nodes being local means that voting can complete (thus allowing writes to finish) without waiting for the vote from the WAN node in normal operating conditions.

3 Separate Global Clusters, One Global Service:
Having 3 separate global clusters works well for infrastructural reasons mentioned above, but it has the potential to be a headache for the users of the service. They want to be able to easily advertise their availability, and discover available servers preferably by those servers available first in their local region, and secondly in other remote regions if no local servers are available.

To do this, we wrapped our ZooKeeper client in such a way as to support the following paradigm:
Advertise Locally
Lookup Globally

Operations requiring a continuous connection to the ZooKeeper, such as advertise (which writes an ephemeral node) or watch are only allowed on the local discovery cluster. Using a virtual IP address we automatically route connections to the discovery service address of the local ZooKeeper cluster and write our ephemeral node advertisement here.

Lookups do not require a continuous connection to the ZooKeeper, and so we can support global lookups. Using the same virtual IP address we can connect to the local cluster to find local servers, and failing that use a deterministic fallback to remote ZooKeeper clusters to discover remote servers. The wrapped ZooKeeper client will automatically close its connection to the remote clusters after a period of client inactivity, so as to limit WAN heartbeat activity.

Lessons learned:
ZooKeeper as a Service (a shared ZooKeeper cluster maintained by a centralized infrastructure team to support many different clients) is a risky proposition. It is easy for a misbehaving client to take down an entire cluster by flooding it with requests or making too many connections and without a working hard quota enforcement system clients can easily push too much data into ZooKeeper. Since ZooKeeper keeps all of its nodes in memory, a client writing huge numbers of nodes with a lot of data in each can cause ZooKeeper to garbage collect or run out of memory, bringing down the entire cluster.

ZooKeeper has a few hard limits. Memory is a well-known limit, but another limit is the number of sockets for a server process (configured via the ulimit in *nix). If a node runs out of sockets due to too many client connections, it will basically cease to function without necessarily crashing. This is not surprising for anyone that has experienced this problem in other Java servers, but it is worth noting when scaling your cluster.

Folks using ZooKeeper to do this sort of dynamic discovery platform should note that if the services you are advertising are Java services, a long full GC pause can cause their session to the ZooKeeper cluster to time out and thus their advertisement will be deleted. This is generally probably a good thing, because a server that is doing a long-running full GC won't respond to client requests to connect, but it can be surprising if you are not expecting it.

Finally, I often get the question of how to set the heartbeats, timeouts, etc, to optimize a ZooKeeper cluster, and the answer is really that it depends on your network. I really recommend playing with Patrick Hunt's zk-smoketest in your data centers to figure out sensible limits for your cluster.