Elided Branches: November 2011

Saturday, November 12, 2011

A new computer

I'm writing this from my brand-new Acer netbook. It's a cute little machine, very light, long battery life, and so far seems to run Eclipse well enough for me to do some simple ZooKeeper work and other light Java development. It's also, hopefully, a symbol of a new chapter in my life.

A few months ago, I was reading The Creative Habit. I've been in search of my own creativity for a while now. You could say my career as a programmer meets the mythical 10,000 hours rule; after putting in several years of intense schooling followed by several years of focused work as a software developer, I finally started to consider myself an expert at writing general-purpose code. It's great to feel confident that you can code almost anything pretty well, but at some point I started wondering when this expertise would turn into truly creative output. I went into computer science for the cliched-but-true reason that it's a skill that can be applied into almost any sort of industry, and hoped it would allow me to build a fulfilling and lucrative career wherever I decided to go. And it has, except, where is my cool side-project? Beyond the creativity needed to architect solutions for work, I haven't found my groove.

So I started reading, and exploring, and trying to break out of my work-focused rut. In The Creative Habit, Twyla Tharpe recommends having the tool of creativity for your trade with you at all times. For a writer that might be a notebook and pencil, for a musician a tape recorder. But what is it for a creative developer? A poll of friends brought us to the conclusion that it might just be a small laptop and Python. A friend put it eloquently:

I'm picking "python" because it seems that the writer's pencil or the artist's sketchpad are more for making rough sketches than finished products, and python is one of my preferred languages for quickly hacking up prototypes.

Now here I am, I've finally taken the plunge, bought the little laptop, even started the blog to chronicle the process. 3, 2, 1, GO!

Monday, November 7, 2011

NoSQL and the Enterprise Developer

One of the people I follow on twitter, @strlen, posted a pretty good comment on Hacker News the other day. In it, he calls for NoSQL stores to become better than they currently are (a notion I doubt anyone would disagree with), and mentions some of the things he would like see evolving in the NoSQL landscape:

* Support for new and interesting distribution models. Allowing users to choose between eventual consistency, quorum protocols, primary copy replication and even transactional replication.

* Support for large, unstructured blob data[...]

* Most NoSQL systems support transactions within the scope of a single value (or document) via the use of quorums, serializing through a single master, etc... However, it'd be nice if something like MegaStore's Entity Groups (or Tablet Groups in Microsoft Azure Cloud SQL server) were supported.

* Secondary indices, whether internal or external (by shipping a changelog) to the system.

* True multi-datacenter support (local quorums if desired, async replication to the remote site) including across unreliable, high latency WAN links (disclosure: Voldemort supports this -- https://github.com/voldemort/voldemort/wiki/Multi-datacenter... )

These are all great points. In particular for the enterprise space (and especially the financial space), I think the first and last points are extremely interesting.

A major concern for the financials is business continuity. If a data center goes down, you had better be able to keep the critical parts of your business running. This has traditionally been done in a few different ways. One major way is through the use of SRDF disk, a rather slow and expensive facility that will automatically mirror data from one disk to a backup disk in a different site. For it to be performant at all, the two sites are generally kept pretty close together, with a fat link connecting them. But the overhead of the synchronous write and the cost of the disk are still meaningful, and the ultimate reality of dealing with SRDF failover of a database or file system is that frequently system administrators and DBAs need to get involved and the failover time can be quite slow. It satisfies certain regulatory requirements, and it satisfies the basic needs of business continuity, but rarely in a clean and easy to use fashion.

Now, many NoSQL systems can do some level of data replication across data centers. I personally chose to use Cassandra for a project because of the fact that I could choose write-level coherence that would guarantee writes hitting a quorum of global servers, thus assuring no data loss even in the event of a single data center failure. And hand-in-hand with point number one, this configurable read/write coherence meant that I could have a system that would always be available for reads even if a region was network partitioned from the other global regions, and would always guarantee that a quorum of servers would see a write before committing thus guaranteeing no loss of data.

Here's a tricky point quorum-based system designers should know: Many enterprises don't have data centers set up to support quorum-based systems in a local region. Often you will see 2 data centers per global region, meaning that if you need to run a quorum-based system and withstand the loss of any one data center (a general requirement for high availability business continuity), you need to have data crossing the WAN at some point. To a distributed systems programmer, this is agony. If only I had 3 data centers available in-region the possibilities for quorum-based systems to keep my data safe while still having relatively fast writes becomes so much more! But don't count on that being available to your clients.

A few glimmers of hope are on the horizon. Companies are aware of the cloud, and some are investigating whether they can use external cloud providers to host some computing. If this becomes a possibility, a cloud datacenter could become the third center in a quorum-based system. Regulators are also taking a closer look at data center locality, and wondering if there isn't too much of a concentration risk with two data centers so close together within a geographic region. This may prompt build out of additional data centers farther away in the states, but with better network connections than a cross-Atlantic link.

NoSQL folks looking for the enterprise and financial services markets, take heed. There's desire out there for what you are selling, but if it isn't easy to meet business continuity and regulatory requirements, you will never gain more than a niche position at these firms.

There's one other todo in the NoSQL space around authentication, but I will take the advice of my post reviewers and save that for a later rant.

Saturday, November 5, 2011

ZooKeeper 3.4: Lessons Learned

After several months on the planning block, it looks like ZooKeeper 3.4 is finally almost ready to be released. (Edit: Hooray! As of 11/22, release 3.4 is available!) I can say with confidence that all of the committers for the project have learned a lot from the course of this release. And most of it is in the form of "ouch, lessons learned".

First lesson: Solidify your new feature set early.
Going through the Jira, the earliest new feature for the 3.4 release is the uplift of the ZAB protocol to ZAB1.0. No small feature, to be sure, we were still debugging minor issues with it through the very end stages of our 3.4 work. We also added multi transactions, kerberos support, a read-only zookeeper, netty, windows support for C, and certainly others I'm forgetting. Some of these features were pretty simple uplifts, but some of them caused us build instability for months and a great deal of distraction. Many of these were added as "just one more feature". But many other features were neglected because "we're almost ready for 3.4" (as it turned out, often not actually the case). If we had decided early what new major features we were pushing for with 3.4, we could have concentrated our efforts more effectively and delivered much sooner.

Second lesson: When it's time to push, push.
Giving birth requires a period of concentrated pushing. If you think you can push a little now, then put it off for a few days, then a bit now, then a few weeks off... the baby will never come, and neither will the release. It took several attempts before the community finally rallied behind the efforts to get a release out, and we ended up losing a lot of momentum in the process. We didn't have a solid and pre-agreed-upon features to know when we were done, so things just kept getting in the way. When the attention on the release was off, a minor bug or feature request would come in and it just seemed so small, what was the harm?

Third lesson: Prioritize as a community, and stick to those priorities
This falls in with setting up a feature list early, but it goes beyond that. Our community was split between those who were very interested in seeing 3.4 released, and those who were working on major new changes or refactorings against trunk. As a result we all ended up feeling shortchanged. Contributors with new features did not get the attention their features needed, and many still sit in unreviewed patch form. Users that were hungry for the 3.4 release were frustrated with our lack of attention to getting it out. We had some massive new refactoring efforts that continued to happen on trunk during the course of the release process, which resulted in a frustrated committer base stuck backporting or forwardporting patches between increasingly divergent branches. These efforts found bugs, but not without some cost. Having unclear priorities divided the community, caused some tension, and ultimately slowed the whole release process down.

Fourth lesson: You can always do more releases, it doesn't all have to happen now
This is perhaps my own biggest takeaway from this process. I wish we had done much less, done it much faster, and been willing to release a 3.4 that was quickly followed by 3.4.1, 3.5, etc, as needed. Proponents of agile development and release practices have a good point; the more often you release, the less there is to go wrong and the easier it will be to fix if and when it does. It becomes a self-fulfilling prophecy. We don't release frequently so people want to cram as many new features in as possible, which slows down the releases, which results in pushes for more new features, which results in more bugs and slowed down releases, and on and on.

These lessons may seem obvious in retrospect, but they came at the price of many people's time and effort. I'm proud of our community for pulling together in the end, but I also hope that 3.5 will be a different and less arduous journey.

Buy My Book, "The Manager's Path," Available Now!