Buy My Book, "The Manager's Path," Available March 2017!

Saturday, December 31, 2011

2011: My Year of Open Source

2011 saw a lot of big events in my life. I got hip surgery early in the year. I found myself thinking of leaving my job of six plus years in the summer. I actually left that job in the fall, and took the big leap into startup land in November. But when I think of 2011, I think the biggest changing influence for me was my entry into the world of open source.

I would call my evolution as a developer a three phase project. First, getting all the fundamentals of computing beaten into me in undergrad and graduate school. Second, learning how to be productive in the working world, the gritty details of actually producing production code and solving problems that sometimes are purely technical but often are a matter of orchestration, attention to detail, and engineering. Finally, combining these two aspects, and putting these talents to use in something that touches developers all over the world.

I fell into the ZooKeeper community by happy chance. I had been given a project to implement a company-wide dynamic discovery service. The developers that had come before me had found ZooKeeper, but had the luxury of implementing a solution that didn't have to scale to the volume and geographic diversity of the whole firm. I had requirements for global scaling and entitlements that didn't seem to be common in the ZooKeeper community at that point, and so I was forced to do more than just comb over the documentation to design my system. I cracked open the source code and got to work learning how it really worked.

My first bug was a simple fix to the way failed authentication was communicated to the Java client library. I had to get approvals almost up to the CTO level to be able to participate, but it was worth the effort. Quickly, I started feeling more responsibility to the community. I was, after all, relying on this piece of software to give me a globally available 24/7 system, and I wanted to be able to support clients where downtime could mean trading losses in the millions. I owed it to my own infrastructure to help fix bugs, and really, it was fun. I love writing distributed systems, and the ZooKeeper code base is a pleasure to work with; a little baroque, but just enough to be a fun challenge, and most of the complexity is in the fundamental problem. 

Working in the community has not just been about fixing hard bugs. It's also been about those engineering and teamwork considerations that are beguiling on a co-located team, and working on a team that I communicated with entirely though email and Jira was intense. Lucky for me, the ZooKeeper community is some of the most mature engineers I've ever had the pleasure of working with. We pull together to solve hard bugs, we all participate in answering questions and we try to make everyone feel welcome to participate. I consider the community to be the textbook example of open source done right.

Working with this community, working in public after being sequestered in the tightly controlled environment of finance for so many years, flipped something in my brain. I realized that I wanted to be able to live out loud, as it were. I value openness, the ability to work with people all over the world, the ability to work in public, getting feedback and appreciation from the wider community of developers. It also gave me confidence that I could be productive outside of the comfort zone of the place I had worked for years, and that I could show a degree of leadership even without an official title.

In the end, this experience freed me from feeling tied to the corporate life I had been living. I feel open to choose the path I want as a developer. The startup world of today has embraced this open source mentality, which I think is one of the most exciting developments of the last five years. So, I chose to go to a new job that I knew would let me live out loud. 

If you're not already in the open source community, why not crack open your favorite open source project and make 2012 your year of open source? 

Wednesday, December 28, 2011

A quick one: Testing log messages

I'm writing a talk on unit testing for work, and it reminded me of one of the coolest things I learned from the ZK code base with respect to testing: testing log messages.

You probably don't want to rely too heavily on log messages for testing, but sometimes it's the only indication you have that a certain condition happened. So how do you test it?

    Layout layout = Logger.getRootLogger().getAppender("CONSOLE")
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        WriterAppender appender = new WriterAppender(layout, os);
        Logger zlogger = Logger.getLogger("org.apache.zookeeper");

        try {
 } finally {

        LineNumberReader r = new LineNumberReader(new StringReader(os
        String line;
        Pattern p = Pattern.compile(".*Majority server found.*");
        boolean found = false;
        while ((line = r.readLine()) != null) {
            if (p.matcher(line).matches()) {
                found = true;
                "Majority server wasn't found while connected to r/o server", found);

From ReadOnlyModeTest. Kudos to Patrick Hunt (@phunt), the original author.

12/29 Edit:
Some folks have found this cringe-worthy. I agree. This is not a testing method that should be common in any code base, and for goodness sakes if you can find a different way to ensure something happened, do so. But there are a few times when this kind of test splits the difference between expedience and coverage (ie, I'd rather write a test to validate a log statement then just make the change and observe the log, or refactor the code base to expose some fact that the conditions causing the log were met in order to be able to test it).

Saturday, December 24, 2011

Effective performance testing

One of the major challenges I have faced in my career is keeping performance up in a large, rapidly-evolving system with many developers. As a Java developer writing enterprise and web applications, I generally have the luxury of not worrying about the minutiae of working within the L1 cache, or having to drop down to assembly to optimize my performance. But even enterprise applications have their performance limits, and users don't want to wait 10+ seconds to have results come back to them, or watch their website freeze due to stop-the-world garbage collection. In the land of e-commerce web apps, performance is even more essential. The bounce rate may double when a result takes 4 seconds vs 1 second, which means dollars we will never realize. So I take performance seriously.
We've all heard the old chestnut "premature optimization is the root of all evil". And indeed, when you find yourself spending half a day worrying about the O(n) time for an operation where N is never bigger than 10, you're probably falling victim to the siren call of premature optimization. A more subtle way I find that people fall into the premature optimization trap is by getting sucked in to profiling tools. I love a good profile as much as the next programmer, but profiling is devilish work that should only be undertaken when you know you have a problem to solve. But how do you know you have a problem to solve before your problem is affecting your users and costing you business?
The easiest thing to do is simply capturing the runtime of important chunks of work. If you happen to be using a service model of system design, capturing the runtime of each of your exposed endpoints is a good start. For those that happen to be using the Play framework (1.X), this is easily accomplished with a @Before in the controller that records start time in a thread local, and an @After that logs the total time for the request Now you can at least see what's running and how long it is taking.
Unfortunately, that alone does very little for detecting performance problems before they hit your live traffic. It also does very little to show you trends in performance without some additional work on your part. The first thing you have to do is actually save the information you're collecting over time. We live in a big data world, but you may not want to save every bit of log output from your services logs forever and ever. And most of the output will be uninteresting except as an aggregate trend. You want to see if your queries are trending up over time, as your data or user load grows, but you probably don't care that at 3:43pm on December 13th the getAvailableSkusForDate endpoint took 100ms to return 20 results. You need something that you can look at quickly, that you can keep around for a long time, and that will warn you in advance if something is going to cause you problems. I'm sure there are many ways to skin this cat, but the way that has worked best for me in the past is basic performance testing.

Starting Simple
Basic performance testing requires a small amount of setup. The main requirement is a data set that is as close as possible to your production data set. A nightly dump of the production database is great. This data source should be dedicated to the performance test for the duration of time it is running. The database (or file system or whatever) doesn't have to be exactly the same hardware as production, but it should be configured with appropriate indexes etc so as to most closely match production. For the most basic test, you can simply pick the queries you know to be worrisome, spin up the latest successful version of your trunk code, and time the execution of a set of important endpoint calls against this data. Record the results. You can use a Jenkins build to drive this whole process with fairly trivial setup, and at the laziest just record the results in a log message. Compare this to yesterday. If the difference is an increase of a certain threshold, or the total time exceeds some cutoff, fail the build and email the team. This is in fact the only thing I have done so far in my new role, but it already is providing me with more confidence that I can see a historic record over time of our most problematic queries.

Slightly More
Now that you're tracking the performance over time, a really slow change is not likely to slip into your codebase unnoticed. But really, what you would like to compare is not just today's trunk vs yesterday's trunk, but today's trunk vs today's production. Now things get a little bit trickier. To do this effectively, you may need to refresh the database twice if you are doing any modifications of the data. And if there has been schema evolution between prod and trunk, you need an automated way to apply that evolution to the data before running the trunk changes (this may apply to the basic case, too, but surely you have an automated way to apply schema changes already eh?). There are several nice things about comparing production to trunk. First of all, you can also use this as a regression test by validating that the results match in the two versions of the code. Second, you can feel pretty confident that you directly know that it is the code, and not the database or the hardware or anything else, that is causing performance differences between prod and trunk.
This setup will ultimately be as simple or as complex as your production system. If your production system is a small service talking to a database, setup is trivial. If, on the other hand, your production system is a complex beast of an application that pulls in data from several different sources, warms up large caches, or generally requires a lot of things to be just so to start up, the setup will be correspondingly complex. As with pretty much all good software engineering, the earlier you start running automated performance monitoring, the easier it is likely to be to set up and the more it will likely influence you to spend a little extra time making your system easy to deploy and automate.

Gaga for Performance Monitoring
I have only touched on the tip of the iceberg for performance testing and monitoring. There's a world of tools that sysops and dbas have to track the load of systems and performance of individual database queries. You can use log analysis tools like Splunk to identify hotspots when your system is running under real user load (a weakness of our basic performance testing framework). But I have found that all these complex tools cannot make up for the feeling of security and tracking that a good performance testing suite provides. So give it a shot. If nothing else, it can give you the confidence that all your fun time playing with profiling tools actually has a trackable difference in the performance of your code base.

Saturday, December 17, 2011

Valuing time, and teamwork

I have a guilty conscience about my self-perceived deficiencies when it comes to teamwork. I'm just not one of those people that sits at her desk until 10pm every night, or even most nights. I don't like eating dinner at work at all, in fact, and I'm not much for socializing at lunch either. It's not that I don't like my colleagues, or don't want to work hard. But I prefer a more focused day with a break for a meditative workout at lunch, I like to see my boyfriend for dinner at night, and while occasional (or even frequent) after work drinks are fun I enjoy having a social life that does not mostly revolve around work.

Perhaps as homage to my guilty conscience, or perhaps as testament to my impatience with wasted time, I put a lot of energy into doing things right the first time, and setting up systems that make it easier for everyone else to do things right the first time as well. Nothing is more awful to me than spending my time cleaning up foreseeable messes, so instead I've learned to set up automated builds, write and maintain extensive test suites, and monitor them like a hawk to ensure that they stay stable and informative.

What I have come to realize over the years is that when you view time spent as a proxy for dedication, you lose opportunities for productivity gains. We often forget that "laziness" is one of the trio of strength/vices that makes a great programmer:
The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer. Also hence, this book. See also impatience and hubris. (p.609)[Larry Wall]
It's seductively easy to fall into the trap of equating hours spent on work with value produced, but it's as silly as using lines of code as a measure of productivity. Valuing time, your own time, your colleagues time, does not make you less valuable, or less of a team player. If anything, I think many technology teams need more people that know when to take a step back and put things in place so that they can go home at a reasonable hour and live a life outside of work. Which would you rather do, spend 8 hours writing productive code, or spend 6 hours writing productive code and 4 hours fixing bugs that you could've caught with better tests, an automated build, or performance monitoring?

I know what my answer is.

Saturday, December 10, 2011

Do the Right Thing

Two weeks into startup life, and things are going great. It's a fascinating change from enterprise development, and so far a very enjoyable one. It has already also provided a good life lesson that illustrates why I am a bit of a zealot about certain structure in the process of software development.

Almost the moment I came in to the company, I was pulled in to help on a project related to the way we compute availability for a given item on a date. This project had been started first by my boss, and had grown into a major effort that touched many parts of the system.

The logic and data design of the system was very thorough, and seemed like a sensible rewrite. A lot of the existing logic lived in a spaghetti monster of PHP code nestled into our front end, and the new system is written as a service using the Play framework. The general development pattern was to point the front end at both the existing logic (to make decisions) but also have it call the new back end service, and eventually swap everything to use the back end. So far, so good. But during development, two major mistakes were made.

First mistake: Not setting up an automated build from the moment the project was first put into source control
I was a little surprised to come in and see that while the system had tests, and developers were running the tests on their changes, there was no automated build set up to run them when people checked in. But, no big deal, I got our sysops guy to build me a machine and we installed Jenkins. Getting the Play build to run in Jenkins was the trivial matter of installing the play plugin and configuring a build to check the code out, clean it and run auto-test. So far, a matter of just a couple hours.
And then the build ran, and the tests failed. I figured they just had some failures that were recently added. But no, the tests ran fine on my local machine. And they ran fine on everyone else's local machines as well. The failures, after some debugging, seemed to boil down to dates. Instead of using Joda Time from get-go, we had a bunch of logic around java.util.Calendar. The new machine was running on UTC, and despite seeming to set the timezone to New York, we had failures all over the place. So, after too many hours trying to solve the problem with piecemeal moves to Joda time, I took a day to completely overhaul all the date logic to use joda.
And still, the build was failing.
Now, I could have let it go at this point, but I had a nagging feeling. If our tests are failing on our UTC build box, what do you think our code is doing on our production UTC machines? Bad things, probably. So I kept digging away, and finally discovered that I needed to set the timezone as a -D parameter to the play framework on startup. And after a long day and a half of struggling, we had a working build, and a much better understanding of how to properly use dates in our system. But this wouldn't have cost a day and a half of developer time if it had been set up from day 0 of the project.

Second mistake: Not thoroughly testing the framework interaction
The month of December is far and away the busiest month for the site. So it should come as no surprise that we would hit our new system with a lot of traffic during this time. At some point mid last week, our system suddenly slowed to a crawl and the database started spiking its load. Frantic hours were spent turning off logic before things finally calmed down. We assumed it was due to poor sql optimization, and so we optimized the queries, but still things were going slow. Finally, we discovered that a particularly heavy-weight call was being made with user id 0, and long story short, bypassing the logic for that userid made everything hum again.
But why were we getting any calls at all with that userid? Must just be a bug in the spaghetti code of the front end system. Turns out that was true, but not in the way we expected.
The Play framework has a relatively nice way of developing. You write these controller classes which expose endpoints. The parameters to these methods can be annotated with this nice @Required annotation. Now, we assumed that @Required meant that any call that didn't have that parameter would fail. But we never bothered to write a test for this fact. So, fast forward to Friday. I'm debugging a warning message that we seem to be getting far too often, when I realize that we have a bunch of calls coming in, and being executed, without some parameters. But they weren't marked as @Required. So I told the front-end devs that they needed to pass those parameters, and went to make the @Required. And as per my zealotry, I started writing a test that would actually POST to the framework without the parameters, expecting it to be a quick matter of verifying the failure.
As I'm sure you can guess by now, the test didn't fail. Why not? Well, two reasons. One, you have to explicitly write a check to see if the method parameter validation failed to have the @Required do anything. Nice. Second, those parameters were being declared as primitive types. But to do the validation, we turned them into Objects, and as a result turned a missing parameter into a Long with value 0. Whoops. So up until now, we haven't been causing any sort of error when parameters were missing, and we've been populating our database with various 0 values unintentionally.

We've fixed it all now. But doing the right thing up front would have cost very little time, and saved us probably 10s of developer hours of work, not to mention the quite likely business cost incurred by site instability during our busiest month. Don't skimp on your testing, even in crunch mode, even for the boring parts of the system. It's just not worth the cost.

Sunday, December 4, 2011

Interviewing for Judgement

One of the things that occurs to me after my first week of startup work is how essential it is to hire people with good judgement. In an environment where even the interns have huge amounts of responsibility and everyone is under a lot of time pressure, the judgement to not only know how to make good technical choices but also to know when to ask for help is essential. Is the right fix to this bug to hack around it, or to take a moment to write a bit more scaffolding and fix it everywhere for good? Which schema changes require another set of eyes, and which are trivial? As a manager you're also a contributor, and you probably don't have the time to micromanage decision making, nor would you want to. But when you discover in code review that a feature you thought would be simple required several schema changes and some sketchy queries, you regret not having insight sooner in the process.

Hiring for judgement is hard. It's easy to hire for technical chops. We know how to screen out people that can't write code, and if you follow the general best-practices of technical hiring (make them write code, at least pseudocode, a few times), you're unlikely to hire candidates that can't produce anything. But the code-heavy interview process does little to ferret out developers with judgement problems, the ones that can't prioritize work, and don't know when to ask for help. If you're hiring for a company the size of Google it doesn't matter that much, you can always put a manager or tech lead in between that person and a deadline. Small companies don't have this luxury, and it is that much important that the screening process captures more than just sheer technical ability.

I was hoping as I wrote this that I would come up with some ideas for hiring for judgement. But I have never had to hire for judgement until now. Judgement was something I had the luxury of teaching, or simply guessing at and hoping for the best. The internet suggests questions like
Tell me about a time when you encountered a situation where you had to decide whether it was appropriate for you to handle a problem yourself, of escalate it to your manager.  What were the circumstances?  How did you handle it?  Would you handle the same situation the same way again or would you do it differently this time?
This seems to me better than nothing, but not amazing.

So, over the next few months I'm going to experiment with questions like these and see how they play out. There's no way to do a rigorous study but at least I can start to get a feel for which questions can even tease out interesting answers. And if you have any questions you like, or thoughts on the matter, leave me a comment.

Saturday, November 12, 2011

A new computer

I'm writing this from my brand-new Acer netbook. It's a cute little machine, very light, long battery life, and so far seems to run Eclipse well enough for me to do some simple ZooKeeper work and other light Java development. It's also, hopefully, a symbol of a new chapter in my life.

A few months ago, I was reading The Creative Habit. I've been in search of my own creativity for a while now. You could say my career as a programmer meets the mythical 10,000 hours rule; after putting in several years of intense schooling followed by several years of focused work as a software developer, I finally started to consider myself an expert at writing general-purpose code. It's great to feel confident that you can code almost anything pretty well, but at some point I started wondering when this expertise would turn into truly creative output. I went into computer science for the cliched-but-true reason that it's a skill that can be applied into almost any sort of industry, and hoped it would allow me to build a fulfilling and lucrative career wherever I decided to go. And it has, except, where is my cool side-project? Beyond the creativity needed to architect solutions for work, I haven't found my groove.

So I started reading, and exploring, and trying to break out of my work-focused rut. In The Creative Habit, Twyla Tharpe recommends having the tool of creativity for your trade with you at all times. For a writer that might be a notebook and pencil, for a musician a tape recorder. But what is it for a creative developer? A poll of friends brought us to the conclusion that it might just be a small laptop and Python. A friend put it eloquently:
I'm picking "python" because it seems that the writer's pencil or the artist's sketchpad are more for making rough sketches than finished products, and python is one of my preferred languages for quickly hacking up prototypes.
Now here I am, I've finally taken the plunge, bought the little laptop, even started the blog to chronicle the process. 3, 2, 1, GO!

Monday, November 7, 2011

NoSQL and the Enterprise Developer

One of the people I follow on twitter, @strlen, posted a pretty good comment on Hacker News the other day. In it, he calls for NoSQL stores to become better than they currently are (a notion I doubt anyone would disagree with), and mentions some of the things he would like see evolving in the NoSQL landscape:

* Support for new and interesting distribution models. Allowing users to choose between eventual consistency, quorum protocols, primary copy replication and even transactional replication.
* Support for large, unstructured blob data[...]
* Most NoSQL systems support transactions within the scope of a single value (or document) via the use of quorums, serializing through a single master, etc... However, it'd be nice if something like MegaStore's Entity Groups (or Tablet Groups in Microsoft Azure Cloud SQL server) were supported. 
* Secondary indices, whether internal or external (by shipping a changelog) to the system. 
* True multi-datacenter support (local quorums if desired, async replication to the remote site) including across unreliable, high latency WAN links (disclosure: Voldemort supports this -- )

These are all great points. In particular for the enterprise space (and especially the financial space), I think the first and last points are extremely interesting. 

A major concern for the financials is business continuity. If a data center goes down, you had better be able to keep the critical parts of your business running. This has traditionally been done in a few different ways. One major way is through the use of SRDF disk, a rather slow and expensive facility that will automatically mirror data from one disk to a backup disk in a different site. For it to be performant at all, the two sites are generally kept pretty close together, with a fat link connecting them. But the overhead of the synchronous write and the cost of the disk are still meaningful, and the ultimate reality of dealing with SRDF failover of a database or file system is that frequently system administrators and DBAs need to get involved and the failover time can be quite slow. It satisfies certain regulatory requirements, and it satisfies the basic needs of business continuity, but rarely in a clean and easy to use fashion.

Now, many NoSQL systems can do some level of data replication across data centers. I personally chose to use Cassandra for a project because of the fact that I could choose write-level coherence that would guarantee writes hitting a quorum of global servers, thus assuring no data loss even in the event of a single data center failure. And hand-in-hand with point number one, this configurable read/write coherence meant that I could have a system that would always be available for reads even if a region was network partitioned from the other global regions, and would always guarantee that a quorum of servers would see a write before committing thus guaranteeing no loss of data.

Here's a tricky point quorum-based system designers should know: Many enterprises don't have data centers set up to support quorum-based systems in a local region. Often you will see 2 data centers per global region, meaning that if you need to run a quorum-based system and withstand the loss of any one data center (a general requirement for high availability business continuity), you need to have data crossing the WAN at some point. To a distributed systems programmer, this is agony. If only I had 3 data centers available in-region the possibilities for quorum-based systems to keep my data safe while still having relatively fast writes becomes so much more! But don't count on that being available to your clients.

A few glimmers of hope are on the horizon. Companies are aware of the cloud, and some are investigating whether they can use external cloud providers to host some computing. If this becomes a possibility, a cloud datacenter could become the third center in a quorum-based system. Regulators are also taking a closer look at data center locality, and wondering if there isn't too much of a concentration risk with two data centers so close together within a geographic region. This may prompt build out of additional data centers farther away in the states, but with better network connections than a cross-Atlantic link.

NoSQL folks looking for the enterprise and financial services markets, take heed. There's desire out there for what you are selling, but if it isn't easy to meet business continuity and regulatory requirements, you will never gain more than a niche position at these firms.

There's one other todo in the NoSQL space around authentication, but I will take the advice of my post reviewers and save that for a later rant.

Saturday, November 5, 2011

ZooKeeper 3.4: Lessons Learned

After several months on the planning block, it looks like ZooKeeper 3.4 is finally almost ready to be released. (Edit: Hooray! As of 11/22, release 3.4 is available!) I can say with confidence that all of the committers for the project have learned a lot from the course of this release. And most of it is in the form of "ouch, lessons learned".

First lesson: Solidify your new feature set early.
Going through the Jira, the earliest new feature for the 3.4 release is the uplift of the ZAB protocol to ZAB1.0. No small feature, to be sure, we were still debugging minor issues with it through the very end stages of our 3.4 work. We also added multi transactions, kerberos support, a read-only zookeeper, netty, windows support for C, and certainly others I'm forgetting. Some of these features were pretty simple uplifts, but some of them caused us build instability for months and a great deal of distraction. Many of these were added as "just one more feature". But many other features were neglected because "we're almost ready for 3.4" (as it turned out, often not actually the case). If we had decided early what new major features we were pushing for with 3.4, we could have concentrated our efforts more effectively and delivered much sooner.

Second lesson: When it's time to push, push.
Giving birth requires a period of concentrated pushing. If you think you can push a little now, then put it off for a few days, then a bit now, then a few weeks off... the baby will never come, and neither will the release. It took several attempts before the community finally rallied behind the efforts to get a release out, and we ended up losing a lot of momentum in the process. We didn't have a solid and pre-agreed-upon features to know when we were done, so things just kept getting in the way. When the attention on the release was off, a minor bug or feature request would come in and it just seemed so small, what was the harm?

Third lesson: Prioritize as a community, and stick to those priorities
This falls in with setting up a feature list early, but it goes beyond that. Our community was split between those who were very interested in seeing 3.4 released, and those who were working on major new changes or refactorings against trunk. As a result we all ended up feeling shortchanged. Contributors with new features did not get the attention their features needed, and many still sit in unreviewed patch form. Users that were hungry for the 3.4 release were frustrated with our lack of attention to getting it out. We had some massive new refactoring efforts that continued to happen on trunk during the course of the release process, which resulted in a frustrated committer base stuck backporting or forwardporting patches between increasingly divergent branches. These efforts found bugs, but not without some cost. Having unclear priorities divided the community, caused some tension, and ultimately slowed the whole release process down.

Fourth lesson: You can always do more releases, it doesn't all have to happen now
This is perhaps my own biggest takeaway from this process. I wish we had done much less, done it much faster, and been willing to release a 3.4 that was quickly followed by 3.4.1, 3.5, etc, as needed. Proponents of agile development and release practices have a good point; the more often you release, the less there is to go wrong and the easier it will be to fix if and when it does. It becomes a self-fulfilling prophecy. We don't release frequently so people want to cram as many new features in as possible, which slows down the releases, which results in pushes for more new features, which results in more bugs and slowed down releases, and on and on.

These lessons may seem obvious in retrospect, but they came at the price of many people's time and effort. I'm proud of our community for pulling together in the end, but I also hope that 3.5 will be a different and less arduous journey.