Buy My Book, "The Manager's Path," Available March 2017!

Saturday, December 31, 2011

2011: My Year of Open Source

2011 saw a lot of big events in my life. I got hip surgery early in the year. I found myself thinking of leaving my job of six plus years in the summer. I actually left that job in the fall, and took the big leap into startup land in November. But when I think of 2011, I think the biggest changing influence for me was my entry into the world of open source.

I would call my evolution as a developer a three phase project. First, getting all the fundamentals of computing beaten into me in undergrad and graduate school. Second, learning how to be productive in the working world, the gritty details of actually producing production code and solving problems that sometimes are purely technical but often are a matter of orchestration, attention to detail, and engineering. Finally, combining these two aspects, and putting these talents to use in something that touches developers all over the world.

I fell into the ZooKeeper community by happy chance. I had been given a project to implement a company-wide dynamic discovery service. The developers that had come before me had found ZooKeeper, but had the luxury of implementing a solution that didn't have to scale to the volume and geographic diversity of the whole firm. I had requirements for global scaling and entitlements that didn't seem to be common in the ZooKeeper community at that point, and so I was forced to do more than just comb over the documentation to design my system. I cracked open the source code and got to work learning how it really worked.

My first bug was a simple fix to the way failed authentication was communicated to the Java client library. I had to get approvals almost up to the CTO level to be able to participate, but it was worth the effort. Quickly, I started feeling more responsibility to the community. I was, after all, relying on this piece of software to give me a globally available 24/7 system, and I wanted to be able to support clients where downtime could mean trading losses in the millions. I owed it to my own infrastructure to help fix bugs, and really, it was fun. I love writing distributed systems, and the ZooKeeper code base is a pleasure to work with; a little baroque, but just enough to be a fun challenge, and most of the complexity is in the fundamental problem. 

Working in the community has not just been about fixing hard bugs. It's also been about those engineering and teamwork considerations that are beguiling on a co-located team, and working on a team that I communicated with entirely though email and Jira was intense. Lucky for me, the ZooKeeper community is some of the most mature engineers I've ever had the pleasure of working with. We pull together to solve hard bugs, we all participate in answering questions and we try to make everyone feel welcome to participate. I consider the community to be the textbook example of open source done right.

Working with this community, working in public after being sequestered in the tightly controlled environment of finance for so many years, flipped something in my brain. I realized that I wanted to be able to live out loud, as it were. I value openness, the ability to work with people all over the world, the ability to work in public, getting feedback and appreciation from the wider community of developers. It also gave me confidence that I could be productive outside of the comfort zone of the place I had worked for years, and that I could show a degree of leadership even without an official title.

In the end, this experience freed me from feeling tied to the corporate life I had been living. I feel open to choose the path I want as a developer. The startup world of today has embraced this open source mentality, which I think is one of the most exciting developments of the last five years. So, I chose to go to a new job that I knew would let me live out loud. 

If you're not already in the open source community, why not crack open your favorite open source project and make 2012 your year of open source? 

Wednesday, December 28, 2011

A quick one: Testing log messages

I'm writing a talk on unit testing for work, and it reminded me of one of the coolest things I learned from the ZK code base with respect to testing: testing log messages.

You probably don't want to rely too heavily on log messages for testing, but sometimes it's the only indication you have that a certain condition happened. So how do you test it?



    Layout layout = Logger.getRootLogger().getAppender("CONSOLE")
                .getLayout();
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        WriterAppender appender = new WriterAppender(layout, os);
        appender.setImmediateFlush(true);
        appender.setThreshold(Level.INFO);
        Logger zlogger = Logger.getLogger("org.apache.zookeeper");
        zlogger.addAppender(appender);


        try {
...
 } finally {
            zlogger.removeAppender(appender);
        }

        os.close();
        LineNumberReader r = new LineNumberReader(new StringReader(os
                .toString()));
        String line;
        Pattern p = Pattern.compile(".*Majority server found.*");
        boolean found = false;
        while ((line = r.readLine()) != null) {
            if (p.matcher(line).matches()) {
                found = true;
                break;
            }
        }
        Assert.assertTrue(
                "Majority server wasn't found while connected to r/o server", found);
    }

From ReadOnlyModeTest. Kudos to Patrick Hunt (@phunt), the original author.


12/29 Edit:
Some folks have found this cringe-worthy. I agree. This is not a testing method that should be common in any code base, and for goodness sakes if you can find a different way to ensure something happened, do so. But there are a few times when this kind of test splits the difference between expedience and coverage (ie, I'd rather write a test to validate a log statement then just make the change and observe the log, or refactor the code base to expose some fact that the conditions causing the log were met in order to be able to test it).

Saturday, December 24, 2011

Effective performance testing

One of the major challenges I have faced in my career is keeping performance up in a large, rapidly-evolving system with many developers. As a Java developer writing enterprise and web applications, I generally have the luxury of not worrying about the minutiae of working within the L1 cache, or having to drop down to assembly to optimize my performance. But even enterprise applications have their performance limits, and users don't want to wait 10+ seconds to have results come back to them, or watch their website freeze due to stop-the-world garbage collection. In the land of e-commerce web apps, performance is even more essential. The bounce rate may double when a result takes 4 seconds vs 1 second, which means dollars we will never realize. So I take performance seriously.
We've all heard the old chestnut "premature optimization is the root of all evil". And indeed, when you find yourself spending half a day worrying about the O(n) time for an operation where N is never bigger than 10, you're probably falling victim to the siren call of premature optimization. A more subtle way I find that people fall into the premature optimization trap is by getting sucked in to profiling tools. I love a good profile as much as the next programmer, but profiling is devilish work that should only be undertaken when you know you have a problem to solve. But how do you know you have a problem to solve before your problem is affecting your users and costing you business?
The easiest thing to do is simply capturing the runtime of important chunks of work. If you happen to be using a service model of system design, capturing the runtime of each of your exposed endpoints is a good start. For those that happen to be using the Play framework (1.X), this is easily accomplished with a @Before in the controller that records start time in a thread local, and an @After that logs the total time for the request Now you can at least see what's running and how long it is taking.
Unfortunately, that alone does very little for detecting performance problems before they hit your live traffic. It also does very little to show you trends in performance without some additional work on your part. The first thing you have to do is actually save the information you're collecting over time. We live in a big data world, but you may not want to save every bit of log output from your services logs forever and ever. And most of the output will be uninteresting except as an aggregate trend. You want to see if your queries are trending up over time, as your data or user load grows, but you probably don't care that at 3:43pm on December 13th the getAvailableSkusForDate endpoint took 100ms to return 20 results. You need something that you can look at quickly, that you can keep around for a long time, and that will warn you in advance if something is going to cause you problems. I'm sure there are many ways to skin this cat, but the way that has worked best for me in the past is basic performance testing.

Starting Simple
Basic performance testing requires a small amount of setup. The main requirement is a data set that is as close as possible to your production data set. A nightly dump of the production database is great. This data source should be dedicated to the performance test for the duration of time it is running. The database (or file system or whatever) doesn't have to be exactly the same hardware as production, but it should be configured with appropriate indexes etc so as to most closely match production. For the most basic test, you can simply pick the queries you know to be worrisome, spin up the latest successful version of your trunk code, and time the execution of a set of important endpoint calls against this data. Record the results. You can use a Jenkins build to drive this whole process with fairly trivial setup, and at the laziest just record the results in a log message. Compare this to yesterday. If the difference is an increase of a certain threshold, or the total time exceeds some cutoff, fail the build and email the team. This is in fact the only thing I have done so far in my new role, but it already is providing me with more confidence that I can see a historic record over time of our most problematic queries.

Slightly More
Now that you're tracking the performance over time, a really slow change is not likely to slip into your codebase unnoticed. But really, what you would like to compare is not just today's trunk vs yesterday's trunk, but today's trunk vs today's production. Now things get a little bit trickier. To do this effectively, you may need to refresh the database twice if you are doing any modifications of the data. And if there has been schema evolution between prod and trunk, you need an automated way to apply that evolution to the data before running the trunk changes (this may apply to the basic case, too, but surely you have an automated way to apply schema changes already eh?). There are several nice things about comparing production to trunk. First of all, you can also use this as a regression test by validating that the results match in the two versions of the code. Second, you can feel pretty confident that you directly know that it is the code, and not the database or the hardware or anything else, that is causing performance differences between prod and trunk.
This setup will ultimately be as simple or as complex as your production system. If your production system is a small service talking to a database, setup is trivial. If, on the other hand, your production system is a complex beast of an application that pulls in data from several different sources, warms up large caches, or generally requires a lot of things to be just so to start up, the setup will be correspondingly complex. As with pretty much all good software engineering, the earlier you start running automated performance monitoring, the easier it is likely to be to set up and the more it will likely influence you to spend a little extra time making your system easy to deploy and automate.

Gaga for Performance Monitoring
I have only touched on the tip of the iceberg for performance testing and monitoring. There's a world of tools that sysops and dbas have to track the load of systems and performance of individual database queries. You can use log analysis tools like Splunk to identify hotspots when your system is running under real user load (a weakness of our basic performance testing framework). But I have found that all these complex tools cannot make up for the feeling of security and tracking that a good performance testing suite provides. So give it a shot. If nothing else, it can give you the confidence that all your fun time playing with profiling tools actually has a trackable difference in the performance of your code base.

Saturday, December 17, 2011

Valuing time, and teamwork

I have a guilty conscience about my self-perceived deficiencies when it comes to teamwork. I'm just not one of those people that sits at her desk until 10pm every night, or even most nights. I don't like eating dinner at work at all, in fact, and I'm not much for socializing at lunch either. It's not that I don't like my colleagues, or don't want to work hard. But I prefer a more focused day with a break for a meditative workout at lunch, I like to see my boyfriend for dinner at night, and while occasional (or even frequent) after work drinks are fun I enjoy having a social life that does not mostly revolve around work.

Perhaps as homage to my guilty conscience, or perhaps as testament to my impatience with wasted time, I put a lot of energy into doing things right the first time, and setting up systems that make it easier for everyone else to do things right the first time as well. Nothing is more awful to me than spending my time cleaning up foreseeable messes, so instead I've learned to set up automated builds, write and maintain extensive test suites, and monitor them like a hawk to ensure that they stay stable and informative.

What I have come to realize over the years is that when you view time spent as a proxy for dedication, you lose opportunities for productivity gains. We often forget that "laziness" is one of the trio of strength/vices that makes a great programmer:
The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write labor-saving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer. Also hence, this book. See also impatience and hubris. (p.609)[Larry Wall]
It's seductively easy to fall into the trap of equating hours spent on work with value produced, but it's as silly as using lines of code as a measure of productivity. Valuing time, your own time, your colleagues time, does not make you less valuable, or less of a team player. If anything, I think many technology teams need more people that know when to take a step back and put things in place so that they can go home at a reasonable hour and live a life outside of work. Which would you rather do, spend 8 hours writing productive code, or spend 6 hours writing productive code and 4 hours fixing bugs that you could've caught with better tests, an automated build, or performance monitoring?

I know what my answer is.

Saturday, December 10, 2011

Do the Right Thing

Two weeks into startup life, and things are going great. It's a fascinating change from enterprise development, and so far a very enjoyable one. It has already also provided a good life lesson that illustrates why I am a bit of a zealot about certain structure in the process of software development.

Almost the moment I came in to the company, I was pulled in to help on a project related to the way we compute availability for a given item on a date. This project had been started first by my boss, and had grown into a major effort that touched many parts of the system.

The logic and data design of the system was very thorough, and seemed like a sensible rewrite. A lot of the existing logic lived in a spaghetti monster of PHP code nestled into our front end, and the new system is written as a service using the Play framework. The general development pattern was to point the front end at both the existing logic (to make decisions) but also have it call the new back end service, and eventually swap everything to use the back end. So far, so good. But during development, two major mistakes were made.

First mistake: Not setting up an automated build from the moment the project was first put into source control
I was a little surprised to come in and see that while the system had tests, and developers were running the tests on their changes, there was no automated build set up to run them when people checked in. But, no big deal, I got our sysops guy to build me a machine and we installed Jenkins. Getting the Play build to run in Jenkins was the trivial matter of installing the play plugin and configuring a build to check the code out, clean it and run auto-test. So far, a matter of just a couple hours.
And then the build ran, and the tests failed. I figured they just had some failures that were recently added. But no, the tests ran fine on my local machine. And they ran fine on everyone else's local machines as well. The failures, after some debugging, seemed to boil down to dates. Instead of using Joda Time from get-go, we had a bunch of logic around java.util.Calendar. The new machine was running on UTC, and despite seeming to set the timezone to New York, we had failures all over the place. So, after too many hours trying to solve the problem with piecemeal moves to Joda time, I took a day to completely overhaul all the date logic to use joda.
And still, the build was failing.
Now, I could have let it go at this point, but I had a nagging feeling. If our tests are failing on our UTC build box, what do you think our code is doing on our production UTC machines? Bad things, probably. So I kept digging away, and finally discovered that I needed to set the timezone as a -D parameter to the play framework on startup. And after a long day and a half of struggling, we had a working build, and a much better understanding of how to properly use dates in our system. But this wouldn't have cost a day and a half of developer time if it had been set up from day 0 of the project.

Second mistake: Not thoroughly testing the framework interaction
The month of December is far and away the busiest month for the site. So it should come as no surprise that we would hit our new system with a lot of traffic during this time. At some point mid last week, our system suddenly slowed to a crawl and the database started spiking its load. Frantic hours were spent turning off logic before things finally calmed down. We assumed it was due to poor sql optimization, and so we optimized the queries, but still things were going slow. Finally, we discovered that a particularly heavy-weight call was being made with user id 0, and long story short, bypassing the logic for that userid made everything hum again.
But why were we getting any calls at all with that userid? Must just be a bug in the spaghetti code of the front end system. Turns out that was true, but not in the way we expected.
The Play framework has a relatively nice way of developing. You write these controller classes which expose endpoints. The parameters to these methods can be annotated with this nice @Required annotation. Now, we assumed that @Required meant that any call that didn't have that parameter would fail. But we never bothered to write a test for this fact. So, fast forward to Friday. I'm debugging a warning message that we seem to be getting far too often, when I realize that we have a bunch of calls coming in, and being executed, without some parameters. But they weren't marked as @Required. So I told the front-end devs that they needed to pass those parameters, and went to make the @Required. And as per my zealotry, I started writing a test that would actually POST to the framework without the parameters, expecting it to be a quick matter of verifying the failure.
As I'm sure you can guess by now, the test didn't fail. Why not? Well, two reasons. One, you have to explicitly write a check to see if the method parameter validation failed to have the @Required do anything. Nice. Second, those parameters were being declared as primitive types. But to do the validation, we turned them into Objects, and as a result turned a missing parameter into a Long with value 0. Whoops. So up until now, we haven't been causing any sort of error when parameters were missing, and we've been populating our database with various 0 values unintentionally.

We've fixed it all now. But doing the right thing up front would have cost very little time, and saved us probably 10s of developer hours of work, not to mention the quite likely business cost incurred by site instability during our busiest month. Don't skimp on your testing, even in crunch mode, even for the boring parts of the system. It's just not worth the cost.

Sunday, December 4, 2011

Interviewing for Judgement

One of the things that occurs to me after my first week of startup work is how essential it is to hire people with good judgement. In an environment where even the interns have huge amounts of responsibility and everyone is under a lot of time pressure, the judgement to not only know how to make good technical choices but also to know when to ask for help is essential. Is the right fix to this bug to hack around it, or to take a moment to write a bit more scaffolding and fix it everywhere for good? Which schema changes require another set of eyes, and which are trivial? As a manager you're also a contributor, and you probably don't have the time to micromanage decision making, nor would you want to. But when you discover in code review that a feature you thought would be simple required several schema changes and some sketchy queries, you regret not having insight sooner in the process.

Hiring for judgement is hard. It's easy to hire for technical chops. We know how to screen out people that can't write code, and if you follow the general best-practices of technical hiring (make them write code, at least pseudocode, a few times), you're unlikely to hire candidates that can't produce anything. But the code-heavy interview process does little to ferret out developers with judgement problems, the ones that can't prioritize work, and don't know when to ask for help. If you're hiring for a company the size of Google it doesn't matter that much, you can always put a manager or tech lead in between that person and a deadline. Small companies don't have this luxury, and it is that much important that the screening process captures more than just sheer technical ability.

I was hoping as I wrote this that I would come up with some ideas for hiring for judgement. But I have never had to hire for judgement until now. Judgement was something I had the luxury of teaching, or simply guessing at and hoping for the best. The internet suggests questions like
Tell me about a time when you encountered a situation where you had to decide whether it was appropriate for you to handle a problem yourself, of escalate it to your manager.  What were the circumstances?  How did you handle it?  Would you handle the same situation the same way again or would you do it differently this time?
This seems to me better than nothing, but not amazing.

So, over the next few months I'm going to experiment with questions like these and see how they play out. There's no way to do a rigorous study but at least I can start to get a feel for which questions can even tease out interesting answers. And if you have any questions you like, or thoughts on the matter, leave me a comment.