Buy My Book, "The Manager's Path," Available March 2017!

Saturday, December 10, 2011

Do the Right Thing

Two weeks into startup life, and things are going great. It's a fascinating change from enterprise development, and so far a very enjoyable one. It has already also provided a good life lesson that illustrates why I am a bit of a zealot about certain structure in the process of software development.

Almost the moment I came in to the company, I was pulled in to help on a project related to the way we compute availability for a given item on a date. This project had been started first by my boss, and had grown into a major effort that touched many parts of the system.

The logic and data design of the system was very thorough, and seemed like a sensible rewrite. A lot of the existing logic lived in a spaghetti monster of PHP code nestled into our front end, and the new system is written as a service using the Play framework. The general development pattern was to point the front end at both the existing logic (to make decisions) but also have it call the new back end service, and eventually swap everything to use the back end. So far, so good. But during development, two major mistakes were made.

First mistake: Not setting up an automated build from the moment the project was first put into source control
I was a little surprised to come in and see that while the system had tests, and developers were running the tests on their changes, there was no automated build set up to run them when people checked in. But, no big deal, I got our sysops guy to build me a machine and we installed Jenkins. Getting the Play build to run in Jenkins was the trivial matter of installing the play plugin and configuring a build to check the code out, clean it and run auto-test. So far, a matter of just a couple hours.
And then the build ran, and the tests failed. I figured they just had some failures that were recently added. But no, the tests ran fine on my local machine. And they ran fine on everyone else's local machines as well. The failures, after some debugging, seemed to boil down to dates. Instead of using Joda Time from get-go, we had a bunch of logic around java.util.Calendar. The new machine was running on UTC, and despite seeming to set the timezone to New York, we had failures all over the place. So, after too many hours trying to solve the problem with piecemeal moves to Joda time, I took a day to completely overhaul all the date logic to use joda.
And still, the build was failing.
Now, I could have let it go at this point, but I had a nagging feeling. If our tests are failing on our UTC build box, what do you think our code is doing on our production UTC machines? Bad things, probably. So I kept digging away, and finally discovered that I needed to set the timezone as a -D parameter to the play framework on startup. And after a long day and a half of struggling, we had a working build, and a much better understanding of how to properly use dates in our system. But this wouldn't have cost a day and a half of developer time if it had been set up from day 0 of the project.

Second mistake: Not thoroughly testing the framework interaction
The month of December is far and away the busiest month for the site. So it should come as no surprise that we would hit our new system with a lot of traffic during this time. At some point mid last week, our system suddenly slowed to a crawl and the database started spiking its load. Frantic hours were spent turning off logic before things finally calmed down. We assumed it was due to poor sql optimization, and so we optimized the queries, but still things were going slow. Finally, we discovered that a particularly heavy-weight call was being made with user id 0, and long story short, bypassing the logic for that userid made everything hum again.
But why were we getting any calls at all with that userid? Must just be a bug in the spaghetti code of the front end system. Turns out that was true, but not in the way we expected.
The Play framework has a relatively nice way of developing. You write these controller classes which expose endpoints. The parameters to these methods can be annotated with this nice @Required annotation. Now, we assumed that @Required meant that any call that didn't have that parameter would fail. But we never bothered to write a test for this fact. So, fast forward to Friday. I'm debugging a warning message that we seem to be getting far too often, when I realize that we have a bunch of calls coming in, and being executed, without some parameters. But they weren't marked as @Required. So I told the front-end devs that they needed to pass those parameters, and went to make the @Required. And as per my zealotry, I started writing a test that would actually POST to the framework without the parameters, expecting it to be a quick matter of verifying the failure.
As I'm sure you can guess by now, the test didn't fail. Why not? Well, two reasons. One, you have to explicitly write a check to see if the method parameter validation failed to have the @Required do anything. Nice. Second, those parameters were being declared as primitive types. But to do the validation, we turned them into Objects, and as a result turned a missing parameter into a Long with value 0. Whoops. So up until now, we haven't been causing any sort of error when parameters were missing, and we've been populating our database with various 0 values unintentionally.

We've fixed it all now. But doing the right thing up front would have cost very little time, and saved us probably 10s of developer hours of work, not to mention the quite likely business cost incurred by site instability during our busiest month. Don't skimp on your testing, even in crunch mode, even for the boring parts of the system. It's just not worth the cost.

1 comment:

  1. Have to totally agree. A continuous build is such an obvious little bit of work that saves everyone's life big time. It should be taught in orientation classes IMO.

    ReplyDelete

Note: Only a member of this blog may post a comment.