Elided Branches: February 2012

Wednesday, February 22, 2012

Networking woes in Java

The only major CS subject I never took a class in was networking. It's kind of ridiculous, looking back, that I took as many systems classes as I did but always eschewed networking. I do own a copy of UNIX Network Programming: Networking APIs: Sockets and XTI; Volume 1, bought at some point in the past when I knew I was going to be doing some distributed systems work and figured it would be a useful reference. But I can't say it's been my constant companion. For I have learned one thing in my years of Java systems coding:

Networking code is HARD.

Here's exhibit A: ZooKeeper monitoring misuses sockets. I spent a good chunk of time desperately trying to figure out why my monitoring commands were crapping out halfway through when run from NY to LN. Turns out, you can't safely expect to just close half a socket, leave the other half open, push some data to it and then close it while seeing all the data through to the other side. Not without a final handshake indicating the client has gotten all the data. Or at least, I think that's the case. The thing is, this will work well enough over a very fast network connection or with very little data. The guarantees around so_linger etc change kernel to kernel and my reading at the time led me to think that in fact the standard linux kernel behavior in this case may well have changed over the years that ZooKeeper has been around. So we need to completely rip out and redo the monitoring code if we want to have any hope of this working right for other big, global deployments in the future.

Exhibit B is my current debugging nightmare. Part of our release last week involved a new backend Play service that itself connects to a different backend Play service to prepare results for our storefront. We noticed, several hours after launch, that the service started to throw exceptions that were ultimately caused by java.io.IOException: Too many open files. I know enough about Java to know that running out of file descriptors is often a Bad Thing.

So we're leaking sockets. Why? To date, we don't know. The underlying libraries are async-http-client and netty, but there's very little to indicate what is going on.¹ The sockets show up in netstat/lsof as ESTABLISHED TCP connections to the various storefront servers. But the storefront servers do not have most of these sockets open on their end. How are they ESTABLISHED with no partner? It's an ongoing mystery, one that we haven't been able to reproduce on any other machine (the current theory is bad network hardware/software at the lowest levels, but honestly that's just a shot in the dark and one that we can't verify without taking down a production service).

So, while I keep debugging, what are the takeaways here?

1) You shouldn't write your own socket handling code in Java. Really, no. Don't do it. Use Netty. It's very good. Of all the things not to reinvent yourself, I would put networking at the top of the list with a bullet. It's hard, and requires the kind of deep expertise that you can't fake. And, when you fake it, you may end up with something like our ZooKeeper monitoring, that seems to work for years while hiding small but significant bugs.

2) If you're a system architect writing any kind of web services/distributed system architecture, you should know your unix socket monitoring commands. lsof is obtuse but powerful. netstat is simpler and still quite useful. This article has a few others, like ss and iftop. Know how to up the ulimits for your processes in case you find yourself with a slow socket leak that you need time to debug.

Have an idea what my bug is? I'd love to hear it! Leave me a comment or hit me up on twitter!

Edit 2/27: Looks like our bug was indeed on the cloud vendor side; possibly a misconfigured firewall. Moving to a new box and rebuilding the box we were on solved it.

¹ Thank God Play is at least using good networking libraries, because the last time I tested ZooKeeper, when it runs out of sockets the service hard fails with almost no indication of what happened.

Thursday, February 16, 2012

The value of physical objects

I led my first successful release at Rent the Runway today. It was a major revamp of our homepage for a segment of our users. My boss was on vacation and the release happened to fall on his travel day, so it was up to me to help guide the team to push it out.

A bit of background. The storefront of our site is a complex, hairy piece of code that currently controls far too much business logic and is very difficult to test and change. We're working on moving logic out of the storefront into a more service-oriented backend framework model, but it's slow going, and we can't just stop all work while we rewrite everything. So we're trying to make these incremental changes that involve many people, integrating in a complex and difficult-to-test front end. And all the Jira tickets and standups in the world didn't seem to be helping us get a handle on what was happening, who was doing what, and where we were getting stuck and missing our deadlines.

Enter the kanban board. We rolled this out Monday with the help of our product management team. There were fears that people wouldn't like it, that it would be an added piece of process that would bog things down further, that the developers wouldn't bother with it. Today proved all those fear wrong. Within the space of 10 hours, we used the board as a quick glance to see who had which bugs assigned, to see how progress was being made, to quickly reassign work, and to triage the remaining tickets to identify the launch-blockers and promote them.

And when we got it all pushed out, the developers took pictures of the board as evidence of all they managed to accomplish in a day.

We live in bits and bytes and screens and dashboards and webpages and text files. But sometimes there's no replacement for a physical thing that you can glance at, touch, move around, circle, and stand in front of with a colleague. Never underestimate that.

Thursday, February 9, 2012

Quick Wins: Monitoring Request Times in Play with Coda Metrics

My twitter feed has been abuzz about coda metrics for a while now. I decided to finally bite the bullet and try it out, and the result was a very nice quick win for our code base.

We're still using Play at work, and we have a service about to go into production that we've been monitoring through the oh-so-elegant method of "writing log messages". This is fine, but it doesn't tell you how long various request types are taking on average without doing a bit of log parsing, and I'm not much of a scripter.

Today, I promised that I would provide something slightly better to measure how our various endpoints are doing. Cue coda. I've been looking at it on and off for a couple of days, but kept getting hung up on wanting to do things like use the EhCache metrics gathering (not trivial in Play at first glance). Going back to basics, I decided after some thinking that the histograms would be the best thing to use. We were already grabbing method execution time for logging purposes, so all I had to do was insert that into a histogram and it would track the running times. Simple enough. But I want to create these histograms for each method type, and ideally, I just want to put it into our superclass controller that is already set up to capture the method timings and log.

Fortunately, Play has lots of nice information floating around in the "request" object of its controllers. Using that object, I can see what Controller subclass this request is destined for, as well as the method that will be called on that class. So I have enough information to create the histogram for each method, like so:

Histogram histo = Metrics.newHistogram(request.controllerClass, request.actionMethod, "requests");

Great. But I was a little tired, and thought that I needed to keep these around, so I stuck them in a ConcurrentHashMap associated with a unique key based on the controller class and the action method. Turns out though, if you look in the MetricsRegistry source code, you'll see that in fact you don't really need to do this at all. As long as the "MetricName" that would be generated for your metric is the same, the same metric will be used for the monitoring. Now THAT is the kind of clever code I like to see.

I decided to keep my ConcurrentHashMap around anyway, to save myself the (utterly trivial) overhead of creating the various objects passed in to the registry by newHistogram. The resulting code is embarrassingly simple. So simple, in fact, I wanted to make it more complicated and it took me 3 revisions to realize how little code I actually needed.

Here is the resulting BaseController, on GitHub, in a skeleton Play application.

I'm a bit sleep-deprived, so if I missed something, be sure to leave a comment or hit me up on twitter!

Wednesday, February 1, 2012

Developer Joy, Part 2

I got some good comments on my last post, including a few submissions. One idea that I intentionally left out was that of working on a project you're passionate about. In fact, I think the point of developer joy is that developer joy has to transcend love for a project. It is the stuff that keeps us going despite working on code with a purpose which we find boring. It is the stuff that makes writing dull-but-important software worthwhile. It's the stuff that, in short, every good tech lead should keep an eye on.

Sadly, most developers are not working on problems that they find intrinsically fascinating. I told myself in coming into computing as a career that one reason I was doing it was that it had application in so many areas. And getting to apply computing to the fashion industry is a problem I find rather fascinating (and never in a million years expected to be solving). But I spent many years in industries that bored me to death to get to where I am today, and I still enjoyed myself immensely doing it so long as I had enough elements of developer joy.

As an illustration of this joy, one of the more enjoyable projects I ever worked on was the distribution of the test runner for a code base I worked on. The code base had thousands of tests, most of which were slow integration tests, and we expected this test suite to run and pass both on every checkin, and before developers checked in code. We'd gotten to a point where tests were taking 4 hours and most of a developer's local CPU cycles. We weren't willing to just get rid of tests, we weren't able to refactor the external dependencies completely to make the individual tests themselves faster (ah, sybase stored procedures!), so we set about to figuring out how to run this build process across a farm of machines. I like distributed computing, but this problem was more about dealing with the distribution system we used, debugging strange transient errors when we pulled formerly-sequential tests apart, and setting up a bunch of infrastructure to get things working.

Why did this project end up being so fulfilling? Well, for one, the team working on the project was incredibly smart. The main developer taught me a ton about the underlying distributed system and in turn we had some great team sessions of brainstorming. The whole project was around testing, and of course we wrote tests themselves for the project components so it was a well-developed system. We had plenty of time to actually work on the problem. And the solution we produced was pretty awesome. In short, it was a fun problem to solve not because I love build systems and tools (a common mis-conception about me, actually), but because I love solving hard problems, solving them well, and solving them with smart people.

Some developers are always going to need projects that they are passionate about. Those developers tend to form startups, go into academia, or find a spot in a big company with the resources to splash on their particular niche. But most of us work on problems that we solve to pay the bills. That doesn't mean there isn't joy in a day's work. When that work involves learning new things, perfecting your own expertise, being able to solve problems and see those solutions delivered, it gives us joy. Tech leads, CTOs, and founders sometimes forget that the people working for them may not have the same love for the project that they do. Keeping up the elements of developer joy helps bridge that inspiration gap.

Buy My Book, "The Manager's Path," Available March 2017!