Pages

Wednesday, March 28, 2012

Yes, and...

I have a terrible habit. When confronted with a new idea, any idea, my knee-jerk instinct is to find its flaws. Or, to put it more succinctly, I'm a "no, wait!" kind of person. You want to add a new feature into my library? No, wait! That won't maintain API backwards-compatibility! No, wait! No one but you really needs that, why should we put it in a library used by everyone? No, wait! The work that will take us to review overwhelms it's worth as a new feature.

There is a well-trod school of thought that says to maintain creative collaboration, anytime you want to say "no, wait!", "no, but" or even just "but", you should instead start the conversation with "Yes, and". Want to add a new feature into my library? Yes, and we will have to think about how to add it without breaking API backwards-compatibility. Yes, and we should see what other community members think about the usefulness of the feature. Yes, and we'll have to evaluate the work that it will take to correctly implement and review.

"Yes, and" is a great baby step on the path to thought leadership and productive collaboration. But let's face it, it's easy to get a bit snarky with our yeses. Yes, and six months later, we'll have another release. Yes, and you can be on hand to fix the bugs that come in. Yes, and why don't we rewrite the whole thing in Scala while we're at it? (Watch Peggy struggle with this on Mad Men[16:50], when Megan shows her work)

So how do you really take "yes, and" to heart? My boss, who is a natural at this sort of thing, taught me one of his favorite tactics. It's not just saying yes. When a new idea is presented, find at least one positive thing to say about it, and praise that quality first.
 "That is an interesting idea. People using our service to do complex state maintenance could really use this feature, it would save them having to write a lot of code themselves."
The goal of this exercise is not only to keep the flow of positive collaboration going. It is also to force yourself to acknowledge the problem that the idea is trying to solve. You may not think the problem is worth solving, but this is something that your collaborator wants to explore, perhaps because it is causing them pain, perhaps because they have a different set of priorities than you, perhaps because they're simply trying to contribute to the discussion.

I will admit that despite knowing this is the right thing to do, I am absolutely terrible at following this advice in person. My knee-jerk nerd comes out and she is very into scoring points by pointing out flaws. Among the tech community, I'm hardly unique. Read the comments on almost any Hacker News thread if you don't believe me. And yet, the easiest place to start with this is online. You have time to think before you hit send on that email, submit on that comment, post on that tweet. Why not take the time to acknowledge the idea fully before pointing out its downsides? Just because you didn't think of it doesn't mean you lose by making it better.

Thursday, March 22, 2012

Java console monitoring basics: The "j" series

Think fast: It's 10pm, you have a production java application on a box you can only ssh into, and it's in distress. This is the third time this month it's happened. You didn't write the code, and the joker who did didn't bother to put in any metrics for you to grab. What do you do? After cursing, but before giving up, restarting the process, and promising to debug it in the morning, you might want to go a round with the Java command-line monitoring tools.

Perhaps you are already familiar with these tools, but I've found that despite their incredible usefulness many seasoned Java developers have never heard of this tool stack. They hide away in the jdk bin directory, but they can be your best friend when you are stuck with nothing but a console and a prayer.

First we have the lowly jps. It does what you would expect: shows you java processes running as that user (or all users, if you're root).

Moving up the stack, we have jinfo. Show me the vm version, jars, and all the flags this process is running with, or the value of a particular flag. You may have this information elsewhere, but it's nice to have a shortcut.

More useful is jstack. Yeah, I'm sure you know how to kill -QUIT but this is a nice way to teach the newbie how to get a stack trace without the risk that they'll accidentally kill the process. If you should be so lucky as to have an obvious deadlock, jstack will kindly point that out to you so you can go fix it. Stack traces are the bread and butter of figuring out what's wrong with a process. Take several, see how they change, or don't, and you'll come closer to finding your problem code.

My personal all-time favorite is jstat. More specifically, jstat -gc 3s. That is, jstat of the garbage collector printing new results every 3 seconds. Back in the days when I wrangled gigantic VMs, this tool was invaluable for spot-checking garbage collection. The output is admittedly hideous. For example:

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT
37568.0 41984.0 6592.0  0.0   74816.0  38651.6   70272.0    39290.5   63616.0 55956.0     38    1.608   4      1.441    3.049

Yes, that unformatted barf is what you will see on a screen console output for jstat -gc. But that barf tells you a lot. First things first, the most useful stuff is at the very end. See that "4" followed by "1.441"? That shows the stop-the-world GC collections, and the total time they have taken. If your application is running particularly slow, or is frequently unresponsive, the FGC count will quite possibly be high, and the time related to it will be very high. Remember, your app is essentially dead when FGC is running, so a high number of FGCs is a bad sign.

This also shows you the various used and total sizes of the generations. I'm not going to go into details of Java Garbage Collection but it is useful to be able to see them broken out and growing or shrinking. One pattern of GC to watch out for is the case when you don't have enough survivor/eden space to handle all of the transient data you need to do a big chunk of work, and not enough old capacity to take it all, but just enough freeable state to keep slowly moving forward with your computation. The result will be neverending full-gcs while your process moves slightly forward, does a full GC, moves forward again, does a full GC, and moves forward again. This looks in jstat like an ever-increasing series of FGC where each one finishes but only causes a small amount of space to be freed in OU, while EU and the survivor spaces are constantly near-full. The details of the actual GC behavior may change with versions of the JVM (and the flags that you are using), but the ability to monitor the behavior easily in real-time is always useful.

Finally, for the tools I usually use, there's jmap. jmap -heap is a nice way to get a pretty-print of the heap info. jmap -histo and jmap -dump are heavier commands that you might want to hold off on doing until you're ready to restart your process anyway, because sometimes running them will result in bad results for the process. If you're producing a ton of garbage and you don't know why, jmap can show you where the memory is going. jmap -dump will product a file that you can push into something like MAT for analysis.

None of these tools are an answer for proper monitoring of your JVM, but they're great for a quick and dirty debug before you restart a troubled process, and something to make sure your whole team has in their toolbox.

Thursday, March 15, 2012

One Feature in One Month or Ten in Six?

The best story I can tell to illustrate the difference between project planning and technical design at a big company vs a small startup happened today. I was talking to our head of analytics about a new project we are speccing out. This project will inevitably be the beachhead of a huge set of new features and systems that we'll want to write over the coming months. He sees this as an opportunity to start building something that will be able to accomplish all of those features in one multi-faceted system. And he's right, this is definitely a feature that we could easily implement in such a system.

However, it's also a feature that we can implement fairly cheaply using building blocks we already have. The cost difference in engineering this feature is pretty stark; my estimation is that to implement it in what we have takes roughly a month, most of which would be spent getting the presentation layer to look good. To implement it in the larger space would probably take closer to 6. If we implemented the larger system, we would get a lot more than just this feature. We would probably get 10 features out of it, and the ability to really start powering several different things.

I'm still going with the one month solution.

On the surface, this seems like a pretty foolish trade. I know that I'm probably going to build the bigger system sooner or later, I know that I would probably get more value for my time building it now, and not building it now means writing some code that will eventually be thrown away. If I were doing this at a big company, I would always choose to do the 6 month project. In fact, I would probably be busy thinking of all the other things we could do with another 6 months, and building up a plan and design that would emphasize the power and complexity of my solution. I'd be aiming for a system that people could use and grow for years to come.

But I don't work at a big company any more. There's no certainty that this larger feature set will actually move the needle on our business. We need to be able to see it in production and learn from the results before we commit resources to engineering a long-term solution. Priorities can change a lot in the span of a month, especially in a data-driven customer-facing business such as ours.

And ironically, I buy myself engineering time by going for the quick 1-month win. That bit about the majority of the time requiring front-end work is the trick. I can slap RESTful endpoints out to serve up and consume this data fairly quickly. While my front-end folks are busy making things pretty, we have the chance to catch our breath and evaluate, start laying groundwork, and eventually swap out the services without the front-end team ever having to know that things have changed. So while it seems like I'm doing one feature in one month, I'm really doing ten features in six and a half months. They may not be exactly the same ten features I would have if I started tomorrow, but that's half the point: By getting one feature out fast I give my business something concrete to evaluate, and I make it more likely that the ten features I eventually implement are the right ten for the job.

Thursday, March 8, 2012

Why I'm Moving Away from the Play Framework

I've been using the Play framework since I started at RtR 3 months ago. Last week, I made the decision that no new services will be written in Play from that point forward. It started out as a great little framework that was pretty quick to learn and easy to use, but it's turned into something that I would not recommend anyone use for serious production applications. What happened?

First, I lost faith in the developers.

One of the first things that annoyed me about Play was the inability to run a single test from within a play test class inside your IDE. I suppose the thinking was that you will always run the play test app, or something, but I prefer to leave my IDE as little as possible when I'm working, and running an entire test class worked fine. So, being the good little open source programmer that I am, instead of bitching I rolled up my sleeves and fixed the bug. It was a pretty trivial fix. I even wrote a test case. Then I put in a pull request and waited.
After submitting the pull request, I commented on the pull request, commented on the ticket, and finally sent an email to the mailing list. And the response I got was basically that the team is too busy working on the next generation of the product to absorb fixes for the older generation. Having worked on open source projects myself, I understand what it's like to have limited bandwidth to look at changes. But if the project team's bandwidth is so limited that they can't even afford to look at small fixes like this, it seems like the project is basically abandoned. At that point I lost faith that I could rely on the community to support the 1.X branch of this product. Not necessarily a dealbreaker, but definitely a bad sign. 

Then, I lost faith in the platform.

We started to hit some serious bugs in the platform during a big push on a complex service. First, our developers that used Mac and Windows hit a bug similar to this, where they just simply couldn't get the app to work no matter what they did. It worked fine in linux, but even a clean checkout would fail to run for them. It was inexplicable, irritating, and we lost a couple of days of development work trying to get around it (rolling back checkins, pulling out modules, poring through stack traces). By this point, I had lost faith in the community, so I didn't see the point in going to them for help. Fortunately, we did finally get around it (it seemed to be a bug in the CRUD module), but we were all really frustrated and annoyed with the framwork after that experience.

Finally, I lost faith in my own ability to debug the framework.

The issues above were enough for me to want to move off of Play for new projects. The thing that caused me to move off of Play for projects that are already in development (but not in production) was this: At some point, we had written a migration job in Play for a major data migration. We discovered the strangest thing would happen. The job would run across several job threads, and at some point, one of the threads would hang. But it would not hang in a way that I have ever seen a JVM thread hang. The thread was in RUNNABLE state, and it was in a HashMap method (either get or put) and it was just sitting there. Not doing anything. No locks, no database or other IO, plenty of memory, plenty of resources, just sitting in that HashMap.get method, hanging out.
Now, maybe you've seen that before (and if you have, please leave a comment!). But I have seen a lot of JVM issues in my day, and this is a new one. There was no reason for this thread to be hung. And yet it was. I can debug just about anything you can throw at me in Java, but I had absolutely nowhere to start looking to debug this issue, except a vague suspicion that it was related to the way the framework was rewriting the classes under the covers. That is a dealbreaker, ladies and gents. I could've probably debugged why the module was causing the app to crap out for my developers, if given enough time. But I cannot say with any certainty that I could debug whatever the hell was causing that thread to hang.

If I felt the developers supporting play were committed to building a real community of support around the 1.X version, I might have stayed with it longer. It's a giant pain to find something else that is easy, lightweight, supports JPA and doesn't force me to write XML. But I can't use a product that I know has issues even I can't debug, and a team I don't trust to maintain the product to the standards my team needs to confidently use it in production.

Thursday, March 1, 2012

Three Reasons You Should Be Training Your Successor

It's Friday afternoon of a long week. Despite wanting to pack up your bag and go home, you know you need to review the new hire's code. It's their first week so inevitably it will be some scary stuff for you to untangle. When you get in there, it is scary. But not in the way you expected. Instead of causing more bugs than they fixed, they've actually managed to fix every bug you gave them and in the process redesign a hairy part of the system into something rather... elegant. You want to find fault somewhere, but you can't. They're really good, so good that you're feeling threatened. I want to convince you that this threat is an opportunity. When you find that up-and-comer on your team, instead of trying to prove that you are better than them, train them to replace you. Why? Here are three major reasons:

1) Career Growth 
It's hard to grow yourself and your career within a company if you hoard the knowledge or ownership of a particular project. The people that succeed the best at growing into bigger roles are those that can point to a protégé waiting in the wings to fill the position they are leaving behind. This applies to us tech types that may not want to grow into managers, but want the freedom to move into new projects. It's easy to get stuck on the same codebase when you're the only expert in the area. Training other smart developers on the team lets you move on to new and exciting things as the opportunity arises, and with no guilt about leaving people in the lurch.
Even if you prefer to switch companies every few years, it is always great to be able show a track record of several successful projects per company, and it gives you the opportunity to get a good breadth of experience.

2) Guru Points
Teaching and mentoring others makes us learn what we know even better. I always find that any time I have to give a teaching talk on a topic, I learn that topic a little bit more. Explaining things to others formalizes what you know, and in watching them learn they will almost always teach you something as well. Do you want to grow into the guru of your team or company? Mentor others and teach them all you know, and before long you may find yourself the most trusted advisor of the team.

3) A Stronger Network
Helping others grows your network, and provides you a pool of talent that you can draw on throughout your career. In a big company, knowing people on other teams that trust and respect you makes cross-company collaboration much more effective. If you are the sort to move between small companies every few years, it's likely you will need to hire talent to work with you and make those companies successful, and who better than your former mentees? And if one of those that you mentor grows to become a great success? They are not likely to forget the people that helped to get them where they are, and can be a very valuable part of your professional network.

This post is dedicated to my mentor Mike, who taught me this rule and practically every other useful thing I know. Please leave a comment here or on twitter!