Buy My Book, "The Manager's Path," Available March 2017!

Tuesday, December 31, 2013

2013: The Constant Introspection of Management

I've been a serious manager for a little over a year now, and it has been my biggest challenge of 2013. Previously I had managed small teams but looking back, while I thought that would prepare me to lead multiple teams and manage managers myself, in reality nothing could be further from the truth.

The hardest part for me of going from individual contributor/architect/tech lead to managing in anger has been the lack of certainty. I believe that, for experienced managers, management can have a level of rigor and certainty, but I'm not there yet. I am striving to be a compassionate manager and that requires developing a level of emotional intelligence that I have never before needed. And so as a result I would have to say that the past year has probably been one of the most emotionally draining of my life (and that is not even counting the baby I had in the middle of it).

I do not want to be a dispassionate leader who views people like pawns on a chessboard. But the emotional resilience that is required for management makes me understand how folks with that tendency may find it easier. A successful manager needs to care about her people without taking the things they do personally. Every person who quits feels like an indictment of all the ways you failed them. If only I had given better projects, fought harder for their salary, coached them better, done more to make them successful! If you wonder why it seems sometimes that management roles get taken up by sociopaths, just think that for every interaction you have with a difficult coworker, your manager has probably had to deal with ten of them. It's not a surprise that a certain bloody-mindedness develops, or, more likely, survives.

In addition to the general emotional angst of all those people, there's the general feeling of utter incompetence. As an engineer, I know how to design successful systems. I can look back on a career of successful projects, and I know many of the best practices for building systems and writing code. Right now, if I were to design a system that ultimately failed for a technical reason, I would be able to pinpoint where the mistakes were made. I am a beginner all over again when it comes to the big-league management game, and it's discouraging. I miss doing what I'm good at, building systems, and I'm afraid that I've given it up for something I may never do particularly well.

One of my friends who has faced the same struggle put it best. 
I like the autonomy/mastery/purpose model of drive. This feels like an issue with mastery. Not building means moving away from something where you have mastery to something new. There’s fear of losing mastery.[1]
As a new manager I believe you lose both autonomy and mastery for a time being, and arguably autonomy is lost forever. You are always only as good as your team, and while some decisions may ultimately rest on your shoulders, when you choose to take the "servant leadership" path you do sacrifice a great deal of autonomy. But I think for many engineers the loss of mastery hits hardest. When you've spent ten plus years getting really, really good at designing and developing systems, and you leave that to think about people all day? It's hard, and no, there isn't always time for side projects to fill the gap. In an industry that doesn't always respect the skills of management, this is a tough pill to swallow. After all, I can become the greatest manager in the world, but if I wanted to work in that role at Google they would still give me highly technical interviews.

So why do it? In the end, it has to be about a sense of purpose. I want to have a bigger impact than I will ever be able to have as an architect or developer. I know that leading teams and setting business direction is the way to ultimately scratch the itch I have for big impact, for really making a lasting difference. And I know that a great manager can have a positive impact on many, many people. So here's to growing some management mastery, and making 2014 a year of purpose and impact.

Wednesday, July 31, 2013

Replatforming? The Proof is in the Hackday

It's pretty common for teams, especially startups, to get to a point with their tech stack where the platform they've been working on just isn't cutting it anymore. Maybe it won't scale, maybe it won't scale to the increased development staff, maybe it was all written so poorly you want to burn it to the ground and start fresh. And so the team takes on the heroic effort we know and love: replatforming.

When you're replatforming because the current system can't handle the necessary load, it's pretty easy to see if your effort was successful. Load test, or simply let your current traffic hit it and watch it hum along where you once were serving 10% failures. But if the replatforming is done to help development scaling, how do you know the effort was a success?

I accidentally discovered one answer to this question today. You see, Rent the Runway has been replatforming our systems for almost the last two years. We've moved from a massive, and massively complex Drupal system to a set of Java services fronted by a Ruby thin client. Part of the reason for this was load, although we arguably could've made Drupal handle the load better by modernizing certain aspects of our usage. But a major reason for the replatforming was that we simply weren't able to develop code quickly and cleanly, and the more developers we added the worse this got. We wanted to burn the whole thing to the ground and start fresh.

We didn't do this, of course. We were running a successful business on that old hideous platform. So we started, piece by piece, to hollow out the old Drupal and make it call Java services. Then, with the launch of Our Runway, we began to create our new client layer in Ruby using Sinatra. Soon we moved major pages of our site to be served by Ruby with Java on the backend. Finally, in early July, we moved our entire checkout logic into Java, at which point Drupal is serving only a handful of pages and very little business logic. 

So yesterday we had a hackday, our first since the replatforming. We do periodic hackdays although we rarely push the entire team to participate, and yesterday's hackday was one of the rarer full-team hackdays. Twenty-some tech team members and four analytics engineers participated, with demos this morning. I was blown away with what was accomplished. One project by a team of four enabled people to create virtual events based on hashtags and rent items to those events, pulling in data from other social media sources and providing incentives to the attendees to rent by giving them credits or other goodies as more people rented with that hashtag. This touched everything from reservations to checkout. We had several projects done by solo developers that fixed major nasty outstanding issues with our customer service apps, and some very nice data visualizations for both our funnel and our warehouse. All told we had over ten projects that I would like to see in production, whether added to existing features, as new features, or simply in the data dashboard displayed in our office. 

Compare this to the last full-team hack day and the differences are striking. That hack day had very few fully functional projects. Many were simulations of what we could do if we could access certain data or actions. Most didn't work, and only a handful were truly something we could productionize. The team didn't work any less hard, nor was it any less smart than it is now. The major difference between that hack day and now is that we've replatformed our system. Now adding new pages to the site is simple and doesn't require full knowledge of all the legacy Drupal. Getting at data and actions is a matter of understanding our service APIs, or possibly writing a new endpoint. Even our analytics data quality is better. 

So if you're wondering whether your replatforming has really made a difference in your development velocity, try running a hackday and seeing what comes out. You may be surprised what you learn, and you'll almost certainly be impressed with what your team can accomplish when given creative freedom and a platform that doesn't resist every attempt at creativity.

Monday, May 20, 2013

ZooKeeper and the Distributed Operating System

From the draft folder, this sat moldering for a few months. Since I think the topic of "distributed coordination as a library/OS fundamental" has flared up a bit in recent conversations, I present this without further editing.

I'm preparing to give a talk at Ricon East about ZooKeeper, and have been thinking a lot of what to cover in this talk. The conference is focused on distributed systems at a somewhat advanced level, so I'm thinking about topics that expand beyond the basics of "what is ZooKeeper" and "how do you use it." After polling twitter and getting some great feedback I've decided to focus on the question that many architects face: When should I use ZooKeeper, and when is it overkill?

This topic is interesting to me in many ways. In my current job as VP of Architecture at Rent the Runway, we do not yet use ZooKeeper. There are things that we could use it for, but in our world most of the distributed computing we do is pure horizontally scalable web services. We're not yet building out complex networks of servers with different roles that need to be centrally configured, managed, or monitored beyond what you can easily do with simple load balancers and nagios. And many of the questions I answer on the ZooKeeper mailing list are those that start with "can ZK do this?" The answer that I prefer to give is almost always "yes, but keep these things in mind before you roll it out for this purpose." So that is what I want to dig into more in my talk.

I've been digging into a lot of the details of ZAB, Paxos, and distributed coordination in general as part of the talk prep, and hit on an interesting thought: What is the role of ZooKeeper in the world of distributed computing? You can see a very clear breakdown right now in major distributed systems out there. There are those that are full platforms for certain types of distributed computing: the Hadoop ecosystem, Storm, Solr, Kafka, that all use ZooKeeper as a service to provide key points of correctness and coordination that must have higher transactional guarantees than these systems want to build intrinsically into their own key logic. Then there are the systems, mostly distributed databases, that implement their own coordination logic: MongoDB, Riak, Cassandra, to name a few. This coordination logic often makes different compromises than a true independent Paxos/ZAB implementation would make; for an interesting conversation check out a Cassandra ticket on the topic.

In thinking about why you would want to use a standard service-type system vs implementing your own internal logic, it reminds me very much of the difference between modern SQL databases and the rest of the application world. The best RDBMSs are highly tuned beasts. They cut out the middleman as much as possible, taking over functionality from the OS and filesystem as it suits them to get absolutely the best performance for their workload. This makes sense. The competitive edge to the product they are selling is its performance under a very well-defined standard of operation (SQL with ACID guarantees), as well as ease of operation. And in the new world of distributed databases, owning exactly the logic for distributed coordination (and understanding where that logic falls apart in the specific use cases for that system) will very likely be a competitive edge for a distributed database looking to gain a larger customer base. After all, installing and administering one type of thing (the database itself) is by definition simpler than installing and administering 2 things (the database plus something like ZooKeeper). It makes sense to prefer to burn your own developer dollars to engineer around the edge cases, so as to make a simpler product for your customers.

But ignoring the highly tuned commercial case of distributed databases, I think that ZooKeeper, or a service like it, is a necessary core component of the "operating system" for distributed computing. It does not make sense for most systems to implement their own distributed coordination, any more than it makes sense to implement your own file system to run your RESTful web app. Remember, to do distributed coordination successfully requires more than just, say, a client library that perfectly implements Paxos. Even with such a library, you would need to design your application up-front to think about high availability. You need to deploy it from the beginning with enough servers to make a sane quorum. You need to think about how the rest of the functioning of your application (say, garbage collection, startup/shutdown conditions, misbehavior) will affect the functioning of your coordination layer. And for most of us, it doesn't make sense to do that up-front. Even the developers at Google didn't always think in such terms, the original Chubby paper from 2006 mentions most of these reasons as driving the decision to create a service rather than a client library.

Love it or hate it, ZooKeeper or a service like it is probably going to be a core component of most complex distributed system deployments for the foreseeable future. Which is all the more reason to get involved and help us make it better.

Friday, February 8, 2013

Branching Is Easy. So? Git-flow Is Not Agile.

I've had roughly the same conversation four times now. It starts with the question of our deployment/development strategy, and some way in which it could be tweaked. Inevitably, someone will bring up the well-known git branching model blog post. They ask, why not use this git-flow workflow? It's very well laid out, and relatively easy to understand. Git makes branching easy, after all. The original blog post in fact contends that because branching and merging is extremely cheap and simple, it should be embraced.
As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.
But here's the thing: There are reasons beyond tool support that would lead one to want to encourage or discourage branching and merging, and mere tool support is not reason enough to embrace a branch-driven workflow.

Let's take a moment to remember the history of git. It was developed by Linus Torvalds for use on the Linux project. He wanted something that was very fast to apply patches, and supported the kind of distributed workflow that you really need if you are supporting a huge distributed team. And he made something very, very, very fast, great for branching and distributed work, and difficult to corrupt.

As a result git has many virtues that align perfectly with the needs of a large distributed team. Such a team has potentially long cycles between an idea being discussed, being developed, being reviewed, and being adopted. Easy and fast branching means that I can go off and work on my feature for a few weeks, pulling from master all the while, without having a huge headache when it comes to finally merge that branch back into the core code base. In my work in ZooKeeper, I often wish I bothered to keep a git-svn sync going because reviewing patches is tedious and slow in svn. Git was made to solve my version control problems as an open source software provider.

But at my day job, things are different. I use git because a) git is FAST and b) Github. Fast makes so much of a difference that I'm willing to use a tool with a tortured command line syntax and some inherent complexity. Github just makes my life easier, I like the interface, and even through production outages I still enjoy using it. But branching is another story. My team is not a distributed team. We all sit in the same office, working on shared repositories. If you need a code review you can tap the shoulder of the person next to you and get one in 5 minutes. We release frequently; I'm trying to move us into a continuous delivery model that may eventually become continuous deployment if we can get the automation in place. And it is for all of these reasons that I do not want to encourage branching or have it as a major part of my workflow.

Feature branching can cause a lot of problems. A developer working on a branch is working alone. They might be frequently pulling in from master, but if everyone is working on their own feature branch, merge conflicts can still hit hard. Maybe they have set things up so that an automated build will still run through every push they make to that branch, but it's just as likely that tests are only being run locally and the minute this goes into master you'll see random failures due to the various gremlins of all software development. Worst of all, it's easy for them to work in the dark, shielded from the eyes of other developers. The burden of doing the right thing is entirely on the developer and good developers are lazy (or busy, or both). It's too easy to let things go for too long without code review, without integration, and without detecting small problems. From a workflow perspective, I want something that makes small problems come to light very early and obviously to the whole team, enabling inherent communication. Branching doesn't fit this bill.

Feature branching also encourages thinking about code and features as all or none. That makes sense when you are delivering a packaged, versioned product that others will have to download and install (say, Linux, or ZooKeeper, or maybe your iOS app). But if you are deploying code to a website, there is no need to think of the code in this binary way. It's reasonable to release code behind feature flags that is not complete but flagged off, for purposes of keeping the integration of that new code in for testing in other environments. Learning how to write code in such a way as to be chunkable, flaggable, and almost always safe to go into production is a necessary skill set for frequent releases of any sort, and it's essential if you ever want to reach continuous deployment.

Release branching may still be a necessary part of your workflow, as it is in some of our systems, but even the release branching parts of the git-flow process seems a bit overly complex. I don't see the point in having a develop branch, nor do I see why you would care about keeping master pristine, since you can tag the points in the master timeline where you cut the release branch. (As an aside, the fact that the original post refers to "nightly builds" as the purpose of the develop branch should raise the eyebrows of anyone doing continuous integration.)  If you're not doing full continuous deployment you need to have some sort of branch that indicates where you cut the code for testing and release, and hotfixes may need to go into up to two places, that release branch and master, but git-flow doesn't solve the problem of pushing fixes to multiple places. So why not just have master and release branches? You can keep your release branches around for as long as you need them to get live fixes out, and even longer for historical records if you so desire.

Git is great for branching. So what? Just because a tool offers a feature, and does it well, does not mean that feature is actually important for your team. Building a whole workflow around a feature just because you can is rarely a good idea. Use the workflow that your team needs, don't cargo cult an important element of your development process.